jkfrench
asked on
Problem matching a list with multi-line items
I'm trying to break a chunk of text containing a numbered list into individual items, but am having trouble with items that continue to the second line. I'm hoping more pairs of eyes will see what I'm doing wrong.
Data looks like this:
l
Data looks like this:
1. This is the first item
2. This is a longer second item that wraps down
to the next line, but it's still one item.
3. This is the last time which is also long and
wraps to the next line as well
Perl code looks like this:$/ = undef;
$_ = <>;
while (/^ \d+ [ \.]* (.*?) $/gmsx) {
print "line[$1]\n";
}
Current output looks like this:l
ine[This is the first item]
line[This is a longer second item that wraps down]
line[This is the last time which is also long and ]
As you can see I'm missing the second line of items two and three. I've tried various combinations of options and pattern greediness to get what I want, but so far no success.ASKER
farzanj,
Thanks for the reply, but I don't see any difference between your example and the code I posted.
And I realize there is no number on the second line. I was trying to get the (.*?) pattern to match across lines to pick up the rest of the item. It looks like the $ is anchoring the pattern to the end of the line, although I was trying to use the 's' option to allow . to match newline.
Thanks for the reply, but I don't see any difference between your example and the code I posted.
And I realize there is no number on the second line. I was trying to get the (.*?) pattern to match across lines to pick up the rest of the item. It looks like the $ is anchoring the pattern to the end of the line, although I was trying to use the 's' option to allow . to match newline.
Sorry, I replaced + with *
$/ = undef;
$_ = <>;
while (/^ \d+ [ \.]* (.*?) $/gmsx) {
print "line[$1]\n";
}
SORRY ONCE AGAIN. Problem with my clipboard.
$/ = undef;
$_ = <>;
while (/^\d* [ \.]* (.*?) $/gmsx) {
print "line[$1]\n";
}
$/ = undef;
$_ = <>;
while (/^\d* [ \.]* (.*?) $/gmsx) {
print "line[$1]\n";
}
ASKER
No problem. That gets the additional text, but as separate items. So I'm now getting:
line[This is the first item]
line[This is a longer second item that wraps down]
line[to the next line, but it's still one item.]
line[This is the last time which is also long and ]
line[wraps to the next line as well]
but am trying to get:line[This is the first item]
line[This is a longer second item that wraps down to the next line, but it's still one item.]
line[This is the last time which is also long and wraps to the next line as well]
Ok.
here is the idea. Change input record separator to
$/= '\d+';
This solves part of the problem
For the second part. You have to delete new line character from the line before printing.
here is the idea. Change input record separator to
$/= '\d+';
This solves part of the problem
For the second part. You have to delete new line character from the line before printing.
And yes, you can keep \d+ the one I had changed to \d*
$/ is a string, not a regex.
s/\n/ /g,print "line[$_]\n" for /^\d+ [ .]* ([\s\S]*?) (?=\n\d|\Z)/gmx;
s/\n/ /g,print "line[$_]\n" for /^\d+ [ .]* ([\s\S]*?) (?=\n\d|\Z)/gmx;
SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
PS: Where I wrote:
$line =~ s/\n/ /sg;
the 's' modifier is not required. It could simply be:
$line =~ s/\n/ /g;
$line =~ s/\n/ /sg;
the 's' modifier is not required. It could simply be:
$line =~ s/\n/ /g;
> Have you tested that code, ozo?
Yes, it produces
line[This is the first item]
line[This is a longer second item that wraps down to the next line, but it's still one item.]
line[This is the last time which is also long and wraps to the next line as well]
Have you tested it?
Yes, it produces
line[This is the first item]
line[This is a longer second item that wraps down to the next line, but it's still one item.]
line[This is the last time which is also long and wraps to the next line as well]
Have you tested it?
Hi ozo,
Well, when I test it like this:
perl -ne 's/\n/ /g,print "line[$_]\n" for /^\d+ [ .]* ([\s\S]*?) (?=\n\d|\Z)/gmx' inputfile
I get this:
line[This is the first item]
line[This is a longer second item that wraps down]
line[This is the last time which is also long and]
What am I doing wrong?
Well, when I test it like this:
perl -ne 's/\n/ /g,print "line[$_]\n" for /^\d+ [ .]* ([\s\S]*?) (?=\n\d|\Z)/gmx' inputfile
I get this:
line[This is the first item]
line[This is a longer second item that wraps down]
line[This is the last time which is also long and]
What am I doing wrong?
perl -0777 -ne 's/\n/ /g,print "line[$_]\n" for /^\d+ [ .]* ([\s\S]*?) (?=\n\d|\Z)/gmx'
-0777 has the effect of $/ = undef;
-0777 has the effect of $/ = undef;
Thanks ozo,
Sorry - wasn't thinking. Should have realised that's what was needed.
BTW, in your code, should this:
[ .]*
be this:
[ \.]*
?
Sorry - wasn't thinking. Should have realised that's what was needed.
BTW, in your code, should this:
[ .]*
be this:
[ \.]*
?
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Thanks ozo. Sorry - I forgot about that, too!
jkfrench, pls take note of that. If you decide to use my solution, you can also change the:
@lines = split(/\d+[\. ]*/, <>);
to:
@lines = split(/\d+[. ]*/, <>);
jkfrench, pls take note of that. If you decide to use my solution, you can also change the:
@lines = split(/\d+[\. ]*/, <>);
to:
@lines = split(/\d+[. ]*/, <>);
ASKER
Thank you all for your help. I'm using ozo's one-liner, because it also works if there is a number embedded in the text. Since tel2's solution also worked for the data I posted, I'm awarding assist points.
Hi ozo,
Can you please explain your use of "([\s\S]*?)". It obviously works, but it just looks a tad strange to say "match zero or more spaces or non-spaces". I'm guessing it's something to do with avoiding newlines?
Hi jkf,
Thanks for the points. I suggest you be careful in future about giving people lower marks for failing to meet unspecified requirements, but if you are interested in an adjusted version which caters for that new requirement, here's one at no extra charge:
Thanks.
Can you please explain your use of "([\s\S]*?)". It obviously works, but it just looks a tad strange to say "match zero or more spaces or non-spaces". I'm guessing it's something to do with avoiding newlines?
Hi jkf,
Thanks for the points. I suggest you be careful in future about giving people lower marks for failing to meet unspecified requirements, but if you are interested in an adjusted version which caters for that new requirement, here's one at no extra charge:
$/ = undef;
@lines = split(/\d+\. */, <>);
foreach $line (@lines)
{
next unless $started++;
chop $line;
$line =~ s/\n/ /g;
print "line[$line]\n";
}
For the future, I would also suggest you provide your expected output in your original post, so that experts like farzanj don't waste your time or theirs posting solutions until they know they work.Thanks.
/[\s\S]/ matches any character including \n
I could have used /./s
but I also wanted to use a . that doesn't match \n elsewhere in the same regex
I could have used (?s:.) and (?-s:.) but [\s\S] and . seemed simpler
I could have used /./s
but I also wanted to use a . that doesn't match \n elsewhere in the same regex
I could have used (?s:.) and (?-s:.) but [\s\S] and . seemed simpler
Thanks heaps, ozo!
Try this.
Open in new window