Solved

Problem matching a list with multi-line items

Posted on 2011-03-23
21
347 Views
Last Modified: 2012-05-11
I'm trying to break a chunk of text containing a numbered list into individual items, but am having trouble with items that continue to the second line. I'm hoping more pairs of eyes will see what I'm doing wrong.

Data looks like this:
1. This is the first item
2. This is a longer second item that wraps down
to the next line, but it's still one item.
3. This is the last time which is also long and
wraps to the next line as well

Open in new window

Perl code looks like this:
$/ = undef;
$_ = <>;

while (/^ \d+ [ \.]* (.*?) $/gmsx) {
  print "line[$1]\n";
}

Open in new window

Current output looks like this:
l
ine[This is the first item]
line[This is a longer second item that wraps down]
line[This is the last time which is also long and ]

Open in new window

As you can see I'm missing the second line of items two and three. I've tried various combinations of options and pattern greediness to get what I want, but so far no success.

0
Comment
Question by:jkfrench
  • 7
  • 6
  • 5
  • +1
21 Comments
 
LVL 31

Expert Comment

by:farzanj
ID: 35202343
Obviously.  You don't have a number in the second line.

Try this.
$/ = undef;
$_ = <>;

while (/^ \d+ [ \.]* (.*?) $/gmsx) {
  print "line[$1]\n";
}

Open in new window

0
 

Author Comment

by:jkfrench
ID: 35202479
farzanj,

Thanks for the reply, but I don't see any difference between your example and the code I posted.

And I realize there is no number on the second line. I was trying to get the (.*?) pattern to match across lines to pick up the rest of the item. It looks like the $ is anchoring the pattern to the end of the line, although I was trying to use the 's' option to allow . to match newline.
0
 
LVL 31

Expert Comment

by:farzanj
ID: 35202534
Sorry,  I replaced + with *
0
Netscaler Common Configuration How To guides

If you use NetScaler you will want to see these guides. The NetScaler How To Guides show administrators how to get NetScaler up and configured by providing instructions for common scenarios and some not so common ones.

 
LVL 31

Expert Comment

by:farzanj
ID: 35202540

$/ = undef;
$_ = <>;

while (/^ \d+ [ \.]* (.*?) $/gmsx) {
  print "line[$1]\n";
}

Open in new window

0
 
LVL 31

Expert Comment

by:farzanj
ID: 35202550
SORRY ONCE AGAIN.  Problem with my clipboard.

$/ = undef;
$_ = <>;

while (/^\d* [ \.]* (.*?) $/gmsx) {
  print "line[$1]\n";
}
0
 

Author Comment

by:jkfrench
ID: 35202589
No problem. That gets the additional text, but as separate items. So I'm now getting:

line[This is the first item]
line[This is a longer second item that wraps down]
line[to the next line, but it's still one item.]
line[This is the last time which is also long and ]
line[wraps to the next line as well]

Open in new window

but am trying to get:
line[This is the first item]
line[This is a longer second item that wraps down to the next line, but it's still one item.]
line[This is the last time which is also long and wraps to the next line as well]

Open in new window

0
 
LVL 31

Expert Comment

by:farzanj
ID: 35202683
Ok.

here is the idea.  Change input record separator to

$/= '\d+';

This solves part of the problem

For the second part.  You have to delete new line character from the line before printing.
0
 
LVL 31

Expert Comment

by:farzanj
ID: 35202689
And yes, you can keep \d+ the one I had changed to \d*
0
 
LVL 84

Expert Comment

by:ozo
ID: 35203052
$/ is a string, not a regex.

s/\n/ /g,print "line[$_]\n" for /^\d+ [ .]* ([\s\S]*?) (?=\n\d|\Z)/gmx;
0
 
LVL 12

Assisted Solution

by:tel2
tel2 earned 150 total points
ID: 35203120
Have you tested that code, ozo?

This seems produce the output you've specified, jk.

$/ = undef;
@lines = split(/\d+[\. ]*/, <>);

foreach $line (@lines)
{
        $lineno ++;
        next if $lineno == 1;
        chop $line;
        $line =~ s/\n/ /sg;
        print "line[$line]\n";
}

Open in new window

0
 
LVL 12

Expert Comment

by:tel2
ID: 35203164
PS: Where I wrote:
    $line =~ s/\n/ /sg;
the 's' modifier is not required.  It could simply be:
    $line =~ s/\n/ /g;
0
 
LVL 84

Expert Comment

by:ozo
ID: 35203199
> Have you tested that code, ozo?
Yes, it produces

line[This is the first item]
line[This is a longer second item that wraps down to the next line, but it's still one item.]
line[This is the last time which is also long and wraps to the next line as well]

Have you tested it?
0
 
LVL 12

Expert Comment

by:tel2
ID: 35203260
Hi ozo,

Well, when I test it like this:
    perl -ne 's/\n/ /g,print "line[$_]\n" for /^\d+ [ .]* ([\s\S]*?) (?=\n\d|\Z)/gmx' inputfile
I get this:
    line[This is the first item]
    line[This is a longer second item that wraps down]
    line[This is the last time which is also long and]

What am I doing wrong?
0
 
LVL 84

Expert Comment

by:ozo
ID: 35203282
perl -0777 -ne 's/\n/ /g,print "line[$_]\n" for /^\d+ [ .]* ([\s\S]*?) (?=\n\d|\Z)/gmx'

-0777 has the effect of $/ = undef;
0
 
LVL 12

Expert Comment

by:tel2
ID: 35203324
Thanks ozo,

Sorry - wasn't thinking.  Should have realised that's what was needed.

BTW, in your code, should this:
    [ .]*
be this:
    [ \.]*
?
0
 
LVL 84

Accepted Solution

by:
ozo earned 350 total points
ID: 35203345
perl -0777 -ne 's/\n/ /g,s/\s+$//,print "line[$_]\n" for grep/./,split/^\d+[. ]*/m'
also works
\. is not necessary in []
0
 
LVL 12

Expert Comment

by:tel2
ID: 35203406
Thanks ozo.  Sorry - I forgot about that, too!

jkfrench, pls take note of that.  If you decide to use my solution, you can also change the:
    @lines = split(/\d+[\. ]*/, <>);
to:
    @lines = split(/\d+[. ]*/, <>);
0
 

Author Closing Comment

by:jkfrench
ID: 35207433
Thank you all for your help. I'm using ozo's one-liner, because it also works if there is a number embedded in the text. Since tel2's solution also worked for the data I posted, I'm awarding assist points.
0
 
LVL 12

Expert Comment

by:tel2
ID: 35211585
Hi ozo,

Can you please explain your use of "([\s\S]*?)".  It obviously works, but it just looks a tad strange to say "match zero or more spaces or non-spaces".  I'm guessing it's something to do with avoiding newlines?


Hi jkf,

Thanks for the points.  I suggest you be careful in future about giving people lower marks for failing to meet unspecified requirements, but if you are interested in an adjusted version which caters for that new requirement, here's one at no extra charge:
$/ = undef;
@lines = split(/\d+\. */, <>);

foreach $line (@lines)
{
  next unless $started++;
  chop $line;
  $line =~ s/\n/ /g;
  print "line[$line]\n";
}

Open in new window

For the future, I would also suggest you provide your expected output in your original post, so that experts like farzanj don't waste your time or theirs posting solutions until they know they work.

Thanks.
0
 
LVL 84

Expert Comment

by:ozo
ID: 35214031
/[\s\S]/ matches any character including \n
I could have used /./s
but I also wanted to use a . that doesn't match \n elsewhere in the same regex
I could have used (?s:.) and (?-s:.) but [\s\S] and . seemed simpler
0
 
LVL 12

Expert Comment

by:tel2
ID: 35218818
Thanks heaps, ozo!
0

Featured Post

Migrating Your Company's PCs

To keep pace with competitors, businesses must keep employees productive, and that means providing them with the latest technology. This document provides the tips and tricks you need to help you migrate an outdated PC fleet to new desktops, laptops, and tablets.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

A year or so back I was asked to have a play with MongoDB; within half an hour I had downloaded (http://www.mongodb.org/downloads),  installed and started the daemon, and had a console window open. After an hour or two of playing at the command …
In the distant past (last year) I hacked together a little toy that would allow a couple of Manager types to query, preview, and extract data from a number of MongoDB instances, to their tool of choice: Excel (http://dilbert.com/strips/comic/2007-08…
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…
A short tutorial showing how to set up an email signature in Outlook on the Web (previously known as OWA). For free email signatures designs, visit https://www.mail-signatures.com/articles/signature-templates/?sts=6651 If you want to manage em…

809 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question