?
Solved

Difficulties in Parsing a Complex String Pattorn Matching ...

Posted on 2004-10-11
26
Medium Priority
?
457 Views
Last Modified: 2008-02-01
The parsing routine (sub parse()) defined below has been working very well with data such as the one in Sample Data 1, where each alphanumeric string is contained inside "" (double quotation) and delimited by a single comma (,).

However, I discovered recently that each field can also contain "" (double quotations), such as the 3rd and 4th fields in Sample Data 2.

That's when my routine quit working properly and my array (new@) could not initialize its fields with strings that contained "".

Please let me know how I can modify my pattern to match strings that contain "" so I can initialize my array with a complete string rather than a chopped one.

Sample Data 1:
"0912091209", "8278RT.991", "", "THIS IS 1 STRING", "THIS is my 2nd string", "", "This is my last STRING"

Sample Data 2:
"0912091209", "8278RT.991", "", "THIS IS "1" STRING", "THIS is my "2nd" string", "", ""This" is my last STRING"

## Parse the flat file and extract the information.
sub parse
{
    my $text = shift;
    my @new = ();

    push(@new, $+) while $text=~ m{"([^\"\\]*(?:\\.[^\"\\]*)*)" ,? | ([^,"]+),? |,}gx ;
   
    push(@new,undef) if substr($text,-1,1) eq ',' ;

    return @new;
}
0
Comment
Question by:TexanLonghorn
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 12
  • 11
  • 3
26 Comments
 
LVL 12

Expert Comment

by:stefan73
ID: 12277912
Hi TexanLonghorn,
Are you trying to parse Excel CSV files? Check Text::CSV
http://search.cpan.org/~alancitt/Text-CSV-0.01/CSV.pm


Cheers!

Stefan
0
 

Author Comment

by:TexanLonghorn
ID: 12278412
Hi Stefan,

No, I am not parsing an Excel CSV file but very similar. Thanks for the clue about the Text::CSV module, but I'd prefer sticking to the routine that I have with little or few modifications if all possible.
0
 
LVL 84

Expert Comment

by:ozo
ID: 12278797
@new = split/"?,\s*"?/,$text;
0
Industry Leaders: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 

Author Comment

by:TexanLonghorn
ID: 12286111
Thanks "OSO" but the pattern you specified for "split" does not apply to the first field.

The other thing it could potentially fail on if the field between the double-quotation contained a comma such as the 4th field in the following Sample Data:

"0912091209", "8278RT.991", "", ""The Palace" at 1234 Main St., NY 99217", "THIS is my "2nd" string", "", "This is my last STRING"
0
 
LVL 84

Expert Comment

by:ozo
ID: 12287062
How do I know that's not just one field with
", " in double quotes like "The Palace" and "2nd" ?

@new = $text=~/"((?:.|"[^"]*")*?)"(?=, "|$)/g;
0
 

Author Comment

by:TexanLonghorn
ID: 12288589
I am not sure I understand your reference to ", "

However, depending on the data, I know which field could potentially contain a comma and which doesn't. For example, it's most likely that the 4th and the 5th fields in the sample data could have a comma.

If you are referring to the fields delimiters then I must have miswritten my data sample. There are no white spaces in the delimiters.

Sample Data Description:
It is all on one line, there are 7 fields of data, each data field is between double-quotes followed by a comma. Each fields could contain empty strings (like the 3rd and 6th field) and/or double-quotes and commas.

Corrected Data Sample:
"0912091209","8278RT.991","",""The Palace" at 1234 Main St., NY 99217","THIS is my "2nd" string","","This is my last STRING"

I'm sorry but the pattern you suggested is not picking any field now :-(
0
 
LVL 84

Expert Comment

by:ozo
ID: 12291589
@new = $text=~/"((?:.|"[^"]*")*?)"(?=,"|$)/g;
0
 

Author Comment

by:TexanLonghorn
ID: 12296681
Thanks SOS, but the "@new" array is still NOT being initialized using the suggested pattern.

Apparently the data line is comming to the "parse" routine without the double-qutation and the end of line character ("\n). That's because the lines picked from the text-file is being split by ("\n+) as follows:

@info=split(/"\n+/,$token);

while (@info)
{
  $textline=shift(@info);
  @temp=parse($textline);
...
}

Therefore, the textline looks as follows before getting parsed by the routine:
"0912091209", "8278RT.991", "", ""The Palace" at 1234 Main St., NY 99217", "THIS is my "2nd" string", "", "This is my last STRING

sub parse
{
    my $text = shift;
    my @new = ();

    @new = $text=~/"((?:.|"[^"]*")*?)"(?=,"|$)/g;

    return @new;
}
0
 
LVL 84

Expert Comment

by:ozo
ID: 12299278
@new = $text=~/(?:^|,\s*)"((?:.|"[^"]*")*?)(?=",\s*"|$)/g;
#I still worry about whether "","" is two empty fields, or a single field containing '","'
0
 

Author Comment

by:TexanLonghorn
ID: 12302126
Thanks OZO. The suggested pattern is working better, but it skips the first field if any of the fields contains an end-of-line character(\n). For example, if my text-line happens to have the following sample data:

"0912091209", "8278RT.991", "", ""The Palace" at 1234 Main St., NY 99217", "THIS is my "2nd" string", "", "This is my 1st line of a multi-line address
This is my 2nd line of a multi-line address
This is my 3rd line of a multi-line address

Then, my "new" array looks as follows:

DEBUG: Array[0] = 8278RT.991
DEBUG: Array[1] =
DEBUG: Array[2] = "The Palace" at 1234 Main St., NY 99217
DEBUG: Array[3] = THIS is my "2nd" string
DEBUG: Array[4] =
DEBUG: Array[5] =
DEBUG: Array[6] =
0
 
LVL 84

Expert Comment

by:ozo
ID: 12302322
@new = $text=~/(?:^|,)\s*"((?:.|"[^"]*")*?)(?=",\s*"|$)/gs;
0
 
LVL 84

Expert Comment

by:ozo
ID: 12302377
@new = $text=~/(?:^|,)\s*"((?:.|"[^"]*")*?)(?=",\s*"|"?$)/gs;
0
 
LVL 84

Expert Comment

by:ozo
ID: 12302397
@new = $text=~/(?:^|,)\s*"((?:[^"]|"[^"]*")*?)(?=",\s*"|"?$)/g;
0
 

Author Comment

by:TexanLonghorn
ID: 12302699
Thanks, but I tried using the last two suggested patterns and it looks worse now. The parser is not able to initialize any field in the array :-(
0
 
LVL 84

Expert Comment

by:ozo
ID: 12302776
What is in $text when the pattern fails?
0
 

Author Comment

by:TexanLonghorn
ID: 12306485
DEBUG: Text = "0912091209", "8278RT.991", "", ""The Palace" at 1234 Main St., NY 99217", "THIS is my "2nd" string", "", "This is my 1st line of a multi-line address
This is my 2nd line of a multi-line address
This is my 3rd line of a multi-line address", "Last Line

Please let me know if I should consider using Text::CSV module at this point.
0
 
LVL 84

Expert Comment

by:ozo
ID: 12315832
$text='"0912091209", "8278RT.991", "", ""The Palace" at 1234 Main St., NY 99217", "THIS is my "2nd" string", "", "This is my 1st line of a multi-line address
This is my 2nd line of a multi-line address
This is my 3rd line of a multi-line address", "Last Line';
@new = $text=~/(?:^|,)\s*"((?:[^"]|"[^"]*")*?)(?=",\s*"|"?$)/g;
for(0..$#new){
    print "[$_] = $new[$_]\n";
}
__DATA__
prints:
[0] = 0912091209
[1] = 8278RT.991
[2] =
[3] = "The Palace" at 1234 Main St., NY 99217
[4] = THIS is my "2nd" string
[5] =
[6] = This is my 1st line of a multi-line address
This is my 2nd line of a multi-line address
This is my 3rd line of a multi-line address
[7] = Last Line
What does it do for you?
0
 
LVL 12

Expert Comment

by:stefan73
ID: 12316883
> ""The Palace" at 1234 Main St., NY 99217"

That's not a good way to escape double quotes. You're in deep trouble when you have something like
""The Palace", 1234 Main St., NY 99217",

Those quotes are ambigious.
If you're the one who did the quote escapes, fix it (e.g., to "" for a single quote, like in CSV).

If not, ask the creator of the file to change his bad way of quoting.
0
 

Author Comment

by:TexanLonghorn
ID: 12345475
Thanks for raising a very good point. Unfortunately, I don't have control over the way the delimiters are created. The software that creates the file is a third-party software and would require a Change Request to eliminate ambiguity. The only control I have is over the type, location and the number of fields.

I thought of an idea to work around this problem by placing a recognizable fields between the special fields:

Text = "0912091209", "8278RT.991", "", "XYZXYZ", ""The Palace" at 1234 Main St., NY 99217","XYZXYZ", "This is my 1st line of a multi-line address
This is my 2nd line of a multi-line address
This is my 3rd line of a multi-line address", "XYZXYZ", "Last Line

        @tempArray = split/","XYZXYZ","/,$text;
        @newline = split/"?,?\s*"/,$tempArray[0];

        shift(@newline);

        ## Initialize an empty field in the array if the field is empty
        push(@newline,undef) if substr($tempArray[0],-3,3) eq '","';

        push(@newline, $tempArray[1]);
        push(@newline, $tempArray[2]);
        push(@newline, $tempArray[3]);
0
 
LVL 84

Expert Comment

by:ozo
ID: 12345632
You still haven't told me how what you want differs from the result of
@new = $text=~/(?:^|,)\s*"((?:[^"]|"[^"]*")*?)(?=",\s*"|"?$)/g;

I know it will fail if you ever have a quoted comma in a string, just
as ","XYZXYZ"," will fail if you ever have two quoted commas separated by XYZXYZ in a string, which seems less likely.
Will there always be a known number of alphanumeric fields before the ","XYZXYZ"," delimited fields?
or should that be ",\s*"XYZXYZ",\s*" delimited fields?
0
 

Author Comment

by:TexanLonghorn
ID: 12356334
When I first tried the initialization pattern you suggested it didn't work for me. That was because the last character (") was chopped from each end-of-line. If you recall, I mentioned that the routine, which reads each line from the text-file, was splitting the lines as follows:

@info=split(/"\n+/,$token);

while (@info)
{
  $textline=shift(@info);
  @temp=parse($textline);
...
}

Adding the double-quotation back to the end of the $textline enabled your pattern to work. THANK YOU!

One last request. Please translate your pattern to me in plain English.
0
 
LVL 84

Accepted Solution

by:
ozo earned 840 total points
ID: 12356722
The pattern should have worked whether or not (") was chopped from the end of each line, since it would not be chopped from the last line.
(although that presents another problem if any field contains a quoted \n)

perl -MYAPE::Regex::Explain -e 'print YAPE::Regex::Explain->new(qr/(?:^|,)\s*"((?:[^"]|"[^"]*")*?)(?=",\s*"|"?$)/)->explain'
The regular expression:

(?-imsx:(?:^|,)\s*"((?:[^"]|"[^"]*")*?)(?=",\s*"|"?$))

matches as follows:
 
NODE                     EXPLANATION
----------------------------------------------------------------------
(?-imsx:                 group, but do not capture (case-sensitive)
                         (with ^ and $ matching normally) (with . not
                         matching \n) (matching whitespace and #
                         normally):
----------------------------------------------------------------------
  (?:                      group, but do not capture:
----------------------------------------------------------------------
    ^                        the beginning of the string
----------------------------------------------------------------------
   |                        OR
----------------------------------------------------------------------
    ,                        ','
----------------------------------------------------------------------
  )                        end of grouping
----------------------------------------------------------------------
  \s*                      whitespace (\n, \r, \t, \f, and " ") (0 or
                           more times (matching the most amount
                           possible))
----------------------------------------------------------------------
  "                        '"'
----------------------------------------------------------------------
  (                        group and capture to \1:
----------------------------------------------------------------------
    (?:                      group, but do not capture (0 or more
                             times (matching the least amount
                             possible)):
----------------------------------------------------------------------
      [^"]                     any character except: '"'
----------------------------------------------------------------------
     |                        OR
----------------------------------------------------------------------
      "                        '"'
----------------------------------------------------------------------
      [^"]*                    any character except: '"' (0 or more
                               times (matching the most amount
                               possible))
----------------------------------------------------------------------
      "                        '"'
----------------------------------------------------------------------
    )*?                      end of grouping
----------------------------------------------------------------------
  )                        end of \1
----------------------------------------------------------------------
  (?=                      look ahead to see if there is:
----------------------------------------------------------------------
    ",                       '",'
----------------------------------------------------------------------
    \s*                      whitespace (\n, \r, \t, \f, and " ") (0
                             or more times (matching the most amount
                             possible))
----------------------------------------------------------------------
    "                        '"'
----------------------------------------------------------------------
   |                        OR
----------------------------------------------------------------------
    "?                       '"' (optional (matching the most amount
                             possible))
----------------------------------------------------------------------
    $                        before an optional \n, and the end of
                             the string
----------------------------------------------------------------------
  )                        end of look-ahead
----------------------------------------------------------------------
)                        end of grouping
----------------------------------------------------------------------
0
 

Author Comment

by:TexanLonghorn
ID: 12356811
Besides the obvious pattern, where there is a quoted comma (",") in a string, do you foresee other cases where this pattern will fail? And how can I trap these failed cases (by an error message) instead of parsing garbage strings?
0
 
LVL 84

Expert Comment

by:ozo
ID: 12358974
As I mentioned, a "\n in a string will break your split.
"THIS is my "2nd string"
This is my 2nd line of a multi-line address"

If you know how many elements @info and @temp should have, you should check that they have the correct length.
0
 

Author Comment

by:TexanLonghorn
ID: 12360165
You raised a very good point. There is a way to tell the number of fields I am expecting to read, but I don't think I can tell the length of each field.
0
 
LVL 12

Expert Comment

by:stefan73
ID: 12360518
> Unfortunately, I don't have control over the way the delimiters are created. The software that creates the file is a third-party software and would require a Change Request to eliminate ambiguity. The only control I have is over the type, location and the number of fields.

Hmmm, it's an obvious bug in the S/W.
Ambiguity shouldn't exist in any kind of machine-readable format.
0

Featured Post

Independent Software Vendors: We Want Your Opinion

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

On Microsoft Windows, if  when you click or type the name of a .pl file, you get an error "is not recognized as an internal or external command, operable program or batch file", then this means you do not have the .pl file extension associated with …
A year or so back I was asked to have a play with MongoDB; within half an hour I had downloaded (http://www.mongodb.org/downloads),  installed and started the daemon, and had a console window open. After an hour or two of playing at the command …
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…
Six Sigma Control Plans
Suggested Courses

752 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question