Solved

Difficulties in Parsing a Complex String Pattorn Matching ...

Posted on 2004-10-11
26
404 Views
Last Modified: 2008-02-01
The parsing routine (sub parse()) defined below has been working very well with data such as the one in Sample Data 1, where each alphanumeric string is contained inside "" (double quotation) and delimited by a single comma (,).

However, I discovered recently that each field can also contain "" (double quotations), such as the 3rd and 4th fields in Sample Data 2.

That's when my routine quit working properly and my array (new@) could not initialize its fields with strings that contained "".

Please let me know how I can modify my pattern to match strings that contain "" so I can initialize my array with a complete string rather than a chopped one.

Sample Data 1:
"0912091209", "8278RT.991", "", "THIS IS 1 STRING", "THIS is my 2nd string", "", "This is my last STRING"

Sample Data 2:
"0912091209", "8278RT.991", "", "THIS IS "1" STRING", "THIS is my "2nd" string", "", ""This" is my last STRING"

## Parse the flat file and extract the information.
sub parse
{
    my $text = shift;
    my @new = ();

    push(@new, $+) while $text=~ m{"([^\"\\]*(?:\\.[^\"\\]*)*)" ,? | ([^,"]+),? |,}gx ;
   
    push(@new,undef) if substr($text,-1,1) eq ',' ;

    return @new;
}
0
Comment
Question by:TexanLonghorn
  • 12
  • 11
  • 3
26 Comments
 
LVL 12

Expert Comment

by:stefan73
ID: 12277912
Hi TexanLonghorn,
Are you trying to parse Excel CSV files? Check Text::CSV
http://search.cpan.org/~alancitt/Text-CSV-0.01/CSV.pm


Cheers!

Stefan
0
 

Author Comment

by:TexanLonghorn
ID: 12278412
Hi Stefan,

No, I am not parsing an Excel CSV file but very similar. Thanks for the clue about the Text::CSV module, but I'd prefer sticking to the routine that I have with little or few modifications if all possible.
0
 
LVL 84

Expert Comment

by:ozo
ID: 12278797
@new = split/"?,\s*"?/,$text;
0
 

Author Comment

by:TexanLonghorn
ID: 12286111
Thanks "OSO" but the pattern you specified for "split" does not apply to the first field.

The other thing it could potentially fail on if the field between the double-quotation contained a comma such as the 4th field in the following Sample Data:

"0912091209", "8278RT.991", "", ""The Palace" at 1234 Main St., NY 99217", "THIS is my "2nd" string", "", "This is my last STRING"
0
 
LVL 84

Expert Comment

by:ozo
ID: 12287062
How do I know that's not just one field with
", " in double quotes like "The Palace" and "2nd" ?

@new = $text=~/"((?:.|"[^"]*")*?)"(?=, "|$)/g;
0
 

Author Comment

by:TexanLonghorn
ID: 12288589
I am not sure I understand your reference to ", "

However, depending on the data, I know which field could potentially contain a comma and which doesn't. For example, it's most likely that the 4th and the 5th fields in the sample data could have a comma.

If you are referring to the fields delimiters then I must have miswritten my data sample. There are no white spaces in the delimiters.

Sample Data Description:
It is all on one line, there are 7 fields of data, each data field is between double-quotes followed by a comma. Each fields could contain empty strings (like the 3rd and 6th field) and/or double-quotes and commas.

Corrected Data Sample:
"0912091209","8278RT.991","",""The Palace" at 1234 Main St., NY 99217","THIS is my "2nd" string","","This is my last STRING"

I'm sorry but the pattern you suggested is not picking any field now :-(
0
 
LVL 84

Expert Comment

by:ozo
ID: 12291589
@new = $text=~/"((?:.|"[^"]*")*?)"(?=,"|$)/g;
0
 

Author Comment

by:TexanLonghorn
ID: 12296681
Thanks SOS, but the "@new" array is still NOT being initialized using the suggested pattern.

Apparently the data line is comming to the "parse" routine without the double-qutation and the end of line character ("\n). That's because the lines picked from the text-file is being split by ("\n+) as follows:

@info=split(/"\n+/,$token);

while (@info)
{
  $textline=shift(@info);
  @temp=parse($textline);
...
}

Therefore, the textline looks as follows before getting parsed by the routine:
"0912091209", "8278RT.991", "", ""The Palace" at 1234 Main St., NY 99217", "THIS is my "2nd" string", "", "This is my last STRING

sub parse
{
    my $text = shift;
    my @new = ();

    @new = $text=~/"((?:.|"[^"]*")*?)"(?=,"|$)/g;

    return @new;
}
0
 
LVL 84

Expert Comment

by:ozo
ID: 12299278
@new = $text=~/(?:^|,\s*)"((?:.|"[^"]*")*?)(?=",\s*"|$)/g;
#I still worry about whether "","" is two empty fields, or a single field containing '","'
0
 

Author Comment

by:TexanLonghorn
ID: 12302126
Thanks OZO. The suggested pattern is working better, but it skips the first field if any of the fields contains an end-of-line character(\n). For example, if my text-line happens to have the following sample data:

"0912091209", "8278RT.991", "", ""The Palace" at 1234 Main St., NY 99217", "THIS is my "2nd" string", "", "This is my 1st line of a multi-line address
This is my 2nd line of a multi-line address
This is my 3rd line of a multi-line address

Then, my "new" array looks as follows:

DEBUG: Array[0] = 8278RT.991
DEBUG: Array[1] =
DEBUG: Array[2] = "The Palace" at 1234 Main St., NY 99217
DEBUG: Array[3] = THIS is my "2nd" string
DEBUG: Array[4] =
DEBUG: Array[5] =
DEBUG: Array[6] =
0
 
LVL 84

Expert Comment

by:ozo
ID: 12302322
@new = $text=~/(?:^|,)\s*"((?:.|"[^"]*")*?)(?=",\s*"|$)/gs;
0
 
LVL 84

Expert Comment

by:ozo
ID: 12302377
@new = $text=~/(?:^|,)\s*"((?:.|"[^"]*")*?)(?=",\s*"|"?$)/gs;
0
 
LVL 84

Expert Comment

by:ozo
ID: 12302397
@new = $text=~/(?:^|,)\s*"((?:[^"]|"[^"]*")*?)(?=",\s*"|"?$)/g;
0
Highfive Gives IT Their Time Back

Highfive is so simple that setting up every meeting room takes just minutes and every employee will be able to start or join a call from any room with ease. Never be called into a meeting just to get it started again. This is how video conferencing should work!

 

Author Comment

by:TexanLonghorn
ID: 12302699
Thanks, but I tried using the last two suggested patterns and it looks worse now. The parser is not able to initialize any field in the array :-(
0
 
LVL 84

Expert Comment

by:ozo
ID: 12302776
What is in $text when the pattern fails?
0
 

Author Comment

by:TexanLonghorn
ID: 12306485
DEBUG: Text = "0912091209", "8278RT.991", "", ""The Palace" at 1234 Main St., NY 99217", "THIS is my "2nd" string", "", "This is my 1st line of a multi-line address
This is my 2nd line of a multi-line address
This is my 3rd line of a multi-line address", "Last Line

Please let me know if I should consider using Text::CSV module at this point.
0
 
LVL 84

Expert Comment

by:ozo
ID: 12315832
$text='"0912091209", "8278RT.991", "", ""The Palace" at 1234 Main St., NY 99217", "THIS is my "2nd" string", "", "This is my 1st line of a multi-line address
This is my 2nd line of a multi-line address
This is my 3rd line of a multi-line address", "Last Line';
@new = $text=~/(?:^|,)\s*"((?:[^"]|"[^"]*")*?)(?=",\s*"|"?$)/g;
for(0..$#new){
    print "[$_] = $new[$_]\n";
}
__DATA__
prints:
[0] = 0912091209
[1] = 8278RT.991
[2] =
[3] = "The Palace" at 1234 Main St., NY 99217
[4] = THIS is my "2nd" string
[5] =
[6] = This is my 1st line of a multi-line address
This is my 2nd line of a multi-line address
This is my 3rd line of a multi-line address
[7] = Last Line
What does it do for you?
0
 
LVL 12

Expert Comment

by:stefan73
ID: 12316883
> ""The Palace" at 1234 Main St., NY 99217"

That's not a good way to escape double quotes. You're in deep trouble when you have something like
""The Palace", 1234 Main St., NY 99217",

Those quotes are ambigious.
If you're the one who did the quote escapes, fix it (e.g., to "" for a single quote, like in CSV).

If not, ask the creator of the file to change his bad way of quoting.
0
 

Author Comment

by:TexanLonghorn
ID: 12345475
Thanks for raising a very good point. Unfortunately, I don't have control over the way the delimiters are created. The software that creates the file is a third-party software and would require a Change Request to eliminate ambiguity. The only control I have is over the type, location and the number of fields.

I thought of an idea to work around this problem by placing a recognizable fields between the special fields:

Text = "0912091209", "8278RT.991", "", "XYZXYZ", ""The Palace" at 1234 Main St., NY 99217","XYZXYZ", "This is my 1st line of a multi-line address
This is my 2nd line of a multi-line address
This is my 3rd line of a multi-line address", "XYZXYZ", "Last Line

        @tempArray = split/","XYZXYZ","/,$text;
        @newline = split/"?,?\s*"/,$tempArray[0];

        shift(@newline);

        ## Initialize an empty field in the array if the field is empty
        push(@newline,undef) if substr($tempArray[0],-3,3) eq '","';

        push(@newline, $tempArray[1]);
        push(@newline, $tempArray[2]);
        push(@newline, $tempArray[3]);
0
 
LVL 84

Expert Comment

by:ozo
ID: 12345632
You still haven't told me how what you want differs from the result of
@new = $text=~/(?:^|,)\s*"((?:[^"]|"[^"]*")*?)(?=",\s*"|"?$)/g;

I know it will fail if you ever have a quoted comma in a string, just
as ","XYZXYZ"," will fail if you ever have two quoted commas separated by XYZXYZ in a string, which seems less likely.
Will there always be a known number of alphanumeric fields before the ","XYZXYZ"," delimited fields?
or should that be ",\s*"XYZXYZ",\s*" delimited fields?
0
 

Author Comment

by:TexanLonghorn
ID: 12356334
When I first tried the initialization pattern you suggested it didn't work for me. That was because the last character (") was chopped from each end-of-line. If you recall, I mentioned that the routine, which reads each line from the text-file, was splitting the lines as follows:

@info=split(/"\n+/,$token);

while (@info)
{
  $textline=shift(@info);
  @temp=parse($textline);
...
}

Adding the double-quotation back to the end of the $textline enabled your pattern to work. THANK YOU!

One last request. Please translate your pattern to me in plain English.
0
 
LVL 84

Accepted Solution

by:
ozo earned 210 total points
ID: 12356722
The pattern should have worked whether or not (") was chopped from the end of each line, since it would not be chopped from the last line.
(although that presents another problem if any field contains a quoted \n)

perl -MYAPE::Regex::Explain -e 'print YAPE::Regex::Explain->new(qr/(?:^|,)\s*"((?:[^"]|"[^"]*")*?)(?=",\s*"|"?$)/)->explain'
The regular expression:

(?-imsx:(?:^|,)\s*"((?:[^"]|"[^"]*")*?)(?=",\s*"|"?$))

matches as follows:
 
NODE                     EXPLANATION
----------------------------------------------------------------------
(?-imsx:                 group, but do not capture (case-sensitive)
                         (with ^ and $ matching normally) (with . not
                         matching \n) (matching whitespace and #
                         normally):
----------------------------------------------------------------------
  (?:                      group, but do not capture:
----------------------------------------------------------------------
    ^                        the beginning of the string
----------------------------------------------------------------------
   |                        OR
----------------------------------------------------------------------
    ,                        ','
----------------------------------------------------------------------
  )                        end of grouping
----------------------------------------------------------------------
  \s*                      whitespace (\n, \r, \t, \f, and " ") (0 or
                           more times (matching the most amount
                           possible))
----------------------------------------------------------------------
  "                        '"'
----------------------------------------------------------------------
  (                        group and capture to \1:
----------------------------------------------------------------------
    (?:                      group, but do not capture (0 or more
                             times (matching the least amount
                             possible)):
----------------------------------------------------------------------
      [^"]                     any character except: '"'
----------------------------------------------------------------------
     |                        OR
----------------------------------------------------------------------
      "                        '"'
----------------------------------------------------------------------
      [^"]*                    any character except: '"' (0 or more
                               times (matching the most amount
                               possible))
----------------------------------------------------------------------
      "                        '"'
----------------------------------------------------------------------
    )*?                      end of grouping
----------------------------------------------------------------------
  )                        end of \1
----------------------------------------------------------------------
  (?=                      look ahead to see if there is:
----------------------------------------------------------------------
    ",                       '",'
----------------------------------------------------------------------
    \s*                      whitespace (\n, \r, \t, \f, and " ") (0
                             or more times (matching the most amount
                             possible))
----------------------------------------------------------------------
    "                        '"'
----------------------------------------------------------------------
   |                        OR
----------------------------------------------------------------------
    "?                       '"' (optional (matching the most amount
                             possible))
----------------------------------------------------------------------
    $                        before an optional \n, and the end of
                             the string
----------------------------------------------------------------------
  )                        end of look-ahead
----------------------------------------------------------------------
)                        end of grouping
----------------------------------------------------------------------
0
 

Author Comment

by:TexanLonghorn
ID: 12356811
Besides the obvious pattern, where there is a quoted comma (",") in a string, do you foresee other cases where this pattern will fail? And how can I trap these failed cases (by an error message) instead of parsing garbage strings?
0
 
LVL 84

Expert Comment

by:ozo
ID: 12358974
As I mentioned, a "\n in a string will break your split.
"THIS is my "2nd string"
This is my 2nd line of a multi-line address"

If you know how many elements @info and @temp should have, you should check that they have the correct length.
0
 

Author Comment

by:TexanLonghorn
ID: 12360165
You raised a very good point. There is a way to tell the number of fields I am expecting to read, but I don't think I can tell the length of each field.
0
 
LVL 12

Expert Comment

by:stefan73
ID: 12360518
> Unfortunately, I don't have control over the way the delimiters are created. The software that creates the file is a third-party software and would require a Change Request to eliminate ambiguity. The only control I have is over the type, location and the number of fields.

Hmmm, it's an obvious bug in the S/W.
Ambiguity shouldn't exist in any kind of machine-readable format.
0

Featured Post

What Should I Do With This Threat Intelligence?

Are you wondering if you actually need threat intelligence? The answer is yes. We explain the basics for creating useful threat intelligence.

Join & Write a Comment

Email validation in proper way is  very important validation required in any web pages. This code is self explainable except that Regular Expression which I used for pattern matching. I originally published as a thread on my website : http://www…
In the distant past (last year) I hacked together a little toy that would allow a couple of Manager types to query, preview, and extract data from a number of MongoDB instances, to their tool of choice: Excel (http://dilbert.com/strips/comic/2007-08…
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…
This tutorial demonstrates a quick way of adding group price to multiple Magento products.

709 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

13 Experts available now in Live!

Get 1:1 Help Now