Improve company productivity with a Business Account.Sign Up

x
  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 465
  • Last Modified:

Difficulties in Parsing a Complex String Pattorn Matching ...

The parsing routine (sub parse()) defined below has been working very well with data such as the one in Sample Data 1, where each alphanumeric string is contained inside "" (double quotation) and delimited by a single comma (,).

However, I discovered recently that each field can also contain "" (double quotations), such as the 3rd and 4th fields in Sample Data 2.

That's when my routine quit working properly and my array (new@) could not initialize its fields with strings that contained "".

Please let me know how I can modify my pattern to match strings that contain "" so I can initialize my array with a complete string rather than a chopped one.

Sample Data 1:
"0912091209", "8278RT.991", "", "THIS IS 1 STRING", "THIS is my 2nd string", "", "This is my last STRING"

Sample Data 2:
"0912091209", "8278RT.991", "", "THIS IS "1" STRING", "THIS is my "2nd" string", "", ""This" is my last STRING"

## Parse the flat file and extract the information.
sub parse
{
    my $text = shift;
    my @new = ();

    push(@new, $+) while $text=~ m{"([^\"\\]*(?:\\.[^\"\\]*)*)" ,? | ([^,"]+),? |,}gx ;
   
    push(@new,undef) if substr($text,-1,1) eq ',' ;

    return @new;
}
0
TexanLonghorn
Asked:
TexanLonghorn
  • 12
  • 11
  • 3
1 Solution
 
stefan73Commented:
Hi TexanLonghorn,
Are you trying to parse Excel CSV files? Check Text::CSV
http://search.cpan.org/~alancitt/Text-CSV-0.01/CSV.pm


Cheers!

Stefan
0
 
TexanLonghornAuthor Commented:
Hi Stefan,

No, I am not parsing an Excel CSV file but very similar. Thanks for the clue about the Text::CSV module, but I'd prefer sticking to the routine that I have with little or few modifications if all possible.
0
 
ozoCommented:
@new = split/"?,\s*"?/,$text;
0
Free Tool: Path Explorer

An intuitive utility to help find the CSS path to UI elements on a webpage. These paths are used frequently in a variety of front-end development and QA automation tasks.

One of a set of tools we're offering as a way of saying thank you for being a part of the community.

 
TexanLonghornAuthor Commented:
Thanks "OSO" but the pattern you specified for "split" does not apply to the first field.

The other thing it could potentially fail on if the field between the double-quotation contained a comma such as the 4th field in the following Sample Data:

"0912091209", "8278RT.991", "", ""The Palace" at 1234 Main St., NY 99217", "THIS is my "2nd" string", "", "This is my last STRING"
0
 
ozoCommented:
How do I know that's not just one field with
", " in double quotes like "The Palace" and "2nd" ?

@new = $text=~/"((?:.|"[^"]*")*?)"(?=, "|$)/g;
0
 
TexanLonghornAuthor Commented:
I am not sure I understand your reference to ", "

However, depending on the data, I know which field could potentially contain a comma and which doesn't. For example, it's most likely that the 4th and the 5th fields in the sample data could have a comma.

If you are referring to the fields delimiters then I must have miswritten my data sample. There are no white spaces in the delimiters.

Sample Data Description:
It is all on one line, there are 7 fields of data, each data field is between double-quotes followed by a comma. Each fields could contain empty strings (like the 3rd and 6th field) and/or double-quotes and commas.

Corrected Data Sample:
"0912091209","8278RT.991","",""The Palace" at 1234 Main St., NY 99217","THIS is my "2nd" string","","This is my last STRING"

I'm sorry but the pattern you suggested is not picking any field now :-(
0
 
ozoCommented:
@new = $text=~/"((?:.|"[^"]*")*?)"(?=,"|$)/g;
0
 
TexanLonghornAuthor Commented:
Thanks SOS, but the "@new" array is still NOT being initialized using the suggested pattern.

Apparently the data line is comming to the "parse" routine without the double-qutation and the end of line character ("\n). That's because the lines picked from the text-file is being split by ("\n+) as follows:

@info=split(/"\n+/,$token);

while (@info)
{
  $textline=shift(@info);
  @temp=parse($textline);
...
}

Therefore, the textline looks as follows before getting parsed by the routine:
"0912091209", "8278RT.991", "", ""The Palace" at 1234 Main St., NY 99217", "THIS is my "2nd" string", "", "This is my last STRING

sub parse
{
    my $text = shift;
    my @new = ();

    @new = $text=~/"((?:.|"[^"]*")*?)"(?=,"|$)/g;

    return @new;
}
0
 
ozoCommented:
@new = $text=~/(?:^|,\s*)"((?:.|"[^"]*")*?)(?=",\s*"|$)/g;
#I still worry about whether "","" is two empty fields, or a single field containing '","'
0
 
TexanLonghornAuthor Commented:
Thanks OZO. The suggested pattern is working better, but it skips the first field if any of the fields contains an end-of-line character(\n). For example, if my text-line happens to have the following sample data:

"0912091209", "8278RT.991", "", ""The Palace" at 1234 Main St., NY 99217", "THIS is my "2nd" string", "", "This is my 1st line of a multi-line address
This is my 2nd line of a multi-line address
This is my 3rd line of a multi-line address

Then, my "new" array looks as follows:

DEBUG: Array[0] = 8278RT.991
DEBUG: Array[1] =
DEBUG: Array[2] = "The Palace" at 1234 Main St., NY 99217
DEBUG: Array[3] = THIS is my "2nd" string
DEBUG: Array[4] =
DEBUG: Array[5] =
DEBUG: Array[6] =
0
 
ozoCommented:
@new = $text=~/(?:^|,)\s*"((?:.|"[^"]*")*?)(?=",\s*"|$)/gs;
0
 
ozoCommented:
@new = $text=~/(?:^|,)\s*"((?:.|"[^"]*")*?)(?=",\s*"|"?$)/gs;
0
 
ozoCommented:
@new = $text=~/(?:^|,)\s*"((?:[^"]|"[^"]*")*?)(?=",\s*"|"?$)/g;
0
 
TexanLonghornAuthor Commented:
Thanks, but I tried using the last two suggested patterns and it looks worse now. The parser is not able to initialize any field in the array :-(
0
 
ozoCommented:
What is in $text when the pattern fails?
0
 
TexanLonghornAuthor Commented:
DEBUG: Text = "0912091209", "8278RT.991", "", ""The Palace" at 1234 Main St., NY 99217", "THIS is my "2nd" string", "", "This is my 1st line of a multi-line address
This is my 2nd line of a multi-line address
This is my 3rd line of a multi-line address", "Last Line

Please let me know if I should consider using Text::CSV module at this point.
0
 
ozoCommented:
$text='"0912091209", "8278RT.991", "", ""The Palace" at 1234 Main St., NY 99217", "THIS is my "2nd" string", "", "This is my 1st line of a multi-line address
This is my 2nd line of a multi-line address
This is my 3rd line of a multi-line address", "Last Line';
@new = $text=~/(?:^|,)\s*"((?:[^"]|"[^"]*")*?)(?=",\s*"|"?$)/g;
for(0..$#new){
    print "[$_] = $new[$_]\n";
}
__DATA__
prints:
[0] = 0912091209
[1] = 8278RT.991
[2] =
[3] = "The Palace" at 1234 Main St., NY 99217
[4] = THIS is my "2nd" string
[5] =
[6] = This is my 1st line of a multi-line address
This is my 2nd line of a multi-line address
This is my 3rd line of a multi-line address
[7] = Last Line
What does it do for you?
0
 
stefan73Commented:
> ""The Palace" at 1234 Main St., NY 99217"

That's not a good way to escape double quotes. You're in deep trouble when you have something like
""The Palace", 1234 Main St., NY 99217",

Those quotes are ambigious.
If you're the one who did the quote escapes, fix it (e.g., to "" for a single quote, like in CSV).

If not, ask the creator of the file to change his bad way of quoting.
0
 
TexanLonghornAuthor Commented:
Thanks for raising a very good point. Unfortunately, I don't have control over the way the delimiters are created. The software that creates the file is a third-party software and would require a Change Request to eliminate ambiguity. The only control I have is over the type, location and the number of fields.

I thought of an idea to work around this problem by placing a recognizable fields between the special fields:

Text = "0912091209", "8278RT.991", "", "XYZXYZ", ""The Palace" at 1234 Main St., NY 99217","XYZXYZ", "This is my 1st line of a multi-line address
This is my 2nd line of a multi-line address
This is my 3rd line of a multi-line address", "XYZXYZ", "Last Line

        @tempArray = split/","XYZXYZ","/,$text;
        @newline = split/"?,?\s*"/,$tempArray[0];

        shift(@newline);

        ## Initialize an empty field in the array if the field is empty
        push(@newline,undef) if substr($tempArray[0],-3,3) eq '","';

        push(@newline, $tempArray[1]);
        push(@newline, $tempArray[2]);
        push(@newline, $tempArray[3]);
0
 
ozoCommented:
You still haven't told me how what you want differs from the result of
@new = $text=~/(?:^|,)\s*"((?:[^"]|"[^"]*")*?)(?=",\s*"|"?$)/g;

I know it will fail if you ever have a quoted comma in a string, just
as ","XYZXYZ"," will fail if you ever have two quoted commas separated by XYZXYZ in a string, which seems less likely.
Will there always be a known number of alphanumeric fields before the ","XYZXYZ"," delimited fields?
or should that be ",\s*"XYZXYZ",\s*" delimited fields?
0
 
TexanLonghornAuthor Commented:
When I first tried the initialization pattern you suggested it didn't work for me. That was because the last character (") was chopped from each end-of-line. If you recall, I mentioned that the routine, which reads each line from the text-file, was splitting the lines as follows:

@info=split(/"\n+/,$token);

while (@info)
{
  $textline=shift(@info);
  @temp=parse($textline);
...
}

Adding the double-quotation back to the end of the $textline enabled your pattern to work. THANK YOU!

One last request. Please translate your pattern to me in plain English.
0
 
ozoCommented:
The pattern should have worked whether or not (") was chopped from the end of each line, since it would not be chopped from the last line.
(although that presents another problem if any field contains a quoted \n)

perl -MYAPE::Regex::Explain -e 'print YAPE::Regex::Explain->new(qr/(?:^|,)\s*"((?:[^"]|"[^"]*")*?)(?=",\s*"|"?$)/)->explain'
The regular expression:

(?-imsx:(?:^|,)\s*"((?:[^"]|"[^"]*")*?)(?=",\s*"|"?$))

matches as follows:
 
NODE                     EXPLANATION
----------------------------------------------------------------------
(?-imsx:                 group, but do not capture (case-sensitive)
                         (with ^ and $ matching normally) (with . not
                         matching \n) (matching whitespace and #
                         normally):
----------------------------------------------------------------------
  (?:                      group, but do not capture:
----------------------------------------------------------------------
    ^                        the beginning of the string
----------------------------------------------------------------------
   |                        OR
----------------------------------------------------------------------
    ,                        ','
----------------------------------------------------------------------
  )                        end of grouping
----------------------------------------------------------------------
  \s*                      whitespace (\n, \r, \t, \f, and " ") (0 or
                           more times (matching the most amount
                           possible))
----------------------------------------------------------------------
  "                        '"'
----------------------------------------------------------------------
  (                        group and capture to \1:
----------------------------------------------------------------------
    (?:                      group, but do not capture (0 or more
                             times (matching the least amount
                             possible)):
----------------------------------------------------------------------
      [^"]                     any character except: '"'
----------------------------------------------------------------------
     |                        OR
----------------------------------------------------------------------
      "                        '"'
----------------------------------------------------------------------
      [^"]*                    any character except: '"' (0 or more
                               times (matching the most amount
                               possible))
----------------------------------------------------------------------
      "                        '"'
----------------------------------------------------------------------
    )*?                      end of grouping
----------------------------------------------------------------------
  )                        end of \1
----------------------------------------------------------------------
  (?=                      look ahead to see if there is:
----------------------------------------------------------------------
    ",                       '",'
----------------------------------------------------------------------
    \s*                      whitespace (\n, \r, \t, \f, and " ") (0
                             or more times (matching the most amount
                             possible))
----------------------------------------------------------------------
    "                        '"'
----------------------------------------------------------------------
   |                        OR
----------------------------------------------------------------------
    "?                       '"' (optional (matching the most amount
                             possible))
----------------------------------------------------------------------
    $                        before an optional \n, and the end of
                             the string
----------------------------------------------------------------------
  )                        end of look-ahead
----------------------------------------------------------------------
)                        end of grouping
----------------------------------------------------------------------
0
 
TexanLonghornAuthor Commented:
Besides the obvious pattern, where there is a quoted comma (",") in a string, do you foresee other cases where this pattern will fail? And how can I trap these failed cases (by an error message) instead of parsing garbage strings?
0
 
ozoCommented:
As I mentioned, a "\n in a string will break your split.
"THIS is my "2nd string"
This is my 2nd line of a multi-line address"

If you know how many elements @info and @temp should have, you should check that they have the correct length.
0
 
TexanLonghornAuthor Commented:
You raised a very good point. There is a way to tell the number of fields I am expecting to read, but I don't think I can tell the length of each field.
0
 
stefan73Commented:
> Unfortunately, I don't have control over the way the delimiters are created. The software that creates the file is a third-party software and would require a Change Request to eliminate ambiguity. The only control I have is over the type, location and the number of fields.

Hmmm, it's an obvious bug in the S/W.
Ambiguity shouldn't exist in any kind of machine-readable format.
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

Join & Write a Comment

Featured Post

Free Tool: Subnet Calculator

The subnet calculator helps you design networks by taking an IP address and network mask and returning information such as network, broadcast address, and host range.

One of a set of tools we're offering as a way of saying thank you for being a part of the community.

  • 12
  • 11
  • 3
Tackle projects and never again get stuck behind a technical roadblock.
Join Now