Want to protect your cyber security and still get fast solutions? Ask a secure question today.Go Premium

x
  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 639
  • Last Modified:

Regex to to parse csv with nested quotes

I want validate a csv row.
the format is as follows:
1,3.12,BCD,"abc 4"" ""qwe""",2011/10/10

the above data is MS excel type escaped code for:
1,3.12,BCD,abc 4" "qwe"",2011/10/10

I just need to match or validate string, thats all.
The main problem here is nested quotes.

The programming language is Java
0
Maverick_Cool
Asked:
Maverick_Cool
  • 14
  • 11
  • 9
  • +1
1 Solution
 
Terry WoodsIT GuruCommented:
This pattern validates your format:
1. allowing empty fields and
2. checking the quotes are in pairs if enclosed in double quotes

^([^"]*,){3}(?![^",][^,]*")("?)(""|[^"])*\2,[^"]*$

In Java format (backslashes and double quotes escaped), I think it would be:

"^([^\"]*,){3}(?![^\",][^,]*\")(\"?)(\"\"|[^\"])*\\2,[^\"]*$"

Out of the following test cases, just the first 3 are accepted in my testing:
1,3.12,BCD,"abc 4"" ""qwe""",2011/10/10
1,3.12,BCD,abc 4,2011/10/10
1,3.12,BCD,"abc ",2011/10/10
1,3.12,BCD,abc 4""asd,2011/10/10
1,3.12,BCD,abc 4"asd,2011/10/10
1,3.12,BCD,"abc 4"" ""qwe"",2011/10/10
0
 
Terry WoodsIT GuruCommented:
Correction:
"^([^\",]*,){3}(?![^\",][^,]*\")(\"?)(\"\"|[^\"])*\\2,[^\",]*$"
0
 
Terry WoodsIT GuruCommented:
That latest pattern only matches if there are 5 fields on the line.
0
Concerto's Cloud Advisory Services

Want to avoid the missteps to gaining all the benefits of the cloud? Learn more about the different assessment options from our Cloud Advisory team.

 
Maverick_CoolAuthor Commented:
hi,
i clarification
BCD and "abc 4"" ""qwe""" should have common regex, it should start and end with double quotes only if there is a nested quote.

more over you underline regex part for nested quote in above expression(abc 4" "qwe"")
0
 
Maverick_CoolAuthor Commented:
"abc 4"" ""qwe"""
0
 
Terry WoodsIT GuruCommented:
Ok, I've altered the pattern to allow both the 3rd and 4th fields to be strings:

"^([^\",]*,){2}(?![^\",][^,]*\")(\"?)(\"\"|[^\"])*\\2,(?![^\",][^,]*\")(\"?)(\"\"|[^\"])*\\4,[^\",]*$"
0
 
Maverick_CoolAuthor Commented:
can you pin pint part in regex that points to "abc 4"" ""qwe"""
0
 
Maverick_CoolAuthor Commented:
also can you please explain me to the code a bit.
0
 
Terry WoodsIT GuruCommented:
One more mistake in the pattern corrected:
"^([^\",]*,){2}(?![^\",][^,]*\")(\"?)(\"\"|[^\",])*\\2,(?![^\",][^,]*\")(\"?)(\"\"|[^\",])*\\4,[^\",]*$"

^  start of string
([^\",]*,){2}  2 fields containing neither a comma nor a " each terminated by a comma

The 3rd field uses pattern:
(?![^\",][^,]*\")(\"?)(\"\"|[^\"])*\\2,
(?![^\",][^,]*\")  is a negative lookahead which causes the match to fail if there's a " character after the first character but before the comma that ends the field
(\"?)  captures the " character at the start of the field, if there is one
(\"\"|[^\",])*  is the contents of the field (either pairs of double quotes, or non-double-quotes-non-commas)
\\2  refers to the captured double quote (or nothing, if there wasn't one), being the 2nd captured sub-pattern
,  the comma that ends the 3rd field

The 4th field is the same as the 3rd, but refers back to the 4th captured sub-pattern instead of the 2nd

[^\",]*  The 5th field
$  Matches the end of the line
0
 
for_yanCommented:
what do you mean by validating this string?
Are you just concerned that between each two commas
you have the even number of double quotes?

Or you want to split the csv string with regular rules
which can be separated by enclosing double quotes
and you want to get whatever it is in between the enclosing double quotes if such quotes
are prsent. In the latter case you'll get nested quotes OK, if they are within the outer quotes
0
 
CEHJCommented:
Why would you want to reinvent a csv parser instead of just using one?
0
 
for_yanCommented:
You can certainly use the opencsv package if your goal is to pasre the CSV file

http://opencsv.sourceforge.net/
0
 
Maverick_CoolAuthor Commented:
can you tell me the significance of \\2, does it has anything to do with position.
Let me be more clear. I am trying build the regex string dynamically depending on column type. I have attached the screen shot and code. Please help fix, I am most struck in the varchar(a string which may contain a nested quote) and not nullable(empty column) check.

String regexpatern = "^";
        int rowCount = jTable1.getRowCount();

        for (int row = 0; row  <  rowCount; row++)
        {
            datatype enumval = datatype.valueOf(jTable1.getValueAt(row, 1).toString());

            switch (enumval) {
            case Integer      :  regexpatern = regexpatern+"[0-9]*,";       break;
            case Float        :  regexpatern = regexpatern+"[-+]?[0-9]*\\.?[0-9]+,";      break;//^[-+]?[0-9]*\.?[0-9]+([eE][-+]?[0-9]+)?
            case AlphaNumeric :  regexpatern = regexpatern+"^\\w*,";         break;
            case Varchar      :  regexpatern = regexpatern+"(?![^\",][^,]*\")(\"?)(\"\"|[^\"])*,";         break;
            case DateTime     :  regexpatern = regexpatern+"([0-9]{2,4})-([0-1][0-9])-([0-3][0-9]) (?:([0-2][0-9]):([0-5][0-9]):([0-5][0-9]))?,";           break;
            default           :  regexpatern = regexpatern+"(\"(?:[^\"]|\"\")*\"|[^,]*),"; break;
            }
        }
        if(regexpatern.endsWith(","))
            regexpatern = regexpatern.substring(0, regexpatern.length()-2);
        regexpatern = regexpatern + "$";

Open in new window

validatecsv.JPG
0
 
Maverick_CoolAuthor Commented:
I dont want to split the csv, My main aim to validate csv row.
0
 
for_yanCommented:
But you want to validate each column in its own way - so you fisrt have to split and then apply spacial validation
pattern to each of the columns - that would be the most starightforward way
0
 
Terry WoodsIT GuruCommented:
Each pair of round brackets like this (blah) is a capturing sub-pattern. (?:blah) is a non-capturing sub-pattern. \\2 is a back reference to match what was captured in the 2nd capturing sub-pattern.
0
 
Maverick_CoolAuthor Commented:
see there will thousand of record. If i split and then do validation, its going to take time.
0
 
for_yanCommented:
SO you should formulate - what it is in that varchar string that you sould consider offending - it has nesetd double quotes - but this is acceptable - so when you want to validate you should have rules which will say - this is wrong - this is good
0
 
for_yanCommented:
OK, you can validate at the time you read them (if that will be faster)
but still you need to have the rulse - what wll violate the string rules
0
 
for_yanCommented:
You are saying - they can have embedded quote - but dio you mean that you don't want such strings, or you mean that this is oK.
Or you want to make sure that you have even  nnumber of double quotes ?
It all ddpends on your business requirements _ after you formulate that you can start thing of the RegEx which could help you
And then there is obviouls no connection to any CSV file here?
0
 
Terry WoodsIT GuruCommented:
If you want to give feedback on what's wrong with the data when it's incorrect (eg too many fields, string not in valid format etc), it's probably best to split the data by field and validate that way. The pattern I've provided should work for determining whether a line is valid or not, but I have to admit it's not very maintainable for the average programmer.
0
 
Maverick_CoolAuthor Commented:
hi TerryAtOpus, I just need to validate thats all. See after loop the data types, nof regex part will be equal to no of columns(rite?)
I have posted a sample code that trying to build , can you help me with that.
0
 
for_yanCommented:
and this is where thwey say how to macth string  even number of quotes
http://stackoverflow.com/questions/668179/regex-to-match-a-string-with-an-even-number-of-quotes

0
 
Maverick_CoolAuthor Commented:
hi TerryAtOpus, I just need to validate thats all. See after looping the column header for data types, no. of regex sections will be equal to no of columns(rite?)
I have posted a sample code that trying to build , can you help me with that.
0
 
for_yanCommented:
Can you post an example of the string which should not satisfy your validation for varchar column?
0
 
Terry WoodsIT GuruCommented:
Made all sub-patterns non-capturing except the one that captures the double quote that we need to back-reference. Try this:
int varCharCount=1;  //use this to determine which captured sub-pattern to back-reference
        for (int row = 0; row  <  rowCount; row++)
        {
            datatype enumval = datatype.valueOf(jTable1.getValueAt(row, 1).toString());
            

            switch (enumval) {
            case Integer      :  regexpatern = regexpatern+"[0-9]*,";       break;
            case Float        :  regexpatern = regexpatern+"[-+]?[0-9]*\\.?[0-9]+,";      break;//^[-+]?[0-9]*\.?[0-9]+([eE][-+]?[0-9]+)?
            case AlphaNumeric :  regexpatern = regexpatern+"^\\w*,";         break;
            case Varchar      :
               regexpatern = regexpatern+"(?![^\",][^,]*\")(\"?)(?:\"\"|[^\"])*\\"+varCharCount+",";  //refer back to captured sub-pattern for matching a closing double quote
               varCharCount++;
               break;
            case DateTime     :  regexpatern = regexpatern+"(?:[0-9]{2,4})-(?:[0-1][0-9])-(?:[0-3][0-9]) (?:(?:[0-2][0-9]):(?:[0-5][0-9]):(?:[0-5][0-9]))?,";           break;
            default           :  regexpatern = regexpatern+"(?:\"(?:[^\"]|\"\")*\"|[^,]*),"; break;
            }
        }

Open in new window

0
 
for_yanCommented:
Any validation implies certain rules.
If you want to validate say phone number - you know that, say it should have certain number of digits you know that
for example area codes in US should not start with zero, etc, etc. - so these are the rules

If you want to validate you string - you should say what it is about that string that you would not like to be accepted.
You only say that it can contain embedded quotes - but is this a violation? - would those quotes not be good for you.
You need to formulate more clearly what it is that you want
0
 
Terry WoodsIT GuruCommented:
Actually I think this is much neater, and also allows a comma to be contained within double quotes, whereas the previous pattern didn't.
String regexpatern = "^";
        int rowCount = jTable1.getRowCount();

        for (int row = 0; row  <  rowCount; row++)
        {
            datatype enumval = datatype.valueOf(jTable1.getValueAt(row, 1).toString());

            switch (enumval) {
            case Integer      :  regexpatern = regexpatern+"[0-9]*,";       break;
            case Float        :  regexpatern = regexpatern+"[-+]?[0-9]*\\.?[0-9]+,";      break;//^[-+]?[0-9]*\.?[0-9]+([eE][-+]?[0-9]+)?
            case AlphaNumeric :  regexpatern = regexpatern+"^\\w*,";         break;
            case Varchar      :  regexpatern = regexpatern+"(?![^\",][^,]*\")(\"(\"\"|[^\"])*\"|[^\",]*),";         break;
            case DateTime     :  regexpatern = regexpatern+"([0-9]{2,4})-([0-1][0-9])-([0-3][0-9]) (?:([0-2][0-9]):([0-5][0-9]):([0-5][0-9]))?,";           break;
            default           :  regexpatern = regexpatern+"(\"(?:[^\"]|\"\")*\"|[^,]*),"; break;
            }
        }
        if(regexpatern.endsWith(","))
            regexpatern = regexpatern.substring(0, regexpatern.length()-2);
        regexpatern = regexpatern + "$";

Open in new window

0
 
Terry WoodsIT GuruCommented:
This line is the only one changed from your code snippet:
            case Varchar      :  regexpatern = regexpatern+"(?![^\",][^,]*\")(\"(\"\"|[^\"])*\"|[^\",]*),";         break;
0
 
CEHJCommented:
>>I dont want to split the csv, My main aim to validate csv row.

The two are effectively identical
0
 
Maverick_CoolAuthor Commented:
"TerryAtOtpus, it working great.
One more thing for date time validation (YYYY-MM-DD HH:MM:SS)
([1-3][0-9]{3,3})-(0?[1-9]|1[0-2])-(0?[1-9]|[1-2][0-9]|3[0-1])\\s([0-1][0-9]|2[0-4]):([0-5][0-9]):([0-5][0-9])
how make HH:MM:SS optional, means datetime may only have date and no time component sometimes
0
 
Terry WoodsIT GuruCommented:
It already is optional, but you might be having trouble because the space between the date and time is required. I've shifted it into the optional part of the pattern:
Space not accepted after date when time not present:
            case DateTime     :  regexpatern = regexpatern+"([0-9]{2,4})-([0-1][0-9])-([0-3][0-9])(?: ([0-2][0-9]):([0-5][0-9]):([0-5][0-9]))?,";           break;

Space allowed after date when time not present:
            case DateTime     :  regexpatern = regexpatern+"([0-9]{2,4})-([0-1][0-9])-([0-3][0-9]) *(?: ([0-2][0-9]):([0-5][0-9]):([0-5][0-9]))?,";           break;

Open in new window

0
 
Maverick_CoolAuthor Commented:
thanks dude!
0
 
Maverick_CoolAuthor Commented:
"TerryAtOpus",
its working great with small inputs, but large its giving stack overflow. Can you optimize using possessive quantifiers
http://stackoverflow.com/questions/7509905/java-lang-stackoverflowerror-while-using-a-regex-to-parse-big-strings.
0
 
Maverick_CoolAuthor Commented:
0

Featured Post

Free Tool: Site Down Detector

Helpful to verify reports of your own downtime, or to double check a downed website you are trying to access.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

  • 14
  • 11
  • 9
  • +1
Tackle projects and never again get stuck behind a technical roadblock.
Join Now