Link to home
Start Free TrialLog in
Avatar of Maverick_Cool
Maverick_CoolFlag for India

asked on

Regex to to parse csv with nested quotes

I want validate a csv row.
the format is as follows:
1,3.12,BCD,"abc 4"" ""qwe""",2011/10/10

the above data is MS excel type escaped code for:
1,3.12,BCD,abc 4" "qwe"",2011/10/10

I just need to match or validate string, thats all.
The main problem here is nested quotes.

The programming language is Java
Avatar of Terry Woods
Terry Woods
Flag of New Zealand image

This pattern validates your format:
1. allowing empty fields and
2. checking the quotes are in pairs if enclosed in double quotes

^([^"]*,){3}(?![^",][^,]*")("?)(""|[^"])*\2,[^"]*$

In Java format (backslashes and double quotes escaped), I think it would be:

"^([^\"]*,){3}(?![^\",][^,]*\")(\"?)(\"\"|[^\"])*\\2,[^\"]*$"

Out of the following test cases, just the first 3 are accepted in my testing:
1,3.12,BCD,"abc 4"" ""qwe""",2011/10/10
1,3.12,BCD,abc 4,2011/10/10
1,3.12,BCD,"abc ",2011/10/10
1,3.12,BCD,abc 4""asd,2011/10/10
1,3.12,BCD,abc 4"asd,2011/10/10
1,3.12,BCD,"abc 4"" ""qwe"",2011/10/10
Correction:
"^([^\",]*,){3}(?![^\",][^,]*\")(\"?)(\"\"|[^\"])*\\2,[^\",]*$"
That latest pattern only matches if there are 5 fields on the line.
Avatar of Maverick_Cool

ASKER

hi,
i clarification
BCD and "abc 4"" ""qwe""" should have common regex, it should start and end with double quotes only if there is a nested quote.

more over you underline regex part for nested quote in above expression(abc 4" "qwe"")
"abc 4"" ""qwe"""
Ok, I've altered the pattern to allow both the 3rd and 4th fields to be strings:

"^([^\",]*,){2}(?![^\",][^,]*\")(\"?)(\"\"|[^\"])*\\2,(?![^\",][^,]*\")(\"?)(\"\"|[^\"])*\\4,[^\",]*$"
can you pin pint part in regex that points to "abc 4"" ""qwe"""
also can you please explain me to the code a bit.
One more mistake in the pattern corrected:
"^([^\",]*,){2}(?![^\",][^,]*\")(\"?)(\"\"|[^\",])*\\2,(?![^\",][^,]*\")(\"?)(\"\"|[^\",])*\\4,[^\",]*$"

^  start of string
([^\",]*,){2}  2 fields containing neither a comma nor a " each terminated by a comma

The 3rd field uses pattern:
(?![^\",][^,]*\")(\"?)(\"\"|[^\"])*\\2,
(?![^\",][^,]*\")  is a negative lookahead which causes the match to fail if there's a " character after the first character but before the comma that ends the field
(\"?)  captures the " character at the start of the field, if there is one
(\"\"|[^\",])*  is the contents of the field (either pairs of double quotes, or non-double-quotes-non-commas)
\\2  refers to the captured double quote (or nothing, if there wasn't one), being the 2nd captured sub-pattern
,  the comma that ends the 3rd field

The 4th field is the same as the 3rd, but refers back to the 4th captured sub-pattern instead of the 2nd

[^\",]*  The 5th field
$  Matches the end of the line
what do you mean by validating this string?
Are you just concerned that between each two commas
you have the even number of double quotes?

Or you want to split the csv string with regular rules
which can be separated by enclosing double quotes
and you want to get whatever it is in between the enclosing double quotes if such quotes
are prsent. In the latter case you'll get nested quotes OK, if they are within the outer quotes
Why would you want to reinvent a csv parser instead of just using one?
You can certainly use the opencsv package if your goal is to pasre the CSV file

http://opencsv.sourceforge.net/
can you tell me the significance of \\2, does it has anything to do with position.
Let me be more clear. I am trying build the regex string dynamically depending on column type. I have attached the screen shot and code. Please help fix, I am most struck in the varchar(a string which may contain a nested quote) and not nullable(empty column) check.

String regexpatern = "^";
        int rowCount = jTable1.getRowCount();

        for (int row = 0; row  <  rowCount; row++)
        {
            datatype enumval = datatype.valueOf(jTable1.getValueAt(row, 1).toString());

            switch (enumval) {
            case Integer      :  regexpatern = regexpatern+"[0-9]*,";       break;
            case Float        :  regexpatern = regexpatern+"[-+]?[0-9]*\\.?[0-9]+,";      break;//^[-+]?[0-9]*\.?[0-9]+([eE][-+]?[0-9]+)?
            case AlphaNumeric :  regexpatern = regexpatern+"^\\w*,";         break;
            case Varchar      :  regexpatern = regexpatern+"(?![^\",][^,]*\")(\"?)(\"\"|[^\"])*,";         break;
            case DateTime     :  regexpatern = regexpatern+"([0-9]{2,4})-([0-1][0-9])-([0-3][0-9]) (?:([0-2][0-9]):([0-5][0-9]):([0-5][0-9]))?,";           break;
            default           :  regexpatern = regexpatern+"(\"(?:[^\"]|\"\")*\"|[^,]*),"; break;
            }
        }
        if(regexpatern.endsWith(","))
            regexpatern = regexpatern.substring(0, regexpatern.length()-2);
        regexpatern = regexpatern + "$";

Open in new window

validatecsv.JPG
I dont want to split the csv, My main aim to validate csv row.
But you want to validate each column in its own way - so you fisrt have to split and then apply spacial validation
pattern to each of the columns - that would be the most starightforward way
Each pair of round brackets like this (blah) is a capturing sub-pattern. (?:blah) is a non-capturing sub-pattern. \\2 is a back reference to match what was captured in the 2nd capturing sub-pattern.
see there will thousand of record. If i split and then do validation, its going to take time.
SO you should formulate - what it is in that varchar string that you sould consider offending - it has nesetd double quotes - but this is acceptable - so when you want to validate you should have rules which will say - this is wrong - this is good
OK, you can validate at the time you read them (if that will be faster)
but still you need to have the rulse - what wll violate the string rules
You are saying - they can have embedded quote - but dio you mean that you don't want such strings, or you mean that this is oK.
Or you want to make sure that you have even  nnumber of double quotes ?
It all ddpends on your business requirements _ after you formulate that you can start thing of the RegEx which could help you
And then there is obviouls no connection to any CSV file here?
If you want to give feedback on what's wrong with the data when it's incorrect (eg too many fields, string not in valid format etc), it's probably best to split the data by field and validate that way. The pattern I've provided should work for determining whether a line is valid or not, but I have to admit it's not very maintainable for the average programmer.
hi TerryAtOpus, I just need to validate thats all. See after loop the data types, nof regex part will be equal to no of columns(rite?)
I have posted a sample code that trying to build , can you help me with that.
and this is where thwey say how to macth string  even number of quotes
http://stackoverflow.com/questions/668179/regex-to-match-a-string-with-an-even-number-of-quotes

hi TerryAtOpus, I just need to validate thats all. See after looping the column header for data types, no. of regex sections will be equal to no of columns(rite?)
I have posted a sample code that trying to build , can you help me with that.
Can you post an example of the string which should not satisfy your validation for varchar column?
Made all sub-patterns non-capturing except the one that captures the double quote that we need to back-reference. Try this:
int varCharCount=1;  //use this to determine which captured sub-pattern to back-reference
        for (int row = 0; row  <  rowCount; row++)
        {
            datatype enumval = datatype.valueOf(jTable1.getValueAt(row, 1).toString());
            

            switch (enumval) {
            case Integer      :  regexpatern = regexpatern+"[0-9]*,";       break;
            case Float        :  regexpatern = regexpatern+"[-+]?[0-9]*\\.?[0-9]+,";      break;//^[-+]?[0-9]*\.?[0-9]+([eE][-+]?[0-9]+)?
            case AlphaNumeric :  regexpatern = regexpatern+"^\\w*,";         break;
            case Varchar      :
               regexpatern = regexpatern+"(?![^\",][^,]*\")(\"?)(?:\"\"|[^\"])*\\"+varCharCount+",";  //refer back to captured sub-pattern for matching a closing double quote
               varCharCount++;
               break;
            case DateTime     :  regexpatern = regexpatern+"(?:[0-9]{2,4})-(?:[0-1][0-9])-(?:[0-3][0-9]) (?:(?:[0-2][0-9]):(?:[0-5][0-9]):(?:[0-5][0-9]))?,";           break;
            default           :  regexpatern = regexpatern+"(?:\"(?:[^\"]|\"\")*\"|[^,]*),"; break;
            }
        }

Open in new window

Any validation implies certain rules.
If you want to validate say phone number - you know that, say it should have certain number of digits you know that
for example area codes in US should not start with zero, etc, etc. - so these are the rules

If you want to validate you string - you should say what it is about that string that you would not like to be accepted.
You only say that it can contain embedded quotes - but is this a violation? - would those quotes not be good for you.
You need to formulate more clearly what it is that you want
ASKER CERTIFIED SOLUTION
Avatar of Terry Woods
Terry Woods
Flag of New Zealand image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
This line is the only one changed from your code snippet:
            case Varchar      :  regexpatern = regexpatern+"(?![^\",][^,]*\")(\"(\"\"|[^\"])*\"|[^\",]*),";         break;
>>I dont want to split the csv, My main aim to validate csv row.

The two are effectively identical
"TerryAtOtpus, it working great.
One more thing for date time validation (YYYY-MM-DD HH:MM:SS)
([1-3][0-9]{3,3})-(0?[1-9]|1[0-2])-(0?[1-9]|[1-2][0-9]|3[0-1])\\s([0-1][0-9]|2[0-4]):([0-5][0-9]):([0-5][0-9])
how make HH:MM:SS optional, means datetime may only have date and no time component sometimes
It already is optional, but you might be having trouble because the space between the date and time is required. I've shifted it into the optional part of the pattern:
Space not accepted after date when time not present:
            case DateTime     :  regexpatern = regexpatern+"([0-9]{2,4})-([0-1][0-9])-([0-3][0-9])(?: ([0-2][0-9]):([0-5][0-9]):([0-5][0-9]))?,";           break;

Space allowed after date when time not present:
            case DateTime     :  regexpatern = regexpatern+"([0-9]{2,4})-([0-1][0-9])-([0-3][0-9]) *(?: ([0-2][0-9]):([0-5][0-9]):([0-5][0-9]))?,";           break;

Open in new window

thanks dude!
"TerryAtOpus",
its working great with small inputs, but large its giving stack overflow. Can you optimize using possessive quantifiers
http://stackoverflow.com/questions/7509905/java-lang-stackoverflowerror-while-using-a-regex-to-parse-big-strings.