Link to home
Start Free TrialLog in
Avatar of Maverick_Cool
Maverick_CoolFlag for India

asked on

Stack overflow error for java regex

I am getting stackoverflow for validating a csv for large string field.

Regex: (?![^\",][^,]\")(\"(\"\"|[^\"])\"|[^\",]),[0-9]

TargetString:

"The Nuvi 1450LMT is a portable global positioning system receiver from Garmin that offers a step-up from the company's standard 1450 and 1450T models. Including free lifetime map and traffic updates, this model can be updated once every three months to ensure to the most-up-to-date location information. A built-in FM signal transmitter can provide up-to-the-minute traffic information concerning accidents, construction and other forms of road blockage, providing users with sufficient time to select an alternate route. A back-lighted 5-inch touchscreen TFT display is included that provides clear visual instruction, complete with &quot;Lane Assist&quot; technology that provides virtual first-person instruction on precisely what lanes to use. Comprehensive &quot;City Navigator&quot; maps are included for Canada, the US and Mexico, with two and three-dimensional support and over 6 million user-selected points of interest. Pedestrian navigation is also fully supported on the 1450LMT, with the &quot;CityXplorer&quot; service offering bus, rail, tram and other public transportation information for a wide variety of major cities. Fuel-effecient routes can be determined with the &quot;EcoRoute&quot; mode, while &quot;HotFix&quot; predictive satellite technology helps to maintain the most accurate locational information even when signal is temporarily lost. Photo navigation is supported through Garmin's &quot;Photo Connect&quot; service, and additional car marker and narration voices can be downloaded via the &quot;Garmin Garage&quot; website.&nbsp;</p> <h2>Features</h2> <ul> <li>5-inch backlit TFT color touchscreen</li> <li>Free lifetime traffic updates</li> <li>Free maps</li> <li>MicroSD card support</li> <li>Voice prompts</li> <li>Lane assist function</li> <li>Auto Re-route</li> <li>Route avoidance</li> <li>FM traffic compatibility</li> <li>EcoRoute routing</li> <li>Custom Points Of Interest</li> <li>Garmin garage car marker and voice customization",9

Can someone help to optimize it. Can you optimize using possessive quantifiers to minimize repetitions.
Avatar of for_yan
for_yan
Flag of United States of America image

Can you still explain what it that you wnat to do with this string?
Waht means "validating csv" - ?
This string is not a csv
Avatar of Terry Woods
I think you missed the first *'s from that regex. You could try this:

(?![^\",][^,]*+\")(\"(\"\"|[^\"])*+\"|[^\",]*),

(Added possessive quantifiers as documented at http://www.regular-expressions.info/possessive.html )
Clarification for others looking at the question: "I think you missed the first *'s from that regex." meant "I think you missed the first *'s from that regex as compared with the solution here: https://www.experts-exchange.com/questions/27422744/Regex-to-to-parse-csv-with-nested-quotes.html?cid=748&anchorAnswerId=37054548#a37054548 ".
TeryyAtOpus,
I'd apprecaite if you explain what is needed to be done with this string and what means "validation" in this sense?
I probably can't give a full explanation without a wordy essay, but I believe this regex (slightly different to the one I just provided):

(?![^\",][^,]*\")(\"(\"\"|[^\"])*\"|[^\",]*),

is being used to validate that a single character field in an excel style csv file is valid, but it's failing on long values. If my suggested change using possessive quantifiers doesn't work, then using the opencsv package as you suggested in the previous question may be the easiest solution.

Note that I don't have experience in using possessive quantifiers, but it makes sense to me that they might solve the problem in this case.
Avatar of Maverick_Cool

ASKER

Thanks "TerryAtOpus". Other EEpeople in discussion, can guys verify terry possessive quantifiers solution. It will be great help.
can you post your code which generates stackoverflow
No matter how I uses these regexes with this string I don't hget any stackoverflows
regex:
(?![^",][^,]*")("(""|[^"])*"|[^",]*),(?![^",][^,]*")("(""|[^"])*"|[^",]*),(?![^",][^,]*")("(""|[^"])*"|[^",]*),(?![^",][^,]*")("(""|[^"])*"|[^",]*),(?![^",][^,]*")("(""|[^"])*"|[^",]*),(?![^",][^,]*")("(""|[^"])*"|[^",]*),(?![^",][^,]*")("(""|[^"])*"|[^",]*),(?![^",][^,]*")("(""|[^"])*"|[^",]*),(?![^",][^,]*")("(""|[^"])*"|[^",]*),(?![^",][^,]*")("(""|[^"])*"|[^",]*),(?![^",][^,]*")("(""|[^"])*"|[^",]*),(?![^",][^,]*")("(""|[^"])*"|[^",]*),(?![^",][^,]*")("(""|[^"])*"|[^",]*),(?![^",][^,]*")("(""|[^"])*"|[^",]*),(?![^",][^,]*")("(""|[^"])*"|[^",]*),(?![^",][^,]*")("(""|[^"])*"|[^",]*),(?![^",][^,]*")("(""|[^"])*"|[^",]*),(?![^",][^,]*")("(""|[^"])*"|[^",]*),(?![^",][^,]*")("(""|[^"])*"|[^",]*),(?![^",][^,]*")("(""|[^"])*"|[^",]*),(?![^",][^,]*")("(""|[^"])*"|[^",]*)

target string:
5,1089764,nuvi 1450LMT Automobile Portable Navigator,4.5,7.54E+11,B003ZX8B2S,,,http://images.shopsavvy.mobi/products/753759970550/image.thumbnail/19202432,http://images.shopsavvy.mobi/products/753759970550/image.full/19202432,ConsumerElectronics,,,INR 6900,Garmin,,Nuvi,The Nuvi 1450LMT is a portable global positioning system receiver from Garmin that offers a step-up from the company's standard 1450 and 1450T models. ,"The Nuvi 1450LMT is a portable global positioning system receiver from Garmin that offers a step-up from the company's standard 1450 and 1450T models. Including free lifetime map and traffic updates, this model can be updated once every three months to ensure to the most-up-to-date location information. A built-in FM signal transmitter can provide up-to-the-minute traffic information concerning accidents, construction and other forms of road blockage, providing users with sufficient time to select an alternate route. A back-lighted 5-inch touchscreen TFT display is included that provides clear visual instruction, complete with &quot;Lane Assist&quot; technology that provides virtual first-person instruction on precisely what lanes to use. Comprehensive &quot;City Navigator&quot; maps are included for Canada, the US and Mexico, with two and three-dimensional support and over 6 million user-selected points of interest. Pedestrian navigation is also fully supported on the 1450LMT, with the &quot;CityXplorer&quot; service offering bus, rail, tram and other public transportation information for a wide variety of major cities. Fuel-effecient routes can be determined with the &quot;EcoRoute&quot; mode, while &quot;HotFix&quot; predictive satellite technology helps to maintain the most accurate locational information even when signal is temporarily lost. Photo navigation is supported through Garmin's &quot;Photo Connect&quot; service, and additional car marker and narration voices can be downloaded via the &quot;Garmin Garage&quot; website.&nbsp;</p> <h2>Features</h2> <ul> <li>5-inch backlit TFT color touchscreen</li> <li>Free lifetime traffic updates</li> <li>Free maps</li> <li>MicroSD card support</li> <li>Voice prompts</li> <li>Lane assist function</li> <li>Auto Re-route</li> <li>Route avoidance</li> <li>FM traffic compatibility</li> <li>EcoRoute routing</li> <li>Custom Points Of Interest</li> <li>Garmin garage car marker and voice customization",,INSERT
ASKER CERTIFIED SOLUTION
Avatar of Terry Woods
Terry Woods
Flag of New Zealand image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
i have, it looks ok. the above one is that one for which stack overflow was coming. After adding possessive quantifiers stack overflow is not coming, but I need check validation once.
Using very slightly modified regex of TerryAtOpus and adding for testing some additional CSV fileds to this long
paragraph I got it matching.
Don't know if this is waht you mean by validation:
String ss =
        "23,\"The Nuvi 1450LMT is a portable global positioning system receiver from Garmin that offers a step-up from " +
                "the company's standard 1450 and 1450T models. Including free lifetime map and traffic updates, this model " +
                "can be updated once every three months to ensure to the most-up-to-date location information. A built-in FM signal " +
                "transmitter can provide up-to-the-minute traffic information concerning accidents, construction and other forms of " +
                "road blockage, providing users with sufficient time to select an alternate route. A back-lighted 5-inch touchscreen " +
                "TFT display is included that provides clear visual instruction, complete with &quot;Lane Assist&quot; technology " +
                "that provides virtual first-person instruction on precisely what lanes to use. Comprehensive &quot;City Navigator&quot; maps" +
                " are included for Canada, the US and Mexico, with two and three-dimensional support and over 6 million user-selected points of" +
                " interest. Pedestrian navigation is also fully supported on the 1450LMT, with the &quot;CityXplorer&quot; service offering bus, " +
                "rail, tram and other public transportation information for a wide variety of major cities. Fuel-effecient routes can be determined " +
                "with the &quot;EcoRoute&quot; mode, while &quot;HotFix&quot; predictive satellite technology helps to" +
                " maintain the most accurate locational information even when signal is temporarily lost. Photo navigation " +
                "is supported through Garmin's &quot;Photo Connect&quot; service, and additional car marker and narration voices can be" +
                " downloaded via the &quot;Garmin Garage&quot; website.&nbsp;</p> <h2>Features</h2> <ul> <li>5-inch backlit TFT color touchscreen</li> <li>Free lifetime" +
                " traffic updates</li> <li>Free maps</li> <li>MicroSD card support</li> <li>Voice prompts</li> <li>Lane assist function</li> <li>Auto Re-route</li> <li>Route " +
                "avoidance</li> <li>FM traffic compatibility</li> <li>EcoRoute routing</li> <li>Custom Points Of Interest</li> <li>Garmin garage car marker and voice customization\",\"dfgdf\",fghfgh";
 




          if(ss.matches("((?![^\",][^,]*+\")(\"(\"\"|[^\"])*+\"|[^\",]*),?)*")) System.out.println(true);

Open in new window


true

Open in new window

However my string above would probably match also string without commas at all - so it does not makes much sense.
Still don't fully understan the purpose of this exercise.
My understanding is that the purpose is to ensure csv data is in a valid format (reject values with poor escaping, too many fields, etc). In the previous question the author had code to apply this to a dynamic format (with different data types).
There is no strict general definition of waht you want and don't want in the CSV file,
as one can see form these numerous discussions:

http://stackoverflow.com/questions/3002688/regex-to-match-csv-file-nested-quotes

http://stackoverflow.com/questions/778485/regex-for-csv-validation-jquery

in this linek:

http://www.java2s.com/Code/Java/Development-Class/SimpledemoofCSVmatchingusingRegularExpressions.htm

they suggest this pattern:
 public static final String CSV_PATTERN = "\"([^\"]+?)\",?|([^,]+),?|,";

Open in new window


but none of these will cover all situations
This should be more application specific - looking at
what happnes in your particular applications and examples of
strings you want to weed out in your specific environemnt


After all you need to valiadtae probably in order to be able to extract the vlues from it eeventually,
and then probably reading with opencsv will be most useful.




It is hard to believ that you can find anything truly reliable for general CSV validation with regex.

This is from the book Regular Expression Cookbook (page 473):
-----
Match a CSV record and capture the field in column 1 to backreference 1
^([^",\r\n]+|"(?:[^"]|"")*")?(?:,(?:[^",\r\n]+|"(?:[^"]|"")*")?)*
Regex options: ^ and $ match at line breaks
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby
----

Still does not seem to work in all cases fro CSV validation
and also produces stackoverflow when I try your long string


you'll  need of course to escape quotes for java use:

("^([^\",\r\n]+|\"(?:[^\"]|\"\")*\")?(?:,(?:[^\",\r\n]+|\"(?:[^\"]|\"\")*\")?)*")

Open in new window





It is hard to believ that you can find anything truly reliable for general CSV validation with regex.

Open in new window

It's the text-qualified fields (i.e. fields surrounded by quotes) that makes it a pain in the ***  ; )
I think that statement was meant to read like this?
"It is hard to believe that you can't find anything truly reliable for general CSV validation with regex."
(But that's how you interpreted it anyway kaufmed, I think!)
No, actually, I became rather pessimistic, so I meant
the way it is written - it is hard to find something which will work in 100% of cases for validating CSV line.
That is after I found that even the regex in the book specifically designed for that, didn't work for this long
string. I think it  didn't work for two quotes (which should represent one quote) within the CSV field either.
The "normal" csv format may have changed over time, so the book you referred to may have just been out of date?

The thing that really bugs me about .csv files though is that when you open them in Excel the data gets automatically altered (with no indication that it's happened)... I tried to find an easy workaround for the issue, but there isn't really one available (see https://www.experts-exchange.com/questions/26427029/Excel-corruption-of-csv-file-data.html )
No, I don't think the format changed since 2009 (or maybe 2008 when book was written, as it was published in 2009),
 and besides no matter how they could change format - this
long text should be allowed to represent a field in CSV, so it should have worked with it, but instaed it also caused this stackoverflow exception.
Anyway, maybe there is still a good regex...which would  at least work for the domain of strings specific to the
appliaction in question
 

Don't know about the whole csv line validation but this  regex
split(",(?=([^\"]*\"[^\"]*\")*[^\"]*$)", 99);

Open in new window

from http://stackoverflow.com/questions/2241758/regarding-java-split-command-parsing-csv-file

seems to be spitting CSV line very reliably (including the splitting
of the line with huge CSV field and with embedded paired double quotes) -
 the only problem which I found will be if you have non-paired double quotes within the field.
Other than that it seems to work as expected with all different tests I was doing (see output below).

Don't know if it may be of use for the quastion, but can be
used in many cases as alternative to reading CSV file with opencsv
(99 is maximum fileds on the line - to ensure getting empty records
if there are empty fields in the end)

   String ss20 = "23,\"The Nuvi 1450LMT is a portable global positioning system receiver from Garmin that offers a step-up from " +
                "the company's standard 1450 and 1450T models. Including free lifetime map and traffic updates, this model " +
                "can be updated once every three months to ensure to the most-up-to-date location information. A built-in FM signal " +
                "transmitter can provide up-to-the-minute traffic information concerning accidents, construction and other forms of " +
                "road blockage, providing users with sufficient time to select an alternate route. A back-lighted 5-inch touchscreen " +
                "TFT display is included that provides clear visual instruction, complete with &quot;Lane Assist&quot; technology " +
                "that provides virtual first-person instruction on precisely what lanes to use. Comprehensive &quot;City Navigator&quot; maps" +
                " are included for Canada, the US and Mexico, with two and three-dimensional support and over 6 million user-selected points of" +
                " interest. Pedestrian navigation is also fully supported on the 1450LMT, with the &quot;CityXplorer&quot; service offering bus, " +
                "rail, tram and other public transportation information for a wide variety of major cities. Fuel-effecient routes can be determined " +
                "with the &quot;EcoRoute&quot; mode, while &quot;HotFix&quot; predictive satellite technology helps to" +
                " maintain the most accurate locational information even when signal is temporarily lost. Photo navigation " +
                "is supported through Garmin's &quot;Photo Connect&quot; service, and additional car marker and narration voices can be" +
                " downloaded via the &quot;Garmin Garage&quot; website.&nbsp;</p> <h2>Features</h2> <ul> <li>5-inch backlit TFT color touchscreen</li> <li>Free lifetime" +
                " traffic updates</li> <li>Free maps</li> <li>MicroSD card support</li> <li>Voice prompts</li> <li>Lane assist function</li> <li>Auto Re-route</li> <li>Route " +
                "avoidance</li> <li>FM traffic compatibility</li> <li>EcoRoute routing</li> <li>Custom Points Of Interest</li> <li>Garmin garage car marker and voice customization\",\"dfgdf\",fghfgh";

      String [] result20 = ss20.split(",(?=([^\"]*\"[^\"]*\")*[^\"]*$)", 99);

              System.out.println("");

          System.out.println("String: " + ss20);

      for(String s: result20 ){

           System.out.println("split1: " +  s);

      }





            ss20 = "1,2,3,4,5,6,7,8,9,10,11";

              System.out.println("");

           System.out.println("String: " + ss20);

          result20 = ss20.split(",(?=([^\"]*\"[^\"]*\")*[^\"]*$)", 99);

        for(String s: result20 ){

             System.out.println("split2: " +  s);

        }



                 ss20 = "1,2,\"3\",4,5";
              System.out.println("");
           System.out.println("String: " + ss20);

          result20 = ss20.split(",(?=([^\"]*\"[^\"]*\")*[^\"]*$)", 99);

        for(String s: result20 ){

             System.out.println("split3: " +  s);

        }


                        ss20 = "1,2,3\"\"4,4,5";
          System.out.println("");
           System.out.println("String: " + ss20);

          result20 = ss20.split(",(?=([^\"]*\"[^\"]*\")*[^\"]*$)", 99);

        for(String s: result20 ){

             System.out.println("split4: " +  s);

        }

                          ss20 = "1,2,3\"fdg,\"4,4,5";
                   System.out.println("");

           System.out.println("String: " + ss20);

          result20 = ss20.split(",(?=([^\"]*\"[^\"]*\")*[^\"]*$)", 99);

        for(String s: result20 ){

             System.out.println("split5: " +  s);

        }

                             ss20 = "1,2,\"brown , fox\",4,5";

                   System.out.println("");

           System.out.println("String: " + ss20);

          result20 = ss20.split(",(?=([^\"]*\"[^\"]*\")*[^\"]*$)", 99);

        for(String s: result20 ){

             System.out.println("split6: " +  s);

        }

                                ss20 = "1,,2,\"brown , fox\",4,5,,,";
                   System.out.println("");

           System.out.println("String: " + ss20);

          result20 = ss20.split(",(?=([^\"]*\"[^\"]*\")*[^\"]*$)", 99);

        for(String s: result20 ){

             System.out.println("split7: " +  s);

        }

                                      ss20 = "1,,2,\"br'o'wn , f\"o\"x\",4,5,,,";
                   System.out.println("");

           System.out.println("String: " + ss20);

          result20 = ss20.split(",(?=([^\"]*\"[^\"]*\")*[^\"]*$)", 99);

        for(String s: result20 ){

             System.out.println("split8: " +  s);

        }


                                      ss20 = "1,,2,\"br\"o\"wn\" ju\"mp , fox\",4,5,,,";

                   System.out.println("");

           System.out.println("String: " + ss20);

          result20 = ss20.split(",(?=([^\"]*\"[^\"]*\")*[^\"]*$)", 99);

        for(String s: result20 ){

             System.out.println("split9: " +  s);

        }

Open in new window




String: 23,"The Nuvi 1450LMT is a portable global positioning system receiver from Garmin that offers a step-up from the company's standard 1450 and 1450T models. Including free lifetime map and traffic updates, this model can be updated once every three months to ensure to the most-up-to-date location information. A built-in FM signal transmitter can provide up-to-the-minute traffic information concerning accidents, construction and other forms of road blockage, providing users with sufficient time to select an alternate route. A back-lighted 5-inch touchscreen TFT display is included that provides clear visual instruction, complete with &quot;Lane Assist&quot; technology that provides virtual first-person instruction on precisely what lanes to use. Comprehensive &quot;City Navigator&quot; maps are included for Canada, the US and Mexico, with two and three-dimensional support and over 6 million user-selected points of interest. Pedestrian navigation is also fully supported on the 1450LMT, with the &quot;CityXplorer&quot; service offering bus, rail, tram and other public transportation information for a wide variety of major cities. Fuel-effecient routes can be determined with the &quot;EcoRoute&quot; mode, while &quot;HotFix&quot; predictive satellite technology helps to maintain the most accurate locational information even when signal is temporarily lost. Photo navigation is supported through Garmin's &quot;Photo Connect&quot; service, and additional car marker and narration voices can be downloaded via the &quot;Garmin Garage&quot; website.&nbsp;</p> <h2>Features</h2> <ul> <li>5-inch backlit TFT color touchscreen</li> <li>Free lifetime traffic updates</li> <li>Free maps</li> <li>MicroSD card support</li> <li>Voice prompts</li> <li>Lane assist function</li> <li>Auto Re-route</li> <li>Route avoidance</li> <li>FM traffic compatibility</li> <li>EcoRoute routing</li> <li>Custom Points Of Interest</li> <li>Garmin garage car marker and voice customization","dfgdf",fghfgh
split1: 23
split1: "The Nuvi 1450LMT is a portable global positioning system receiver from Garmin that offers a step-up from the company's standard 1450 and 1450T models. Including free lifetime map and traffic updates, this model can be updated once every three months to ensure to the most-up-to-date location information. A built-in FM signal transmitter can provide up-to-the-minute traffic information concerning accidents, construction and other forms of road blockage, providing users with sufficient time to select an alternate route. A back-lighted 5-inch touchscreen TFT display is included that provides clear visual instruction, complete with &quot;Lane Assist&quot; technology that provides virtual first-person instruction on precisely what lanes to use. Comprehensive &quot;City Navigator&quot; maps are included for Canada, the US and Mexico, with two and three-dimensional support and over 6 million user-selected points of interest. Pedestrian navigation is also fully supported on the 1450LMT, with the &quot;CityXplorer&quot; service offering bus, rail, tram and other public transportation information for a wide variety of major cities. Fuel-effecient routes can be determined with the &quot;EcoRoute&quot; mode, while &quot;HotFix&quot; predictive satellite technology helps to maintain the most accurate locational information even when signal is temporarily lost. Photo navigation is supported through Garmin's &quot;Photo Connect&quot; service, and additional car marker and narration voices can be downloaded via the &quot;Garmin Garage&quot; website.&nbsp;</p> <h2>Features</h2> <ul> <li>5-inch backlit TFT color touchscreen</li> <li>Free lifetime traffic updates</li> <li>Free maps</li> <li>MicroSD card support</li> <li>Voice prompts</li> <li>Lane assist function</li> <li>Auto Re-route</li> <li>Route avoidance</li> <li>FM traffic compatibility</li> <li>EcoRoute routing</li> <li>Custom Points Of Interest</li> <li>Garmin garage car marker and voice customization"
split1: "dfgdf"
split1: fghfgh

String: 1,2,3,4,5,6,7,8,9,10,11
split2: 1
split2: 2
split2: 3
split2: 4
split2: 5
split2: 6
split2: 7
split2: 8
split2: 9
split2: 10
split2: 11

String: 1,2,"3",4,5
split3: 1
split3: 2
split3: "3"
split3: 4
split3: 5

String: 1,2,3""4,4,5
split4: 1
split4: 2
split4: 3""4
split4: 4
split4: 5

String: 1,2,3"fdg,"4,4,5
split5: 1
split5: 2
split5: 3"fdg,"4
split5: 4
split5: 5

String: 1,2,"brown , fox",4,5
split6: 1
split6: 2
split6: "brown , fox"
split6: 4
split6: 5

String: 1,,2,"brown , fox",4,5,,,
split7: 1
split7: 
split7: 2
split7: "brown , fox"
split7: 4
split7: 5
split7: 
split7: 
split7: 

String: 1,,2,"br'o'wn , f"o"x",4,5,,,
split8: 1
split8: 
split8: 2
split8: "br'o'wn , f"o"x"
split8: 4
split8: 5
split8: 
split8: 
split8: 

String: 1,,2,"br"o"wn" ju"mp , fox",4,5,,,
split9: 1
split9: 
split9: 2
split9: "br"o"wn" ju"mp , fox"
split9: 4
split9: 5
split9: 
split9: 
split9: 

Open in new window

@TerryAtOpus
I think that statement was meant to read like this?
If you mean quoted, then yes...  I hit the wrong button.
the only problem which I found will be if you have non-paired double quotes within the field.
...and therein lies the problem. You really need a parser for parsing CSV. This is because one can have quotes that qualify a textual field, or there can be a quote that is simply a part of the data. A CFG should be relatively simple to write for parsing CSV--but my CFG skills are not that great, so I won't offer one here.
thanks