Link to home
Start Free TrialLog in
Avatar of aikimark
aikimarkFlag for United States of America

asked on

Regex parsing

I am doing a series of presentations on string searching and parsing for one of my local user groups, TriMOUG, and thought there might be an elegant regular expression solution for parsing the data from question http:Q_28133857.html in a single pass, rather than a double pass.  However, I've played with different versions of regex patterns and none of them seem to work.  Some of them actually freeze and confound the regexp object.

The snippet below is an extract of the posted file, which was unicode with a BOM sequence at the beginning, which I will not worry about for the purposes of this question or next month's user group demo.
9780003020656|BPOQ|I.|Hodder|40.00|0|05121991|Meanings of Things|234|156|15|419||Y||BC|Y
9780020038719|BPOQ|August|Heckscher|30.99|0|03051993|Woodrow Wilson|229|152|38|994||Y||BC|Y
9780020087908|BPOQ|Mark|Stevens|12.99|0|01031984|Big Eight|216|140|14|330||Y||BC|Y
9780020119302|BPOQ|Phyllis W.|Schwebke|10.99|0|19021974|How to Sew Leather, Suede, Fur|235|191|8|288||Y||BC|Y
9780020130406|BPOQ|Mortimer Jerome|Adler|12.99|0|19111984|The Paideia Program|229|152|14|375||Y||BC|Y
9780020160229|BPOQ|Mortimer Jerome|Adler|12.99|0|19071991|How to Think about God|216|140|10|246||Y||BC|Y
9780020218050|BPOQ|Anatoly|Karpov|10.99|0|19041990|The Semi-Closed Openings in Action|216|140|8|192||Y||BC|Y
9780020220053|BPOQ|Peter|Knobler|16.99|0|01041995|Very Seventies|229|152|21|557||Y||BC|Y
9780020223429|BPOQ|Ring W.|Lardner|12.99|0|19091991|You Know Me Al|216|140|12|285||Y||BC|Y
9780020223436|BPOQ|Ring W.|Lardner|12.99|0|19061993|Some Champions|216|140|12|290||Y||BC|Y
9780020223443|BPOQ|Ring W.|Lardner|10.99|0|19091991|Haircut|216|140|11|251||Y||BC|Y
9780020285618|BPOQ|Gyozo|Forintos|12.99|0|19031992|The Petroff Defense|203|127|14|282||Y||BC|Y
9780020288909|BPOQ|Israel A.|Horowitz|10.99|0|19031972|How to Improve Your Chess|216|140|11|266||Y||BC|Y
9780020300656|BPOQ|Mortimer Jerome|Adler|12.99|0|19091993|The Angels and Us|216|140|12|290||Y||BC|Y
9780020321507|BPOQ|Brian|Fawcett|12.99|0|19101989|Cambodia|216|140|12|290||Y||BC|Y
9780020345152|BPOQ|John F.|Marszalek|16.99|0|19041994|Assault at West Point|229|152|19|518||Y||BC|Y
9780020408918|BPOQ|Thomas|Wolfe|26.99|0|19041989|The Complete Short Stories of Thomas Wolfe|229|152|36|949||Y||BC|Y
9780020427254|BPOQ|Nicholas|Christopher|12.99|0|19041994|Walk on the Wild Side|216|140|14|330||Y||BC|Y
9780020456001|BPOQ|Jay|Williams|8.99|0|01101984|Everyone Knows What a Dragon Looks Like|216|216|2|78||Y||BC|N
9780020641407|BPOQ|Mortimer Jerome|Adler|10.99|0|19041992|Truth in Religion|216|140|10|236||Y||BC|Y
9780020820253|BPOQ|Elaine|Showalter|18.99|0|19091993|Modern American Women Writers|235|191|22|742||Y||BC|Y
9780020930815|BPOQ|Gordon|Inkeles|14.99|0|01111994|Ergonomic Living|254|178|10|344||Y||BC|Y
9780023159909|BPOQ||Bryan|49.99|0|04041993|Fire Suppression and Detection Systems|244|170|31|959||Y||BC|Y
9780023174001|BPOQ|George|Burns|61.99|0|02121988|Science of Genetics, The|280|210|26|1136||Y||BC|Y
9780023253409|BPOQ|Frank|Copley|44.99|0|01011975|Vergil|216|140|18|411||Y||BC|Y
9780023269004|BPOQ|William|Dalton|61.99|0|26091993|Technology of Metallurgy, The|254|178|24|833||Y||BC|Y
9780023300295|BPOQ|Mary|Drake|47.99|0|01011992|Retail Fashion Promotion and Advertising|254|178|22|865||Y||BB|Y
9780023395703|BPOQ|Russell|Fraser|57.99|0|01031976|Drama of the English Renaissance|254|178|28|937||Y||BC|Y
9780023397639|BPOQ|Linda|Friedman|36.99|0|01011989|Little LISPer|254|178|11|394||Y||BC|Y
9780023589416|BPOQ|Madeline|Hunter|34.99|0|18111993|Enhancing Teaching|229|152|14|369||Y||BC|Y
9780023628429|BPOQ|Stephen|Kesler|53.99|0|18021994|Mineral Resources Economics and the Environment|280|210|21|919||Y||BC|Y

Open in new window


I am looking for a regex pattern that will work with the vbscript.regexp object library to present matches on a line level and submatches for the pipe-delimited data.
Avatar of ozo
ozo
Flag of United States of America image

what do you want to parse from that data?
Avatar of aikimark

ASKER

all the pipe-delimited data.
ASKER CERTIFIED SOLUTION
Avatar of Robert Schutt
Robert Schutt
Flag of Netherlands image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
I would have done what you did in the original question where you read a line and create  an array of each line.

field1=MyArray(0)
field2=MyArray(1)
:
field20=MyArray(19)

I have read where you can access .NET controls from vbs but .NET has to be prior to 4.

I wonder if would speed things up to use ajax to post each row of data client side.
@Robert

Is there an advantage to using ([^|]*) instead of (.*?)

This pattern parses the part.txt file in the original question.  I guess I was relying on repeating capture groups, which isn't provided by the RegExp object.
(.*?)\|(.*?)\|(.*?)\|(.*?)\|(.*?)\|(.*?)\|(.*?)\|(.*?)\|(.*?)\|(.*?)\|(.*?)\|(.*?)\|(.*?)\|(.*?)\|(.*?)\|(.*?)\|(.*?)\r|$
You know, I was about to answer that could go wrong due to backtracking but actually it does exactly the same. In other circumstances the negative character match is better because it just can't go past the pipe whereas the non-greedy * can cause backtracking when there is a non-match later on (because it will also try matches including the pipe) but I think that is prevented to go haywire by the 'end-of-line' condition here.

I have to admit though: I used to think I knew a lot about the inner workings of (perl) regexp's, until I read DON'T BE A FRED, that just blew me out of the water... since you mentioned freezing when using certain regexp patterns I thought that must have been excessive backtracking due to mismatches. So that could have been a non-greedy match going past the end-of-line for some reason. There could also be a difference in using the Global option as opposed to sticking it all in (...)* and trying to match as a single string.
Thanks for the regex pattern.