Regex parsing

I am doing a series of presentations on string searching and parsing for one of my local user groups, TriMOUG, and thought there might be an elegant regular expression solution for parsing the data from question http:Q_28133857.html in a single pass, rather than a double pass.  However, I've played with different versions of regex patterns and none of them seem to work.  Some of them actually freeze and confound the regexp object.

The snippet below is an extract of the posted file, which was unicode with a BOM sequence at the beginning, which I will not worry about for the purposes of this question or next month's user group demo.
9780003020656|BPOQ|I.|Hodder|40.00|0|05121991|Meanings of Things|234|156|15|419||Y||BC|Y
9780020038719|BPOQ|August|Heckscher|30.99|0|03051993|Woodrow Wilson|229|152|38|994||Y||BC|Y
9780020087908|BPOQ|Mark|Stevens|12.99|0|01031984|Big Eight|216|140|14|330||Y||BC|Y
9780020119302|BPOQ|Phyllis W.|Schwebke|10.99|0|19021974|How to Sew Leather, Suede, Fur|235|191|8|288||Y||BC|Y
9780020130406|BPOQ|Mortimer Jerome|Adler|12.99|0|19111984|The Paideia Program|229|152|14|375||Y||BC|Y
9780020160229|BPOQ|Mortimer Jerome|Adler|12.99|0|19071991|How to Think about God|216|140|10|246||Y||BC|Y
9780020218050|BPOQ|Anatoly|Karpov|10.99|0|19041990|The Semi-Closed Openings in Action|216|140|8|192||Y||BC|Y
9780020220053|BPOQ|Peter|Knobler|16.99|0|01041995|Very Seventies|229|152|21|557||Y||BC|Y
9780020223429|BPOQ|Ring W.|Lardner|12.99|0|19091991|You Know Me Al|216|140|12|285||Y||BC|Y
9780020223436|BPOQ|Ring W.|Lardner|12.99|0|19061993|Some Champions|216|140|12|290||Y||BC|Y
9780020223443|BPOQ|Ring W.|Lardner|10.99|0|19091991|Haircut|216|140|11|251||Y||BC|Y
9780020285618|BPOQ|Gyozo|Forintos|12.99|0|19031992|The Petroff Defense|203|127|14|282||Y||BC|Y
9780020288909|BPOQ|Israel A.|Horowitz|10.99|0|19031972|How to Improve Your Chess|216|140|11|266||Y||BC|Y
9780020300656|BPOQ|Mortimer Jerome|Adler|12.99|0|19091993|The Angels and Us|216|140|12|290||Y||BC|Y
9780020321507|BPOQ|Brian|Fawcett|12.99|0|19101989|Cambodia|216|140|12|290||Y||BC|Y
9780020345152|BPOQ|John F.|Marszalek|16.99|0|19041994|Assault at West Point|229|152|19|518||Y||BC|Y
9780020408918|BPOQ|Thomas|Wolfe|26.99|0|19041989|The Complete Short Stories of Thomas Wolfe|229|152|36|949||Y||BC|Y
9780020427254|BPOQ|Nicholas|Christopher|12.99|0|19041994|Walk on the Wild Side|216|140|14|330||Y||BC|Y
9780020456001|BPOQ|Jay|Williams|8.99|0|01101984|Everyone Knows What a Dragon Looks Like|216|216|2|78||Y||BC|N
9780020641407|BPOQ|Mortimer Jerome|Adler|10.99|0|19041992|Truth in Religion|216|140|10|236||Y||BC|Y
9780020820253|BPOQ|Elaine|Showalter|18.99|0|19091993|Modern American Women Writers|235|191|22|742||Y||BC|Y
9780020930815|BPOQ|Gordon|Inkeles|14.99|0|01111994|Ergonomic Living|254|178|10|344||Y||BC|Y
9780023159909|BPOQ||Bryan|49.99|0|04041993|Fire Suppression and Detection Systems|244|170|31|959||Y||BC|Y
9780023174001|BPOQ|George|Burns|61.99|0|02121988|Science of Genetics, The|280|210|26|1136||Y||BC|Y
9780023253409|BPOQ|Frank|Copley|44.99|0|01011975|Vergil|216|140|18|411||Y||BC|Y
9780023269004|BPOQ|William|Dalton|61.99|0|26091993|Technology of Metallurgy, The|254|178|24|833||Y||BC|Y
9780023300295|BPOQ|Mary|Drake|47.99|0|01011992|Retail Fashion Promotion and Advertising|254|178|22|865||Y||BB|Y
9780023395703|BPOQ|Russell|Fraser|57.99|0|01031976|Drama of the English Renaissance|254|178|28|937||Y||BC|Y
9780023397639|BPOQ|Linda|Friedman|36.99|0|01011989|Little LISPer|254|178|11|394||Y||BC|Y
9780023589416|BPOQ|Madeline|Hunter|34.99|0|18111993|Enhancing Teaching|229|152|14|369||Y||BC|Y
9780023628429|BPOQ|Stephen|Kesler|53.99|0|18021994|Mineral Resources Economics and the Environment|280|210|21|919||Y||BC|Y

Open in new window


I am looking for a regex pattern that will work with the vbscript.regexp object library to present matches on a line level and submatches for the pipe-delimited data.
LVL 47
aikimarkAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

ozoCommented:
what do you want to parse from that data?
0
aikimarkAuthor Commented:
all the pipe-delimited data.
0
Robert SchuttSoftware EngineerCommented:
Elegant? Hmmm no not really I guess...

AFAIK the vbscript regexp does not have that nice feature of .NET regex: captures.

So a repeated group shows up in Submatches as the last match only, which is useless in this case.

The only thing that works for me and seems close to what you want (regarding matches/submatches) if not very elegant, is to repeat the sub-pattern for a field as many times as you have actual fields:
Option Explicit

Dim oFSO, oFile, oText, oRE, iCounter, oMatches, oMatch, iCounter2, sSubmatch

Set oFSO = CreateObject("Scripting.FileSystemObject")
Set oFile = oFSO.OpenTextFile("tst.txt", 1, False)
oText = oFile.ReadAll
oFile.Close
Set oFile = Nothing
Set oFSO = Nothing

Set oRE = CreateObject("vbscript.regexp")
oRE.Pattern = "([^|]*)\|([^|]*)\|([^|]*)\|([^|]*)\|([^|]*)\|([^|]*)\|([^|]*)\|([^|]*)\|([^|]*)\|([^|]*)\|([^|]*)\|([^|]*)\|([^|]*)\|([^|]*)\|([^|]*)\|([^|]*)\|([^|]*)\r\n"
'oRE.Pattern = "(([^|]*)\|)*([^|]*)\r\n" ' doesn't work
oRE.Global = True
oRE.MultiLine = True
iCounter = 1
Set oMatches = oRE.Execute(oText)
For Each oMatch In oMatches
  WScript.Echo "--- record #" & iCounter
  iCounter2 = 1
  For Each sSubmatch In oMatch.Submatches
    WScript.Echo "field #" & iCounter2 & " - " & sSubmatch
    iCounter2 = iCounter2 + 1
  Next
  WScript.Echo "---"
  iCounter = iCounter + 1
Next

Open in new window

If the number of fields is variable then of course you could parse the first line and generate the pattern dynamically. Not sure if that would make it more or actually less elegant ;-)
0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
Determine the Perfect Price for Your IT Services

Do you wonder if your IT business is truly profitable or if you should raise your prices? Learn how to calculate your overhead burden with our free interactive tool and use it to determine the right price for your IT services. Download your free eBook now!

Scott Fell, EE MVEDeveloper & EE ModeratorCommented:
I would have done what you did in the original question where you read a line and create  an array of each line.

field1=MyArray(0)
field2=MyArray(1)
:
field20=MyArray(19)

I have read where you can access .NET controls from vbs but .NET has to be prior to 4.

I wonder if would speed things up to use ajax to post each row of data client side.
0
aikimarkAuthor Commented:
@Robert

Is there an advantage to using ([^|]*) instead of (.*?)

This pattern parses the part.txt file in the original question.  I guess I was relying on repeating capture groups, which isn't provided by the RegExp object.
(.*?)\|(.*?)\|(.*?)\|(.*?)\|(.*?)\|(.*?)\|(.*?)\|(.*?)\|(.*?)\|(.*?)\|(.*?)\|(.*?)\|(.*?)\|(.*?)\|(.*?)\|(.*?)\|(.*?)\r|$
0
Robert SchuttSoftware EngineerCommented:
You know, I was about to answer that could go wrong due to backtracking but actually it does exactly the same. In other circumstances the negative character match is better because it just can't go past the pipe whereas the non-greedy * can cause backtracking when there is a non-match later on (because it will also try matches including the pipe) but I think that is prevented to go haywire by the 'end-of-line' condition here.

I have to admit though: I used to think I knew a lot about the inner workings of (perl) regexp's, until I read DON'T BE A FRED, that just blew me out of the water... since you mentioned freezing when using certain regexp patterns I thought that must have been excessive backtracking due to mismatches. So that could have been a non-greedy match going past the end-of-line for some reason. There could also be a difference in using the Global option as opposed to sticking it all in (...)* and trying to match as a single string.
0
aikimarkAuthor Commented:
Thanks for the regex pattern.
0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
VB Script

From novice to tech pro — start learning today.