Regex parsing

I am doing a series of presentations on string searching and parsing for one of my local user groups, TriMOUG, and thought there might be an elegant regular expression solution for parsing the data from question http:Q_28133857.html in a single pass, rather than a double pass.  However, I've played with different versions of regex patterns and none of them seem to work.  Some of them actually freeze and confound the regexp object.

The snippet below is an extract of the posted file, which was unicode with a BOM sequence at the beginning, which I will not worry about for the purposes of this question or next month's user group demo.
9780003020656|BPOQ|I.|Hodder|40.00|0|05121991|Meanings of Things|234|156|15|419||Y||BC|Y
9780020038719|BPOQ|August|Heckscher|30.99|0|03051993|Woodrow Wilson|229|152|38|994||Y||BC|Y
9780020087908|BPOQ|Mark|Stevens|12.99|0|01031984|Big Eight|216|140|14|330||Y||BC|Y
9780020119302|BPOQ|Phyllis W.|Schwebke|10.99|0|19021974|How to Sew Leather, Suede, Fur|235|191|8|288||Y||BC|Y
9780020130406|BPOQ|Mortimer Jerome|Adler|12.99|0|19111984|The Paideia Program|229|152|14|375||Y||BC|Y
9780020160229|BPOQ|Mortimer Jerome|Adler|12.99|0|19071991|How to Think about God|216|140|10|246||Y||BC|Y
9780020218050|BPOQ|Anatoly|Karpov|10.99|0|19041990|The Semi-Closed Openings in Action|216|140|8|192||Y||BC|Y
9780020220053|BPOQ|Peter|Knobler|16.99|0|01041995|Very Seventies|229|152|21|557||Y||BC|Y
9780020223429|BPOQ|Ring W.|Lardner|12.99|0|19091991|You Know Me Al|216|140|12|285||Y||BC|Y
9780020223436|BPOQ|Ring W.|Lardner|12.99|0|19061993|Some Champions|216|140|12|290||Y||BC|Y
9780020223443|BPOQ|Ring W.|Lardner|10.99|0|19091991|Haircut|216|140|11|251||Y||BC|Y
9780020285618|BPOQ|Gyozo|Forintos|12.99|0|19031992|The Petroff Defense|203|127|14|282||Y||BC|Y
9780020288909|BPOQ|Israel A.|Horowitz|10.99|0|19031972|How to Improve Your Chess|216|140|11|266||Y||BC|Y
9780020300656|BPOQ|Mortimer Jerome|Adler|12.99|0|19091993|The Angels and Us|216|140|12|290||Y||BC|Y
9780020345152|BPOQ|John F.|Marszalek|16.99|0|19041994|Assault at West Point|229|152|19|518||Y||BC|Y
9780020408918|BPOQ|Thomas|Wolfe|26.99|0|19041989|The Complete Short Stories of Thomas Wolfe|229|152|36|949||Y||BC|Y
9780020427254|BPOQ|Nicholas|Christopher|12.99|0|19041994|Walk on the Wild Side|216|140|14|330||Y||BC|Y
9780020456001|BPOQ|Jay|Williams|8.99|0|01101984|Everyone Knows What a Dragon Looks Like|216|216|2|78||Y||BC|N
9780020641407|BPOQ|Mortimer Jerome|Adler|10.99|0|19041992|Truth in Religion|216|140|10|236||Y||BC|Y
9780020820253|BPOQ|Elaine|Showalter|18.99|0|19091993|Modern American Women Writers|235|191|22|742||Y||BC|Y
9780020930815|BPOQ|Gordon|Inkeles|14.99|0|01111994|Ergonomic Living|254|178|10|344||Y||BC|Y
9780023159909|BPOQ||Bryan|49.99|0|04041993|Fire Suppression and Detection Systems|244|170|31|959||Y||BC|Y
9780023174001|BPOQ|George|Burns|61.99|0|02121988|Science of Genetics, The|280|210|26|1136||Y||BC|Y
9780023269004|BPOQ|William|Dalton|61.99|0|26091993|Technology of Metallurgy, The|254|178|24|833||Y||BC|Y
9780023300295|BPOQ|Mary|Drake|47.99|0|01011992|Retail Fashion Promotion and Advertising|254|178|22|865||Y||BB|Y
9780023395703|BPOQ|Russell|Fraser|57.99|0|01031976|Drama of the English Renaissance|254|178|28|937||Y||BC|Y
9780023397639|BPOQ|Linda|Friedman|36.99|0|01011989|Little LISPer|254|178|11|394||Y||BC|Y
9780023589416|BPOQ|Madeline|Hunter|34.99|0|18111993|Enhancing Teaching|229|152|14|369||Y||BC|Y
9780023628429|BPOQ|Stephen|Kesler|53.99|0|18021994|Mineral Resources Economics and the Environment|280|210|21|919||Y||BC|Y

Open in new window

I am looking for a regex pattern that will work with the vbscript.regexp object library to present matches on a line level and submatches for the pipe-delimited data.
LVL 47
Who is Participating?
Robert SchuttConnect With a Mentor Software EngineerCommented:
Elegant? Hmmm no not really I guess...

AFAIK the vbscript regexp does not have that nice feature of .NET regex: captures.

So a repeated group shows up in Submatches as the last match only, which is useless in this case.

The only thing that works for me and seems close to what you want (regarding matches/submatches) if not very elegant, is to repeat the sub-pattern for a field as many times as you have actual fields:
Option Explicit

Dim oFSO, oFile, oText, oRE, iCounter, oMatches, oMatch, iCounter2, sSubmatch

Set oFSO = CreateObject("Scripting.FileSystemObject")
Set oFile = oFSO.OpenTextFile("tst.txt", 1, False)
oText = oFile.ReadAll
Set oFile = Nothing
Set oFSO = Nothing

Set oRE = CreateObject("vbscript.regexp")
oRE.Pattern = "([^|]*)\|([^|]*)\|([^|]*)\|([^|]*)\|([^|]*)\|([^|]*)\|([^|]*)\|([^|]*)\|([^|]*)\|([^|]*)\|([^|]*)\|([^|]*)\|([^|]*)\|([^|]*)\|([^|]*)\|([^|]*)\|([^|]*)\r\n"
'oRE.Pattern = "(([^|]*)\|)*([^|]*)\r\n" ' doesn't work
oRE.Global = True
oRE.MultiLine = True
iCounter = 1
Set oMatches = oRE.Execute(oText)
For Each oMatch In oMatches
  WScript.Echo "--- record #" & iCounter
  iCounter2 = 1
  For Each sSubmatch In oMatch.Submatches
    WScript.Echo "field #" & iCounter2 & " - " & sSubmatch
    iCounter2 = iCounter2 + 1
  WScript.Echo "---"
  iCounter = iCounter + 1

Open in new window

If the number of fields is variable then of course you could parse the first line and generate the pattern dynamically. Not sure if that would make it more or actually less elegant ;-)
what do you want to parse from that data?
aikimarkAuthor Commented:
all the pipe-delimited data.
The new generation of project management tools

With’s project management tool, you can see what everyone on your team is working in a single glance. Its intuitive dashboards are customizable, so you can create systems that work for you.

Scott Fell, EE MVEDeveloper & EE ModeratorCommented:
I would have done what you did in the original question where you read a line and create  an array of each line.


I have read where you can access .NET controls from vbs but .NET has to be prior to 4.

I wonder if would speed things up to use ajax to post each row of data client side.
aikimarkAuthor Commented:

Is there an advantage to using ([^|]*) instead of (.*?)

This pattern parses the part.txt file in the original question.  I guess I was relying on repeating capture groups, which isn't provided by the RegExp object.
Robert SchuttSoftware EngineerCommented:
You know, I was about to answer that could go wrong due to backtracking but actually it does exactly the same. In other circumstances the negative character match is better because it just can't go past the pipe whereas the non-greedy * can cause backtracking when there is a non-match later on (because it will also try matches including the pipe) but I think that is prevented to go haywire by the 'end-of-line' condition here.

I have to admit though: I used to think I knew a lot about the inner workings of (perl) regexp's, until I read DON'T BE A FRED, that just blew me out of the water... since you mentioned freezing when using certain regexp patterns I thought that must have been excessive backtracking due to mismatches. So that could have been a non-greedy match going past the end-of-line for some reason. There could also be a difference in using the Global option as opposed to sticking it all in (...)* and trying to match as a single string.
aikimarkAuthor Commented:
Thanks for the regex pattern.
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.