Link to home
Start Free TrialLog in
Avatar of sirbounty
sirbountyFlag for United States of America

asked on

vbscript regex against html data

I'm reading some code using msxml2.xmlhttp "GET" method.
A lot of functions and extraneous data is being returned that I'd like to easily filter out - I figure a reg ex is easiest.

There are basically two bits of information I'd like to pull from this data:
1) user name - this is the text following an <a href...> tag, but the user's real name is found in both alt="John Smith" and title="John Smith".  Presumably if that's easier to grab, I could live with it as well...
2) the user's ID:  This is inside a link tag, as a parameter (user=xxxxxx) (repeats several times), but also in <img uid="xxxxx"...>  which might be easier to get.

I have one other piece of data I'd like to grab, but figure that deserves a new question.
Let me know if sample data is needed.  
Avatar of abel
abel
Flag of Netherlands image

1) matching only the parts that matter, john smith goes in $1:
<a[^>]+title="([^"])+"

2) same, userid goes into $1:
<img[^>]+uid="([^"])+"

since the result is XML, you may also use DOM methods or selectNode XPath style methods to be more precise... But that's another chapter.

I assumed that a is always a and not A, and that img is always img and not Img, IMG or iMg.

-- Abel --
Avatar of sirbounty

ASKER

Not working for me...basically what I have for code is below...
Unfortunately, I have loads of data returned - still over 109 lines (should only be one or two if everything is filtered correctly?).

Your assumptions are correct.
Would it be easier to use the selectNode that you mentioned?  I don't mind opening a new thread if needed...
set re1=createobject("Vbscript.RegExp")
set re2=createobject("Vbscript.RegExp")
re1.Pattern= "<a[^>]+title=""(^""])+"
re2.Pattern="<img[^>]+uid=""([^""])+"
 
'.....code to extract xml data
 
  strData = objX.ResponseText
 
strNewData=re1.replace(strData,"")
strNewerData=re2.replace(strNewData,"")
 
objFile.Write strnewerdata

Open in new window

bear in mind also - this is meant to be a loop.
There will be approximately 30 users on each page - and there are around 120 pages...all I want is user id & username (any other related user data is okay - I just don't want the html/xml tags)
> strNewData=re1.replace(strData,"")

no no.... that's not what I meant. I didn't know what you were after, so I showed you a way to have a matching regex. If you want to replace all and extract only one uid/name, then prepend and postpend the expr with ".*" and put $1 in the replace expr. You probably also want to set MultiLine to true (don't know exact syntax in VBS).

re1.Pattern= ".*<a[^>]+title=""(^""])+.*"re2.Pattern=".*<img[^>]+uid=""([^""])+.*"strDataForUid = objX.ResponseTextstrDataForname =  objX.ResponseTextuid = re1.replace(strDataForUid,"$1")name = re1.replace(strDataForname,"$1")
-- Abel --


sorry, tgot those last lines messed up:

name = re1.replace(strDataForname,"$1")
uid = re2.replace(strDataForUid,"$1")

Hmm - now the script seems to be hanging...
I'll have to play around with it a bit, perhaps just getting the id first.
Hmm - no.  Taking longer to run, but still outputting quite a bit of data..
Basically I'm trying with this...
set re1=createobject("Vbscript.RegExp")
set re2=createobject("Vbscript.RegExp")
 
re1.Pattern= ".*<a[^>]+title=""(^""])+.*"
re2.Pattern=".*<img[^>]+uid=""([^""])+.*"
re1.Multiline=True
re2.Multiline=True
 
  objX.Open "GET", strURL, False
  objX.Send
 
strData = objX.ResponseText
 
uid = re1.replace(strData,"$1")
wscript.echo uid
msgbox uid

Open in new window

ASKER CERTIFIED SOLUTION
Avatar of abel
abel
Flag of Netherlands image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
I've given up...thanks for the help...