Want to protect your cyber security and still get fast solutions? Ask a secure question today.Go Premium

x
  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 549
  • Last Modified:

vbscript regex against html data

I'm reading some code using msxml2.xmlhttp "GET" method.
A lot of functions and extraneous data is being returned that I'd like to easily filter out - I figure a reg ex is easiest.

There are basically two bits of information I'd like to pull from this data:
1) user name - this is the text following an <a href...> tag, but the user's real name is found in both alt="John Smith" and title="John Smith".  Presumably if that's easier to grab, I could live with it as well...
2) the user's ID:  This is inside a link tag, as a parameter (user=xxxxxx) (repeats several times), but also in <img uid="xxxxx"...>  which might be easier to get.

I have one other piece of data I'd like to grab, but figure that deserves a new question.
Let me know if sample data is needed.  
0
sirbounty
Asked:
sirbounty
  • 5
  • 4
1 Solution
 
abelCommented:
1) matching only the parts that matter, john smith goes in $1:
<a[^>]+title="([^"])+"

2) same, userid goes into $1:
<img[^>]+uid="([^"])+"

since the result is XML, you may also use DOM methods or selectNode XPath style methods to be more precise... But that's another chapter.

I assumed that a is always a and not A, and that img is always img and not Img, IMG or iMg.

-- Abel --
0
 
sirbountyAuthor Commented:
Not working for me...basically what I have for code is below...
Unfortunately, I have loads of data returned - still over 109 lines (should only be one or two if everything is filtered correctly?).

Your assumptions are correct.
Would it be easier to use the selectNode that you mentioned?  I don't mind opening a new thread if needed...
set re1=createobject("Vbscript.RegExp")
set re2=createobject("Vbscript.RegExp")
re1.Pattern= "<a[^>]+title=""(^""])+"
re2.Pattern="<img[^>]+uid=""([^""])+"
 
'.....code to extract xml data
 
  strData = objX.ResponseText
 
strNewData=re1.replace(strData,"")
strNewerData=re2.replace(strNewData,"")
 
objFile.Write strnewerdata

Open in new window

0
 
sirbountyAuthor Commented:
bear in mind also - this is meant to be a loop.
There will be approximately 30 users on each page - and there are around 120 pages...all I want is user id & username (any other related user data is okay - I just don't want the html/xml tags)
0
VIDEO: THE CONCERTO CLOUD FOR HEALTHCARE

Modern healthcare requires a modern cloud. View this brief video to understand how the Concerto Cloud for Healthcare can help your organization.

 
abelCommented:
> strNewData=re1.replace(strData,"")

no no.... that's not what I meant. I didn't know what you were after, so I showed you a way to have a matching regex. If you want to replace all and extract only one uid/name, then prepend and postpend the expr with ".*" and put $1 in the replace expr. You probably also want to set MultiLine to true (don't know exact syntax in VBS).

re1.Pattern= ".*<a[^>]+title=""(^""])+.*"re2.Pattern=".*<img[^>]+uid=""([^""])+.*"strDataForUid = objX.ResponseTextstrDataForname =  objX.ResponseTextuid = re1.replace(strDataForUid,"$1")name = re1.replace(strDataForname,"$1")
-- Abel --


0
 
abelCommented:
sorry, tgot those last lines messed up:

name = re1.replace(strDataForname,"$1")
uid = re2.replace(strDataForUid,"$1")

0
 
sirbountyAuthor Commented:
Hmm - now the script seems to be hanging...
I'll have to play around with it a bit, perhaps just getting the id first.
0
 
sirbountyAuthor Commented:
Hmm - no.  Taking longer to run, but still outputting quite a bit of data..
Basically I'm trying with this...
set re1=createobject("Vbscript.RegExp")
set re2=createobject("Vbscript.RegExp")
 
re1.Pattern= ".*<a[^>]+title=""(^""])+.*"
re2.Pattern=".*<img[^>]+uid=""([^""])+.*"
re1.Multiline=True
re2.Multiline=True
 
  objX.Open "GET", strURL, False
  objX.Send
 
strData = objX.ResponseText
 
uid = re1.replace(strData,"$1")
wscript.echo uid
msgbox uid

Open in new window

0
 
abelCommented:
hmm, something went wrong here... My mistake. The expression is not correct. Btw, can you give me a snippet of your data so that I can actually test the regex? That will be much faster then this trial and error process (and it is quite hard to come up with the right regex without a real example of the data we're facing).

re1.Pattern= ".*<a[^>]+title=""([^""]+)"".*"re2.Pattern=".*<img[^>]+uid=""([^""]+)"".*"
From http://www.xaprb.com/blog/2005/11/04/vbscript-regular-expression-gotchas/ and http://www.somacon.com/p138.php I googled up that other gotcha with VBScript bugs. You might need the Global flag too, though I think, since we are looking for one match only, it should not be needed.
0
 
sirbountyAuthor Commented:
I've given up...thanks for the help...
0

Featured Post

Concerto Cloud for Software Providers & ISVs

Can Concerto Cloud Services help you focus on evolving your application offerings, while delivering the best cloud experience to your customers? From DevOps to revenue models and customer support, the answer is yes!

Learn how Concerto can help you.

  • 5
  • 4
Tackle projects and never again get stuck behind a technical roadblock.
Join Now