russoffl
asked on
Regular Expression HELP
I am trying to get the Title and Meta data from a site.
To get the data I am using XMLHTTP.
I am then using a Regular Expression to extract the data.
function stripHTMLTags(strPattern, strText)
dim re
set re = new RegExp
re.pattern = strPattern
re.ignorecase = true
re.global = true
Set Matches = re.Execute(strText)
for each match in matches
str2 = str2 & Match.value
response.write server.HTMLEncode(str2)
next
end function
I can extract the title with:
response.write "<br>" & stripHTMLTags("<title>(.*) ?\<\/title >", xmlHTTP.responseText)
This works on some sites, and not on others.
Also Meta data that is proving a little more difficult. On some pages it works, and others not.
If I use:
response.write "<br>" & stripHTMLTags("<head>(.*)? \<\/head>" , xmlHTTP.responseText)
...to extract the entire header block I get nothing.
Also, the following will produce results with some pages, and nothing with others:
response.write "<br>" & stripHTMLTags("<meta(.*)?\ >", xmlHTTP.responseText)
I had wanted to grab the <head>...</head> and assign it to a variable so that I don't have to check the entire page code each time I look for data - making the script quicker. Basically then I could replace the xmlhttp.responsetext with the variable.
Anyway, any idea why the <head> part produces nothing, and the rest work intermittently?
To get the data I am using XMLHTTP.
I am then using a Regular Expression to extract the data.
function stripHTMLTags(strPattern, strText)
dim re
set re = new RegExp
re.pattern = strPattern
re.ignorecase = true
re.global = true
Set Matches = re.Execute(strText)
for each match in matches
str2 = str2 & Match.value
response.write server.HTMLEncode(str2)
next
end function
I can extract the title with:
response.write "<br>" & stripHTMLTags("<title>(.*)
This works on some sites, and not on others.
Also Meta data that is proving a little more difficult. On some pages it works, and others not.
If I use:
response.write "<br>" & stripHTMLTags("<head>(.*)?
...to extract the entire header block I get nothing.
Also, the following will produce results with some pages, and nothing with others:
response.write "<br>" & stripHTMLTags("<meta(.*)?\
I had wanted to grab the <head>...</head> and assign it to a variable so that I don't have to check the entire page code each time I look for data - making the script quicker. Basically then I could replace the xmlhttp.responsetext with the variable.
Anyway, any idea why the <head> part produces nothing, and the rest work intermittently?
Are you sure that the pages you are trying this on have these tags?
People frequently design web pages and omit the <title> and other <meta> tags.
Tom
People frequently design web pages and omit the <title> and other <meta> tags.
Tom
ASKER
Yeah - I used View Source to confirm the tags. That is where I noticed (in WordPad) that the start of the tag was on one line, and the closing tag on the next - with a crlf.
"." Matches any single character except a newline character.
I think you could use something like:
response.write "<br>" & stripHTMLTags("<title>(.|$ *)?\<\/tit le>", xmlHTTP.responseText)
hope helps
I think you could use something like:
response.write "<br>" & stripHTMLTags("<title>(.|$
hope helps
I must be tired because I know this isn't the best way to do this, but it should work:
<title>((\s|.)*)?</title>
I can't believe that VBScript doesn't have a multiline setting somewhere. I think that the 5.5 script engine may. You might try:
function stripHTMLTags(strPattern, strText)
dim re
set re = new RegExp
re.pattern = strPattern
re.ignorecase = true
re.global = true
re.multiline = true
Set Matches = re.Execute(strText)
for each match in matches
str2 = str2 & Match.value
response.write server.HTMLEncode(str2)
next
end function
With your <title>(.*)?</title> regexp.
<title>((\s|.)*)?</title>
I can't believe that VBScript doesn't have a multiline setting somewhere. I think that the 5.5 script engine may. You might try:
function stripHTMLTags(strPattern, strText)
dim re
set re = new RegExp
re.pattern = strPattern
re.ignorecase = true
re.global = true
re.multiline = true
Set Matches = re.Execute(strText)
for each match in matches
str2 = str2 & Match.value
response.write server.HTMLEncode(str2)
next
end function
With your <title>(.*)?</title> regexp.
ASKER
Ver. 5 doesn't support re.multiline = true.
I tried the script without it anyway, and it just hung - as did the 2 dozen other combinations - or it could'nt find a pattern.
Just about ready to give up on this option!
I tried the script without it anyway, and it just hung - as did the 2 dozen other combinations - or it could'nt find a pattern.
Just about ready to give up on this option!
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
Just as you said the ">" was the last in the </html>.
For anyone who may be working on a similar solution, I used <head>([\s\S]*)</head> to grab the <head>...</head>, and then assigned this to a variable.
I then replaced all chr(10) and chr(13) with "", and then replaced all ">" with ">" & chr(10) & chr(13) to force each tag onto a seperate line. The only problem was <title>
...</title>, so I re-replaced <title> & chr(10) & chr(13) with <title>, forcing it all back onto one line.
Not sure what the impact would be for javascripts and css stuff in the header, but since I am only grabbing meta description, meta keywords, and title I don't have a problem (so far) with the code.
May not be tidy, but it works for my purpose for now.
Just wish people wrote "tidy" HTML.
For anyone who may be working on a similar solution, I used <head>([\s\S]*)</head> to grab the <head>...</head>, and then assigned this to a variable.
I then replaced all chr(10) and chr(13) with "", and then replaced all ">" with ">" & chr(10) & chr(13) to force each tag onto a seperate line. The only problem was <title>
...</title>, so I re-replaced <title> & chr(10) & chr(13) with <title>, forcing it all back onto one line.
Not sure what the impact would be for javascripts and css stuff in the header, but since I am only grabbing meta description, meta keywords, and title I don't have a problem (so far) with the code.
May not be tidy, but it works for my purpose for now.
Just wish people wrote "tidy" HTML.
wouldn't it be an option for just to replace all vbCrLf(=chr(10)&chr(13) btw...) with a space(" ") before and then run normal reg. expressions... to get everything you want...
ASKER