Solved

Regular Expression HELP

Posted on 2001-06-14
9
270 Views
Last Modified: 2008-02-20
I am trying to get the Title and Meta data from a site.

To get the data I am using XMLHTTP.

I am then using a Regular Expression to extract the data.

function stripHTMLTags(strPattern, strText)

dim re

set re = new RegExp
      re.pattern = strPattern
      re.ignorecase = true
      re.global = true
           
      Set Matches = re.Execute(strText)
      for each match in matches
           str2 = str2 & Match.value
             response.write server.HTMLEncode(str2)
        next

end function

I can extract the title with:

response.write "<br>" & stripHTMLTags("<title>(.*)?\<\/title>", xmlHTTP.responseText)

This works on some sites, and not on others.

Also Meta data that is proving a little more difficult.  On some pages it works, and others not.

If I use:

response.write "<br>" & stripHTMLTags("<head>(.*)?\<\/head>", xmlHTTP.responseText)

...to extract the entire header block I get nothing.

Also, the following will produce results with some pages, and nothing with others:

response.write "<br>" & stripHTMLTags("<meta(.*)?\>", xmlHTTP.responseText)

I had wanted to grab the <head>...</head> and assign it to a variable so that I don't have to check the entire page code each time I look for data - making the script quicker.  Basically then I could replace the xmlhttp.responsetext with the variable.

Anyway, any idea why the <head> part produces nothing, and the rest work intermittently?

0
Comment
Question by:russoffl
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
9 Comments
 
LVL 1

Author Comment

by:russoffl
ID: 6192939
I am starting to think that line breaks are causing the problem.  Any comments, or solutions to keep reading past these?
0
 
LVL 9

Expert Comment

by:TTom
ID: 6193003
Are you sure that the pages you are trying this on have these tags?

People frequently design web pages and omit the <title> and other <meta> tags.

Tom
0
 
LVL 1

Author Comment

by:russoffl
ID: 6193020
Yeah - I used View Source to confirm the tags.  That is where I noticed (in WordPad) that the start of the tag was on one line, and the closing tag on the next - with a crlf.

0
Revamp Your Training Process

Drastically shorten your training time with WalkMe's advanced online training solution that Guides your trainees to action.

 
LVL 4

Expert Comment

by:aponcealbuerne
ID: 6193178
"." Matches any single character except a newline character.

I think you could use something like:


response.write "<br>" & stripHTMLTags("<title>(.|$*)?\<\/title>", xmlHTTP.responseText)

hope helps
0
 
LVL 25

Expert Comment

by:clockwatcher
ID: 6193959
I must be tired because I know this isn't the best way to do this, but it should work:

  <title>((\s|.)*)?</title>


I can't believe that VBScript doesn't have a multiline setting somewhere.  I think that the 5.5 script engine may.  You might try:

function stripHTMLTags(strPattern, strText)

   dim re

   set re = new RegExp
   re.pattern = strPattern
   re.ignorecase = true
   re.global = true
   re.multiline = true
       
   Set Matches = re.Execute(strText)
   for each match in matches
      str2 = str2 & Match.value
      response.write server.HTMLEncode(str2)
   next

end function

With your <title>(.*)?</title> regexp.
0
 
LVL 1

Author Comment

by:russoffl
ID: 6194037
Ver. 5 doesn't support re.multiline = true.

I tried the script without it anyway, and it just hung - as did the 2 dozen other combinations - or it could'nt find a pattern.

Just about ready to give up on this option!
0
 
LVL 11

Accepted Solution

by:
ASPGuru earned 30 total points
ID: 6194040
1. you don't need to escape "/" or ">"

2. quantifiers like * are greedy, so keep care... ("<meta(.*)>" will gwt you all bwtween "<meta" and ">" of "</html>")

first:
  re.ignorecase = true
  re.global = true
  re.multiline = true

then:
for title:
"<title>([\s\S]*)</title>"


for head:
"<head>([\s\S]*)</head>"

for what you meant with this: "<meta(.*)?\>":
"<meta([^>]*)>"


the last will produce more than one match if the page has more than one metatags, so handle this...
0
 
LVL 1

Author Comment

by:russoffl
ID: 6196174
Just as you said the ">" was the last in the </html>.

For anyone who may be working on a similar solution, I used <head>([\s\S]*)</head> to grab the <head>...</head>, and then assigned this to a variable.

I then replaced all chr(10) and chr(13) with "", and then replaced all ">" with ">" & chr(10) & chr(13) to force each tag onto a seperate line.  The only problem was <title>
...</title>, so I re-replaced <title> & chr(10) & chr(13) with <title>, forcing it all back onto one line.

Not sure what the impact would be for javascripts and css stuff in the header, but since I am only grabbing meta description, meta keywords, and title I don't have a problem (so far) with the code.

May not be tidy, but it works for my purpose for now.

Just wish people wrote "tidy" HTML.
0
 
LVL 11

Expert Comment

by:ASPGuru
ID: 6196505
wouldn't it be an option for just to replace all vbCrLf(=chr(10)&chr(13) btw...) with a space(" ") before and then run normal reg. expressions... to get everything you want...
0

Featured Post

Industry Leaders: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

I have helped a lot of people on EE with their coding sources and have enjoyed near about every minute of it. Sometimes it can get a little tedious but it is always a challenge and the one thing that I always say is:   The Exchange of informatio…
I would like to start this tip/trick by saying Thank You, to all who said that this could not be done, as it forced me to make sure that it could be accomplished. :) To start, I want to make sure everyone understands the importance of utilizing p…
Come and listen to Percona CEO Peter Zaitsev discuss what’s new in Percona open source software, including Percona Server for MySQL (https://www.percona.com/software/mysql-database/percona-server) and MongoDB (https://www.percona.com/software/mongo-…
Monitoring a network: how to monitor network services and why? Michael Kulchisky, MCSE, MCSA, MCP, VTSP, VSP, CCSP outlines the philosophy behind service monitoring and why a handshake validation is critical in network monitoring. Software utilized …

728 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question