Solved

Regular Expression HELP

Posted on 2001-06-14
9
268 Views
Last Modified: 2008-02-20
I am trying to get the Title and Meta data from a site.

To get the data I am using XMLHTTP.

I am then using a Regular Expression to extract the data.

function stripHTMLTags(strPattern, strText)

dim re

set re = new RegExp
      re.pattern = strPattern
      re.ignorecase = true
      re.global = true
           
      Set Matches = re.Execute(strText)
      for each match in matches
           str2 = str2 & Match.value
             response.write server.HTMLEncode(str2)
        next

end function

I can extract the title with:

response.write "<br>" & stripHTMLTags("<title>(.*)?\<\/title>", xmlHTTP.responseText)

This works on some sites, and not on others.

Also Meta data that is proving a little more difficult.  On some pages it works, and others not.

If I use:

response.write "<br>" & stripHTMLTags("<head>(.*)?\<\/head>", xmlHTTP.responseText)

...to extract the entire header block I get nothing.

Also, the following will produce results with some pages, and nothing with others:

response.write "<br>" & stripHTMLTags("<meta(.*)?\>", xmlHTTP.responseText)

I had wanted to grab the <head>...</head> and assign it to a variable so that I don't have to check the entire page code each time I look for data - making the script quicker.  Basically then I could replace the xmlhttp.responsetext with the variable.

Anyway, any idea why the <head> part produces nothing, and the rest work intermittently?

0
Comment
Question by:russoffl
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
9 Comments
 
LVL 1

Author Comment

by:russoffl
ID: 6192939
I am starting to think that line breaks are causing the problem.  Any comments, or solutions to keep reading past these?
0
 
LVL 9

Expert Comment

by:TTom
ID: 6193003
Are you sure that the pages you are trying this on have these tags?

People frequently design web pages and omit the <title> and other <meta> tags.

Tom
0
 
LVL 1

Author Comment

by:russoffl
ID: 6193020
Yeah - I used View Source to confirm the tags.  That is where I noticed (in WordPad) that the start of the tag was on one line, and the closing tag on the next - with a crlf.

0
Industry Leaders: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 
LVL 4

Expert Comment

by:aponcealbuerne
ID: 6193178
"." Matches any single character except a newline character.

I think you could use something like:


response.write "<br>" & stripHTMLTags("<title>(.|$*)?\<\/title>", xmlHTTP.responseText)

hope helps
0
 
LVL 25

Expert Comment

by:clockwatcher
ID: 6193959
I must be tired because I know this isn't the best way to do this, but it should work:

  <title>((\s|.)*)?</title>


I can't believe that VBScript doesn't have a multiline setting somewhere.  I think that the 5.5 script engine may.  You might try:

function stripHTMLTags(strPattern, strText)

   dim re

   set re = new RegExp
   re.pattern = strPattern
   re.ignorecase = true
   re.global = true
   re.multiline = true
       
   Set Matches = re.Execute(strText)
   for each match in matches
      str2 = str2 & Match.value
      response.write server.HTMLEncode(str2)
   next

end function

With your <title>(.*)?</title> regexp.
0
 
LVL 1

Author Comment

by:russoffl
ID: 6194037
Ver. 5 doesn't support re.multiline = true.

I tried the script without it anyway, and it just hung - as did the 2 dozen other combinations - or it could'nt find a pattern.

Just about ready to give up on this option!
0
 
LVL 11

Accepted Solution

by:
ASPGuru earned 30 total points
ID: 6194040
1. you don't need to escape "/" or ">"

2. quantifiers like * are greedy, so keep care... ("<meta(.*)>" will gwt you all bwtween "<meta" and ">" of "</html>")

first:
  re.ignorecase = true
  re.global = true
  re.multiline = true

then:
for title:
"<title>([\s\S]*)</title>"


for head:
"<head>([\s\S]*)</head>"

for what you meant with this: "<meta(.*)?\>":
"<meta([^>]*)>"


the last will produce more than one match if the page has more than one metatags, so handle this...
0
 
LVL 1

Author Comment

by:russoffl
ID: 6196174
Just as you said the ">" was the last in the </html>.

For anyone who may be working on a similar solution, I used <head>([\s\S]*)</head> to grab the <head>...</head>, and then assigned this to a variable.

I then replaced all chr(10) and chr(13) with "", and then replaced all ">" with ">" & chr(10) & chr(13) to force each tag onto a seperate line.  The only problem was <title>
...</title>, so I re-replaced <title> & chr(10) & chr(13) with <title>, forcing it all back onto one line.

Not sure what the impact would be for javascripts and css stuff in the header, but since I am only grabbing meta description, meta keywords, and title I don't have a problem (so far) with the code.

May not be tidy, but it works for my purpose for now.

Just wish people wrote "tidy" HTML.
0
 
LVL 11

Expert Comment

by:ASPGuru
ID: 6196505
wouldn't it be an option for just to replace all vbCrLf(=chr(10)&chr(13) btw...) with a space(" ") before and then run normal reg. expressions... to get everything you want...
0

Featured Post

Free Tool: Path Explorer

An intuitive utility to help find the CSS path to UI elements on a webpage. These paths are used frequently in a variety of front-end development and QA automation tasks.

One of a set of tools we're offering as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
Need to rewrite code for checking if a file exists 3 72
Insert Button on a table 16 46
Set time on Session (ASP) 14 29
SQL to JSON 14 30
I would like to start this tip/trick by saying Thank You, to all who said that this could not be done, as it forced me to make sure that it could be accomplished. :) To start, I want to make sure everyone understands the importance of utilizing p…
Have you ever needed to get an ASP script to wait for a while? I have, just to let something else happen. Or in my case, to allow other stuff to happen while I was murdering my MySQL database with an update. The Original Issue This was written…
With Secure Portal Encryption, the recipient is sent a link to their email address directing them to the email laundry delivery page. From there, the recipient will be required to enter a user name and password to enter the page. Once the recipient …
I've attached the XLSM Excel spreadsheet I used in the video and also text files containing the macros used below. https://filedb.experts-exchange.com/incoming/2017/03_w12/1151775/Permutations.txt https://filedb.experts-exchange.com/incoming/201…

740 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question