Solved

HELP: Extract info with javascript and regular expressions from search results. For IE 5.0+ only

Posted on 2003-12-06
14
641 Views
Last Modified: 2013-12-03
Dear Experts,

Need help to extract info with javascript and regular expressions from search results. The javascript code must be compatible with IE 5.0+ (no other browsers compatibility is required).

So, giving the following string/results:

-----START OF STRING/RESULTS-----

<HTML>
<HEAD>
<TITLE>Search Results: widgets</TITLE>
<STYLE><!--
body { margin: 3pt; }
// --></STYLE>
</HEAD>

<BODY BGCOLOR=#FFFFFF LINK=000099 ALINK=#CC0033 TEXT=#000000>
<FORM method=GET action=/ target=_self name=se>
<CENTER>
<IMG src=logos.gif width=203 height=52 alt=Search><br>
Enter your search:<br>
<INPUT type=text name=q value="widgets" size=30><font size=1><br></font>
<input type=hidden name=l value="en"><input type=hidden name=ie value="ISO-8859-1">
<INPUT TYPE="submit" NAME=btn1 VALUE="Search"><font size=1><br></font>
<INPUT TYPE="submit" NAME=btn2 VALUE="I'm Feeling with NO Lucky"><font size=1><br></font>
</CENTER>
</FORM>
<NOBR>1. <A TITLE="Gregory Seidman&amp;#39;s VRML2 Widgets. Notes. ... This is a complex widget consisting of four other widgets (Pause, Slider, Dial, UpDown) and some custom controls. ...  " TARGET=_main HREF=http://ovrt.nist.gov/gseidman/widgets.html>Gregory Seidman&#39;s VRML2 <b>Widgets</b></A></NOBR><BR>
<NOBR>2. <A TITLE="WiDGets for IE4. WiDGets are free authoring tools and accessibility add-ons for Microsoft Internet Explorer 4. x and higher for Windows 95/98/Me/NT4/2000/XP. ...  " TARGET=_main HREF=http://www.htmlhelp.com/tools/widgets/><b>WiDGets</b> for IE4</A></NOBR><BR>
<NOBR>3. <A TITLE="The Curses::Widgets modules are designed to provide rapid UI (user interface) design for console applications. Digital Mages. Life in binary. ...  (Curses::Widgets). ...  " TARGET=_main HREF=http://www.digitalmages.com/perl/CursesWidgets/>Digital Mages - Curses::<b>Widgets</b> Home</A></NOBR><BR>
<NOBR>4. <A TITLE="Click Here to continue. " TARGET=_main HREF=http://www.widgets.com/><b>widgets</b>.com</A></NOBR><BR>
<NOBR>5. <A TITLE=" ... password. We have numerous database powered &amp;quot;widgets&amp;quot; and a GUI html...[more]. ... databases. Web Widgets Ltd are based in Auckland, New Zealand. ...  " TARGET=_main HREF=http://www.web-widgets.net/>Content Management Software CMS, Web Hosting - Web <b>Widgets</b> Ltd</A></NOBR><BR>
<NOBR>6. <A TITLE="[ incr Widgets ]. Welcome to the official [incr Widgets] Web site! For those ... widgets. They look, act and feel like Tk widgets. In ...  " TARGET=_main HREF=http://incrtcl.sourceforge.net/iwidgets/>[incr <b>Widgets</b>] -&gt; home</A></NOBR><BR>

<hr>
<CENTER><A HREF=/?q=widgets&start=6>Next &raquo;</A></CENTER>
<font size=-1><p>&copy;2003 Search</font></center>
</BODY>
</HTML>
-----END OF STRING/RESULTS-----


I need first to extract the info of the search results:
 It starts after: </FORM> (or at the first <NOBR>)
 And ends before: <hr>
 -OR-
 Extract all info between each <NOBR>....</NOBR> into an array


Then I need to extract the results into a class array, per position, the following info:
1. URL from the HREF
2. Anchor Text
3. Text from TITLE on the LINK - not the page Title ;-)



The function to create the class can be something like:

function ResultsClass() {
      this.Pos = '';    // POSITION (not the number on results but a counter starting on 1)
      this.URL = '';    // URL
      this.Title = '';  // The Anchor Text
      this.Desc = '';   // Description (Text from TITLE)
}


The results class array can be named: Results_array


To write(display) the results on the page something like the following should be used:
for (var i = 0; i < Results_array.length; i++) {
      var strPos = Results_array[i].Pos;
      var strURL = Results_array[i].URL;
      var strTitle = Results_array[i].Title;
      var strDesc = Results_array[i].Desc;

      document.write (bla..bla..bla...)
}


The results on the page should be in the following format:
Pos: 1 - <a href="URL">Title</a><br>
The description (Desc)<br>
The URL<br>
<br>
Pos: 2 - <a href="URL">Title</a><br>
...and so on


-----

NOTES:
1. Please do not use any method to extract the content by counting number of lines, as number of lines can be different per result.
2. If possible use Regular Expressions for extracting content.
3. Use 2 variables at beggining to give the option to remove the bold tags (<b>...</b>) from the description and/or the title.
   For Example:
   removeBoldTitle = [1/0] - To remove or not bold tags <b>..</b> from the title
   removeBoldDesc = [1/0] - To remove or not bold tags <b>..</b> from the description.

-----

Important: Please do not post any code if not tested before AND for this situation (use the above string results to make tests). I need a solution that works ;-)

Thanks for your help,
CarMar
0
Comment
Question by:CarMar
  • 8
  • 5
14 Comments
 
LVL 3

Expert Comment

by:etain
ID: 9888224
     
obj = document.getElementsByTagName("nobr");
      for (i =0 ; i < obj.length; i++)
      {
            alert(obj[i].innerText)
      }
0
 
LVL 3

Expert Comment

by:etain
ID: 9888230
Can u call me from where to where u want the info to go what array part
0
 
LVL 3

Expert Comment

by:etain
ID: 9888284
Is this what u want, add a span around the results to avoid looping

<script language="javascript1.2">
onload= function()
{
      obj = document.getElementById("Result");
      obj = obj.getElementsByTagName("a");
      for (i =0 ; i < obj.length; i++)
      {
            alert(obj[i].href +"\n" + obj[i].innerText+"\n"+ obj[i].title);
      }
}
</script>

<span id="Result">
<nobr>1. <a title="Gregory Seidman&amp;#39;s VRML2 Widgets. Notes. ... This is a complex widget consisting of four other widgets (Pause, Slider, Dial, UpDown) and some custom controls. ...  " target=_main href=http://ovrt.nist.gov/gseidman/widgets.html>Gregory Seidman&#39;s VRML2 <b>Widgets</b></a></nobr><br>
<nobr>2. <a title="WiDGets for IE4. WiDGets are free authoring tools and accessibility add-ons for Microsoft Internet Explorer 4. x and higher for Windows 95/98/Me/NT4/2000/XP. ...  " target=_main href=http://www.htmlhelp.com/tools/widgets/><b>WiDGets</b> for IE4</A></nobr><br>
<nobr>3. <a title="The Curses::Widgets modules are designed to provide rapid UI (user interface) design for console applications. Digital Mages. Life in binary. ...  (Curses::Widgets). ...  " target=_main href=http://www.digitalmages.com/perl/CursesWidgets/>Digital Mages - Curses::<b>Widgets</b> Home</A></nobr><br>
<nobr>4. <a title="Click Here to continue. " target=_main href=http://www.widgets.com/><b>widgets</b>.com</A></nobr><br>
<nobr>5. <a title=" ... password. We have numerous database powered &amp;quot;widgets&amp;quot; and a GUI html...[more]. ... databases. Web Widgets Ltd are based in Auckland, New Zealand. ...  " target=_main href=http://www.web-widgets.net/>Content Management Software CMS, Web Hosting - Web <b>Widgets</b> Ltd</A></nobr><br>
<nobr>6. <a title="[ incr Widgets ]. Welcome to the official [incr Widgets] Web site! For those ... widgets. They look, act and feel like Tk widgets. In ...  " target=_main href=http://incrtcl.sourceforge.net/iwidgets/>[incr <b>Widgets</b>] -&gt; home</A></nobr><br>
</span>
<hr>
0
 
LVL 1

Author Comment

by:CarMar
ID: 9888368
etain,

Thanks for your postings so far, but I need that you understand that the results will be available on a variable, as they are obtained via a query to a serch engine using ActiveXObject("Microsoft.XMLHTTP").

So, I can't just place those results in a span or in a div to extract the info. I need to extract them directly from a variable. That's why I though on using RegExp.

Also, I need the info to be stored on a class because there might be situations where I need to extract more records from the search engine and add them to the class object and only display them at end.

Hope I did explain well what is needed.

Thanks
Carlos
0
 
LVL 3

Expert Comment

by:etain
ID: 9888467
Dont think it is use RegExp cause there wasn't a standard on how the title and url will be.

How do u call the function to store the info, other window ??
0
 
LVL 3

Expert Comment

by:etain
ID: 9888490
if this search always have the same format then u can use this
since the "next" link dosent have title.

      obj = document.getElementsByTagName("a");
      for (i =0 ; i < obj.length; i++)
      {
            if(obj[i].href != "" && obj[i].innerText != ""&& obj[i].title != "")
            alert(obj[i].href +"\n" + obj[i].innerText+"\n"+ obj[i].title);
      }
0
 
LVL 3

Expert Comment

by:etain
ID: 9888505
u can wite the result into a frame or div then do the spliting
0
Is Your Active Directory as Secure as You Think?

More than 75% of all records are compromised because of the loss or theft of a privileged credential. Experts have been exploring Active Directory infrastructure to identify key threats and establish best practices for keeping data safe. Attend this month’s webinar to learn more.

 
LVL 1

Author Comment

by:CarMar
ID: 9888513
etain,

> Dont think it is use RegExp cause there wasn't a standard on how the title and url will be.
No problem with that... if they change, I'll change it.

> How do u call the function to store the info, other window ??
Same document.

Let me explain. Before this code (on same page) I will use Microsoft.XMLHTTP to get the info from the search engine and store it in a variable, let's call it "SEResults". Then I want to extract the info above from the "SEResults" variable and store it in a class array. Once the whole process is done, I then use a for>next to get all elements from the class array and display them on the page.

My problem is how to get the info I need from the variable. I don't know if using getElementsByTagName from a variable will work. That's why I suggested RegExp, as I think is a better solution for this situation.
0
 
LVL 3

Expert Comment

by:etain
ID: 9888676
Something like this..
is the values that i get the one u wanted?

function getResult(SEResults)
{
   temp = document.createElement("div");
   document.body.appendChild(temp)
   temp.innerHTML = SEResults
   temp.style.visibility = 'hidden'
 
     obj = temp.getElementsByTagName("a");
     for (i =0 ; i < obj.length; i++)
     {  
          if(obj[i].href != "" && obj[i].innerText != ""&& obj[i].title != "")
          {
             Results_array[i] = new ResultsClass()
             Results_array[i].Pos = i+1
             Results_array[i].URL=obj[i].href
             Results_array[i].Title = obj[i].innerText
             Results_array[i].Desc =obj[i].title
      }
}
0
 
LVL 1

Author Comment

by:CarMar
ID: 9888747
etain,

I really appreciate the time you are taking to help me, but the problem here is that I prefer to not use any div or frame or span for that, even if created dinamically. I just want to extract the values directly from the variable and I think the best solution is by using RegExp.

So, if possible, are you able to provide me a solution by using RegExp or anything else (that don't use div or frame or spans) to extract the info directly from the variable? If yes, here's a variable I created with the results to simulate a real situation that you can use for testing purposes:

<script>

// Variable with the search results

SEResults = ' \n' +
'<HTML>\n' +
'<HEAD>\n' +
'<TITLE>Search Results: widgets</TITLE>\n' +
'<STYLE><!--\n' +
'body { margin: 3pt; }\n' +
'// --></STYLE>\n' +
'</HEAD>\n' +
'\n' +
'<BODY BGCOLOR=#FFFFFF LINK=000099 ALINK=#CC0033 TEXT=#000000>\n' +
'<FORM method=GET action=/ target=_self name=se>\n' +
'<CENTER>\n' +
'<IMG src=logos.gif width=203 height=52 alt=Search><br>\n' +
'Enter your search:<br>\n' +
'<INPUT type=text name=q value="widgets" size=30><font size=1><br></font>\n' +
'<input type=hidden name=l value="en"><input type=hidden name=ie value="ISO-8859-1">\n' +
'<INPUT TYPE="submit" NAME=btn1 VALUE="Search"><font size=1><br></font>\n' +
'<INPUT TYPE="submit" NAME=btn2 VALUE="I\'m Feeling with NO Lucky"><font size=1><br></font>\n' +
'</CENTER>\n' +
'</FORM>\n' +
'<NOBR>1. <A TITLE="Gregory Seidman&amp;#39;s VRML2 Widgets. Notes. ... This is a complex widget consisting of four other widgets (Pause, Slider, Dial, UpDown) and some custom controls. ...  " TARGET=_main HREF=http://ovrt.nist.gov/gseidman/widgets.html>Gregory Seidman&#39;s VRML2 <b>Widgets</b></A></NOBR><BR>\n' +
'<NOBR>2. <A TITLE="WiDGets for IE4. WiDGets are free authoring tools and accessibility add-ons for Microsoft Internet Explorer 4. x and higher for Windows 95/98/Me/NT4/2000/XP. ...  " TARGET=_main HREF=http://www.htmlhelp.com/tools/widgets/><b>WiDGets</b> for IE4</A></NOBR><BR>\n' +
'<NOBR>3. <A TITLE="The Curses::Widgets modules are designed to provide rapid UI (user interface) design for console applications. Digital Mages. Life in binary. ...  (Curses::Widgets). ...  " TARGET=_main HREF=http://www.digitalmages.com/perl/CursesWidgets/>Digital Mages - Curses::<b>Widgets</b> Home</A></NOBR><BR>\n' +
'<NOBR>4. <A TITLE="Click Here to continue. " TARGET=_main HREF=http://www.widgets.com/><b>widgets</b>.com</A></NOBR><BR>\n' +
'<NOBR>5. <A TITLE=" ... password. We have numerous database powered &amp;quot;widgets&amp;quot; and a GUI html...[more]. ... databases. Web Widgets Ltd are based in Auckland, New Zealand. ...  " TARGET=_main HREF=http://www.web-widgets.net/>Content Management Software CMS, Web Hosting - Web <b>Widgets</b> Ltd</A></NOBR><BR>\n' +
'<NOBR>6. <A TITLE="[ incr Widgets ]. Welcome to the official [incr Widgets] Web site! For those ... widgets. They look, act and feel like Tk widgets. In ...  " TARGET=_main HREF=http://incrtcl.sourceforge.net/iwidgets/>[incr <b>Widgets</b>] -&gt; home</A></NOBR><BR>\n' +
'\n' +
'<hr>\n' +
'<CENTER><A HREF=/?q=widgets&start=6>Next &raquo;</A></CENTER>\n' +
'<font size=-1><p>&copy;2003 Search</font></center>\n' +
'</BODY>\n' +
'</HTML>\n' +
' ';

alert(SEResults); // Shows the SEResults value
</script>


Thanks a lot for you help and cooperation on this.
0
 
LVL 3

Accepted Solution

by:
etain earned 500 total points
ID: 9889159
this will do the spliting, but will be slow if there are many record

function getResult(SEResults)
{
        temp = replacetext(String(SEResults).substring(String(SEResults).indexOf("<NOBR>"),String(SEResults).lastIndexOf("</NOBR>")+7))
      i= 0
      do
      {
            Desc = String(temp).substring(String(temp).indexOf('TITLE=') + 7,String(temp).indexOf('TARGET=') - 2)
            temp2 = String(temp).substring(String(temp).indexOf('HREF=') + 5,String(temp).indexOf('</A>'))
            HREF = String(temp2).substring(0,String(temp2).indexOf('>'))
            Title = String(temp2).substring(String(temp2).indexOf('>')+1,String(temp2).length)

             Results_array[i] = new ResultsClass()
         Results_array[i].Pos = i+1
         Results_array[i].URL = HREF
         Results_array[i].Title = Title
         Results_array[i].Desc = Desc
             
             alert(Desc)
             alert(HREF)
             alert(Title)
             
            sPos = String(temp).indexOf("</NOBR>")+7
            ePos = String(temp).length
            temp =  String(temp).substr(sPos, ePos)
      }while (sPos < ePos)
}

function replacetext(str)
{
               // Add any other replacement
      str = String(str).replace(/&amp;/gi, "&");
      str = String(str).replace(/quot;/gi, "\"");
      str = String(str).replace(/#39;/gi,"'");
      str= String(str).replace(/<b>|<\/b>/gi, "");
      return str
}
0
 
LVL 11

Expert Comment

by:Zontar
ID: 9889905
(WTF is <NOBR> supposed to be? There's no such tag in any W3C spec that I'm aware of.)

Why the requirement for regexps -- you should be able to do this using nothing but DOM, and more neatly. (Well, okay, I cheated and used split() and replace() a couple of times...)

function searchResult(link)
{
  this.href = link.getAttribute("HREF");
  this.title = link.getAttribute("TITLE");
  this.withBold = link.firstChild.xml;
  this.withoutBold = link.text;
}

var searchString = "";  // the string of HTML returned from the search; should be available from the Microsoft.XMLHTTP object
var searchString = "<search>" + searchString.split("</FORM>")[1].split("<hr>")[0].replace("<BR>", "").replace("\n", "") + "</search>";

var DomDoc = new ActiveXObject("Microsoft.XMLDOM");
DomDoc.resolveExternals = false;
DomDoc.async = false;
DomDoc.loadXML(searchString);

var searchResults = new Array();
var found = DomDoc.getElementsByTagName("A");
for(i = 0; i < found.length - 1; i++)
  searchResults[i] = new searchResult(found[i]);

I can't live-test this since I don't have the URL you're retrieving the search results from, but I believe the methodology should be basically sound.
0
 
LVL 1

Author Comment

by:CarMar
ID: 9892121
Quick notes:

etain: I used a mix of your solutions, and you were right. The last one provided is not the fast and it was missing the i++ before the while. Anyway, thanks for getting more than one working solution.

Zontar: Your solution didn't work because the returned string could not be loaded as XML - it was returning false when I checked the DomDoc.loadXML status. But if I removed all the attributes inside the A tag (title, href and target) it worked... but then the info I needed couldn't be get. Anyway, thanks for your time.
0
 
LVL 1

Author Comment

by:CarMar
ID: 9892125
Zontar:

> (WTF is <NOBR> supposed to be? There's no such tag in any W3C spec that I'm aware of.)

NOBR Element | noBR Object - Renders text without line breaks.
http://msdn.microsoft.com/library/default.asp?url=/workshop/author/dhtml/reference/objects/nobr.asp

Accordingly with Microsoft, this object is an extension to HTML (http://www.w3.org/TR/REC-html32.html), but I couldn't find anything about it at w3.
0

Featured Post

Is Your Active Directory as Secure as You Think?

More than 75% of all records are compromised because of the loss or theft of a privileged credential. Experts have been exploring Active Directory infrastructure to identify key threats and establish best practices for keeping data safe. Attend this month’s webinar to learn more.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

International Data Corporation (IDC) prognosticates that before the current the year gets over disbursing on IT framework products to be sent in cloud environs will be $37.1B.
Boost your ability to deliver ambitious and competitive web apps by choosing the right JavaScript framework to best suit your project’s needs.
The viewer will learn the basics of jQuery including how to code hide show and toggles. Reference your jQuery libraries: (CODE) Include your new external js/jQuery file: (CODE) Write your first lines of code to setup your site for jQuery…
Excel styles will make formatting consistent and let you apply and change formatting faster. In this tutorial, you'll learn how to use Excel's built-in styles, how to modify styles, and how to create your own. You'll also learn how to use your custo…

920 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

12 Experts available now in Live!

Get 1:1 Help Now