Solved

HELP: Extract info with javascript and regular expressions from search results. For IE 5.0+ only

Posted on 2003-12-06
14
640 Views
Last Modified: 2013-12-03
Dear Experts,

Need help to extract info with javascript and regular expressions from search results. The javascript code must be compatible with IE 5.0+ (no other browsers compatibility is required).

So, giving the following string/results:

-----START OF STRING/RESULTS-----

<HTML>
<HEAD>
<TITLE>Search Results: widgets</TITLE>
<STYLE><!--
body { margin: 3pt; }
// --></STYLE>
</HEAD>

<BODY BGCOLOR=#FFFFFF LINK=000099 ALINK=#CC0033 TEXT=#000000>
<FORM method=GET action=/ target=_self name=se>
<CENTER>
<IMG src=logos.gif width=203 height=52 alt=Search><br>
Enter your search:<br>
<INPUT type=text name=q value="widgets" size=30><font size=1><br></font>
<input type=hidden name=l value="en"><input type=hidden name=ie value="ISO-8859-1">
<INPUT TYPE="submit" NAME=btn1 VALUE="Search"><font size=1><br></font>
<INPUT TYPE="submit" NAME=btn2 VALUE="I'm Feeling with NO Lucky"><font size=1><br></font>
</CENTER>
</FORM>
<NOBR>1. <A TITLE="Gregory Seidman&amp;#39;s VRML2 Widgets. Notes. ... This is a complex widget consisting of four other widgets (Pause, Slider, Dial, UpDown) and some custom controls. ...  " TARGET=_main HREF=http://ovrt.nist.gov/gseidman/widgets.html>Gregory Seidman&#39;s VRML2 <b>Widgets</b></A></NOBR><BR>
<NOBR>2. <A TITLE="WiDGets for IE4. WiDGets are free authoring tools and accessibility add-ons for Microsoft Internet Explorer 4. x and higher for Windows 95/98/Me/NT4/2000/XP. ...  " TARGET=_main HREF=http://www.htmlhelp.com/tools/widgets/><b>WiDGets</b> for IE4</A></NOBR><BR>
<NOBR>3. <A TITLE="The Curses::Widgets modules are designed to provide rapid UI (user interface) design for console applications. Digital Mages. Life in binary. ...  (Curses::Widgets). ...  " TARGET=_main HREF=http://www.digitalmages.com/perl/CursesWidgets/>Digital Mages - Curses::<b>Widgets</b> Home</A></NOBR><BR>
<NOBR>4. <A TITLE="Click Here to continue. " TARGET=_main HREF=http://www.widgets.com/><b>widgets</b>.com</A></NOBR><BR>
<NOBR>5. <A TITLE=" ... password. We have numerous database powered &amp;quot;widgets&amp;quot; and a GUI html...[more]. ... databases. Web Widgets Ltd are based in Auckland, New Zealand. ...  " TARGET=_main HREF=http://www.web-widgets.net/>Content Management Software CMS, Web Hosting - Web <b>Widgets</b> Ltd</A></NOBR><BR>
<NOBR>6. <A TITLE="[ incr Widgets ]. Welcome to the official [incr Widgets] Web site! For those ... widgets. They look, act and feel like Tk widgets. In ...  " TARGET=_main HREF=http://incrtcl.sourceforge.net/iwidgets/>[incr <b>Widgets</b>] -&gt; home</A></NOBR><BR>

<hr>
<CENTER><A HREF=/?q=widgets&start=6>Next &raquo;</A></CENTER>
<font size=-1><p>&copy;2003 Search</font></center>
</BODY>
</HTML>
-----END OF STRING/RESULTS-----


I need first to extract the info of the search results:
 It starts after: </FORM> (or at the first <NOBR>)
 And ends before: <hr>
 -OR-
 Extract all info between each <NOBR>....</NOBR> into an array


Then I need to extract the results into a class array, per position, the following info:
1. URL from the HREF
2. Anchor Text
3. Text from TITLE on the LINK - not the page Title ;-)



The function to create the class can be something like:

function ResultsClass() {
      this.Pos = '';    // POSITION (not the number on results but a counter starting on 1)
      this.URL = '';    // URL
      this.Title = '';  // The Anchor Text
      this.Desc = '';   // Description (Text from TITLE)
}


The results class array can be named: Results_array


To write(display) the results on the page something like the following should be used:
for (var i = 0; i < Results_array.length; i++) {
      var strPos = Results_array[i].Pos;
      var strURL = Results_array[i].URL;
      var strTitle = Results_array[i].Title;
      var strDesc = Results_array[i].Desc;

      document.write (bla..bla..bla...)
}


The results on the page should be in the following format:
Pos: 1 - <a href="URL">Title</a><br>
The description (Desc)<br>
The URL<br>
<br>
Pos: 2 - <a href="URL">Title</a><br>
...and so on


-----

NOTES:
1. Please do not use any method to extract the content by counting number of lines, as number of lines can be different per result.
2. If possible use Regular Expressions for extracting content.
3. Use 2 variables at beggining to give the option to remove the bold tags (<b>...</b>) from the description and/or the title.
   For Example:
   removeBoldTitle = [1/0] - To remove or not bold tags <b>..</b> from the title
   removeBoldDesc = [1/0] - To remove or not bold tags <b>..</b> from the description.

-----

Important: Please do not post any code if not tested before AND for this situation (use the above string results to make tests). I need a solution that works ;-)

Thanks for your help,
CarMar
0
Comment
Question by:CarMar
  • 8
  • 5
14 Comments
 
LVL 3

Expert Comment

by:etain
Comment Utility
     
obj = document.getElementsByTagName("nobr");
      for (i =0 ; i < obj.length; i++)
      {
            alert(obj[i].innerText)
      }
0
 
LVL 3

Expert Comment

by:etain
Comment Utility
Can u call me from where to where u want the info to go what array part
0
 
LVL 3

Expert Comment

by:etain
Comment Utility
Is this what u want, add a span around the results to avoid looping

<script language="javascript1.2">
onload= function()
{
      obj = document.getElementById("Result");
      obj = obj.getElementsByTagName("a");
      for (i =0 ; i < obj.length; i++)
      {
            alert(obj[i].href +"\n" + obj[i].innerText+"\n"+ obj[i].title);
      }
}
</script>

<span id="Result">
<nobr>1. <a title="Gregory Seidman&amp;#39;s VRML2 Widgets. Notes. ... This is a complex widget consisting of four other widgets (Pause, Slider, Dial, UpDown) and some custom controls. ...  " target=_main href=http://ovrt.nist.gov/gseidman/widgets.html>Gregory Seidman&#39;s VRML2 <b>Widgets</b></a></nobr><br>
<nobr>2. <a title="WiDGets for IE4. WiDGets are free authoring tools and accessibility add-ons for Microsoft Internet Explorer 4. x and higher for Windows 95/98/Me/NT4/2000/XP. ...  " target=_main href=http://www.htmlhelp.com/tools/widgets/><b>WiDGets</b> for IE4</A></nobr><br>
<nobr>3. <a title="The Curses::Widgets modules are designed to provide rapid UI (user interface) design for console applications. Digital Mages. Life in binary. ...  (Curses::Widgets). ...  " target=_main href=http://www.digitalmages.com/perl/CursesWidgets/>Digital Mages - Curses::<b>Widgets</b> Home</A></nobr><br>
<nobr>4. <a title="Click Here to continue. " target=_main href=http://www.widgets.com/><b>widgets</b>.com</A></nobr><br>
<nobr>5. <a title=" ... password. We have numerous database powered &amp;quot;widgets&amp;quot; and a GUI html...[more]. ... databases. Web Widgets Ltd are based in Auckland, New Zealand. ...  " target=_main href=http://www.web-widgets.net/>Content Management Software CMS, Web Hosting - Web <b>Widgets</b> Ltd</A></nobr><br>
<nobr>6. <a title="[ incr Widgets ]. Welcome to the official [incr Widgets] Web site! For those ... widgets. They look, act and feel like Tk widgets. In ...  " target=_main href=http://incrtcl.sourceforge.net/iwidgets/>[incr <b>Widgets</b>] -&gt; home</A></nobr><br>
</span>
<hr>
0
 
LVL 1

Author Comment

by:CarMar
Comment Utility
etain,

Thanks for your postings so far, but I need that you understand that the results will be available on a variable, as they are obtained via a query to a serch engine using ActiveXObject("Microsoft.XMLHTTP").

So, I can't just place those results in a span or in a div to extract the info. I need to extract them directly from a variable. That's why I though on using RegExp.

Also, I need the info to be stored on a class because there might be situations where I need to extract more records from the search engine and add them to the class object and only display them at end.

Hope I did explain well what is needed.

Thanks
Carlos
0
 
LVL 3

Expert Comment

by:etain
Comment Utility
Dont think it is use RegExp cause there wasn't a standard on how the title and url will be.

How do u call the function to store the info, other window ??
0
 
LVL 3

Expert Comment

by:etain
Comment Utility
if this search always have the same format then u can use this
since the "next" link dosent have title.

      obj = document.getElementsByTagName("a");
      for (i =0 ; i < obj.length; i++)
      {
            if(obj[i].href != "" && obj[i].innerText != ""&& obj[i].title != "")
            alert(obj[i].href +"\n" + obj[i].innerText+"\n"+ obj[i].title);
      }
0
 
LVL 3

Expert Comment

by:etain
Comment Utility
u can wite the result into a frame or div then do the spliting
0
Free Trending Threat Insights Every Day

Enhance your security with threat intelligence from the web. Get trending threat insights on hackers, exploits, and suspicious IP addresses delivered to your inbox with our free Cyber Daily.

 
LVL 1

Author Comment

by:CarMar
Comment Utility
etain,

> Dont think it is use RegExp cause there wasn't a standard on how the title and url will be.
No problem with that... if they change, I'll change it.

> How do u call the function to store the info, other window ??
Same document.

Let me explain. Before this code (on same page) I will use Microsoft.XMLHTTP to get the info from the search engine and store it in a variable, let's call it "SEResults". Then I want to extract the info above from the "SEResults" variable and store it in a class array. Once the whole process is done, I then use a for>next to get all elements from the class array and display them on the page.

My problem is how to get the info I need from the variable. I don't know if using getElementsByTagName from a variable will work. That's why I suggested RegExp, as I think is a better solution for this situation.
0
 
LVL 3

Expert Comment

by:etain
Comment Utility
Something like this..
is the values that i get the one u wanted?

function getResult(SEResults)
{
   temp = document.createElement("div");
   document.body.appendChild(temp)
   temp.innerHTML = SEResults
   temp.style.visibility = 'hidden'
 
     obj = temp.getElementsByTagName("a");
     for (i =0 ; i < obj.length; i++)
     {  
          if(obj[i].href != "" && obj[i].innerText != ""&& obj[i].title != "")
          {
             Results_array[i] = new ResultsClass()
             Results_array[i].Pos = i+1
             Results_array[i].URL=obj[i].href
             Results_array[i].Title = obj[i].innerText
             Results_array[i].Desc =obj[i].title
      }
}
0
 
LVL 1

Author Comment

by:CarMar
Comment Utility
etain,

I really appreciate the time you are taking to help me, but the problem here is that I prefer to not use any div or frame or span for that, even if created dinamically. I just want to extract the values directly from the variable and I think the best solution is by using RegExp.

So, if possible, are you able to provide me a solution by using RegExp or anything else (that don't use div or frame or spans) to extract the info directly from the variable? If yes, here's a variable I created with the results to simulate a real situation that you can use for testing purposes:

<script>

// Variable with the search results

SEResults = ' \n' +
'<HTML>\n' +
'<HEAD>\n' +
'<TITLE>Search Results: widgets</TITLE>\n' +
'<STYLE><!--\n' +
'body { margin: 3pt; }\n' +
'// --></STYLE>\n' +
'</HEAD>\n' +
'\n' +
'<BODY BGCOLOR=#FFFFFF LINK=000099 ALINK=#CC0033 TEXT=#000000>\n' +
'<FORM method=GET action=/ target=_self name=se>\n' +
'<CENTER>\n' +
'<IMG src=logos.gif width=203 height=52 alt=Search><br>\n' +
'Enter your search:<br>\n' +
'<INPUT type=text name=q value="widgets" size=30><font size=1><br></font>\n' +
'<input type=hidden name=l value="en"><input type=hidden name=ie value="ISO-8859-1">\n' +
'<INPUT TYPE="submit" NAME=btn1 VALUE="Search"><font size=1><br></font>\n' +
'<INPUT TYPE="submit" NAME=btn2 VALUE="I\'m Feeling with NO Lucky"><font size=1><br></font>\n' +
'</CENTER>\n' +
'</FORM>\n' +
'<NOBR>1. <A TITLE="Gregory Seidman&amp;#39;s VRML2 Widgets. Notes. ... This is a complex widget consisting of four other widgets (Pause, Slider, Dial, UpDown) and some custom controls. ...  " TARGET=_main HREF=http://ovrt.nist.gov/gseidman/widgets.html>Gregory Seidman&#39;s VRML2 <b>Widgets</b></A></NOBR><BR>\n' +
'<NOBR>2. <A TITLE="WiDGets for IE4. WiDGets are free authoring tools and accessibility add-ons for Microsoft Internet Explorer 4. x and higher for Windows 95/98/Me/NT4/2000/XP. ...  " TARGET=_main HREF=http://www.htmlhelp.com/tools/widgets/><b>WiDGets</b> for IE4</A></NOBR><BR>\n' +
'<NOBR>3. <A TITLE="The Curses::Widgets modules are designed to provide rapid UI (user interface) design for console applications. Digital Mages. Life in binary. ...  (Curses::Widgets). ...  " TARGET=_main HREF=http://www.digitalmages.com/perl/CursesWidgets/>Digital Mages - Curses::<b>Widgets</b> Home</A></NOBR><BR>\n' +
'<NOBR>4. <A TITLE="Click Here to continue. " TARGET=_main HREF=http://www.widgets.com/><b>widgets</b>.com</A></NOBR><BR>\n' +
'<NOBR>5. <A TITLE=" ... password. We have numerous database powered &amp;quot;widgets&amp;quot; and a GUI html...[more]. ... databases. Web Widgets Ltd are based in Auckland, New Zealand. ...  " TARGET=_main HREF=http://www.web-widgets.net/>Content Management Software CMS, Web Hosting - Web <b>Widgets</b> Ltd</A></NOBR><BR>\n' +
'<NOBR>6. <A TITLE="[ incr Widgets ]. Welcome to the official [incr Widgets] Web site! For those ... widgets. They look, act and feel like Tk widgets. In ...  " TARGET=_main HREF=http://incrtcl.sourceforge.net/iwidgets/>[incr <b>Widgets</b>] -&gt; home</A></NOBR><BR>\n' +
'\n' +
'<hr>\n' +
'<CENTER><A HREF=/?q=widgets&start=6>Next &raquo;</A></CENTER>\n' +
'<font size=-1><p>&copy;2003 Search</font></center>\n' +
'</BODY>\n' +
'</HTML>\n' +
' ';

alert(SEResults); // Shows the SEResults value
</script>


Thanks a lot for you help and cooperation on this.
0
 
LVL 3

Accepted Solution

by:
etain earned 500 total points
Comment Utility
this will do the spliting, but will be slow if there are many record

function getResult(SEResults)
{
        temp = replacetext(String(SEResults).substring(String(SEResults).indexOf("<NOBR>"),String(SEResults).lastIndexOf("</NOBR>")+7))
      i= 0
      do
      {
            Desc = String(temp).substring(String(temp).indexOf('TITLE=') + 7,String(temp).indexOf('TARGET=') - 2)
            temp2 = String(temp).substring(String(temp).indexOf('HREF=') + 5,String(temp).indexOf('</A>'))
            HREF = String(temp2).substring(0,String(temp2).indexOf('>'))
            Title = String(temp2).substring(String(temp2).indexOf('>')+1,String(temp2).length)

             Results_array[i] = new ResultsClass()
         Results_array[i].Pos = i+1
         Results_array[i].URL = HREF
         Results_array[i].Title = Title
         Results_array[i].Desc = Desc
             
             alert(Desc)
             alert(HREF)
             alert(Title)
             
            sPos = String(temp).indexOf("</NOBR>")+7
            ePos = String(temp).length
            temp =  String(temp).substr(sPos, ePos)
      }while (sPos < ePos)
}

function replacetext(str)
{
               // Add any other replacement
      str = String(str).replace(/&amp;/gi, "&");
      str = String(str).replace(/quot;/gi, "\"");
      str = String(str).replace(/#39;/gi,"'");
      str= String(str).replace(/<b>|<\/b>/gi, "");
      return str
}
0
 
LVL 11

Expert Comment

by:Zontar
Comment Utility
(WTF is <NOBR> supposed to be? There's no such tag in any W3C spec that I'm aware of.)

Why the requirement for regexps -- you should be able to do this using nothing but DOM, and more neatly. (Well, okay, I cheated and used split() and replace() a couple of times...)

function searchResult(link)
{
  this.href = link.getAttribute("HREF");
  this.title = link.getAttribute("TITLE");
  this.withBold = link.firstChild.xml;
  this.withoutBold = link.text;
}

var searchString = "";  // the string of HTML returned from the search; should be available from the Microsoft.XMLHTTP object
var searchString = "<search>" + searchString.split("</FORM>")[1].split("<hr>")[0].replace("<BR>", "").replace("\n", "") + "</search>";

var DomDoc = new ActiveXObject("Microsoft.XMLDOM");
DomDoc.resolveExternals = false;
DomDoc.async = false;
DomDoc.loadXML(searchString);

var searchResults = new Array();
var found = DomDoc.getElementsByTagName("A");
for(i = 0; i < found.length - 1; i++)
  searchResults[i] = new searchResult(found[i]);

I can't live-test this since I don't have the URL you're retrieving the search results from, but I believe the methodology should be basically sound.
0
 
LVL 1

Author Comment

by:CarMar
Comment Utility
Quick notes:

etain: I used a mix of your solutions, and you were right. The last one provided is not the fast and it was missing the i++ before the while. Anyway, thanks for getting more than one working solution.

Zontar: Your solution didn't work because the returned string could not be loaded as XML - it was returning false when I checked the DomDoc.loadXML status. But if I removed all the attributes inside the A tag (title, href and target) it worked... but then the info I needed couldn't be get. Anyway, thanks for your time.
0
 
LVL 1

Author Comment

by:CarMar
Comment Utility
Zontar:

> (WTF is <NOBR> supposed to be? There's no such tag in any W3C spec that I'm aware of.)

NOBR Element | noBR Object - Renders text without line breaks.
http://msdn.microsoft.com/library/default.asp?url=/workshop/author/dhtml/reference/objects/nobr.asp

Accordingly with Microsoft, this object is an extension to HTML (http://www.w3.org/TR/REC-html32.html), but I couldn't find anything about it at w3.
0

Featured Post

Find Ransomware Secrets With All-Source Analysis

Ransomware has become a major concern for organizations; its prevalence has grown due to past successes achieved by threat actors. While each ransomware variant is different, we’ve seen some common tactics and trends used among the authors of the malware.

Join & Write a Comment

Suggested Solutions

I've been trying to accomplish this for a while and it just struck me yesterday how to accomplish this task. I have done searches all over the internet looking for ways to email pages from my applications and finally I have done it!!! Every single s…
In Part 1 (http://www.experts-exchange.com/Programming/Languages/Scripting/JavaScript/A_7849-Hex-Maze.html) we covered the hexagonal maze basics -- how the cells are represented in a JavaScript array and how the maze is displayed.  In this part, we'…
The viewer will learn the basics of jQuery, including how to invoke it on a web page. Reference your jQuery libraries: (CODE) Include your new external js/jQuery file: (CODE) Write your first lines of code to setup your site for jQuery.: (CODE)
Excel styles will make formatting consistent and let you apply and change formatting faster. In this tutorial, you'll learn how to use Excel's built-in styles, how to modify styles, and how to create your own. You'll also learn how to use your custo…

743 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

14 Experts available now in Live!

Get 1:1 Help Now