Community Pick: Many members of our community have endorsed this article.
Editor's Choice: This article has been selected by our editors as an exceptional contribution.

Reading a Series of Webpages With JavaScript

DanRollins
CERTIFIED EXPERT
Published:
Updated:
This article shows how to read a single webpage's HTML into a string variable, and it also shows how to automate a sequence so you can read and process a list of webpages.  I was tasked with reading and parsing a series of webpages to collect some statistics, and I thought that the techniques I used would be helpful to other Windows programmers.

XMLHttpRequest

The tool of interest is the XMLHttpRequest object that is a member of the window object of your browser's DOM (Document Object Model).  This object is at the core of Ajax operations and is often used for "delayed load" or "asynchronous include" operations in which additional information is obtained from a web server after a page is loaded.  It is commonly used to obtain data in response to a user action, without needing to do a full "submit" and loading a new page.

The basic page-reading operation is simple, as illustrated here:
var gfPgDone;
                      
                      function ProcessPage(sPgTxt) {
                          alert( sPgTxt );
                      } 
                      function CollectPageInfo( sURL ) {
                          gfPgDone= false;
                          var req = new XMLHttpRequest();
                          req.onreadystatechange = function() {
                              if (req.readyState == 4) {
                                  ProcessPage( req.responseText );
                                  gfPgDone= true;
                              }
                          }
                          req.open( "GET", sURL, true ); // true for async operation
                          req.send();
                      }

Open in new window

The critical (and possibly confusing) factor is that the operation is asynchronous; that is, when you call CollectPageInfo, control resumes immediately -- but the new data is not yet available.  So lines 7 and 12 manipulate a globally-visible Boolean flag variable so that your program can know when the page has been loaded.  That whole asynchronous issue is covered in detail, below.

For my purposes, I decided to write an HTA to contain the processing logic.  An HTA is flexible and easy-to-develop.  See my article, HTA - Hypertext Application tutorial for lots of related information and reference material.

Let's get started!  Create a text file on your desktop.  Edit it to contain, for instance,
<html>
                      <script>
                      //---------------------------------------------------
                      function DoTest() // on button click
                      {
                          // instantiate and exercise ActiveX objects, etc.
                          oDivDisplayArea.innerHTML="<b>test complete!</b>";
                      }
                      </script>
                      
                      <!-- I like to put the U/I stuff at the bottom -->
                      <!-- ***************************************** -->
                      <body onload="window.resizeTo(400,300);">
                      <div align=right><input type=button value="restart" 
                          onclick= "document.location.reload();"></div>
                      
                      <input type=button value='DoTest' onclick='DoTest();' </input>
                      <div id=oDivDisplayArea>
                         <font color=gray>(stuff will be displayed here)</font>
                      </div>
                      </body>
                      </html>

Open in new window

Rename the file to give it an extension of .HTA ... and that is your starting "HTA skeleton."   Double click the HTA file's icon to run it:
The skeleton HTA displays...The [restart] button saves time during development.  Use your text editor to make changes and save the HTA file.  Then click [restart] to activate and test those changes; that is, you don't need to exit and restart the program between edits.

To that simple HTA skeleton, let's add the logic to read a webpage.  Here's the entire functional HTA page that grabs the Google home page and displays its HTML:
<html>
                      <script>
                      var gfPgDone;  // global variable
                      function ProcessPage( sPgTxt ){
                          oDivDisplayArea.innerText= sPgTxt;
                      }
                      function CollectPageInfo( sURL ) {
                          gfPgDone= false;
                          var req = new XMLHttpRequest();
                          req.onreadystatechange = function() {
                              if (req.readyState == 4) {
                                  ProcessPage( req.responseText );
                                  gfPgDone= true;
                              }
                          }
                          req.open( "GET", sURL, true ); // true for async operation
                          req.send();
                      }
                      //---------------------------------------------------
                      function DoTest() // on button click
                      {
                          CollectPageInfo( "http://www.google.com/" );
                      }
                      </script>
                      <!-- ***************************************** -->
                      <body onload="window.resizeTo(400,300);">
                      <div align=right><input type=button value="restart" 
                          onclick= "document.location.reload();"></div>
                      
                      <input type=button value='DoTest' onclick='DoTest();' </input>
                      <div id=oDivDisplayArea>
                         <font color=gray>(stuff will be displayed here)</font>
                      </div>
                      </body>
                      </html>

Open in new window

When you click the [DoTest] button, the program starts the process that reads the Google home page.  A few milliseconds later, the raw HTML of that page is displayed:
After clicking DoTest.  Downloaded results shown.Now you can use all manner of procedural logic and/or GREP functions to "screen scrape" data from the HTML that you have just pulled from the web.

Reading a Series of Pages

Things get just a little bit tricker now.   You might first try using the synchronous version of the XMLHttpRequest.send() function.  That function does not return control until it has completely read the webpage document.  Just so you can see how that might look:
var gasListOfURLs= 
                        new Array("http://www.google.com/", "http://www.yahoo.com/", "http://www.msn.com/");
                      
                      function DoTest() 
                      {
                          var req = new XMLHttpRequest();
                          for (var j=0; j<gasListOfURLs.length; j++ ) {
                              var sURL= gasListOfURLs[j]; 
                              req.open( "GET", sURL, false ); // false for ***synchronous*** operation
                              req.send();                     // <----- control is stuck here until done
                              ProcessPage( req.responseText );
                              // alert( req.responseText );
                          }
                      }

Open in new window

The problem with this is that the application seems to hang.  The U/I goes dead.  The window does not get updated and buttons are unresponsive until the entire sequence is finished.  So we need to work out a way to request the pages one-at-a-time without that happening.

My solution, and the one I'm describing here, is to use a timer.  Start the first page and use setInterval() so that the TimerProc can check every so often to see if the page is finished.  When it is, you can start the next page, and so forth (rinse and repeat...).

So, replace the DoTest() function in the HTA source file with this code:
var gasListOfURLs=
                        new Array("http://www.google.com/", "http://www.yahoo.com/", "http://www.msn.com/");
                      var gnIdxStart=0;
                      var gnIdxEnd=  2;
                      var gnIdxCurr= gnIdxStart;
                      var gnTimerID;
                      
                      function DoTest() {
                          gnIdxCurr= gnIdxStart= 0;
                          StartNextPage();
                      }
                      function StartNextPage() {
                          oDivDisplayArea.innerText= "reading...";
                          CollectPageInfo( gasListOfURLs[gnIdxCurr] ); 
                          gnTimerID= setInterval( "TimerProc();", 500 );  // twice per second
                      }
                      function TimerProc() {
                          if ( ! gfPgDone ) {
                              oDivDisplayArea.innerText += ".";  // visual feedback
                              return;
                          }
                          //----------- else page is done and has been processed... start next page
                          clearInterval( gnTimerID );  // avoid recursing
                          gnIdxCurr++;
                          if ( gnIdxCurr > gnIdxEnd ) {
                              return;
                          }
                          StartNextPage();  
                      }

Open in new window

With this mechanism in place, all you need to do is define the list of pages and provide your own ProcessPage() function.  Even pages that take a long time to load will be handled correctly.

What happens if the page never finishes loading?
Because we are using a timer rather than a synchronous download, the HTA always remains responsive.  You can close it to exit or just click the [restart] button.

However, you'll want some way to handle a timeout programmatically.  One option is to add a failsafe in the TimerProc function; for instance use:
var gdtFailSafe= new Date().valueOf()+30000;   // max wait= 30 seconds
                      ...
                      function TimerProc() {
                          var n= new Date().valueOf();
                          if ( n > gdtFailSafe ) {
                             alert("timed out");
                          }
                          ...

Open in new window

If you have IE8 or later installed, you can set a timeout in the XMLHttpRequest object itself.  In that case, when there is a timeout, the response value is null.
var req = new XMLHttpRequest();
                      ...
                          req.timeout=30000; // 30 seconds, then call it bad
                          req.open( "GET", sURL, true ); // true for async operation
                      ...
                      function ProcessPage( sPgTxt ){
                          if ( sPgTxt==null ) { 
                              // handle the timeout
                          }
                      ...

Open in new window


Notes:

Not a sandbox.  Some of you may think to try this in a <script> block on some HTML page that you serve up from a web host.  Just be aware that in that setting, it will only work to access pages in your own domain.  That's a browser security feature.  

When used in an HTA or in a Win7/Vista Desktop Gadget, the technique works to access any domain.   That's because such programs are not constrained to stay in a "Play nice, children!" sandbox.
More about the HTA
Since the HTA is considered a "trusted application" you can instantiate and use common (and often needed) ActiveX objects, such as the FileSystemObject object.  For instance, to save the page's HTML to a disk file:
//------------------------------------ save the HTML to your hard disk function ProcessPage( sPgTxt ){ var oFSO= new ActiveXObject("Scripting.FileSystemObject"); var oTS= oFSO.CreateTextFile("C:\\temp\\File" +gnIdxCurr+ ".html", true ); oTS.Write( sPgTxt ); oTS.Close(); }In my case, I needed to use the ADODB ActiveX object so that I could insert data into a database.  An example of ADODB usage is shown in my HTA article.
Processing the page.   Once you have an HTML page in a string variable, what can you do with it?  See my Browser Bot series for some ideas.  But it really helps to get to know how to use GREP in JavaScript.  Here's a snippet as an example:
CollectPageInfo("http://www.experts-exchange.com/A_3432.html" ); ... function ProcessPage( sPgTxt ) { pat= new RegExp('Posted on (.*?)<div' ); var sArticleDate= pat.exec(sPgTxt)[1]; alert( "Posted on: " + sArticleDate ); }

Summary

Pounding out a few lines of JavaScript can be a heck of a lot easier than writing a full-featured application program, especially to create a small utility for your own needs.  

I needed to download and parse about 3000 webpages and I wanted to be able try out different options.   An HTA or even a desktop gadget is a good way to go with something like that.  Along the way, I had to puzzle out how to process a sequence of pages, one after another, without locking up the JavaScript processor.

The basic trick is to set up a timer and so that you can do the task incrementally.   You start a page loading asynchonously and then have the timer check to see when it is finished.  When it is done, the timer starts the next page load.   It sounds easy, but it certainly is puzzling unless you already know the trick.  

Having worked out the technique, I decided to describe what I did in this Experts-Exchange Article.  I hope it helps you!

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
If you liked this article and want to see more from this author, please click the Yes button near the:
      Was this article helpful?
label that is just below and to the right of this text.   Thanks!
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=  
2
7,549 Views
DanRollins
CERTIFIED EXPERT

Comments (2)

CERTIFIED EXPERT
Author of the Year 2009

Author

Commented:

Addendum to the article:

The XMLHttpRequest object might not be available if you have an old enough system (if your IE is stil in the 6.x range).  It is still possible to accomplish all of the above by using the related ActiveX object that does the same thing.  Check out this great EE article that shows how to instantiate that object on older systems:

   Reading Files Into Your Web Page With JavaScript
   https://www.experts-exchange.com/Programming/Languages/Scripting/JavaScript/A_3327.html

See the getRequestObject()  function in the first code snippet.

iGottZd.3 Administrator
CERTIFIED EXPERT

Commented:
luckily nowdays exist frameworks like jQuery.
http://api.jquery.com/category/ajax/
this makes creating ajax easyer and even cross browser compatible.

Have a question about something in this article? You can receive help directly from the article author. Sign up for a free trial to get started.