Link to home
Start Free TrialLog in
Avatar of c7c4c7
c7c4c7Flag for United States of America

asked on

How do I use VBA to loop through linked HTML pages

I have a database that accesses web pages, using VBA, not mine and I have no control over the web pages.  One of the web page queries has returned a page that has several more linked to it using href's.  One of the hrefs will return the next page until there is no more.  Problem is I have no idea how to use the href to access the next page using VBA.

Please provide specific example on how to do this.

Thanks
Avatar of Rick_Rickards
Rick_Rickards
Flag of United States of America image

There's a great many ways this could be done.  Without knowing exactly what you're starting with I've prepared an example that makes some assumptions.  Hopefully it's close enough to illustrate how to do what you want to do by merely altering the code below to fit your situation.

In this case it is assumed that you've loaded each web page's HTML code into a memo field located in a table named "tblWeb" with a field named "HTML".  Simply pushing the data into a string is enough as long as you have built in a way to loop thru all the pages you're interested in.

First you start off with the entire HTML text found on the web page.  This example I'm going to assume you've loaded the entire HTML text from each page into a single memo field inside an Access Table.  Where the data comes from isn't partifcularly important, I just used the table/recordset sample to illustrate the idea.  

From there we can easily extract the link for every "<a href=" tag found within the the HTML code.  We can do this by using the Split() function to create a string array containing the tag "<a href=".  Naturally there will be a bunch of garbage at the end of each string that is why we clean up each element within the array within a loop.  Whether to start counting at 0 or 1 depends on whether your HTML code started with "<a href=" which is why you see the use of...

Abs(Left(rst(0).Value, 9) <> "<a href=""")

... within the for Next loop.  If it starts that way it will start counting at 0 since the very first thing it finds is the tag you're looking for.  Otherwise it starts at 1 since the first element will contain nothing but garbage data that you're not seeking out.

As we loop thru the array we look for the lclose of the "<a href=" tag which appears as ">

Once found we strip everything from the 2nd character to the location of "> less 2 characters so we don't end up with "> at the end of the string.

When done, the entire array is nothing but a cleaned up list of every link found within the HTML document.

Good luck!!!



.  We drop the first character and then we drop everything that extends beyond the characters ">.

The line that reads...

Debug.Print str(lngCount)

... will spit out each of the web links found withing the HTML document.  

Of course you can get fancy and then pull HTML documents from each web link found and repeat the process until you've reached a page w/o any links and/or one that contains links already examined or otherwise deemed of no importance to you.

Hope this helps and proves to be easier than you might have anticipated.  It is probably worth noting that href links are not always the only type of data people find valuable when mining for data in an HTML document but once you've mastered this technique it isn't hard to see how you could use the same technique to extract any other part you were interested in.

Rick


Sub ExtractingHrefs()
    Dim db As DAO.Database
    Dim rst As DAO.Recordset
    Dim str() As String
    Dim lngCount As Long
    Set db = CurrentDb
    Set rst = db.OpenRecordset("SELECT tblWeb.HTML FROM tblWeb;")
    If rst.RecordCount Then
        Do Until rst.EOF
            str() = Split(rst(0), "<a href=")
            For lngCount = Abs(Left(rst(0).Value, 9) <> "<a href=""") To UBound(str)
                str(lngCount) = (Mid(str(lngCount), 2, (InStr(2, str(lngCount), """>") - 2)))
                Debug.Print str(lngCount)
            Next lngCount
            rst.MoveNext
        Loop
    End If
End Sub

Open in new window

Sorry about the extra text that follows Good Luck!!!.  Seems I lost track of where I was in the message and left text I rewrote in by accident.  In any case just look to everything above the line that reads Good Luck!!! and you should find your answer.  Code supplied to illustrate it step by step.

Rick
Avatar of c7c4c7

ASKER

I think the title of this question has misled you.  Your statement here is what i am trying to solve.

Of course you can get fancy and then pull HTML documents from each web link found and repeat the process until you've reached a page w/o any links and/or one that contains links already examined or otherwise deemed of no importance to you.

The real meat of the question is this portion

I have no idea how to use the href to access the next page using VBA.

Thanks
How are you viewing the web page now?  There is an Internet Exploer Active X control that works really well for displaying and navigating thru web pages, (I know as I use it within my own apps), but before I assume to much, perhaps I should be clear on how you are viewing the web pages now that you wish to then click and/or execute a given href on.

Rick
Avatar of c7c4c7

ASKER

The actual view is through the existing VBA program, that does screen scraping.  The problem I'm trying to get past is how to access the rest of the pages that are referenced via href's.

To answer your question though both IE and Firefox are on the machine though

Thanks for the patience
To be honest I'm not sure I'm following the question exactly.  My current guess is you have a hyperlink you'd like your web browser to open.  Below is a very simple example of how to do that, (obviously it can be modified any number of ways to work in a more automated scenario but I thought a very basic example would be best to illustrate the basic idea.

This example assumes you have a form with a command button on it named cmdHyperlink.  Other than that you just place the following code behind the form's Load Event, (obviously it doesn't have to be a form's load event, it's merely something I arbitrarily used for the example.

The result is, when the form loads it will open your web browser to the web address identified as the .HyperlinkAddress of the command button.  In this case it is http://www.microsoft.com/

Is this what you were looking for or am I still missing something?
Private Sub Form_Load()
    Dim ctl As CommandButton
    Set ctl = Me.cmdHyperlink
    With ctl
        .Visible = False
        .HyperlinkAddress = "http://www.microsoft.com/"
        .Hyperlink.Follow
    End With
End Sub

Open in new window

Avatar of c7c4c7

ASKER

Here is a sample of a href I want to follow until there are no more

<a href="list.asp?catid=3&man=MyValue&PG=2&guid=10d56e80563040ab900a0d2669163072"

My question is what is the VBA code to get the  "<a href" value to return the next page so I can screen scrape it?

Thanks
So is that value one of many values in a table of many rows or is it one long contiguous string of <a href values that need to be parsed out and followed?
Avatar of c7c4c7

ASKER

The href's are not in the database.  
The hrefs are in the html pages that I am pulling the information for the database from
I just need to understand how to code VBA, with an href,  so when I submit it to the internet I get the html page it references
What exactly do you mean when you say... "I just need to understand how to code VBA, with an href,  so when I submit it to the internet I get the html page it references"

Specifically, are you opening the web page in IE, FireFox or some other such browser or for example, opening the web page within an ActiveX Control like the one Windows Provides for IE?

Rick
ASKER CERTIFIED SOLUTION
Avatar of c7c4c7
c7c4c7
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of modus_operandi
modus_operandi

Closed, 500 points refunded.
modus_operandi
EE Moderator