Extracting Links from a Web page

rperies
rperies used Ask the Experts™
on
Hi,

I'm trying to read all the HTML links in a single web page, including the link itself, as well as the text displayed, ie URL, Link Description (ie Click Here), Description.

The page I want to extract links from is:

http://www.google.com/search?hl=en&ie=UTF-8&oe=UTF-8&q=Programming&btnG=Google+Search

Note that the string "url?q" appears in every search result URL..

I would like these to be recorded in a user-defined type, as follows:

Private Type ResultType
    SiteURL As String 'The URL of the search result
    LinkDescription As String 'ie CLICK HERE
    Description As String 'ie Programming help...
End Type

Dim lstResults(50) As ResultType 'An Array of Results

So that I can call results in the following ways:

lstResults(0).SiteURL etc...

How do I get this information I want into my user defined type? There's 200 points in it for the answer, and an extra 100 points if I get an answer within 24 hours

Thanks very much
Comment
Watch Question

Do more with

Expert Office
EXPERT OFFICE® is a registered trademark of EXPERTS EXCHANGE®
Private Type ResultType
   SiteURL As String 'The URL of the search result
   LinkDescription As String 'ie CLICK HERE
   Description As String 'ie Programming help...
End Type

Function Extract(ByVal HTML As String) As ResultType()
    Dim lstReturn() As ResultType
    Dim lUBReturn As Long
    Dim lPos1 As Long
    Dim lPos2 As Long
   
    lUBReturn = -1
    HTML = Replace(HTML, vbCrLf, vbNullString)
   
    lPos1 = InStr(1, HTML, "<p class=g>")
    While lPos1
        'New list element
        lUBReturn = lUBReturn + 1
        ReDim Preserve lstReturn(lUBReturn)
        'Get SiteURL
        lPos1 = InStr(lPos1 + 11, HTML, "<a href=") '11 = Len("<p class=g>")
        lPos2 = InStr(lPos1 + 8, HTML, " target=nw>") '8 = Len("<a href=")
        lstReturn(lUBReturn).SiteURL = Mid$(HTML, lPos1 + 8, lPos2 - lPos1 - 8)
        'Get LinkDescription
        lPos1 = lPos2 + 11 '11 = Len(" target=nw>")
        lPos2 = InStr(lPos1, HTML, "</a>")
        lstReturn(lUBReturn).LinkDescription = TextOnly(Mid$(HTML, lPos1, lPos2 - lPos1))
        'Get Description
        lPos1 = InStr(lPos2, HTML, "<span class=f><font size=-1>Description:")
        lPos1 = lPos1 + 40 '40 = Len("<span class=f><font size=-1>Description:")
        lPos2 = InStr(lPos1, HTML, "<span class=f>")
        lstReturn(lUBReturn).Description = TextOnly(Mid$(HTML, lPos1, lPos2 - lPos1))
       
        'Search Next
        lPos1 = InStr(lPos1 + 14, HTML, "<p class=g>")
    Wend

    If lUBReturn > -1 Then
        Extract = lstReturn
    End If
End Function

Function TextOnly(ByVal HTML As String) As String
    Dim lLen As Long
    Dim lPos1 As Long
    Dim lPos2 As Long
   
    lLen = Len(HTML)
    lPos2 = 0
   
    lPos1 = InStr(lPos2 + 1, HTML, "<")
    While lPos1
        TextOnly = TextOnly & Mid$(HTML, lPos2 + 1, lPos1 - lPos2 - 1)
        lPos2 = InStr(lPos1 + 1, HTML, ">")
       
        lPos1 = InStr(lPos2 + 1, HTML, "<")
    Wend
    TextOnly = TextOnly & Mid(HTML, lPos2 + 1)
End Function

'Use Clipboard test
Sub Main()
    Dim lstResults() As ResultType 'An Array of Results
    Dim I As Long
   
    lstResults = Extract(Clipboard.GetText())
    For I = 0 To UBound(lstResults)
        Debug.Print "#### " & I & " ####"
        Debug.Print lstResults(I).SiteURL
        Debug.Print lstResults(I).LinkDescription
        Debug.Print lstResults(I).Description
    Next
End Sub

Commented:
Here is a counteroffer ;)
This will print all URLs into the Debug window , and its a bit shorter

For it to work you need either a hook on an Internet explorer or an Internet Control on your form


---snipp---
dim myobj as variant
For Each myObj In WebBrowser1.Document.All
    If myObj.tagName = "A" Then
        Debug.Print myObj.href
    End If
Next
---end snipp---

Author

Commented:
TigerZhao, I appreciate what you've done, however, the code I am looking for is not specific to google, but rather, any type of results page. Is this possible?

Commented:
Well you posted google as a link ;)

To catch "all links" on a Page you would need to use my Code Snippet...

But since you dont "know" the structure of the page in advance, it will not be possible to capture the description for each link.

Author

Commented:
Thanks Zhao. Badly phrased question on my part. I won't begrudge you the points for your effort.

Do more with

Expert Office
Submit tech questions to Ask the Experts™ at any time to receive solutions, advice, and new ideas from leading industry professionals.

Start 7-Day Free Trial