Best approaches for reading and parsing from a webpage

Posted on 2006-05-16
Last Modified: 2010-04-07

I've never done this before and don't have any idea where to begin.  There are some websites that have information on them that I want to read and parse and populate a database with.  I am familiar with VB6.

What are the concepts / approaches I should explore given my VB6 bias?
Question by:SAbboushi
    LVL 24

    Accepted Solution

    Set a reverence to the Microsoft Internet Controls

    Option Explicit
    Dim comp2 As Boolean
    Dim sHtml As String
    Dim WithEvents Web1 As InternetExplorer

    Private Sub Form_Load()
      Set Web1 = New InternetExplorer
      Web1.Visible = True
      Web1.Navigate "

      comp2 = False
      ' wait until the page is fully loaded
      Do Until comp2 = True

      sHtml = Web1.Document.documentElement.innerHTML
    'sHtml now contains the contents of the web page
    ' then just parse it to find what you want
    End Sub
    Private Sub Web1_DocumentComplete(ByVal pDisp As Object, URL As Variant)
      comp2 = True
    End Sub
    LVL 35

    Assisted Solution

    Just to give you a couple alternatives to using internet controls (though there really isn't anything wrong with it), I usually use msxml2 or the wininet api.  You could also use the IE object, though it is just about the same as using internet controls.


    Function GetWebPage(ByVal vWebSite As String) As String
     Dim oXMLHTTP As Object, vWebText As String, i As Long
     Set oXMLHTTP = CreateObject("msxml2.xmlhttp")
     oXMLHTTP.Open "GET", vWebSite, False
     If (oXMLHTTP.readyState = 4) And (oXMLHTTP.Status = 200) Then
      vWebText = oXMLHTTP.ResponseText
      vWebText = Replace(vWebText, """, Chr(34))
      vWebText = Replace(vWebText, "<", Chr(60))
      vWebText = Replace(vWebText, ">", Chr(62))
      vWebText = Replace(vWebText, "&", Chr(38))
      vWebText = Replace(vWebText, " ", Chr(32))
      For i = 1 To 255
       vWebText = Replace(vWebText, "&#" & i & ";", Chr(i))
     End If
     GetWebPage = vWebText
     Set oXMLHTTP = Nothing
    End Function


    Private Const INTERNET_OPEN_TYPE_PROXY = 3
    Private Const scUserAgent = "VB Project"
    Private Const INTERNET_FLAG_RELOAD = &H80000000
    Private Declare Function InternetOpen Lib "wininet.dll" Alias "InternetOpenA" (ByVal _
     sAgent As String, ByVal lAccessType As Long, ByVal sProxyName As String, ByVal _
     sProxyBypass As String, ByVal lFlags As Long) As Long
    Private Declare Function InternetOpenUrl Lib "wininet.dll" Alias "InternetOpenUrlA" _
     (ByVal hOpen As Long, ByVal sUrl As String, ByVal sHeaders As String, ByVal lLength _
     As Long, ByVal lFlags As Long, ByVal lContext As Long) As Long
    Private Declare Function InternetReadFile Lib "wininet.dll" (ByVal hFile As Long, ByVal _
     sBuffer As String, ByVal lNumBytesToRead As Long, lNumberOfBytesRead As Long) As Long
    Private Declare Function InternetCloseHandle Lib "wininet.dll" (ByVal hInet As Long) _
     As Long
    Private Declare Function URLDownloadToFile Lib "urlmon" Alias "URLDownloadToFileA" ( _
     ByVal pCaller As Long, ByVal szURL As String, ByVal szFilename As String, ByVal _
     dwReserved As Long, ByVal lpfnCB As Long) As Long
    Function OpenURL(ByVal sUrl As String) As String
     Dim hOpen As Long, hOpenUrl As Long, lNumberOfBytesRead As Long, i As Long
     Dim bDoLoop As Boolean, bRet As Boolean
     Dim sReadBuffer As String * 2048, sBuffer As String
     hOpen = InternetOpen(scUserAgent, INTERNET_OPEN_TYPE_PRECONFIG, vbNullString, _
      vbNullString, 0)
     hOpenUrl = InternetOpenUrl(hOpen, sUrl, vbNullString, 0, INTERNET_FLAG_RELOAD, 0)
     bDoLoop = True
     While bDoLoop
      sReadBuffer = vbNullString
      bRet = InternetReadFile(hOpenUrl, sReadBuffer, Len(sReadBuffer), lNumberOfBytesRead)
      sBuffer = sBuffer & Left$(sReadBuffer, lNumberOfBytesRead)
      If Not CBool(lNumberOfBytesRead) Then bDoLoop = False
     If hOpenUrl <> 0 Then InternetCloseHandle (hOpenUrl)
     If hOpen <> 0 Then InternetCloseHandle (hOpen)
      sBuffer = Replace(sBuffer, "&quot;", Chr(34))
      sBuffer = Replace(sBuffer, "&lt;", Chr(60))
      sBuffer = Replace(sBuffer, "&gt;", Chr(62))
      sBuffer = Replace(sBuffer, "&amp;", Chr(38))
      sBuffer = Replace(sBuffer, "&nbsp;", Chr(32))
      For i = 1 To 255
       sBuffer = Replace(sBuffer, "&#" & i & ";", Chr(i))
     OpenURL = sBuffer

    End Function


    Function GetWebIE(ByVal vWebSite As String) As String
     Dim IE As Object
     Set IE = CreateObject("internetexplorer.application")
     IE.Navigate2 vWebSite
     Do While IE.readyState <> 4 'READYSTATE_COMPLETE
     GetWebIE = IE.Document.Body.InnerHTML 'could also be .InnerText
     Set IE = Nothing
    End Function

    I would say to run some speed tests to see which works best for you and the site(s) you'll be parsing.
    LVL 17

    Expert Comment

    You have 2 problems to solve a) How to get the data, b) how to decode the response.

    You already have some examples for problem "a" but here is a simple way. Need to set a component reference to Microsoft Internet Transfer Control, and then place it on a form.

    Function GetWebPage(psURL As String) As String
    On Error Resume Next
    GetWebPage = Inet1.OpenURL(psURL)
    End Function

    For problem "b" I would use a binary method to decode the data, this can be very much faster than using string functions.  Numeric operations work many many times faster than string functions.

    You decode function needs to be a little cute as you can have nested tags.  Here is an extract from a class I create to decode XML files into a sort of recordset object.

    To convert from string to byte array

    bytData = StrConv(psXML, vbFromUnicode)

    The just loop throught the array looking for control characters like < > & etc. But set them up as numerics first

    mlLT = Asc("<")
    mlGT = Asc(">")

    Select Case bytData(lCount)
        Case = mlLT
              ' handle lessthan


    Author Comment

    Hi folks - thanks for the posts.  I will review and get back to you-

    Featured Post

    Why You Should Analyze Threat Actor TTPs

    After years of analyzing threat actor behavior, it’s become clear that at any given time there are specific tactics, techniques, and procedures (TTPs) that are particularly prevalent. By analyzing and understanding these TTPs, you can dramatically enhance your security program.

    Join & Write a Comment

    Introduction While answering a recent question ( in the VB classic zone, I wrote some VB code in the (Office) VBA environment, rather than fire up my older PC.  I didn't post completely correct code o…
    Background What I'm presenting in this article is the result of 2 conditions in my work area: We have a SQL Server production environment but no development or test environment; andWe have an MS Access front end using tables in SQL Server but we a…
    As developers, we are not limited to the functions provided by the VBA language. In addition, we can call the functions that are part of the Windows operating system. These functions are part of the Windows API (Application Programming Interface). U…
    This lesson covers basic error handling code in Microsoft Excel using VBA. This is the first lesson in a 3-part series that uses code to loop through an Excel spreadsheet in VBA and then fix errors, taking advantage of error handling code. This l…

    746 members asked questions and received personalized solutions in the past 7 days.

    Join the community of 500,000 technology professionals and ask your questions.

    Join & Ask a Question

    Need Help in Real-Time?

    Connect with top rated Experts

    18 Experts available now in Live!

    Get 1:1 Help Now