Best approaches for reading and parsing from a webpage


I've never done this before and don't have any idea where to begin.  There are some websites that have information on them that I want to read and parse and populate a database with.  I am familiar with VB6.

What are the concepts / approaches I should explore given my VB6 bias?
Who is Participating?
Set a reverence to the Microsoft Internet Controls

Option Explicit
Dim comp2 As Boolean
Dim sHtml As String
Dim WithEvents Web1 As InternetExplorer

Private Sub Form_Load()
  Set Web1 = New InternetExplorer
  Web1.Visible = True
  Web1.Navigate "

  comp2 = False
  ' wait until the page is fully loaded
  Do Until comp2 = True

  sHtml = Web1.Document.documentElement.innerHTML
'sHtml now contains the contents of the web page
' then just parse it to find what you want
End Sub
Private Sub Web1_DocumentComplete(ByVal pDisp As Object, URL As Variant)
  comp2 = True
End Sub
Just to give you a couple alternatives to using internet controls (though there really isn't anything wrong with it), I usually use msxml2 or the wininet api.  You could also use the IE object, though it is just about the same as using internet controls.


Function GetWebPage(ByVal vWebSite As String) As String
 Dim oXMLHTTP As Object, vWebText As String, i As Long
 Set oXMLHTTP = CreateObject("msxml2.xmlhttp")
 oXMLHTTP.Open "GET", vWebSite, False
 If (oXMLHTTP.readyState = 4) And (oXMLHTTP.Status = 200) Then
  vWebText = oXMLHTTP.ResponseText
  vWebText = Replace(vWebText, """, Chr(34))
  vWebText = Replace(vWebText, "<", Chr(60))
  vWebText = Replace(vWebText, ">", Chr(62))
  vWebText = Replace(vWebText, "&", Chr(38))
  vWebText = Replace(vWebText, " ", Chr(32))
  For i = 1 To 255
   vWebText = Replace(vWebText, "&#" & i & ";", Chr(i))
 End If
 GetWebPage = vWebText
 Set oXMLHTTP = Nothing
End Function


Private Const scUserAgent = "VB Project"
Private Const INTERNET_FLAG_RELOAD = &H80000000
Private Declare Function InternetOpen Lib "wininet.dll" Alias "InternetOpenA" (ByVal _
 sAgent As String, ByVal lAccessType As Long, ByVal sProxyName As String, ByVal _
 sProxyBypass As String, ByVal lFlags As Long) As Long
Private Declare Function InternetOpenUrl Lib "wininet.dll" Alias "InternetOpenUrlA" _
 (ByVal hOpen As Long, ByVal sUrl As String, ByVal sHeaders As String, ByVal lLength _
 As Long, ByVal lFlags As Long, ByVal lContext As Long) As Long
Private Declare Function InternetReadFile Lib "wininet.dll" (ByVal hFile As Long, ByVal _
 sBuffer As String, ByVal lNumBytesToRead As Long, lNumberOfBytesRead As Long) As Long
Private Declare Function InternetCloseHandle Lib "wininet.dll" (ByVal hInet As Long) _
 As Long
Private Declare Function URLDownloadToFile Lib "urlmon" Alias "URLDownloadToFileA" ( _
 ByVal pCaller As Long, ByVal szURL As String, ByVal szFilename As String, ByVal _
 dwReserved As Long, ByVal lpfnCB As Long) As Long
Function OpenURL(ByVal sUrl As String) As String
 Dim hOpen As Long, hOpenUrl As Long, lNumberOfBytesRead As Long, i As Long
 Dim bDoLoop As Boolean, bRet As Boolean
 Dim sReadBuffer As String * 2048, sBuffer As String
 hOpen = InternetOpen(scUserAgent, INTERNET_OPEN_TYPE_PRECONFIG, vbNullString, _
  vbNullString, 0)
 hOpenUrl = InternetOpenUrl(hOpen, sUrl, vbNullString, 0, INTERNET_FLAG_RELOAD, 0)
 bDoLoop = True
 While bDoLoop
  sReadBuffer = vbNullString
  bRet = InternetReadFile(hOpenUrl, sReadBuffer, Len(sReadBuffer), lNumberOfBytesRead)
  sBuffer = sBuffer & Left$(sReadBuffer, lNumberOfBytesRead)
  If Not CBool(lNumberOfBytesRead) Then bDoLoop = False
 If hOpenUrl <> 0 Then InternetCloseHandle (hOpenUrl)
 If hOpen <> 0 Then InternetCloseHandle (hOpen)
  sBuffer = Replace(sBuffer, "&quot;", Chr(34))
  sBuffer = Replace(sBuffer, "&lt;", Chr(60))
  sBuffer = Replace(sBuffer, "&gt;", Chr(62))
  sBuffer = Replace(sBuffer, "&amp;", Chr(38))
  sBuffer = Replace(sBuffer, "&nbsp;", Chr(32))
  For i = 1 To 255
   sBuffer = Replace(sBuffer, "&#" & i & ";", Chr(i))
 OpenURL = sBuffer

End Function


Function GetWebIE(ByVal vWebSite As String) As String
 Dim IE As Object
 Set IE = CreateObject("internetexplorer.application")
 IE.Navigate2 vWebSite
 Do While IE.readyState <> 4 'READYSTATE_COMPLETE
 GetWebIE = IE.Document.Body.InnerHTML 'could also be .InnerText
 Set IE = Nothing
End Function

I would say to run some speed tests to see which works best for you and the site(s) you'll be parsing.
You have 2 problems to solve a) How to get the data, b) how to decode the response.

You already have some examples for problem "a" but here is a simple way. Need to set a component reference to Microsoft Internet Transfer Control, and then place it on a form.

Function GetWebPage(psURL As String) As String
On Error Resume Next
GetWebPage = Inet1.OpenURL(psURL)
End Function

For problem "b" I would use a binary method to decode the data, this can be very much faster than using string functions.  Numeric operations work many many times faster than string functions.

You decode function needs to be a little cute as you can have nested tags.  Here is an extract from a class I create to decode XML files into a sort of recordset object.

To convert from string to byte array

bytData = StrConv(psXML, vbFromUnicode)

The just loop throught the array looking for control characters like < > & etc. But set them up as numerics first

mlLT = Asc("<")
mlGT = Asc(">")

Select Case bytData(lCount)
    Case = mlLT
          ' handle lessthan

SAbboushiAuthor Commented:
Hi folks - thanks for the posts.  I will review and get back to you-
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.