Question

Using VB and the MSHTML Object Model to parse data from an HTML document

Asked by: pique_tech

I receive periodic HTML emails detailing real estate property listings that have recently been added or changed.  These emails include several relevant data elements, such as address and price, and are system-generated, so they are standardized.  They take the form of
<HTML>
<STYLE>
...
</STYLE>
<BODY>
<TABLE>                                       A table used primarily for formatting appearance, nothing important in it
...
</TABLE>
<TABLE>                                       A table used primarily for formatting appearance, nothing important in it
...
</TABLE>
<TABLE>                                         This is the headings for the data table columns
<TR>
     <TD>Listing #</TD>
     <TD>Status</TD>
     <TD>Price</TD>
     <TD>Address</TD>
     <TD>Cross Street</TD>
     <TD>Area</TD>
     <TD>Type</TD>
     <TD>BD</TD>
     <TD>BA</TD>
     <TD>Sq Ft</TD>
</TR>
<TR>                                               This is the data, there may be one or many <TR> elements like this, one for each listing detail item
     <TD>123456</TD>
     <TD>Active</TD>
     <TD>100000</TD>
     <TD>123 Main St</TD>
     <TD>Market</TD>
     <TD>Downtown</TD>
     <TD>MFM2-4</TD>
     <TD>2</TD>
     <TD>1.5</TD>
     <TD>2560</TD>
</TR>
</TABLE>
<TABLE>                                       A table used primarily for formatting appearance, nothing important in it
...
</TABLE>
<TABLE>                                       A table used primarily for formatting appearance, nothing important in it
...
</TABLE>
</BODY>
</HTML>

So, the issue is:  how can I use the MSHTML Object Model to extract the data contained in the tables where the actual data is?  I cannot figure out how to identify and use a particular table through the object library.

I've come up with a pretty kludgey approach:  since these emails are automatically generated, the first N lines before the first data table are always the same, and the last M lines after the last data table are also always the same.  So using the FileSystemObject TextStream object, I can extract lines N+1 through M-1, which are exactly the HTML data tables I care about.  I can then use that shortened HTML to instantiate a much smaller object which I think I know how to manipulate.  But I'm accustomed to the MSXML object model, where you can actually walk the nodes and find what you want and I'm hoping that MSHTML offers something similar that I just haven't found yet.

Any advice anyone has about how to do this using only the object model would be greatly appreciated.  If it is not possible, confirmation would be great.  If anyone has an alternate approach to either the object model or the parsing approach, I'm all (virtual) ears.

The goal is to create a searchable database for this data.  I have over 10 months' worth of emails and finding a particular property I know I saw listed sometime in the first quarter is currently a pretty tedious (and not always successful) process.    

This Question has been solved and asker verified All Experts Exchange premium technology solutions are available to subscription members.

Subscribe now for full access to Experts Exchange and get

Instant Access to this Solution

  • Plus...
  • 30 Day FREE access, no risk, no obligation
  • Collaborate with the world's top tech experts
  • Unlimited access to our exclusive solution database
  • Never be left without tech help again

Subscribe Now

Asked On
2005-06-15 at 14:18:32ID21459594
Tags

mshtml

,

vb

Topic

Visual Basic Programming

Participating Experts
2
Points
500
Comments
3

Trusted by hundreds of thousands everyday for fast, accurate and reliable tech support.

  • "The time we save is the biggest benefit of Experts Exchange to Warner Bros. What could take multiple guys 2 hours or more each to find is accessed in around 15 minutes on Experts Exchange." Mike Kapnisakis, Warner Bros.
  • "Our team likes having a resource that is more secure than just using Google and most experts using this service really know their stuff. It's nice to look here first versus using Google." Dayna Sellner, Lockheed Martin
  • "Anytime that I've been stumped with a problem, 9 out of 10 times Experts Exchange has either the accepted solution or an open discussion of the potential solution to the problem." Kenny Red, eBay Inc.

See what Experts Exchange can do for you.

Got a question?

We've got the answer.

Experts Exchange has been collecting answers to technology questions since 1996…3 million and counting! If you have a question, chances are we already have your answer.

Screenshot of Experts Exchange Knowledgebase

Need individual assistance?

Our experts are ready to help.

If you can't find the exact answer you're looking for, ask our exclusive community of 50,000 experts. You’ll get a personalized answer from a trusted professional.

Screenshot of Experts Exchange Knowledgebase

Want to learn from the best?

Read articles from industry experts.

Thousands of free tech tips, tricks, how-to’s and tutorials are available in our peer reviewed articles section. See for yourself how smart our experts are, no login required.

Screenshot of an Article

Working on a long term project?

Store your work and research.

Save solutions to your questions, answers you’ve discovered through searching plus helpful articles in your personal knowledgebase for easy future access.

Screenshot of Experts Exchange Knowledgebase

Access the answers to your technology questions today.

Subscribe Now

30-day free trial. Register in 60 seconds.

What Makes Experts Exchange Unique?

Members of the expert community talk about why the experience at Experts Exchange is different than what you will find anywhere else.

Trusted by the world's most respected brands.

image of each brand's logo

Faithfully serving IT professionals since 1996.

Experts Exchange Logo

Try it out and discover for yourself.

Subscribe Now

30-day free trial. Register in 60 seconds.

Related Solutions

  1. par file
    Waht is a file with .par used for and what does this extension mean? Can I list the parameters for import in a .par file and use this file in imp, like imp file=filename.par Please suggest.
  2. par io err laserjet 4 plus
    Every time I start my computer my old laserjet 4 plus comes up with an error message on the control panel "err par io" the work around is to cycle power on the printer after the computer is loaded. This problem started when I got a new computer. I have tried chan...
  3. PAR Protocol
    How has the PAR Protocol been extended for use in the Internet?
  4. Converting from PERL to Executable using PAR
    I have a PERL script that begins with the following: use Win32::OLE qw(in with); use Win32::OLE::Const 'Microsoft Excel'; When I convert my PERL script to an executable using pp -o file.exe file.pl, there is no problem. But, when I run file.exe I receive the following pop-...
  5. What does the /par in a .bat file mean?
    I am learning to do batch files, and came across this /par in my .bat file. I dont know really how it got there, as I was just manually creating the batch. I did just install easy batch creator, but I didnt use that prog. Anyway here is the code that I have in my batch, but ...

Free Tech Articles

  1. WARNING: 5 Reasons why you should NEVER fix a computer for free.
    It is in our nature to love the puzzle. We are obsessed. The lot of us. We love puzzles. We love the challenge. We thrive on finding the answer. We hate disarray. It bothers us deep in our soul. W...
  2. SCCM OSD Basic troubleshooting
    SCCM 2007 OSD is a fantastic way to deploy operating systems, however, like most things SCCM issues can sometimes be difficult to resolve due to the sheer volume of logs to sift through and the dispe...
  3. Migrate Small Business Server 2003 to Exchange 2010 and Windows 2008 R2
    This guide is intended to provide step by step instructions on how to migrate from Small Business Server 2003 to Windows 2008 R2 with Exchange 2010. For this migration to work you will need the fo...
  4. Create a Win7 Gadget
    This article shows you how to create a simple "Gadget" -- a sort of mini-application supported by Windows 7 and Vista. Gadgets can be dropped anywhere on the desktop to provide instant information, ...
  5. Outlook continually prompting for username and password
    There have been a lot of questions recently regarding Outlook prompting for a username and password whilst using Exchange 2007. There are a few reasons why this would happen and I will try to cover t...
  6. Backup Exchange 2010 Information Store using Windows Backup
    There seems to be quite a lot of confusion around the ability to backup Exchange 2010 using the built in Windows Backup feature. This stems from the omission of this feature prior to Exchange 2007 s...

Cloud Class Webinars

  1. Avoiding Bugs in Microsoft Access
    Alison Balter takes and in-depth look at avoiding bugs in Access. In this webinar you will learn about using the immediate window to debug your applications, invoking the debugger, using breakpoints to troubleshoot, stepping through code, setting the next statement to execute, ...
  2. Top 10 Best New Features in Visio 2010
    Scott Helmers gives live demonstrations of the top 10 new features in Visio 2010. This webinar will teach you how to create compelling diagrams by adding shapes to the page with a single click, linking the shapes in a diagram to data in Excel (or SQL Server, or SharePoint), ...
  3. IT Consultant Business Secrets Revealed
    Michael Munger, Experts Exchange tech pro and IT consultant, pulls back the curtain on his very successful businesses and answers question on every IT consultant and business owner should know about. He shares secrets on what he did to solve the 5 most common problems in IT, ...
  4. Disaster Recovery and Business Continuity
    Quest CTO, Mike Billon, gives an overview of the steps involved in building a dunamic disaster recovery plan. Through case studies and an examination of software/hardware tooles for monitoring and testing, you'll gain a better understandin of where you are, where you want ...
  5. Organize Your Visio Diagrams with Containers and Lists
    Scott Helmers uses cross functional flowcharts, wireframe diagrams, data graphic legends and seating charts to teach you: how to ustilize all three new structured diagram components in Visio 2010, the best practices for organizeing shapes in previous version of Visio, how to organize ...
  6. How to Us Objects, Properties, Events and Methods in Microsoft Access
    Alison Dalter gives an in-depbth look at objects, properties, events and methods in Microsoft Access. In this webinar you will learn about using the object browser, referring to objects, working with properties and methods, working with object variables, understanding the ...

Join the Community

Give a Little. Get a Lot.

Join the community of experts here and help other tech pros by answering question in your area of expertise. You can earn FREE access to all Experts Exchange's premium features and resources.

Join the Community

Answers

 

by: wesbirdPosted on 2005-06-17 at 08:49:01ID: 14242172

Here's an example from something similar in access VBA which should help you to figure the DOM:

You must remember to include shdocvw.dll in your project references/components.

****************************

Option Compare Database

Dim WithEvents doc As HTMLDocument
Dim WithEvents win As HTMLWindow2

Dim cmd As ADODB.Command
Dim rs As ADODB.Recordset

Dim id As Long
Dim fn As Long

Function StartsLike(strRef, strTest) As Boolean
    If Left(strTest, Len(strRef)) = strRef Then
        StartsLike = True
    Else
        StartsLike = False
    End If
End Function

Private Sub Command1_Click()
   

    WebBrowser1.Navigate2 "http://www.yoursite.com/page.htm"
   
End Sub

Private Sub WebBrowser1_DownloadBegin()
    Dim htm As IHTMLDocument2
    Dim htmwin As IHTMLWindow2
    Dim strArr As Variant
   
    On Error Resume Next
    Set doc = WebBrowser1.Document
    Set win = htm.parentWindow
End Sub


Private Sub WebBrowser1_DocumentComplete(ByVal pDisp As Object, URL As Variant)
    Dim htm As IHTMLDocument2
    Dim htmwin As IHTMLWindow2
    Dim strArr As Variant
    Dim hi As IHTMLElement
   
    On Error Resume Next
   
    Set doc = WebBrowser1.Document
    Set cmd2 = New ADODB.Command
    cmd2.ActiveConnection = CurrentProject.Connection
       
        For Each hi In doc.body.all
            If bKeep Then
                If hi.nodeName = "TD" Then
                    If Trim(hi.innerText) <> "" Then
                        fn = fn + 1
                        If fn = 1 Then
                            cmd.CommandText = "INSERT INTO Addr (ID, Addr1 ) " & _
                                    "VALUES( " & CStr(id) & ", " & Chr(34) & Trim(hi.innerText) & Chr(34) & ")"
                        Else
                            cmd.CommandText = "UPDATE Addr SET Addr" & CStr(fn) & " = " & Chr(34) & Trim(hi.innerText) & Chr(34) & " WHERE ID = " & CStr(id)
                        End If
                        cmd.Execute
                    End If
                End If
            End If
           
            If hi.nodeName = "TD" Then
                strTmp = ""
               
                If StartsLike("Name", hi.innerText) Then
                    Debug.Print hi.nextSibling.innerText
                    bKeep = True
                    fn = 0
                End If
            End If
           
            If hi.nodeName = "P" Then
                If StartsLike("Organization", hi.innerText) Then
                    Debug.Print
                    bKeep = False
                End If
            End If
           
        Next hi
       
         
         If Not rs.EOF Then
            rs.MoveNext
            WebBrowser1.Navigate2 "http://nextsite/nextpage.htm"
            id = id + 1
        End If
       
'   End If
   
    Set cmd2 = Nothing

End Sub

 

by: PreecePosted on 2005-06-20 at 15:23:39ID: 14261275

With some tweaking, this may help:

    sResult = gfGetHTMLTableCellVal(WebBrowser1, 1, 1, 1, 1, True, sSearchString)


Public Function gfGetHTMLTableCellVal(webX As WebBrowser, lRow As Long, lCol As Long, lMidStart As Long, lMidLen As Long, bInStr As Boolean, sSearchString As String) As String
    Dim Tbl As HTMLTable
    Dim trRow As HTMLTableRow
    Dim trTD As HTMLTableCell
   
    For Each Tbl In webX.Document.All
        If UCase$(Tbl.tagName) = "TABLE" Then
            If Tbl.rows.length > 0 Then
                If bInStr Then
                    'Debug.Print Tbl.rows(lRow).cells(lCol).innerText
                    'If InStr(1, Tbl.rows(lRow).cells(lCol).innerText, sSearchString) > 0 Then
                    If InStr(1, UCase(Tbl.innerText), UCase(sSearchString)) > 0 Then
                        'For Each trRow In Tbl.All
                        For Each trTD In Tbl.All
                            If InStr(1, UCase(trTD.innerText), UCase(sSearchString)) > 0 Then
                                gfGetHTMLTableCellVal = trTD.innerText
                                'trTD.scrollIntoView (trTD.scrollTop - 400)
                                trTD.scrollIntoView (trTD.scrollTop)
                            End If
                        Next
                        Exit For
                    End If
                Else
                    'Debug.Print Mid(Tbl.rows(lRow).cells(lCol).innerText, lMidStart, lMidLen)
                    If UCase(Mid(Tbl.rows(lRow).cells(lCol).innerText, lMidStart, lMidLen)) = UCase(sSearchString) Then
                        gfGetHTMLTableCellVal = Tbl.rows(lRow).cells(lCol).innerText
                        Exit For
                    End If
                End If
            End If
        End If
    Next
   
End Function

 

by: pique_techPosted on 2005-09-15 at 13:29:27ID: 14893180

Thanks for your input.  I couldn't find the direct answer to the "is it possible to walk the document, and if so, how" in either response, but got lots of good pointers about how to approach the MSHTML object model through code.

20120131-EE-VQP-002

3 Ways to Join

30-Day Free Trial

The Experts

98% positive feedback on 31,087 answers since March 2000. angeliii is a Microsoft Most Valuable Professional for his work with MS SQL Server & Develoment.

He has also proven his knowledge of Visual Basic Programming, PHP Scripting and Oracle Databases.

The Experts

97% positive feedback on 10,752 answers since July 2000. lrmoore has more than 18 years experience in the networking industry.

The six-time Mircosoft MVPs specialties include firewalls, virtual private networking, and network management.

Testimonials

"...and excellent source for support... Kind of like having your very own IT dept." Electriciansnet

Testimonials

"I was apprehensive at signing up at first. However... it has already made my life as an IT administrator much easier." JaCrews

Testimonials

"WOW! You guys have great, active, and knowledgeable people on here." moore50

Business Clients

Business Clients

In the Press

"If you’ve got a question... Experts Exchange can supply an answer.”

In the Press

"...an invaluable aid for both IT professionals and those who require tech support."

In the Press

"where IT professionals provide quick answers on just about any topic"

Business Account Plans

Loading Advertisement...