?
Solved

Using MSXML2 to import Web Table Data into Excel with VBA

Posted on 2014-12-02
15
Medium Priority
?
2,335 Views
Last Modified: 2014-12-04
I need help to use the MSXML2 object to load xml data into Excel.

Here is the website with data preloaded (the Account number - "1234" is the only variable):

http://ec.europa.eu/environment/ets/transaction.do;?languageCode=en&startDate=&endDate=&transactionStatus=4&fromCompletionDate=&toCompletionDate=&transactionID=&transactionType=-1&suppTransactionType=-1&originatingRegistry=-1&destinationRegistry=-1&originatingAccountType=-1&destinationAccountType=-1&originatingAccountNumber=1234&destinationAccountNumber=&originatingAccountIdentifier=&destinationAccountIdentifier=&originatingAccountHolder=&destinationAccountHolder=&search=Search¤tSortSettings=&resultList.currentPageNumber=1

I just need to be able to scrape the very top row of the table into the first row of Sheet1 in excel using VBA.
Capture.JPG
 

With the help of experts exchange ( see Excel VBA Web Data Scraping from a Table ). I can get the table data into excel using Internet Explorer object or TableObject but its very slow. I need to run the query many time with the range of account number 1-15,000 which will take weeks with the current methods.

How do I get the top row of the table data on the web into Excel row 1 on Sheet1 using the MSXML2 object? I am hoping this will be faster,
0
Comment
Question by:mikecvba
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 6
  • 5
  • 4
15 Comments
 
LVL 46

Accepted Solution

by:
aikimark earned 2000 total points
ID: 40476396
Add this code to a module in your workbook.
Option Explicit

Function GetPage(ByVal parmURL As String) As String
    Static oXMLHTTP As Object
    
    If oXMLHTTP Is Nothing Then
        Set oXMLHTTP = CreateObject("MSXML2.XMLHTTP")
    End If
    
    oXMLHTTP.Open "GET", parmURL, False
    oXMLHTTP.Send
    
    Do Until oXMLHTTP.ReadyState = 4
        DoEvents
    Loop
    
    If oXMLHTTP.Status = 200 Then
        GetPage = oXMLHTTP.responsetext
    Else
        GetPage = vbNullString
    End If
End Function

Function Q_28572779(ByVal parmAccount As String, Optional parmIncludeHeaderRow As Boolean = False) As Variant
    'Returns array of table values
    Dim strHTML As String
    Dim oRE As Object
    Dim oMatches As Object
    Dim oM As Object
    Dim oSM As Object
    Dim lngSM As Long

    Dim lngRow As Long
    Dim lngPage As Long
    Dim lngPageCount As Long
    Dim vData As Variant
    Dim vHeader As Variant
    Dim colData As New Collection
    Dim vItem As Variant
    
    Set oRE = CreateObject("vbscript.regexp")
    oRE.Global = True

    strHTML = GetPage("http://ec.europa.eu/environment/ets/transaction.do?languageCode=en&startDate=&endDate=&transactionStatus=4&fromCompletionDate=&toCompletionDate=&transactionID=&transactionType=-1&suppTransactionType=-1&originatingRegistry=-1&destinationRegistry=-1&originatingAccountType=-1&destinationAccountType=-1&originatingAccountNumber=" & parmAccount & "&destinationAccountNumber=&originatingAccountIdentifier=&destinationAccountIdentifier=&originatingAccountHolder=&destinationAccountHolder=&currentSortSettings=&resultList.currentPageNumber=0&nextList=Next>")
    'parse the result
    oRE.Pattern = """resultList.lastPageNumber""[^>]*value=""(\d+)"""
    If oRE.test(strHTML) Then
        Set oMatches = oRE.Execute(strHTML)
        lngPageCount = oMatches(0).submatches(0)
    End If
    If parmIncludeHeaderRow Then
        oRE.Pattern = "<td [^>]*class=""bgtitlelist"">(?:\s|\S)*?>(\w[^<]*)<(?:\s|\S)*?</td>\s*<td [^>]*class=""bgtitlelist"">(?:\s|\S)*?>(\w[^<]*)<(?:\s|\S)*?</td>\s*<td [^>]*class=""bgtitlelist"">(?:\s|\S)*?>(\w[^<]*)<(?:\s|\S)*?</td>\s*<td [^>]*class=""bgtitlelist"">(?:\s|\S)*?>(\w[^<]*)<(?:\s|\S)*?</td>\s*<td [^>]*class=""bgtitlelist"">(?:\s|\S)*?>(\w[^<]*)<(?:\s|\S)*?</td>\s*<td [^>]*class=""bgtitlelist"">(?:\s|\S)*?>(\w[^<]*)<(?:\s|\S)*?</td>\s*<td [^>]*class=""bgtitlelist"">(?:\s|\S)*?>(\w[^<]*)<(?:\s|\S)*?</td>\s*<td [^>]*class=""bgtitlelist"">(?:\s|\S)*?>(\w[^<]*)<(?:\s|\S)*?</td>\s*<td [^>]*class=""bgtitlelist"">(?:\s|\S)*?>(\w[^<]*)<(?:\s|\S)*?</td>\s*<td [^>]*class=""bgtitlelist"">(?:\s|\S)*?>(\w[^<]*)<(?:\s|\S)*?</td>\s*<td [^>]*class=""bgtitlelist"">(?:\s|\S)*?>(\w[^<]*)<(?:\s|\S)*?</td>\s*<td [^>]*class=""bgtitlelist"">(?:\s|\S)*?>(\w[^<]*)<(?:\s|\S)*?</td>\s*<td [^>]*class=""bgtitlelist"">(?:\s|\S)*?>(\w[^<]*)<(?:\s|\S)*?</td>\s*"
        If oRE.test(strHTML) Then
            Set oMatches = oRE.Execute(strHTML)
            colData.Add oMatches(0)
        End If
    End If
    
    lngRow = 1
    'now for the data
    oRE.Pattern = "<td [^>]*class=""bgcelllist"">(?:\s|\S)*?&nbsp;(.*?)&nbsp;(?:\s|\S)*?</td>\s*<td [^>]*class=""bgcelllist"">(?:\s|\S)*?&nbsp;(.*?)&nbsp;(?:\s|\S)*?</td>\s*<td [^>]*class=""bgcelllist"">(?:\s|\S)*?&nbsp;(.*?)&nbsp;(?:\s|\S)*?</td>\s*<td [^>]*class=""bgcelllist"">(?:\s|\S)*?&nbsp;(.*?)&nbsp;(?:\s|\S)*?</td>\s*<td [^>]*class=""bgcelllist"">(?:\s|\S)*?&nbsp;(.*?)&nbsp;(?:\s|\S)*?</td>\s*<td [^>]*class=""bgcelllist"">(?:\s|\S)*?&nbsp;(.*?)&nbsp;(?:\s|\S)*?</td>\s*<td [^>]*class=""bgcelllist"">(?:\s|\S)*?&nbsp;(.*?)&nbsp;(?:\s|\S)*?</td>\s*<td [^>]*class=""bgcelllist"">(?:\s|\S)*?&nbsp;(.*?)&nbsp;(?:\s|\S)*?</td>\s*<td [^>]*class=""bgcelllist"">(?:\s|\S)*?&nbsp;(.*?)&nbsp;(?:\s|\S)*?</td>\s*<td [^>]*class=""bgcelllist"">(?:\s|\S)*?&nbsp;(.*?)&nbsp;(?:\s|\S)*?</td>\s*<td [^>]*class=""bgcelllist"">(?:\s|\S)*?&nbsp;(.*?)&nbsp;(?:\s|\S)*?</td>\s*<td [^>]*class=""bgcelllist"">(?:\s|\S)*?&nbsp;(.*?)&nbsp;(?:\s|\S)*?</td>\s*<td [^>]*class=""bgcelllist"">(?:\s|\S)*?&nbsp;(.*?)&nbsp;(?:\s|\S)*?</td>\s*"
    For lngPage = 1 To lngPageCount
        If oRE.test(strHTML) Then
            Set oMatches = oRE.Execute(strHTML)
            For Each oM In oMatches
                With oM
                    colData.Add oM
                End With
                lngRow = lngRow + 1
            Next
        End If
        
        strHTML = GetPage("http://ec.europa.eu/environment/ets/transaction.do?languageCode=en&startDate=&endDate=&transactionStatus=4&fromCompletionDate=&toCompletionDate=&transactionID=&transactionType=-1&suppTransactionType=-1&originatingRegistry=-1&destinationRegistry=-1&originatingAccountType=-1&destinationAccountType=-1&originatingAccountNumber=" & parmAccount & "&destinationAccountNumber=&originatingAccountIdentifier=&destinationAccountIdentifier=&originatingAccountHolder=&destinationAccountHolder=&currentSortSettings=&resultList.currentPageNumber=" & lngPage & "&nextList=Next>")

    Next

    ReDim vData(1 To colData.Count, 0 To 12)
    lngRow = 1
    For Each vItem In colData
        Set oM = vItem
        For lngSM = 0 To 12
            vData(lngRow, lngSM) = oM.submatches(lngSM)
        Next
        lngRow = lngRow + 1
    Next

    Q_28572779 = vData
End Function

Open in new window

Invocation Example:

vPageData = Q_28572779("1234")

When you get the function results, you can push the entire array into your worksheet in one statement.  For more information on that, please read my Fast Data Push to Excel article.
http://www.experts-exchange.com/members/A_2253.html

Notes:

You can optionally include or (default) exclude the headers.
This code goes through all the pages of any multi-page output.
The code in the prior related question only got the first page of output.
Compare the output from this code against the output of the code you currently have.
I tested with account 1231, which has 20 pages of results = 191 rows
0
 

Author Comment

by:mikecvba
ID: 40477105
Staggeringly brilliant.  Just when i thought i might one day get to grips with the html/xml world of objects, you pull another show out of the bag. I like arrays. They are normally fast.  I will test and report thanks
0
 
LVL 46

Expert Comment

by:aikimark
ID: 40477120
You should compare the time it takes to Next through all 20 pages of account 1231 in your browser against the posted routine.

Note: I would have answered this sooner, but you didn't post a comment in the prior thread letting the participating experts know about this new question.
0
Technology Partners: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 
LVL 70

Expert Comment

by:Qlemo
ID: 40480596
Reading raw HTML and do your own parsing is probably the fastest way. I'm not happy about having a need for such a complex parsing expression. However, with the ResponseText you loose all properties. Only ResponseXML would be really useful, and that requires to get a XML response. So there is nothing I could provide superior to this solution.
0
 

Author Comment

by:mikecvba
ID: 40480618
It is a great solution and I understand the principle of downloading the full hmtl response text and then sifting it for the relevant content using the VBScript libraries. I do not profess to understand the very complex parsing expression you created but it works. The grammar in that looks like a PhD on its own!! I would be happy to apply the getpage with a simpler string and keep the properties intact but I don't know how to keep the xml request happy.

A further question for you please. When a webpage send an xml with all the parameters, it gets the html response back as we have seen which contains the results embedded in the overall page result. On this website it seems there are two basic tables that can be returned, one basic on the transaction screen and, if you click on the detailed button, one with a lot more details.

Is it possible to query the xml database directly in a manner that only returns specific fields such as 'AccountNo' or 'TranferringAccountNumber' only? Or is the XML server hard coded to only churn out these two set responses?

There are two full page html responses - main screen and detailed screen and each of those has an export button which generates the data in an xml file.

Put another way, I am asking if it is possible to force the remote xml data server to give up discrete fields only? We can clearly see the field names. Can it be done from this end or are the server responses predetermined? I don't know any xml and just wondered whether, like in access or SQL, you can demand what you want in terms of fields via the query, rather than just having something fixed format thrown at you?

Either way, your code is excellent and does the job fast.
0
 

Author Comment

by:mikecvba
ID: 40480647
As I attempt to learn from you two giants, perhaps you can explain 'oXMLHTTP.Status = 200'  . What is the status of an object and where does the number 200 comes from? Olemo, why couldn't you use response XML which would allow you to jump quickly to the relevant part of the table as it would keep the formatting? Alternatively can you not bring it in as a HMTL string - response text and the convert that into some kind of XML object?
0
 
LVL 70

Expert Comment

by:Qlemo
ID: 40480649
Again, this is HTML, no XML. XML needs to get provided directly as response by the Web server. MSXML2 is just a means to download HTML without any processing - you would have the same means with a WebClient object.

The export is indeed a XML, but it is a button on a form. No way I can see in the source to invoke that directly via an URL - that would be great. However, we can go to the export page directly (http://ec.europa.eu/environment/ets/export.do), fill out the hidden form field "exportURL" with something like
destinationAccountHolder=&startDate=&destinationRegistry=-1&originatingAccountType=-1&form=transaction&endDate=&transactionID=&originatingAccountHolder=&suppTransactionType=-1&transactionType=-1&languageCode=en&destinationAccountNumber=&destinationAccountType=-1&toCompletionDate=&originatingRegistry=-1&destinationAccountIdentifier=&fromCompletionDate=&originatingAccountIdentifier=&transactionStatus=4&originatingAccountNumber=1234&currentSortSettings=

Open in new window

which probably only would need the AccountNumber set, btw, and then emulate pressing the "Ok" button. The response is an export.xml file.
0
 
LVL 70

Expert Comment

by:Qlemo
ID: 40480654
oXMLHTTP.Status is a property of the MSXML2 object, containing the resulting status of the page - completed, redirected, error aso., coded as an integer. Details about the object: http://msdn.microsoft.com/en-us/library/ie/ms535874%28v=vs.85%29.aspx
0
 
LVL 70

Expert Comment

by:Qlemo
ID: 40480660
HTML needs to be translated into XML, unless the remote web page directly submits XML. HTML->XML recoding is available with the free HtmlAgilityPack, but the usage is not well documented, and so it is suited only for the adventurous, or those who are used to XML Path syntax with all its details.
0
 

Author Comment

by:mikecvba
ID: 40480673
Thanks Olemo. I understand the limitations. You two already provided the code to access the export page directly and emulate the Ok button press under the aforementioned earlier article. That works a treat.
0
 
LVL 46

Expert Comment

by:aikimark
ID: 40480723
Alternatively, this could be parsed with a DOM object, also available in the MSXML2 name space (type library).  As in the prior related question, the table can be isolated by its name attribute.

Lots of discussion in this thread since yesterday.

200 is a good http return code
404 is not found
405 is forbidden
http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html

==================
It might be possible to request the XML output and capture the response directly into the MSXML2 object.  I hadn't noticed that this was an option.  If the resulting XML contains all the data and not just the first 10 items, then it might be a faster solution.
0
 

Author Comment

by:mikecvba
ID: 40480756
So we could bring back the page as a MSXLM2 DOM object and jump to the table (by name) or even field in the table and that would avoid the need for such complex VBScript patterns? Can we give that a try? I am sure these two pages of solutions will serve many people, who, like me needed a better understanding of the IE and MSXML2 objects. It seems to be condensed nicely here.
0
 
LVL 46

Expert Comment

by:aikimark
ID: 40480769
I just checked and the XML export does include all the data, not just the first page.  Unfortunately, the XML doesn't directly include information about the account number, so the code would either need to rename the recently exported XML file with a name that includes the account number or process each XML file, including file deletion, before making the next call -- if the XML couldn't be captured directly by the code.

It would have been nice if one of the URL parameters allowed more than 10 rows per page.
0
 
LVL 46

Expert Comment

by:aikimark
ID: 40480791
You assign the returned string (or byte stream) to a DOM object.
0
 
LVL 46

Expert Comment

by:aikimark
ID: 40480799
Can we give that a try?
You are welcome to post another question.  Please be sure to come back to this thread and post a link to your new question.  Otherwise, the participating experts do no know you have posted a new question.

If you are familiar with DOM processing, you should first attempt to do this yourself before posting a question.  At this point, you have a working solution.
0

Featured Post

Free Tool: Path Explorer

An intuitive utility to help find the CSS path to UI elements on a webpage. These paths are used frequently in a variety of front-end development and QA automation tasks.

One of a set of tools we're offering as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

If you need to start windows update installation remotely or as a scheduled task you will find this very helpful.
This article describes how you can use Custom Document Properties to store settings and other information in your workbook so that they will be available the next time you open the workbook.
This Micro Tutorial demonstrate the bugs in Microsoft Excel for Mac with Pivot Charts.
This Micro Tutorial will demonstrate the scrolling table in Microsoft Excel using the INDEX function.

762 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question