Solved

Using MSXML2 to import Web Table Data into Excel with VBA

Posted on 2014-12-02
15
1,493 Views
Last Modified: 2014-12-04
I need help to use the MSXML2 object to load xml data into Excel.

Here is the website with data preloaded (the Account number - "1234" is the only variable):

http://ec.europa.eu/environment/ets/transaction.do;?languageCode=en&startDate=&endDate=&transactionStatus=4&fromCompletionDate=&toCompletionDate=&transactionID=&transactionType=-1&suppTransactionType=-1&originatingRegistry=-1&destinationRegistry=-1&originatingAccountType=-1&destinationAccountType=-1&originatingAccountNumber=1234&destinationAccountNumber=&originatingAccountIdentifier=&destinationAccountIdentifier=&originatingAccountHolder=&destinationAccountHolder=&search=Search&currentSortSettings=&resultList.currentPageNumber=1

I just need to be able to scrape the very top row of the table into the first row of Sheet1 in excel using VBA.
Capture.JPG
 

With the help of experts exchange ( see Excel VBA Web Data Scraping from a Table ). I can get the table data into excel using Internet Explorer object or TableObject but its very slow. I need to run the query many time with the range of account number 1-15,000 which will take weeks with the current methods.

How do I get the top row of the table data on the web into Excel row 1 on Sheet1 using the MSXML2 object? I am hoping this will be faster,
0
Comment
Question by:mikecvba
  • 6
  • 5
  • 4
15 Comments
 
LVL 45

Accepted Solution

by:
aikimark earned 500 total points
ID: 40476396
Add this code to a module in your workbook.
Option Explicit

Function GetPage(ByVal parmURL As String) As String
    Static oXMLHTTP As Object
    
    If oXMLHTTP Is Nothing Then
        Set oXMLHTTP = CreateObject("MSXML2.XMLHTTP")
    End If
    
    oXMLHTTP.Open "GET", parmURL, False
    oXMLHTTP.Send
    
    Do Until oXMLHTTP.ReadyState = 4
        DoEvents
    Loop
    
    If oXMLHTTP.Status = 200 Then
        GetPage = oXMLHTTP.responsetext
    Else
        GetPage = vbNullString
    End If
End Function

Function Q_28572779(ByVal parmAccount As String, Optional parmIncludeHeaderRow As Boolean = False) As Variant
    'Returns array of table values
    Dim strHTML As String
    Dim oRE As Object
    Dim oMatches As Object
    Dim oM As Object
    Dim oSM As Object
    Dim lngSM As Long

    Dim lngRow As Long
    Dim lngPage As Long
    Dim lngPageCount As Long
    Dim vData As Variant
    Dim vHeader As Variant
    Dim colData As New Collection
    Dim vItem As Variant
    
    Set oRE = CreateObject("vbscript.regexp")
    oRE.Global = True

    strHTML = GetPage("http://ec.europa.eu/environment/ets/transaction.do?languageCode=en&startDate=&endDate=&transactionStatus=4&fromCompletionDate=&toCompletionDate=&transactionID=&transactionType=-1&suppTransactionType=-1&originatingRegistry=-1&destinationRegistry=-1&originatingAccountType=-1&destinationAccountType=-1&originatingAccountNumber=" & parmAccount & "&destinationAccountNumber=&originatingAccountIdentifier=&destinationAccountIdentifier=&originatingAccountHolder=&destinationAccountHolder=&currentSortSettings=&resultList.currentPageNumber=0&nextList=Next>")
    'parse the result
    oRE.Pattern = """resultList.lastPageNumber""[^>]*value=""(\d+)"""
    If oRE.test(strHTML) Then
        Set oMatches = oRE.Execute(strHTML)
        lngPageCount = oMatches(0).submatches(0)
    End If
    If parmIncludeHeaderRow Then
        oRE.Pattern = "<td [^>]*class=""bgtitlelist"">(?:\s|\S)*?>(\w[^<]*)<(?:\s|\S)*?</td>\s*<td [^>]*class=""bgtitlelist"">(?:\s|\S)*?>(\w[^<]*)<(?:\s|\S)*?</td>\s*<td [^>]*class=""bgtitlelist"">(?:\s|\S)*?>(\w[^<]*)<(?:\s|\S)*?</td>\s*<td [^>]*class=""bgtitlelist"">(?:\s|\S)*?>(\w[^<]*)<(?:\s|\S)*?</td>\s*<td [^>]*class=""bgtitlelist"">(?:\s|\S)*?>(\w[^<]*)<(?:\s|\S)*?</td>\s*<td [^>]*class=""bgtitlelist"">(?:\s|\S)*?>(\w[^<]*)<(?:\s|\S)*?</td>\s*<td [^>]*class=""bgtitlelist"">(?:\s|\S)*?>(\w[^<]*)<(?:\s|\S)*?</td>\s*<td [^>]*class=""bgtitlelist"">(?:\s|\S)*?>(\w[^<]*)<(?:\s|\S)*?</td>\s*<td [^>]*class=""bgtitlelist"">(?:\s|\S)*?>(\w[^<]*)<(?:\s|\S)*?</td>\s*<td [^>]*class=""bgtitlelist"">(?:\s|\S)*?>(\w[^<]*)<(?:\s|\S)*?</td>\s*<td [^>]*class=""bgtitlelist"">(?:\s|\S)*?>(\w[^<]*)<(?:\s|\S)*?</td>\s*<td [^>]*class=""bgtitlelist"">(?:\s|\S)*?>(\w[^<]*)<(?:\s|\S)*?</td>\s*<td [^>]*class=""bgtitlelist"">(?:\s|\S)*?>(\w[^<]*)<(?:\s|\S)*?</td>\s*"
        If oRE.test(strHTML) Then
            Set oMatches = oRE.Execute(strHTML)
            colData.Add oMatches(0)
        End If
    End If
    
    lngRow = 1
    'now for the data
    oRE.Pattern = "<td [^>]*class=""bgcelllist"">(?:\s|\S)*?&nbsp;(.*?)&nbsp;(?:\s|\S)*?</td>\s*<td [^>]*class=""bgcelllist"">(?:\s|\S)*?&nbsp;(.*?)&nbsp;(?:\s|\S)*?</td>\s*<td [^>]*class=""bgcelllist"">(?:\s|\S)*?&nbsp;(.*?)&nbsp;(?:\s|\S)*?</td>\s*<td [^>]*class=""bgcelllist"">(?:\s|\S)*?&nbsp;(.*?)&nbsp;(?:\s|\S)*?</td>\s*<td [^>]*class=""bgcelllist"">(?:\s|\S)*?&nbsp;(.*?)&nbsp;(?:\s|\S)*?</td>\s*<td [^>]*class=""bgcelllist"">(?:\s|\S)*?&nbsp;(.*?)&nbsp;(?:\s|\S)*?</td>\s*<td [^>]*class=""bgcelllist"">(?:\s|\S)*?&nbsp;(.*?)&nbsp;(?:\s|\S)*?</td>\s*<td [^>]*class=""bgcelllist"">(?:\s|\S)*?&nbsp;(.*?)&nbsp;(?:\s|\S)*?</td>\s*<td [^>]*class=""bgcelllist"">(?:\s|\S)*?&nbsp;(.*?)&nbsp;(?:\s|\S)*?</td>\s*<td [^>]*class=""bgcelllist"">(?:\s|\S)*?&nbsp;(.*?)&nbsp;(?:\s|\S)*?</td>\s*<td [^>]*class=""bgcelllist"">(?:\s|\S)*?&nbsp;(.*?)&nbsp;(?:\s|\S)*?</td>\s*<td [^>]*class=""bgcelllist"">(?:\s|\S)*?&nbsp;(.*?)&nbsp;(?:\s|\S)*?</td>\s*<td [^>]*class=""bgcelllist"">(?:\s|\S)*?&nbsp;(.*?)&nbsp;(?:\s|\S)*?</td>\s*"
    For lngPage = 1 To lngPageCount
        If oRE.test(strHTML) Then
            Set oMatches = oRE.Execute(strHTML)
            For Each oM In oMatches
                With oM
                    colData.Add oM
                End With
                lngRow = lngRow + 1
            Next
        End If
        
        strHTML = GetPage("http://ec.europa.eu/environment/ets/transaction.do?languageCode=en&startDate=&endDate=&transactionStatus=4&fromCompletionDate=&toCompletionDate=&transactionID=&transactionType=-1&suppTransactionType=-1&originatingRegistry=-1&destinationRegistry=-1&originatingAccountType=-1&destinationAccountType=-1&originatingAccountNumber=" & parmAccount & "&destinationAccountNumber=&originatingAccountIdentifier=&destinationAccountIdentifier=&originatingAccountHolder=&destinationAccountHolder=&currentSortSettings=&resultList.currentPageNumber=" & lngPage & "&nextList=Next>")

    Next

    ReDim vData(1 To colData.Count, 0 To 12)
    lngRow = 1
    For Each vItem In colData
        Set oM = vItem
        For lngSM = 0 To 12
            vData(lngRow, lngSM) = oM.submatches(lngSM)
        Next
        lngRow = lngRow + 1
    Next

    Q_28572779 = vData
End Function

Open in new window

Invocation Example:

vPageData = Q_28572779("1234")

When you get the function results, you can push the entire array into your worksheet in one statement.  For more information on that, please read my Fast Data Push to Excel article.
http://www.experts-exchange.com/members/A_2253.html

Notes:

You can optionally include or (default) exclude the headers.
This code goes through all the pages of any multi-page output.
The code in the prior related question only got the first page of output.
Compare the output from this code against the output of the code you currently have.
I tested with account 1231, which has 20 pages of results = 191 rows
0
 

Author Comment

by:mikecvba
ID: 40477105
Staggeringly brilliant.  Just when i thought i might one day get to grips with the html/xml world of objects, you pull another show out of the bag. I like arrays. They are normally fast.  I will test and report thanks
0
 
LVL 45

Expert Comment

by:aikimark
ID: 40477120
You should compare the time it takes to Next through all 20 pages of account 1231 in your browser against the posted routine.

Note: I would have answered this sooner, but you didn't post a comment in the prior thread letting the participating experts know about this new question.
0
 
LVL 68

Expert Comment

by:Qlemo
ID: 40480596
Reading raw HTML and do your own parsing is probably the fastest way. I'm not happy about having a need for such a complex parsing expression. However, with the ResponseText you loose all properties. Only ResponseXML would be really useful, and that requires to get a XML response. So there is nothing I could provide superior to this solution.
0
 

Author Comment

by:mikecvba
ID: 40480618
It is a great solution and I understand the principle of downloading the full hmtl response text and then sifting it for the relevant content using the VBScript libraries. I do not profess to understand the very complex parsing expression you created but it works. The grammar in that looks like a PhD on its own!! I would be happy to apply the getpage with a simpler string and keep the properties intact but I don't know how to keep the xml request happy.

A further question for you please. When a webpage send an xml with all the parameters, it gets the html response back as we have seen which contains the results embedded in the overall page result. On this website it seems there are two basic tables that can be returned, one basic on the transaction screen and, if you click on the detailed button, one with a lot more details.

Is it possible to query the xml database directly in a manner that only returns specific fields such as 'AccountNo' or 'TranferringAccountNumber' only? Or is the XML server hard coded to only churn out these two set responses?

There are two full page html responses - main screen and detailed screen and each of those has an export button which generates the data in an xml file.

Put another way, I am asking if it is possible to force the remote xml data server to give up discrete fields only? We can clearly see the field names. Can it be done from this end or are the server responses predetermined? I don't know any xml and just wondered whether, like in access or SQL, you can demand what you want in terms of fields via the query, rather than just having something fixed format thrown at you?

Either way, your code is excellent and does the job fast.
0
 

Author Comment

by:mikecvba
ID: 40480647
As I attempt to learn from you two giants, perhaps you can explain 'oXMLHTTP.Status = 200'  . What is the status of an object and where does the number 200 comes from? Olemo, why couldn't you use response XML which would allow you to jump quickly to the relevant part of the table as it would keep the formatting? Alternatively can you not bring it in as a HMTL string - response text and the convert that into some kind of XML object?
0
 
LVL 68

Expert Comment

by:Qlemo
ID: 40480649
Again, this is HTML, no XML. XML needs to get provided directly as response by the Web server. MSXML2 is just a means to download HTML without any processing - you would have the same means with a WebClient object.

The export is indeed a XML, but it is a button on a form. No way I can see in the source to invoke that directly via an URL - that would be great. However, we can go to the export page directly (http://ec.europa.eu/environment/ets/export.do), fill out the hidden form field "exportURL" with something like
destinationAccountHolder=&startDate=&destinationRegistry=-1&originatingAccountType=-1&form=transaction&endDate=&transactionID=&originatingAccountHolder=&suppTransactionType=-1&transactionType=-1&languageCode=en&destinationAccountNumber=&destinationAccountType=-1&toCompletionDate=&originatingRegistry=-1&destinationAccountIdentifier=&fromCompletionDate=&originatingAccountIdentifier=&transactionStatus=4&originatingAccountNumber=1234&currentSortSettings=

Open in new window

which probably only would need the AccountNumber set, btw, and then emulate pressing the "Ok" button. The response is an export.xml file.
0
Enabling OSINT in Activity Based Intelligence

Activity based intelligence (ABI) requires access to all available sources of data. Recorded Future allows analysts to observe structured data on the open, deep, and dark web.

 
LVL 68

Expert Comment

by:Qlemo
ID: 40480654
oXMLHTTP.Status is a property of the MSXML2 object, containing the resulting status of the page - completed, redirected, error aso., coded as an integer. Details about the object: http://msdn.microsoft.com/en-us/library/ie/ms535874%28v=vs.85%29.aspx
0
 
LVL 68

Expert Comment

by:Qlemo
ID: 40480660
HTML needs to be translated into XML, unless the remote web page directly submits XML. HTML->XML recoding is available with the free HtmlAgilityPack, but the usage is not well documented, and so it is suited only for the adventurous, or those who are used to XML Path syntax with all its details.
0
 

Author Comment

by:mikecvba
ID: 40480673
Thanks Olemo. I understand the limitations. You two already provided the code to access the export page directly and emulate the Ok button press under the aforementioned earlier article. That works a treat.
0
 
LVL 45

Expert Comment

by:aikimark
ID: 40480723
Alternatively, this could be parsed with a DOM object, also available in the MSXML2 name space (type library).  As in the prior related question, the table can be isolated by its name attribute.

Lots of discussion in this thread since yesterday.

200 is a good http return code
404 is not found
405 is forbidden
http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html

==================
It might be possible to request the XML output and capture the response directly into the MSXML2 object.  I hadn't noticed that this was an option.  If the resulting XML contains all the data and not just the first 10 items, then it might be a faster solution.
0
 

Author Comment

by:mikecvba
ID: 40480756
So we could bring back the page as a MSXLM2 DOM object and jump to the table (by name) or even field in the table and that would avoid the need for such complex VBScript patterns? Can we give that a try? I am sure these two pages of solutions will serve many people, who, like me needed a better understanding of the IE and MSXML2 objects. It seems to be condensed nicely here.
0
 
LVL 45

Expert Comment

by:aikimark
ID: 40480769
I just checked and the XML export does include all the data, not just the first page.  Unfortunately, the XML doesn't directly include information about the account number, so the code would either need to rename the recently exported XML file with a name that includes the account number or process each XML file, including file deletion, before making the next call -- if the XML couldn't be captured directly by the code.

It would have been nice if one of the URL parameters allowed more than 10 rows per page.
0
 
LVL 45

Expert Comment

by:aikimark
ID: 40480791
You assign the returned string (or byte stream) to a DOM object.
0
 
LVL 45

Expert Comment

by:aikimark
ID: 40480799
Can we give that a try?
You are welcome to post another question.  Please be sure to come back to this thread and post a link to your new question.  Otherwise, the participating experts do no know you have posted a new question.

If you are familiar with DOM processing, you should first attempt to do this yourself before posting a question.  At this point, you have a working solution.
0

Featured Post

Do You Know the 4 Main Threat Actor Types?

Do you know the main threat actor types? Most attackers fall into one of four categories, each with their own favored tactics, techniques, and procedures.

Join & Write a Comment

Workbook link problems after copying tabs to a new workbook? David Miller (dlmille) Intro Have you either copied sheets to a new workbook, and after having saved and opened that workbook, you find that there are links back to the original sou…
This code takes an Excel list of URL’s and adds a header titled “URL List”. It then searches through all URL’s in column “A”, looking for duplicates. When a duplicate is found, it is moved to the top of the list. The duplicate URL’s are then highlig…
This Micro Tutorial will demonstrate how to use longer labels with horizontal bar charts instead of the vertical column chart.
Excel styles will make formatting consistent and let you apply and change formatting faster. In this tutorial, you'll learn how to use Excel's built-in styles, how to modify styles, and how to create your own. You'll also learn how to use your custo…

746 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

11 Experts available now in Live!

Get 1:1 Help Now