?
Solved

automate data extraction from webpage using vba

Posted on 2011-05-12
21
Medium Priority
?
1,057 Views
Last Modified: 2012-05-11
I need to extract Class III Milk prices daily and want to automate the process.  I want the data imported saved in either access, excel or a word document.  I can get the first page imported into a word document following someone's example but then am lost after that.  The webpage is www.cmegroup.com.  Then select "Agriculture", select "Class III Milk", Select "Time & Sales" tab and want to enter the "Date" and for the time Period "Full Trade Day".

Hope someone can help me......
0
Comment
Question by:MTMonday
  • 10
  • 10
21 Comments
 
LVL 39

Expert Comment

by:BillDL
ID: 35752394
I take it that it is the "Globex Futures" you want to filter here?:
http://www.cmegroup.com/trading/agricultural/dairy/class-iii-milk_quotes_timeSales_globex_futures.html

0
 
LVL 19

Expert Comment

by:Arno Koster
ID: 35752865
you could use this code as a starting point.
Although the code works when you step through it, there are a number of thing to improve before it will be usefull :

- find out when the ie page has fully loaded (to get rid of the 10 seconds delay)
- update the date selection to adapt to your needs
- update the 'do something with this line' data processing to adapt to your needs

Sub import_milk()
Dim ie As Object

    Application.StatusBar = "Initialising"
    Set ie = CreateObject("internetexplorer.application")
        
    Application.StatusBar = "Loading CME group website"
    ie.navigate "http://www.cmegroup.com/trading/agricultural/dairy/class-iii-milk_quotes_timeSales_globex_futures.html"
    ie.Visible = True
    
    '-- make sure that page is fully loaded
    Application.StatusBar = "Waiting for page to load"
    Do While ie.Busy: DoEvents: Loop
    Do While ie.ReadyState <> 4: DoEvents: Loop
    Application.Wait DateAdd("s", 10, Now)
        
    '-- select expiration date (3rd option)
    Set expiration = ie.document.getelementsbyname("ExpMonths")(0)
    expiration.selectedindex = 2
        
    '-- select full trade day (full day is 8th option)
    Set timeperiod = ie.document.getelementsbyname("tsfromtime")(0)
    timeperiod.selectedindex = 7
    
    '-- perform update
    Set update_button = expiration.ParentNode.ParentNode.ParentNode.ParentNode.LastChild.FirstChild.FirstChild.FirstChild
    update_button.Click
    
    '-- process results
    Set milk_prices = ie.document.getelementsbyname("loaderholder")(0).FirstChild.NextSibling
    For Each Line In milk_prices.ChildNodes
        '-- do something with this line
        Debug.Print Line.NodeValue
    Next Line

    ie.Quit
    Set ie = Nothing
    
End Sub

Open in new window

0
 

Author Comment

by:MTMonday
ID: 35753406
Yes, that is the page that I would like to import specifically when I select the date and the full current day: http://www.cmegroup.com/trading/agricultural/dairy/class-iii-milk_quotes_timeSales_globex_futures.html

0
What does it mean to be "Always On"?

Is your cloud always on? With an Always On cloud you won't have to worry about downtime for maintenance or software application code updates, ensuring that your bottom line isn't affected.

 

Author Comment

by:MTMonday
ID: 35753774
When I run the code, I get an run-time error 462:  The remote server machine does not exist or is unavailable.  This happens right after the IE is made visible (after Line 9 on the above code).
0
 
LVL 19

Expert Comment

by:Arno Koster
ID: 35753869
That particular message is prone to happen when a VBA error occured somewhere in between executing line 5 and line 9. When this happens, the object IE no longer relates to the internet explorer application (which is still running, visible and can be used normally). Or when the internet explorer application has been closed.

you could try to swap lines 8 & 9, that way internet should be visible first and then you can load the webpage. If this still throws the same error, something is wrong with your internet explorer setup.
0
 

Author Comment

by:MTMonday
ID: 35753971
tried the swapping of the lines, I can see the page loading, loads fully, but now I get stuck in the Do loop on line 13.  I tried commenting out lines 12-15, made sure the page was fully loaded before running the code from line 18, but with that I get an run time error '-214767259 (80004005) Automation error Unspecified error.
0
 
LVL 19

Expert Comment

by:Arno Koster
ID: 35754001
which version of internet explorer du you currently use ?

older versions of IE might not allow access to the automation structures used.
0
 

Author Comment

by:MTMonday
ID: 35754051
IE 7.0, with Vista as my operating system.  I will upgrade to IE 8 and try my luck with the code again.
0
 

Author Comment

by:MTMonday
ID: 35754332
Upgraded to IE 8, the code worked all the way thru, expect I do not get any data.  

Debug.Print Line.NodeValue prints only "Null" in the immediate window.

I would like to take the titles and the data and save it to an Excel worksheet names "Sheet_Class3".
0
 
LVL 19

Expert Comment

by:Arno Koster
ID: 35754418
That's at least a step in the good direction !

When you arrive at the line.nodevalue, i suppose the data lines are visible on the screen ?

At the time I made the original post the data results were formatted as clear data, when I try it now somehow it is formatted in a table.
Probably the lines 30 ~ 35 need to be adapted, give me a couple of minutes and i'll give you an update.

0
 
LVL 19

Expert Comment

by:Arno Koster
ID: 35754620
can you post some of the results that you get when you replace lines 30 ~ 34 with

    Set milk_prices = ie.document.getelementsbyname("loaderholder")(0)
    For Each Line In milk_prices.ChildNodes
        '-- do something with this line
        Debug.Print Line.nodename & " = " & Line.NodeValue
        If Line.ChildNodes.Length > 0 Then
        For Each line2 In Line.ChildNodes
            Debug.Print Line.nodename & "." & line2.nodename & " = " & line2.NodeValue
            If line2.ChildNodes.Length > 0 Then
            For Each line3 In line2.ChildNodes
                Debug.Print Line.nodename & "." & line2.nodename & "." & line3.nodename & " = " & line3.NodeValue
                If line3.ChildNodes.Length > 0 Then
                For Each line4 In line3.ChildNodes
                    Debug.Print Line.nodename & "." & line2.nodename & "." & line3.nodename & "." & line4.nodename & " = " & line4.NodeValue
                    If line4.ChildNodes.Length > 0 Then
                    For Each line5 In line4.ChildNodes
                        Debug.Print Line.nodename & "." & line2.nodename & "." & line3.nodename & "." & line4.nodename & "." & line5.nodename & " = " & line5.NodeValue
                        If line5.ChildNodes.Length > 0 Then
                        For Each line6 In line5.ChildNodes
                            Debug.Print Line.nodename & "." & line2.nodename & "." & line3.nodename & "." & line4.nodename & "." & line5.nodename & "." & line6.nodename & " = " & line6.NodeValue
                        Next line6
                        End If
                    Next line5
                    End If
                Next line4
                End If
            Next line3
            End If
        Next line2
        End If
    Next Line

Open in new window

0
 

Author Comment

by:MTMonday
ID: 35754738
When I step into the code (F8) the following is the result that I obtain:
TABLE.TBODY.TR.TD =
TABLE.TBODY.TR.TD.#text = -
TABLE.TBODY.TR =
TABLE.TBODY.TR.TD =
TABLE.TBODY.TR.TD.DIV =
TABLE.TBODY.TR.TD.DIV.#text = 5/12/2011
TABLE.TBODY.TR.TD =
TABLE.TBODY.TR.TD.#text = 09:13:54 AM
TABLE.TBODY.TR.TD =
TABLE.TBODY.TR.TD.#text = 1714
TABLE.TBODY.TR.TD =
TABLE.TBODY.TR.TD.#text = 1
TABLE.TBODY.TR.TD =
TABLE.TBODY.TR.TD.#text = -
TABLE.TBODY.TR =
TABLE.TBODY.TR.TD =
TABLE.TBODY.TR.TD.DIV =
TABLE.TBODY.TR.TD.DIV.#text = 5/12/2011
TABLE.TBODY.TR.TD =
TABLE.TBODY.TR.TD.#text = 09:09:21 AM
TABLE.TBODY.TR.TD =
TABLE.TBODY.TR.TD.#text = 1710
TABLE.TBODY.TR.TD =
TABLE.TBODY.TR.TD.#text = 2
TABLE.TBODY.TR.TD =
TABLE.TBODY.TR.TD.#text = -
TABLE.TBODY.TR =
TABLE.TBODY.TR.TD =
TABLE.TBODY.TR.TD.DIV =
TABLE.TBODY.TR.TD.DIV.#text = 5/12/2011
TABLE.TBODY.TR.TD =
TABLE.TBODY.TR.TD.#text = 09:09:21 AM
TABLE.TBODY.TR.TD =
TABLE.TBODY.TR.TD.#text = 1710
TABLE.TBODY.TR.TD =
TABLE.TBODY.TR.TD.#text = 2
TABLE.TBODY.TR.TD =
TABLE.TBODY.TR.TD.#text = -
TABLE.TBODY.TR =
TABLE.TBODY.TR.TD =
TABLE.TBODY.TR.TD.DIV =
TABLE.TBODY.TR.TD.DIV.#text = 5/12/2011
TABLE.TBODY.TR.TD =
TABLE.TBODY.TR.TD.#text = 09:09:20 AM
TABLE.TBODY.TR.TD =
TABLE.TBODY.TR.TD.#text = 1710
TABLE.TBODY.TR.TD =
TABLE.TBODY.TR.TD.#text = 2
TABLE.TBODY.TR.TD =
TABLE.TBODY.TR.TD.#text = -
TABLE.TBODY.TR =
TABLE.TBODY.TR.TD =
TABLE.TBODY.TR.TD.DIV =
TABLE.TBODY.TR.TD.DIV.#text = 5/12/2011
TABLE.TBODY.TR.TD =
TABLE.TBODY.TR.TD.#text = 09:09:20 AM
TABLE.TBODY.TR.TD =
TABLE.TBODY.TR.TD.#text = 1710
TABLE.TBODY.TR.TD =
TABLE.TBODY.TR.TD.#text = 2
TABLE.TBODY.TR.TD =
TABLE.TBODY.TR.TD.#text = -
TABLE.TBODY.TR =
TABLE.TBODY.TR.TD =
TABLE.TBODY.TR.TD.DIV =
TABLE.TBODY.TR.TD.DIV.#text = 5/12/2011
TABLE.TBODY.TR.TD =
TABLE.TBODY.TR.TD.#text = 09:08:58 AM
TABLE.TBODY.TR.TD =
TABLE.TBODY.TR.TD.#text = 1710
TABLE.TBODY.TR.TD =
TABLE.TBODY.TR.TD.#text = 1
TABLE.TBODY.TR.TD =
TABLE.TBODY.TR.TD.#text = -
TABLE.TBODY.TR =
TABLE.TBODY.TR.TD =
TABLE.TBODY.TR.TD.DIV =
TABLE.TBODY.TR.TD.DIV.#text = 5/12/2011
TABLE.TBODY.TR.TD =
TABLE.TBODY.TR.TD.#text = 09:08:49 AM
TABLE.TBODY.TR.TD =
TABLE.TBODY.TR.TD.#text = 1710
TABLE.TBODY.TR.TD =
TABLE.TBODY.TR.TD.#text = 1
TABLE.TBODY.TR.TD =
TABLE.TBODY.TR.TD.#text = -
TABLE.TBODY.TR =
TABLE.TBODY.TR.TD =
TABLE.TBODY.TR.TD.DIV =
TABLE.TBODY.TR.TD.DIV.#text = 5/12/2011
TABLE.TBODY.TR.TD =
TABLE.TBODY.TR.TD.#text = 09:08:23 AM
TABLE.TBODY.TR.TD =
TABLE.TBODY.TR.TD.#text = 1715
TABLE.TBODY.TR.TD =
TABLE.TBODY.TR.TD.#text = 1
TABLE.TBODY.TR.TD =
TABLE.TBODY.TR.TD.#text = -
TABLE.TBODY.TR =
TABLE.TBODY.TR.TD =
TABLE.TBODY.TR.TD.DIV =
TABLE.TBODY.TR.TD.DIV.#text = 5/12/2011
TABLE.TBODY.TR.TD =
TABLE.TBODY.TR.TD.#text = 09:08:21 AM
TABLE.TBODY.TR.TD =
TABLE.TBODY.TR.TD.#text = 1715
TABLE.TBODY.TR.TD =
TABLE.TBODY.TR.TD.#text = 2
TABLE.TBODY.TR.TD =
TABLE.TBODY.TR.TD.#text = -
TABLE.TBODY.TR =
TABLE.TBODY.TR.TD =
TABLE.TBODY.TR.TD.DIV =
TABLE.TBODY.TR.TD.DIV.#text = 5/12/2011
TABLE.TBODY.TR.TD =
TABLE.TBODY.TR.TD.#text = 09:05:00 AM
TABLE.TBODY.TR.TD =
TABLE.TBODY.TR.TD.#text = 1716
TABLE.TBODY.TR.TD =
TABLE.TBODY.TR.TD.#text = 1
TABLE.TBODY.TR.TD =
TABLE.TBODY.TR.TD.#text = -
TABLE.TBODY.TR =
TABLE.TBODY.TR.TD =
TABLE.TBODY.TR.TD.DIV =
TABLE.TBODY.TR.TD.DIV.#text = 5/12/2011
TABLE.TBODY.TR.TD =
TABLE.TBODY.TR.TD.#text = 09:05:00 AM
TABLE.TBODY.TR.TD =
TABLE.TBODY.TR.TD.#text = 1715
TABLE.TBODY.TR.TD =
TABLE.TBODY.TR.TD.#text = 1
TABLE.TBODY.TR.TD =
TABLE.TBODY.TR.TD.#text = -
TABLE.TBODY.TR =
TABLE.TBODY.TR.TD =
TABLE.TBODY.TR.TD.DIV =
TABLE.TBODY.TR.TD.DIV.#text = 5/12/2011
TABLE.TBODY.TR.TD =
TABLE.TBODY.TR.TD.#text = 09:04:59 AM
TABLE.TBODY.TR.TD =
TABLE.TBODY.TR.TD.#text = 1716B
TABLE.TBODY.TR.TD =
TABLE.TBODY.TR.TD.#text = 0
TABLE.TBODY.TR.TD =
TABLE.TBODY.TR.TD.#text = -
TABLE.TBODY.TR =
TABLE.TBODY.TR.TD =
TABLE.TBODY.TR.TD.DIV =
TABLE.TBODY.TR.TD.DIV.#text = 5/12/2011
TABLE.TBODY.TR.TD =
TABLE.TBODY.TR.TD.#text = 09:04:57 AM
TABLE.TBODY.TR.TD =
TABLE.TBODY.TR.TD.#text = 1715
TABLE.TBODY.TR.TD =
TABLE.TBODY.TR.TD.#text = 2
TABLE.TBODY.TR.TD =
TABLE.TBODY.TR.TD.#text = -
TABLE.TBODY.TR =
TABLE.TBODY.TR.TD =
TABLE.TBODY.TR.TD.DIV =
TABLE.TBODY.TR.TD.DIV.#text = 5/12/2011
TABLE.TBODY.TR.TD =
TABLE.TBODY.TR.TD.#text = 09:04:37 AM
TABLE.TBODY.TR.TD =
TABLE.TBODY.TR.TD.#text = 1715
TABLE.TBODY.TR.TD =
TABLE.TBODY.TR.TD.#text = 1
TABLE.TBODY.TR.TD =
TABLE.TBODY.TR.TD.#text = -
TABLE.TBODY.TR =
TABLE.TBODY.TR.TD =
TABLE.TBODY.TR.TD.DIV =
TABLE.TBODY.TR.TD.DIV.#text = 5/12/2011
TABLE.TBODY.TR.TD =
TABLE.TBODY.TR.TD.#text = 09:04:32 AM
TABLE.TBODY.TR.TD =
TABLE.TBODY.TR.TD.#text = 1715
TABLE.TBODY.TR.TD =
TABLE.TBODY.TR.TD.#text = 1
TABLE.TBODY.TR.TD =
TABLE.TBODY.TR.TD.#text = -
TABLE.TBODY.TR =
TABLE.TBODY.TR.TD =
TABLE.TBODY.TR.TD.DIV =
TABLE.TBODY.TR.TD.DIV.#text = 5/12/2011
TABLE.TBODY.TR.TD =
TABLE.TBODY.TR.TD.#text = 09:01:16 AM
TABLE.TBODY.TR.TD =
TABLE.TBODY.TR.TD.#text = 1720
TABLE.TBODY.TR.TD =
TABLE.TBODY.TR.TD.#text = 4
TABLE.TBODY.TR.TD =
TABLE.TBODY.TR.TD.#text = -
TABLE =
TABLE.TBODY =
TABLE.TBODY.TR =
TABLE.TBODY.TR.TD =
 


But When I try to run the sub (F5) I get only the following results:  
DIV =
DIV.#text = Processing..
DIV.IMG =
0
 
LVL 19

Expert Comment

by:Arno Koster
ID: 35754821
that is the problem with the website implementation, to my knowledge it is not possible to know when the 'update' action has completed.
When you run the sub with F5 the website is still querying for data, so it might be helpfull to include a statement such as

Application.Wait DateAdd("s", 7, Now)

which waits for an arbitrarily chosen number of seconds (in this case 7 seconds) before continuing. This is not an elegant solution but at least it works.

looking at the result structure you get, it should not be that difficult to export the values to a usable form, eg. add this line

if line6.nodename = "#text" then worksheets("Sheet1").range("A" & worksheets("Sheet1").usedrange.rows.count+1) = line6.NodeValue

Open in new window

just below the line
debug.print [....] & line6.nodevalue

Open in new window


you can of course do this also for line up to line5

0
 

Author Comment

by:MTMonday
ID: 35755017
debug.print [....] & line6.nodevalue this line gives "run-time error "13:  type mismatch".  Also it only put "5/12/2011" and nothing else on the excel worksheet.  

You mentioned something about website implementation, is there any other method which will let me extract data....
0
 
LVL 19

Expert Comment

by:Arno Koster
ID: 35761838
The date line indicates that the first item has been processed correctly, the type 13 error indicates that excel tries to process the next text line (apperently the one containing the time of day)  as a date instead of a textual value, and the date & time settings in your windows user profile do not match. At my workplace this error does not occur.

This can be circumvented by specifying the cell contents to be pure textual. To do this, add the line

Range("A:A").NumberFormat = "@"

before the 'for each line' loop.



0
 

Author Comment

by:MTMonday
ID: 35777469
I did add the "@" to format the date & time settings, but it still does not work when I run "F8".  The other thing I notice is using "F5" in debug mode, I can only get one value written in cell 1 and it gets replaced with the next.  I want to be able to see the whole line from the webpage and keep adding it to the worksheet rather than overwriting it.  I am not good at Excel vba, more familiar with access
0
 
LVL 19

Expert Comment

by:Arno Koster
ID: 35777837
Hmm, that is strange.
normally when you place a text in say cell A5, the usedrange is enlarged to at least contain this cell => usedrange.rows.count = 5
the next time the information is to be placed in cell A(usedrange.rows.count+1) = A(5+1) = A6.
This way you will only add and not overwrite information.

I have no clue what is going on or wrong at your workplace, but could you post the file you are working on so that I can check if the same behavior can be seen at my working place ?
0
 

Author Comment

by:MTMonday
ID: 35778616
See the attached Excel file (excel 2007)
CMI-Data.xls
0
 
LVL 19

Expert Comment

by:Arno Koster
ID: 35784274
MTMonday,

a couple of remarks :

the line application.wait 10 instructs excel to wait until 10 january 1900, which is useless as this date and time has already passed.
the [...] in the debug.print line was not to be taken literally ;-)

I now see the problem :
the usedrange.rows.count delivers the number of used rows, which will be 1 after the first insert.
What is doesn't take into account however is that the first used row starts at A2. So the next item will be placed in A[#rows used+1] = A2 !
So it should have been placed in A[#rows used + # of first used row + 1] = A3...

at the moment the website seems to be unable to supply data, so i cannot really test, but when this has been resolved i will look into a new version which fixes these problems.




0
 
LVL 19

Accepted Solution

by:
Arno Koster earned 2000 total points
ID: 35786831
Well, it again took me a while but I got the following results
 CMI-Data.xls
0
 

Author Closing Comment

by:MTMonday
ID: 35794119
Thanks
0

Featured Post

Concerto Cloud for Software Providers & ISVs

Can Concerto Cloud Services help you focus on evolving your application offerings, while delivering the best cloud experience to your customers? From DevOps to revenue models and customer support, the answer is yes!

Learn how Concerto can help you.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

You need to know the location of the Office templates folder, so that when you create new templates, they are saved to that location, and thus are available for selection when creating new documents.  The steps to find the Templates folder path are …
Windows Explorer lets you open cabinet (cab) files like any other folder. In VBA you can easily handle normal files and folders, but opening and indeed creating cabinet files takes a lot more - and that's you'll find here.
The viewer will learn how to create two correlated normally distributed random variables in Excel, use a normal distribution to simulate the return on different levels of investment in each of the two funds over a period of ten years, and, create a …
Excel styles will make formatting consistent and let you apply and change formatting faster. In this tutorial, you'll learn how to use Excel's built-in styles, how to modify styles, and how to create your own. You'll also learn how to use your custo…

840 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question