Solved

Access VBA scrape website for summary information based on hyperlink in table

Posted on 2013-05-13
26
816 Views
Last Modified: 2016-08-31
Hi,

I've attached a representative sample of the Access 2010 database that I am working with.  There are two tables in the database: thomasdata and billinfo.  In the thomasdata table there are three columns but of interst is a unique id (ID) and a hyperlink (BillInfo).  in the billinfo table the columns of interest are the thomasdataID and the billsummary columns.

What I would like to do is develop VBA to read the thomas data and pull the ID and the hyperlink from the thomasdata table.  Then I would like access to follow the link to the webpage and scrape the information between the tags <a name="summary">SUMMARY AS OF:</a> and <a name='major actions'><b>MAJOR ACTIONS:</b></a>.  Once that information is pulled, I would like the unique ID from the thomasdata table stored in the thomasdataID column in billinfo table and the summary of the webpage to be stored in the billsummary column of the billinfo table.

I know that this must have been done somewhere before, so I am hoping one of the experts can help find a solution.
getsummaryrequest.accdb
0
Comment
Question by:atljarman
  • 14
  • 11
26 Comments
 
LVL 33

Expert Comment

by:Norie
Comment Utility
The thomasdataid in table billinfo is a number field but the billuid in table thomasdata is text.

thomasdataid will need to be text if it's to take the value from billuid.

I've changed that in the attached database and added a form that will run code to get the data you want.
getsummaryrequest---updated.accdb
0
 

Author Comment

by:atljarman
Comment Utility
imnorie,

Thanks for the quick reply.  The value that I would like to store in the thomasbillid coulumn is the unique identifier ID from the thomasdata table (an auto number).  How would you pull the data from the same page (during the same call) between the text: "ALL ACTIONS:" and "TITLE(S):" into a column called "actions" in the billinfo table?

It is probably easier to pull it all at once, then to make a second call then update the record.  Thanks again for your help.
0
 
LVL 33

Expert Comment

by:Norie
Comment Utility
The code can easily be altered to store a the ID rather than the bill ID.

I'd need to have another look at how to pull ALL ACTIONS and TITLES.

Unfortunately the page isn't well organised so it would probably just use the same way as for SUMMARY, ie searching for certain headings to parse the text.

Not hard just a bit of a pain.:)
0
 

Author Comment

by:atljarman
Comment Utility
This fixed the ID issue:

            .Fields("thomasdataid").Value = rstThomas.Fields("ID").Value

Open in new window

0
 
LVL 33

Expert Comment

by:Norie
Comment Utility
Here's updated code.
Option Compare Database
Option Explicit

Private Sub Command0_Click()
Dim rstThomas As DAO.Recordset
Dim rstBillInfo As DAO.Recordset
Dim strBillURL As String
Dim varContent

    Set rstThomas = CurrentDb.OpenRecordset("thomasdata")
    Set rstBillInfo = CurrentDb.OpenRecordset("billinfo")

    Me.lblStatus.Visible = True
    rstThomas.MoveFirst

    While Not (rstThomas.EOF)

        strBillURL = rstThomas("billinfo").Value

     '   Me.lblStatus.Caption = "Status: Retreiving data for Bill ID " & rstThomas("billuid").Value & "..."
      '  Me.Repaint

        With rstBillInfo
            .AddNew
            .Fields("thomasdataid").Value = rstThomas.Fields("id").Value
            varContent = GetBillData(strBillURL)
            .Fields("billsummary").Value = varContent(1)
            .Fields("actions").Value = varContent(2)
            .Update
        End With

        rstThomas.MoveNext
    Wend

    rstThomas.Close
    rstBillInfo.Close

    Set rstThomas = Nothing
    Set rstBillInfo = Nothing

    'Me.lblStatus.Caption = "Status: Completed"

End Sub

Function GetBillData(strURL As String)
Dim xml As Object
Dim doc As Object
Dim strContent(1 To 2)
Dim varData
Dim pos


    Set xml = CreateObject("MSXML2.XMLHTTP")

    xml.Open "GET", strURL, False

    xml.send

    varData = xml.responseText

    Set doc = CreateObject("htmlfile")

    doc.body.innerhtml = varData

    strContent(1) = doc.getelementbyid("content").innerText

    pos = InStr(strContent(1), "SUMMARY AS OF")

    strContent(1) = Mid(strContent(1), pos)

    pos = InStrRev(strContent(1), "MAJOR ACTIONS")

    strContent(1) = Left(strContent(1), pos - 1)

    strContent(2) = doc.getelementbyid("content").innerText

    pos = InStr(strContent(2), "All ACTIONS")

    strContent(2) = Mid(strContent(2), pos)

    pos = InStrRev(strContent(2), "TITLES")

    strContent(2) = Left(strContent(2), pos - 1)

    Set xml = Nothing

    GetBillData = strContent

End Function

Open in new window

0
 
LVL 33

Expert Comment

by:Norie
Comment Utility
Ignore that last code, it had some mistakes - didn't read the page close enough.

Option Compare Database
Option Explicit

Private Sub Command0_Click()
Dim rstThomas As DAO.Recordset
Dim rstBillInfo As DAO.Recordset
Dim strBillURL As String
Dim varContent

    Set rstThomas = CurrentDb.OpenRecordset("thomasdata")
    Set rstBillInfo = CurrentDb.OpenRecordset("billinfo")

    Me.lblStatus.Visible = True
    rstThomas.MoveFirst

    While Not (rstThomas.EOF)

        strBillURL = rstThomas("billinfo").Value

        Me.lblStatus.Caption = "Status: Retreiving data for Bill ID " & rstThomas("billuid").Value & "..."
        Me.Repaint

        With rstBillInfo
            .AddNew
            .Fields("thomasdataid").Value = rstThomas.Fields("id").Value
            varContent = GetBillData(strBillURL)
            .Fields("billsummary").Value = varContent(1)
            .Fields("actions").Value = varContent(2)
            .Update
        End With

        rstThomas.MoveNext
    Wend

    rstThomas.Close
    rstBillInfo.Close

    Set rstThomas = Nothing
    Set rstBillInfo = Nothing

    Me.lblStatus.Caption = "Status: Completed"

End Sub

Function GetBillData(strURL As String)
Dim xml As Object
Dim doc As Object
Dim strContent(1 To 2)
Dim varData
Dim pos


    Set xml = CreateObject("MSXML2.XMLHTTP")

    xml.Open "GET", strURL, False

    xml.send

    varData = xml.responseText

    Set doc = CreateObject("htmlfile")

    doc.body.innerhtml = varData

    strContent(1) = doc.getelementbyid("content").innerText

    pos = InStr(strContent(1), "SUMMARY AS OF")

    strContent(1) = Mid(strContent(1), pos)

    pos = InStrRev(strContent(1), "MAJOR ACTIONS")

    strContent(1) = Left(strContent(1), pos - 1)

    strContent(2) = doc.getelementbyid("content").innerText

    pos = InStr(strContent(2), "ALL ACTIONS:")

    strContent(2) = Mid(strContent(2), pos)

    pos = InStrRev(strContent(2), "TITLE(S)")

    strContent(2) = Left(strContent(2), pos - 1)

    Set xml = Nothing

    GetBillData = strContent

End Function

Open in new window

0
 

Author Comment

by:atljarman
Comment Utility
Imnorie,

I have this error: "Method or data member not found" on this line:

Me.lblStatus.Visible = True

Looks like the form you are using has a label on it that was not on the oringal accdb that you posted.  I'll comment it out and test the code.  When I commented out the code and made a couple changes to the previous script post (id: 39161973) it seemed to work.
0
 
LVL 33

Expert Comment

by:Norie
Comment Utility
Forgot to mention I added a label, it was for my own sanity.

It showed the code was actually running, as it was executing so fast I thought it wasn't.
0
 

Author Comment

by:atljarman
Comment Utility
Yep.  I put a blank label with the name lblStatus on the form and it worked.  The challenge that I am having though, is that some of the Action data is not stopping at the end of the actions.  Seems not to be correctly identifying the placement of Titles, and I'm wondering if we need to subtract the position of the Actions possibly.  I thought maybe the lenght was carrying over from the summary, but this code made no difference:

Option Compare Database
Option Explicit

Private Sub Command0_Click()
Dim rstThomas As DAO.Recordset
Dim rstBillInfo As DAO.Recordset
Dim strBillURL As String
Dim varContent

    Set rstThomas = CurrentDb.OpenRecordset("thomasdata")
    Set rstBillInfo = CurrentDb.OpenRecordset("billinfo")

    Me.lblStatus.Visible = True
    rstThomas.MoveFirst

    While Not (rstThomas.EOF)

        strBillURL = rstThomas("billinfo").Value

        Me.lblStatus.Caption = "Status: Retreiving data for Bill ID " & rstThomas("billuid").Value & "..."
        Me.Repaint

        With rstBillInfo
            .AddNew
            .Fields("thomasdataid").Value = rstThomas.Fields("id").Value
            varContent = GetBillData(strBillURL)
            .Fields("billsummary").Value = varContent(1)
            .Fields("actions").Value = varContent(2)
            .Update
        End With

        rstThomas.MoveNext
    Wend

    rstThomas.Close
    rstBillInfo.Close

    Set rstThomas = Nothing
    Set rstBillInfo = Nothing

    Me.lblStatus.Caption = "Status: Completed"

End Sub

Function GetBillData(strURL As String)
Dim xml As Object
Dim doc As Object
Dim strContent(1 To 2)
Dim varData
Dim pos, pos2


    Set xml = CreateObject("MSXML2.XMLHTTP")

    xml.Open "GET", strURL, False

    xml.send

    varData = xml.responseText

    Set doc = CreateObject("htmlfile")

    doc.body.innerhtml = varData

    strContent(1) = doc.getelementbyid("content").innerText

    pos = InStr(strContent(1), "SUMMARY AS OF")

    strContent(1) = Mid(strContent(1), pos)

    pos = InStrRev(strContent(1), "MAJOR ACTIONS")

    strContent(1) = Left(strContent(1), pos - 1)

    strContent(2) = doc.getelementbyid("content").innerText

    pos2 = InStr(strContent(2), "ALL ACTIONS:")

    strContent(2) = Mid(strContent(2), pos2)

    pos2 = InStrRev(strContent(2), "TITLE(S):")

    strContent(2) = Left(strContent(2), pos2 - 1)

    Set xml = Nothing

    GetBillData = strContent

End Function

Open in new window


I'll check to see if this is in summary too.
0
 
LVL 33

Expert Comment

by:Norie
Comment Utility
No lengths or positions are carried over from the code to find the summary.

I'm not sure what you mean about subtracting the position of ALL ACTIONS.

The code looks for ALL ACTIONS: and then returns the contents from that point on and puts it in a variable.

Then it looks at that variable for TITLE(S) and returns the contents up to the point where TITLE(S) is found.

That should give us the contents between ALL ACTIONS and TITLES.

If TITLES wasn't found the code would error.
0
 
LVL 33

Expert Comment

by:Norie
Comment Utility
Actually I think I figured out the problem, TITLE(S) is repeated.

Change this,

 
InStrRev(strContents(2), "TITLE(S)") 

Open in new window


to this.

InStr(strContents(2), "TITLE(S)") 

Open in new window


I don't have time to check that but it should work.
0
 

Author Comment

by:atljarman
Comment Utility
I was think that maybe the length was relative and that when it was looking for titles that the code was starting from the beginning of the HTML file and not from the location of Actions making it search a length much longer.  So for example, if All Actions: was a character 2000 and Title(s): was at character 4000, then I was thinking that it was taking 4000 characters starting at character 2000.  This is probably way wrong, but this is what the very first one was for the Actions (Notice the Title(s): was included):

ALL ACTIONS:
1/22/2013:
Referred to the Committee on the Judiciary, and in addition to the Committees on Energy and Commerce, Education and the Workforce, Financial Services, and Natural Resources, for a period to be subsequently determined by the Speaker, in each case for consideration of such provisions as fall within the jurisdiction of the committee concerned.
1/22/2013:
Referred to House Judiciary
2/28/2013:
Referred to the Subcommittee on Crime, Terrorism, Homeland Security, And Investigations.
1/22/2013:
Referred to House Energy and Commerce
1/25/2013:
Referred to the Subcommittee on Health.
1/22/2013:
Referred to House Education and the Workforce
4/23/2013:
Referred to the Subcommittee on Early Childhood, Elementary, and Secondary Education.
4/23/2013:
Referred to the Subcommittee on Higher Education and Workforce Training.
1/22/2013:
Referred to House Financial Services
1/22/2013:
Referred to House Natural Resources
1/29/2013:
Referred to the Subcommittee Indian and Alaska Native Affairs.


TITLE(S):  (italics indicate a title for a portion of a bill)
POPULAR TITLE(S):
VAWA bill (identified by CRS)
SHORT
0
 
LVL 33

Assisted Solution

by:Norie
Norie earned 500 total points
Comment Utility
See my last post and attached file.

PS Just remembered, the billinfo table has some old data in it.
getsummaryrequest---updated-V2.accdb
0
IT, Stop Being Called Into Every Meeting

Highfive is so simple that setting up every meeting room takes just minutes and every employee will be able to start or join a call from any room with ease. Never be called into a meeting just to get it started again. This is how video conferencing should work!

 

Author Comment

by:atljarman
Comment Utility
Any way to easliy remove the first line from each of the two columns before inserting them?  For example, the "ALL ACTIONS: " and "" are in each cell and duplicative.  It appears that there would be a chr(10) or chr(13) on the first row that one could break off at after the record is processed.  Not sure if you have encountered this before.  The InStr rather than the InStrRev worked.
0
 

Accepted Solution

by:
atljarman earned 0 total points
Comment Utility
This works like a charm.

Option Compare Database
Option Explicit

Private Sub Command0_Click()
Dim rstThomas As DAO.Recordset
Dim rstBillInfo As DAO.Recordset
Dim strBillURL As String
Dim varContent

    Set rstThomas = CurrentDb.OpenRecordset("thomasdata")
    Set rstBillInfo = CurrentDb.OpenRecordset("billinfo")

    Me.lblStatus.Visible = True
    rstThomas.MoveFirst

    While Not (rstThomas.EOF)

        strBillURL = rstThomas("billinfo").Value

        Me.lblStatus.Caption = "Status: Retreiving data for Bill ID " & rstThomas("billuid").Value & "..."
        Me.Repaint

        With rstBillInfo
            .AddNew
            .Fields("thomasdataid").Value = rstThomas.Fields("id").Value
            varContent = GetBillData(strBillURL)
            '.Fields("billsummary").Value = varContent(1)  'Mid([billsummary],InStr(1,[BillSummary],Chr(13))+2,10000)
            .Fields("billsummary").Value = Mid(varContent(1), InStr(1, varContent(1), Chr(13)) + 2, 10000)
            '.Fields("actions").Value = varContent(2)
            .Fields("actions").Value = Mid(varContent(2), InStr(1, varContent(2), Chr(13)) + 2, 10000)
            .Update
        End With

        rstThomas.MoveNext
    Wend

    rstThomas.Close
    rstBillInfo.Close

    Set rstThomas = Nothing
    Set rstBillInfo = Nothing

    Me.lblStatus.Caption = "Status: Completed"

End Sub

Function GetBillData(strURL As String)
Dim xml As Object
Dim doc As Object
Dim strContent(1 To 2)
Dim varData
Dim pos


    Set xml = CreateObject("MSXML2.XMLHTTP")

    xml.Open "GET", strURL, False

    xml.send

    varData = xml.responseText

    Set doc = CreateObject("htmlfile")

    doc.body.innerhtml = varData

    strContent(1) = doc.getelementbyid("content").innerText

    pos = InStr(strContent(1), "SUMMARY AS OF")

    strContent(1) = Mid(strContent(1), pos)

    pos = InStrRev(strContent(1), "MAJOR ACTIONS")

    strContent(1) = Left(strContent(1), pos - 1)

    strContent(2) = doc.getelementbyid("content").innerText

    pos = InStr(strContent(2), "ALL ACTIONS:")

    strContent(2) = Mid(strContent(2), pos)

    pos = InStr(strContent(2), "TITLE(S)")

    strContent(2) = Left(strContent(2), pos - 1)

    Set xml = Nothing

    GetBillData = strContent

End Function

Open in new window

0
 
LVL 33

Expert Comment

by:Norie
Comment Utility
I was just about to post code to exclude SUMMARY etc, it can actually go in the function.

PS You don't need the 10000 in Mid, it will return to the end of the string no matter the  length.
0
 

Author Comment

by:atljarman
Comment Utility
awesome.  Thank you again.
0
 

Author Comment

by:atljarman
Comment Utility
I'm running into an error with this code. Not sure if you see where the error is:

Option Compare Database
Option Explicit

Private Sub Command0_Click()
Dim rstThomas As DAO.Recordset
Dim rstBillInfo As DAO.Recordset
Dim strBillURL As String
Dim varContent

    Set rstThomas = CurrentDb.OpenRecordset("thomasdata")
    Set rstBillInfo = CurrentDb.OpenRecordset("billinfo")

    Me.lblStatus.Visible = True
    rstThomas.MoveFirst

    While Not (rstThomas.EOF)

        strBillURL = rstThomas("billinfo").Value

        Me.lblStatus.Caption = "Status: Retreiving data for Bill ID " & rstThomas("billuid").Value & "..."
        Me.Repaint

        With rstBillInfo
            .AddNew
            .Fields("thomasdataid").Value = rstThomas.Fields("id").Value
            varContent = GetBillData(strBillURL)
            .Fields("billsummary").Value = Mid(varContent(1), InStr(1, varContent(1), Chr(13)) + 2, 10000)
            .Fields("actions").Value = Mid(varContent(2), InStr(1, varContent(2), Chr(13)) + 2, 10000)
            .Fields("relatedbills").Value = Mid(varContent(3), InStr(1, varContent(3), Chr(13)) + 2, 10000)
            .Fields("billtitles").Value = Mid(varContent(4), InStr(1, varContent(4), Chr(13)) + 2, 10000)
            .Update
        End With

        rstThomas.MoveNext
    Wend

    rstThomas.Close
    rstBillInfo.Close

    Set rstThomas = Nothing
    Set rstBillInfo = Nothing

    Me.lblStatus.Caption = "Status: Completed"

End Sub

Function GetBillData(strURL As String)
Dim xml As Object
Dim doc As Object
Dim strContent(1 To 4)
Dim varData
Dim pos


    Set xml = CreateObject("MSXML2.XMLHTTP")

    xml.Open "GET", strURL, False

    xml.send

    varData = xml.responseText

    Set doc = CreateObject("htmlfile")

    doc.body.innerhtml = varData

    'Extract the Summary

    strContent(1) = doc.getelementbyid("content").innerText

    pos = InStr(strContent(1), "SUMMARY AS OF")

    strContent(1) = Mid(strContent(1), pos)

    pos = InStrRev(strContent(1), "MAJOR ACTIONS")

    strContent(1) = Left(strContent(1), pos - 1)
    
    'Extract the All Actions

    strContent(2) = doc.getelementbyid("content").innerText

    pos = InStr(strContent(2), "ALL ACTIONS:")

    strContent(2) = Mid(strContent(2), pos)

    pos = InStr(strContent(2), "TITLE(S)")

    strContent(2) = Left(strContent(2), pos - 1)
    
    'Extract the Title(s)
    
    strContent(3) = doc.getelementbyid("content").innerText

    pos = InStr(strContent(3), "TITLE(S)")

    strContent(3) = Mid(strContent(3), pos)

    pos = InStr(strContent(3), "COSPONSOR(S):")

    strContent(3) = Left(strContent(3), pos - 1)

    'Extract the Related Bills

    strContent(4) = doc.getelementbyid("content").innerText

    pos = InStr(strContent(4), "RELATED BILL DETAILS")

    strContent(4) = Mid(strContent(1), pos)

    pos = InStr(strContent(4), "AMENDMENT(S)")

    strContent(4) = Left(strContent(4), pos - 1)



    Set xml = Nothing

    GetBillData = strContent

End Function

Open in new window

0
 

Author Comment

by:atljarman
Comment Utility
I got it. It was an issue with cosponsor.  Thanks a million.
0
 

Author Comment

by:atljarman
Comment Utility
Here is the final version of the database for anyone trying to retrieve BillInfo from Thomas.gov
getsummaryrequest---updated-FINA.accdb
0
 

Author Closing Comment

by:atljarman
Comment Utility
Excellent, thanks for your help on this.  Grade A work.
0
 
LVL 33

Expert Comment

by:Norie
Comment Utility
Assist?
0
 

Author Comment

by:atljarman
Comment Utility
No offense implied; you did the work here.  I wanted to clearly identify the final code to do the action and gave you all the points with the db you created,  this is an issue with experts exchange. I can sign my answer.
0
 
LVL 33

Expert Comment

by:Norie
Comment Utility
It's not a problem.:)
0
 

Author Comment

by:atljarman
Comment Utility
Thanks again.
0
 

Expert Comment

by:Ettore Cicinelli
Comment Utility
I try your file but gives runtime error 91
0

Featured Post

How your wiki can always stay up-to-date

Quip doubles as a “living” wiki and a project management tool that evolves with your organization. As you finish projects in Quip, the work remains, easily accessible to all team members, new and old.
- Increase transparency
- Onboard new hires faster
- Access from mobile/offline

Join & Write a Comment

Creating and Managing Databases with phpMyAdmin in cPanel.
If you’re thinking to yourself “That description sounds a lot like two people doing the work that one could accomplish,” you’re not alone.
An introduction to basic programming syntax in Java by creating a simple program. Viewers can follow the tutorial as they create their first class in Java. Definitions and explanations about each element are given to help prepare viewers for future …
Access reports are powerful and flexible. Learn how to create a query and then a grouped report using the wizard. Modify the report design after the wizard is done to make it look better. There will be another video to explain how to put the final p…

771 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

10 Experts available now in Live!

Get 1:1 Help Now