• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 1061
  • Last Modified:

Access VBA scrape website for summary information based on hyperlink in table

Hi,

I've attached a representative sample of the Access 2010 database that I am working with.  There are two tables in the database: thomasdata and billinfo.  In the thomasdata table there are three columns but of interst is a unique id (ID) and a hyperlink (BillInfo).  in the billinfo table the columns of interest are the thomasdataID and the billsummary columns.

What I would like to do is develop VBA to read the thomas data and pull the ID and the hyperlink from the thomasdata table.  Then I would like access to follow the link to the webpage and scrape the information between the tags <a name="summary">SUMMARY AS OF:</a> and <a name='major actions'><b>MAJOR ACTIONS:</b></a>.  Once that information is pulled, I would like the unique ID from the thomasdata table stored in the thomasdataID column in billinfo table and the summary of the webpage to be stored in the billsummary column of the billinfo table.

I know that this must have been done somewhere before, so I am hoping one of the experts can help find a solution.
getsummaryrequest.accdb
0
atljarman
Asked:
atljarman
  • 14
  • 11
2 Solutions
 
NorieCommented:
The thomasdataid in table billinfo is a number field but the billuid in table thomasdata is text.

thomasdataid will need to be text if it's to take the value from billuid.

I've changed that in the attached database and added a form that will run code to get the data you want.
getsummaryrequest---updated.accdb
0
 
atljarmanAuthor Commented:
imnorie,

Thanks for the quick reply.  The value that I would like to store in the thomasbillid coulumn is the unique identifier ID from the thomasdata table (an auto number).  How would you pull the data from the same page (during the same call) between the text: "ALL ACTIONS:" and "TITLE(S):" into a column called "actions" in the billinfo table?

It is probably easier to pull it all at once, then to make a second call then update the record.  Thanks again for your help.
0
 
NorieCommented:
The code can easily be altered to store a the ID rather than the bill ID.

I'd need to have another look at how to pull ALL ACTIONS and TITLES.

Unfortunately the page isn't well organised so it would probably just use the same way as for SUMMARY, ie searching for certain headings to parse the text.

Not hard just a bit of a pain.:)
0
Concerto's Cloud Advisory Services

Want to avoid the missteps to gaining all the benefits of the cloud? Learn more about the different assessment options from our Cloud Advisory team.

 
atljarmanAuthor Commented:
This fixed the ID issue:

            .Fields("thomasdataid").Value = rstThomas.Fields("ID").Value

Open in new window

0
 
NorieCommented:
Here's updated code.
Option Compare Database
Option Explicit

Private Sub Command0_Click()
Dim rstThomas As DAO.Recordset
Dim rstBillInfo As DAO.Recordset
Dim strBillURL As String
Dim varContent

    Set rstThomas = CurrentDb.OpenRecordset("thomasdata")
    Set rstBillInfo = CurrentDb.OpenRecordset("billinfo")

    Me.lblStatus.Visible = True
    rstThomas.MoveFirst

    While Not (rstThomas.EOF)

        strBillURL = rstThomas("billinfo").Value

     '   Me.lblStatus.Caption = "Status: Retreiving data for Bill ID " & rstThomas("billuid").Value & "..."
      '  Me.Repaint

        With rstBillInfo
            .AddNew
            .Fields("thomasdataid").Value = rstThomas.Fields("id").Value
            varContent = GetBillData(strBillURL)
            .Fields("billsummary").Value = varContent(1)
            .Fields("actions").Value = varContent(2)
            .Update
        End With

        rstThomas.MoveNext
    Wend

    rstThomas.Close
    rstBillInfo.Close

    Set rstThomas = Nothing
    Set rstBillInfo = Nothing

    'Me.lblStatus.Caption = "Status: Completed"

End Sub

Function GetBillData(strURL As String)
Dim xml As Object
Dim doc As Object
Dim strContent(1 To 2)
Dim varData
Dim pos


    Set xml = CreateObject("MSXML2.XMLHTTP")

    xml.Open "GET", strURL, False

    xml.send

    varData = xml.responseText

    Set doc = CreateObject("htmlfile")

    doc.body.innerhtml = varData

    strContent(1) = doc.getelementbyid("content").innerText

    pos = InStr(strContent(1), "SUMMARY AS OF")

    strContent(1) = Mid(strContent(1), pos)

    pos = InStrRev(strContent(1), "MAJOR ACTIONS")

    strContent(1) = Left(strContent(1), pos - 1)

    strContent(2) = doc.getelementbyid("content").innerText

    pos = InStr(strContent(2), "All ACTIONS")

    strContent(2) = Mid(strContent(2), pos)

    pos = InStrRev(strContent(2), "TITLES")

    strContent(2) = Left(strContent(2), pos - 1)

    Set xml = Nothing

    GetBillData = strContent

End Function

Open in new window

0
 
NorieCommented:
Ignore that last code, it had some mistakes - didn't read the page close enough.

Option Compare Database
Option Explicit

Private Sub Command0_Click()
Dim rstThomas As DAO.Recordset
Dim rstBillInfo As DAO.Recordset
Dim strBillURL As String
Dim varContent

    Set rstThomas = CurrentDb.OpenRecordset("thomasdata")
    Set rstBillInfo = CurrentDb.OpenRecordset("billinfo")

    Me.lblStatus.Visible = True
    rstThomas.MoveFirst

    While Not (rstThomas.EOF)

        strBillURL = rstThomas("billinfo").Value

        Me.lblStatus.Caption = "Status: Retreiving data for Bill ID " & rstThomas("billuid").Value & "..."
        Me.Repaint

        With rstBillInfo
            .AddNew
            .Fields("thomasdataid").Value = rstThomas.Fields("id").Value
            varContent = GetBillData(strBillURL)
            .Fields("billsummary").Value = varContent(1)
            .Fields("actions").Value = varContent(2)
            .Update
        End With

        rstThomas.MoveNext
    Wend

    rstThomas.Close
    rstBillInfo.Close

    Set rstThomas = Nothing
    Set rstBillInfo = Nothing

    Me.lblStatus.Caption = "Status: Completed"

End Sub

Function GetBillData(strURL As String)
Dim xml As Object
Dim doc As Object
Dim strContent(1 To 2)
Dim varData
Dim pos


    Set xml = CreateObject("MSXML2.XMLHTTP")

    xml.Open "GET", strURL, False

    xml.send

    varData = xml.responseText

    Set doc = CreateObject("htmlfile")

    doc.body.innerhtml = varData

    strContent(1) = doc.getelementbyid("content").innerText

    pos = InStr(strContent(1), "SUMMARY AS OF")

    strContent(1) = Mid(strContent(1), pos)

    pos = InStrRev(strContent(1), "MAJOR ACTIONS")

    strContent(1) = Left(strContent(1), pos - 1)

    strContent(2) = doc.getelementbyid("content").innerText

    pos = InStr(strContent(2), "ALL ACTIONS:")

    strContent(2) = Mid(strContent(2), pos)

    pos = InStrRev(strContent(2), "TITLE(S)")

    strContent(2) = Left(strContent(2), pos - 1)

    Set xml = Nothing

    GetBillData = strContent

End Function

Open in new window

0
 
atljarmanAuthor Commented:
Imnorie,

I have this error: "Method or data member not found" on this line:

Me.lblStatus.Visible = True

Looks like the form you are using has a label on it that was not on the oringal accdb that you posted.  I'll comment it out and test the code.  When I commented out the code and made a couple changes to the previous script post (id: 39161973) it seemed to work.
0
 
NorieCommented:
Forgot to mention I added a label, it was for my own sanity.

It showed the code was actually running, as it was executing so fast I thought it wasn't.
0
 
atljarmanAuthor Commented:
Yep.  I put a blank label with the name lblStatus on the form and it worked.  The challenge that I am having though, is that some of the Action data is not stopping at the end of the actions.  Seems not to be correctly identifying the placement of Titles, and I'm wondering if we need to subtract the position of the Actions possibly.  I thought maybe the lenght was carrying over from the summary, but this code made no difference:

Option Compare Database
Option Explicit

Private Sub Command0_Click()
Dim rstThomas As DAO.Recordset
Dim rstBillInfo As DAO.Recordset
Dim strBillURL As String
Dim varContent

    Set rstThomas = CurrentDb.OpenRecordset("thomasdata")
    Set rstBillInfo = CurrentDb.OpenRecordset("billinfo")

    Me.lblStatus.Visible = True
    rstThomas.MoveFirst

    While Not (rstThomas.EOF)

        strBillURL = rstThomas("billinfo").Value

        Me.lblStatus.Caption = "Status: Retreiving data for Bill ID " & rstThomas("billuid").Value & "..."
        Me.Repaint

        With rstBillInfo
            .AddNew
            .Fields("thomasdataid").Value = rstThomas.Fields("id").Value
            varContent = GetBillData(strBillURL)
            .Fields("billsummary").Value = varContent(1)
            .Fields("actions").Value = varContent(2)
            .Update
        End With

        rstThomas.MoveNext
    Wend

    rstThomas.Close
    rstBillInfo.Close

    Set rstThomas = Nothing
    Set rstBillInfo = Nothing

    Me.lblStatus.Caption = "Status: Completed"

End Sub

Function GetBillData(strURL As String)
Dim xml As Object
Dim doc As Object
Dim strContent(1 To 2)
Dim varData
Dim pos, pos2


    Set xml = CreateObject("MSXML2.XMLHTTP")

    xml.Open "GET", strURL, False

    xml.send

    varData = xml.responseText

    Set doc = CreateObject("htmlfile")

    doc.body.innerhtml = varData

    strContent(1) = doc.getelementbyid("content").innerText

    pos = InStr(strContent(1), "SUMMARY AS OF")

    strContent(1) = Mid(strContent(1), pos)

    pos = InStrRev(strContent(1), "MAJOR ACTIONS")

    strContent(1) = Left(strContent(1), pos - 1)

    strContent(2) = doc.getelementbyid("content").innerText

    pos2 = InStr(strContent(2), "ALL ACTIONS:")

    strContent(2) = Mid(strContent(2), pos2)

    pos2 = InStrRev(strContent(2), "TITLE(S):")

    strContent(2) = Left(strContent(2), pos2 - 1)

    Set xml = Nothing

    GetBillData = strContent

End Function

Open in new window


I'll check to see if this is in summary too.
0
 
NorieCommented:
No lengths or positions are carried over from the code to find the summary.

I'm not sure what you mean about subtracting the position of ALL ACTIONS.

The code looks for ALL ACTIONS: and then returns the contents from that point on and puts it in a variable.

Then it looks at that variable for TITLE(S) and returns the contents up to the point where TITLE(S) is found.

That should give us the contents between ALL ACTIONS and TITLES.

If TITLES wasn't found the code would error.
0
 
NorieCommented:
Actually I think I figured out the problem, TITLE(S) is repeated.

Change this,

 
InStrRev(strContents(2), "TITLE(S)") 

Open in new window


to this.

InStr(strContents(2), "TITLE(S)") 

Open in new window


I don't have time to check that but it should work.
0
 
atljarmanAuthor Commented:
I was think that maybe the length was relative and that when it was looking for titles that the code was starting from the beginning of the HTML file and not from the location of Actions making it search a length much longer.  So for example, if All Actions: was a character 2000 and Title(s): was at character 4000, then I was thinking that it was taking 4000 characters starting at character 2000.  This is probably way wrong, but this is what the very first one was for the Actions (Notice the Title(s): was included):

ALL ACTIONS:
1/22/2013:
Referred to the Committee on the Judiciary, and in addition to the Committees on Energy and Commerce, Education and the Workforce, Financial Services, and Natural Resources, for a period to be subsequently determined by the Speaker, in each case for consideration of such provisions as fall within the jurisdiction of the committee concerned.
1/22/2013:
Referred to House Judiciary
2/28/2013:
Referred to the Subcommittee on Crime, Terrorism, Homeland Security, And Investigations.
1/22/2013:
Referred to House Energy and Commerce
1/25/2013:
Referred to the Subcommittee on Health.
1/22/2013:
Referred to House Education and the Workforce
4/23/2013:
Referred to the Subcommittee on Early Childhood, Elementary, and Secondary Education.
4/23/2013:
Referred to the Subcommittee on Higher Education and Workforce Training.
1/22/2013:
Referred to House Financial Services
1/22/2013:
Referred to House Natural Resources
1/29/2013:
Referred to the Subcommittee Indian and Alaska Native Affairs.


TITLE(S):  (italics indicate a title for a portion of a bill)
POPULAR TITLE(S):
VAWA bill (identified by CRS)
SHORT
0
 
NorieCommented:
See my last post and attached file.

PS Just remembered, the billinfo table has some old data in it.
getsummaryrequest---updated-V2.accdb
0
 
atljarmanAuthor Commented:
Any way to easliy remove the first line from each of the two columns before inserting them?  For example, the "ALL ACTIONS: " and "" are in each cell and duplicative.  It appears that there would be a chr(10) or chr(13) on the first row that one could break off at after the record is processed.  Not sure if you have encountered this before.  The InStr rather than the InStrRev worked.
0
 
atljarmanAuthor Commented:
This works like a charm.

Option Compare Database
Option Explicit

Private Sub Command0_Click()
Dim rstThomas As DAO.Recordset
Dim rstBillInfo As DAO.Recordset
Dim strBillURL As String
Dim varContent

    Set rstThomas = CurrentDb.OpenRecordset("thomasdata")
    Set rstBillInfo = CurrentDb.OpenRecordset("billinfo")

    Me.lblStatus.Visible = True
    rstThomas.MoveFirst

    While Not (rstThomas.EOF)

        strBillURL = rstThomas("billinfo").Value

        Me.lblStatus.Caption = "Status: Retreiving data for Bill ID " & rstThomas("billuid").Value & "..."
        Me.Repaint

        With rstBillInfo
            .AddNew
            .Fields("thomasdataid").Value = rstThomas.Fields("id").Value
            varContent = GetBillData(strBillURL)
            '.Fields("billsummary").Value = varContent(1)  'Mid([billsummary],InStr(1,[BillSummary],Chr(13))+2,10000)
            .Fields("billsummary").Value = Mid(varContent(1), InStr(1, varContent(1), Chr(13)) + 2, 10000)
            '.Fields("actions").Value = varContent(2)
            .Fields("actions").Value = Mid(varContent(2), InStr(1, varContent(2), Chr(13)) + 2, 10000)
            .Update
        End With

        rstThomas.MoveNext
    Wend

    rstThomas.Close
    rstBillInfo.Close

    Set rstThomas = Nothing
    Set rstBillInfo = Nothing

    Me.lblStatus.Caption = "Status: Completed"

End Sub

Function GetBillData(strURL As String)
Dim xml As Object
Dim doc As Object
Dim strContent(1 To 2)
Dim varData
Dim pos


    Set xml = CreateObject("MSXML2.XMLHTTP")

    xml.Open "GET", strURL, False

    xml.send

    varData = xml.responseText

    Set doc = CreateObject("htmlfile")

    doc.body.innerhtml = varData

    strContent(1) = doc.getelementbyid("content").innerText

    pos = InStr(strContent(1), "SUMMARY AS OF")

    strContent(1) = Mid(strContent(1), pos)

    pos = InStrRev(strContent(1), "MAJOR ACTIONS")

    strContent(1) = Left(strContent(1), pos - 1)

    strContent(2) = doc.getelementbyid("content").innerText

    pos = InStr(strContent(2), "ALL ACTIONS:")

    strContent(2) = Mid(strContent(2), pos)

    pos = InStr(strContent(2), "TITLE(S)")

    strContent(2) = Left(strContent(2), pos - 1)

    Set xml = Nothing

    GetBillData = strContent

End Function

Open in new window

0
 
NorieCommented:
I was just about to post code to exclude SUMMARY etc, it can actually go in the function.

PS You don't need the 10000 in Mid, it will return to the end of the string no matter the  length.
0
 
atljarmanAuthor Commented:
awesome.  Thank you again.
0
 
atljarmanAuthor Commented:
I'm running into an error with this code. Not sure if you see where the error is:

Option Compare Database
Option Explicit

Private Sub Command0_Click()
Dim rstThomas As DAO.Recordset
Dim rstBillInfo As DAO.Recordset
Dim strBillURL As String
Dim varContent

    Set rstThomas = CurrentDb.OpenRecordset("thomasdata")
    Set rstBillInfo = CurrentDb.OpenRecordset("billinfo")

    Me.lblStatus.Visible = True
    rstThomas.MoveFirst

    While Not (rstThomas.EOF)

        strBillURL = rstThomas("billinfo").Value

        Me.lblStatus.Caption = "Status: Retreiving data for Bill ID " & rstThomas("billuid").Value & "..."
        Me.Repaint

        With rstBillInfo
            .AddNew
            .Fields("thomasdataid").Value = rstThomas.Fields("id").Value
            varContent = GetBillData(strBillURL)
            .Fields("billsummary").Value = Mid(varContent(1), InStr(1, varContent(1), Chr(13)) + 2, 10000)
            .Fields("actions").Value = Mid(varContent(2), InStr(1, varContent(2), Chr(13)) + 2, 10000)
            .Fields("relatedbills").Value = Mid(varContent(3), InStr(1, varContent(3), Chr(13)) + 2, 10000)
            .Fields("billtitles").Value = Mid(varContent(4), InStr(1, varContent(4), Chr(13)) + 2, 10000)
            .Update
        End With

        rstThomas.MoveNext
    Wend

    rstThomas.Close
    rstBillInfo.Close

    Set rstThomas = Nothing
    Set rstBillInfo = Nothing

    Me.lblStatus.Caption = "Status: Completed"

End Sub

Function GetBillData(strURL As String)
Dim xml As Object
Dim doc As Object
Dim strContent(1 To 4)
Dim varData
Dim pos


    Set xml = CreateObject("MSXML2.XMLHTTP")

    xml.Open "GET", strURL, False

    xml.send

    varData = xml.responseText

    Set doc = CreateObject("htmlfile")

    doc.body.innerhtml = varData

    'Extract the Summary

    strContent(1) = doc.getelementbyid("content").innerText

    pos = InStr(strContent(1), "SUMMARY AS OF")

    strContent(1) = Mid(strContent(1), pos)

    pos = InStrRev(strContent(1), "MAJOR ACTIONS")

    strContent(1) = Left(strContent(1), pos - 1)
    
    'Extract the All Actions

    strContent(2) = doc.getelementbyid("content").innerText

    pos = InStr(strContent(2), "ALL ACTIONS:")

    strContent(2) = Mid(strContent(2), pos)

    pos = InStr(strContent(2), "TITLE(S)")

    strContent(2) = Left(strContent(2), pos - 1)
    
    'Extract the Title(s)
    
    strContent(3) = doc.getelementbyid("content").innerText

    pos = InStr(strContent(3), "TITLE(S)")

    strContent(3) = Mid(strContent(3), pos)

    pos = InStr(strContent(3), "COSPONSOR(S):")

    strContent(3) = Left(strContent(3), pos - 1)

    'Extract the Related Bills

    strContent(4) = doc.getelementbyid("content").innerText

    pos = InStr(strContent(4), "RELATED BILL DETAILS")

    strContent(4) = Mid(strContent(1), pos)

    pos = InStr(strContent(4), "AMENDMENT(S)")

    strContent(4) = Left(strContent(4), pos - 1)



    Set xml = Nothing

    GetBillData = strContent

End Function

Open in new window

0
 
atljarmanAuthor Commented:
I got it. It was an issue with cosponsor.  Thanks a million.
0
 
atljarmanAuthor Commented:
Here is the final version of the database for anyone trying to retrieve BillInfo from Thomas.gov
getsummaryrequest---updated-FINA.accdb
0
 
atljarmanAuthor Commented:
Excellent, thanks for your help on this.  Grade A work.
0
 
NorieCommented:
Assist?
0
 
atljarmanAuthor Commented:
No offense implied; you did the work here.  I wanted to clearly identify the final code to do the action and gave you all the points with the db you created,  this is an issue with experts exchange. I can sign my answer.
0
 
NorieCommented:
It's not a problem.:)
0
 
atljarmanAuthor Commented:
Thanks again.
0
 
Ettore CicinelliCommented:
I try your file but gives runtime error 91
0

Featured Post

 [eBook] Windows Nano Server

Download this FREE eBook and learn all you need to get started with Windows Nano Server, including deployment options, remote management
and troubleshooting tips and tricks

  • 14
  • 11
Tackle projects and never again get stuck behind a technical roadblock.
Join Now