?
Solved

Access VBA scrape website for summary information based on hyperlink in table

Posted on 2013-05-13
26
Medium Priority
?
992 Views
Last Modified: 2016-08-31
Hi,

I've attached a representative sample of the Access 2010 database that I am working with.  There are two tables in the database: thomasdata and billinfo.  In the thomasdata table there are three columns but of interst is a unique id (ID) and a hyperlink (BillInfo).  in the billinfo table the columns of interest are the thomasdataID and the billsummary columns.

What I would like to do is develop VBA to read the thomas data and pull the ID and the hyperlink from the thomasdata table.  Then I would like access to follow the link to the webpage and scrape the information between the tags <a name="summary">SUMMARY AS OF:</a> and <a name='major actions'><b>MAJOR ACTIONS:</b></a>.  Once that information is pulled, I would like the unique ID from the thomasdata table stored in the thomasdataID column in billinfo table and the summary of the webpage to be stored in the billsummary column of the billinfo table.

I know that this must have been done somewhere before, so I am hoping one of the experts can help find a solution.
getsummaryrequest.accdb
0
Comment
Question by:atljarman
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 14
  • 11
26 Comments
 
LVL 34

Expert Comment

by:Norie
ID: 39161708
The thomasdataid in table billinfo is a number field but the billuid in table thomasdata is text.

thomasdataid will need to be text if it's to take the value from billuid.

I've changed that in the attached database and added a form that will run code to get the data you want.
getsummaryrequest---updated.accdb
0
 

Author Comment

by:atljarman
ID: 39161792
imnorie,

Thanks for the quick reply.  The value that I would like to store in the thomasbillid coulumn is the unique identifier ID from the thomasdata table (an auto number).  How would you pull the data from the same page (during the same call) between the text: "ALL ACTIONS:" and "TITLE(S):" into a column called "actions" in the billinfo table?

It is probably easier to pull it all at once, then to make a second call then update the record.  Thanks again for your help.
0
 
LVL 34

Expert Comment

by:Norie
ID: 39161843
The code can easily be altered to store a the ID rather than the bill ID.

I'd need to have another look at how to pull ALL ACTIONS and TITLES.

Unfortunately the page isn't well organised so it would probably just use the same way as for SUMMARY, ie searching for certain headings to parse the text.

Not hard just a bit of a pain.:)
0
Optimize your web performance

What's in the eBook?
- Full list of reasons for poor performance
- Ultimate measures to speed things up
- Primary web monitoring types
- KPIs you should be monitoring in order to increase your ROI

 

Author Comment

by:atljarman
ID: 39161896
This fixed the ID issue:

            .Fields("thomasdataid").Value = rstThomas.Fields("ID").Value

Open in new window

0
 
LVL 34

Expert Comment

by:Norie
ID: 39161973
Here's updated code.
Option Compare Database
Option Explicit

Private Sub Command0_Click()
Dim rstThomas As DAO.Recordset
Dim rstBillInfo As DAO.Recordset
Dim strBillURL As String
Dim varContent

    Set rstThomas = CurrentDb.OpenRecordset("thomasdata")
    Set rstBillInfo = CurrentDb.OpenRecordset("billinfo")

    Me.lblStatus.Visible = True
    rstThomas.MoveFirst

    While Not (rstThomas.EOF)

        strBillURL = rstThomas("billinfo").Value

     '   Me.lblStatus.Caption = "Status: Retreiving data for Bill ID " & rstThomas("billuid").Value & "..."
      '  Me.Repaint

        With rstBillInfo
            .AddNew
            .Fields("thomasdataid").Value = rstThomas.Fields("id").Value
            varContent = GetBillData(strBillURL)
            .Fields("billsummary").Value = varContent(1)
            .Fields("actions").Value = varContent(2)
            .Update
        End With

        rstThomas.MoveNext
    Wend

    rstThomas.Close
    rstBillInfo.Close

    Set rstThomas = Nothing
    Set rstBillInfo = Nothing

    'Me.lblStatus.Caption = "Status: Completed"

End Sub

Function GetBillData(strURL As String)
Dim xml As Object
Dim doc As Object
Dim strContent(1 To 2)
Dim varData
Dim pos


    Set xml = CreateObject("MSXML2.XMLHTTP")

    xml.Open "GET", strURL, False

    xml.send

    varData = xml.responseText

    Set doc = CreateObject("htmlfile")

    doc.body.innerhtml = varData

    strContent(1) = doc.getelementbyid("content").innerText

    pos = InStr(strContent(1), "SUMMARY AS OF")

    strContent(1) = Mid(strContent(1), pos)

    pos = InStrRev(strContent(1), "MAJOR ACTIONS")

    strContent(1) = Left(strContent(1), pos - 1)

    strContent(2) = doc.getelementbyid("content").innerText

    pos = InStr(strContent(2), "All ACTIONS")

    strContent(2) = Mid(strContent(2), pos)

    pos = InStrRev(strContent(2), "TITLES")

    strContent(2) = Left(strContent(2), pos - 1)

    Set xml = Nothing

    GetBillData = strContent

End Function

Open in new window

0
 
LVL 34

Expert Comment

by:Norie
ID: 39162027
Ignore that last code, it had some mistakes - didn't read the page close enough.

Option Compare Database
Option Explicit

Private Sub Command0_Click()
Dim rstThomas As DAO.Recordset
Dim rstBillInfo As DAO.Recordset
Dim strBillURL As String
Dim varContent

    Set rstThomas = CurrentDb.OpenRecordset("thomasdata")
    Set rstBillInfo = CurrentDb.OpenRecordset("billinfo")

    Me.lblStatus.Visible = True
    rstThomas.MoveFirst

    While Not (rstThomas.EOF)

        strBillURL = rstThomas("billinfo").Value

        Me.lblStatus.Caption = "Status: Retreiving data for Bill ID " & rstThomas("billuid").Value & "..."
        Me.Repaint

        With rstBillInfo
            .AddNew
            .Fields("thomasdataid").Value = rstThomas.Fields("id").Value
            varContent = GetBillData(strBillURL)
            .Fields("billsummary").Value = varContent(1)
            .Fields("actions").Value = varContent(2)
            .Update
        End With

        rstThomas.MoveNext
    Wend

    rstThomas.Close
    rstBillInfo.Close

    Set rstThomas = Nothing
    Set rstBillInfo = Nothing

    Me.lblStatus.Caption = "Status: Completed"

End Sub

Function GetBillData(strURL As String)
Dim xml As Object
Dim doc As Object
Dim strContent(1 To 2)
Dim varData
Dim pos


    Set xml = CreateObject("MSXML2.XMLHTTP")

    xml.Open "GET", strURL, False

    xml.send

    varData = xml.responseText

    Set doc = CreateObject("htmlfile")

    doc.body.innerhtml = varData

    strContent(1) = doc.getelementbyid("content").innerText

    pos = InStr(strContent(1), "SUMMARY AS OF")

    strContent(1) = Mid(strContent(1), pos)

    pos = InStrRev(strContent(1), "MAJOR ACTIONS")

    strContent(1) = Left(strContent(1), pos - 1)

    strContent(2) = doc.getelementbyid("content").innerText

    pos = InStr(strContent(2), "ALL ACTIONS:")

    strContent(2) = Mid(strContent(2), pos)

    pos = InStrRev(strContent(2), "TITLE(S)")

    strContent(2) = Left(strContent(2), pos - 1)

    Set xml = Nothing

    GetBillData = strContent

End Function

Open in new window

0
 

Author Comment

by:atljarman
ID: 39162055
Imnorie,

I have this error: "Method or data member not found" on this line:

Me.lblStatus.Visible = True

Looks like the form you are using has a label on it that was not on the oringal accdb that you posted.  I'll comment it out and test the code.  When I commented out the code and made a couple changes to the previous script post (id: 39161973) it seemed to work.
0
 
LVL 34

Expert Comment

by:Norie
ID: 39162084
Forgot to mention I added a label, it was for my own sanity.

It showed the code was actually running, as it was executing so fast I thought it wasn't.
0
 

Author Comment

by:atljarman
ID: 39162122
Yep.  I put a blank label with the name lblStatus on the form and it worked.  The challenge that I am having though, is that some of the Action data is not stopping at the end of the actions.  Seems not to be correctly identifying the placement of Titles, and I'm wondering if we need to subtract the position of the Actions possibly.  I thought maybe the lenght was carrying over from the summary, but this code made no difference:

Option Compare Database
Option Explicit

Private Sub Command0_Click()
Dim rstThomas As DAO.Recordset
Dim rstBillInfo As DAO.Recordset
Dim strBillURL As String
Dim varContent

    Set rstThomas = CurrentDb.OpenRecordset("thomasdata")
    Set rstBillInfo = CurrentDb.OpenRecordset("billinfo")

    Me.lblStatus.Visible = True
    rstThomas.MoveFirst

    While Not (rstThomas.EOF)

        strBillURL = rstThomas("billinfo").Value

        Me.lblStatus.Caption = "Status: Retreiving data for Bill ID " & rstThomas("billuid").Value & "..."
        Me.Repaint

        With rstBillInfo
            .AddNew
            .Fields("thomasdataid").Value = rstThomas.Fields("id").Value
            varContent = GetBillData(strBillURL)
            .Fields("billsummary").Value = varContent(1)
            .Fields("actions").Value = varContent(2)
            .Update
        End With

        rstThomas.MoveNext
    Wend

    rstThomas.Close
    rstBillInfo.Close

    Set rstThomas = Nothing
    Set rstBillInfo = Nothing

    Me.lblStatus.Caption = "Status: Completed"

End Sub

Function GetBillData(strURL As String)
Dim xml As Object
Dim doc As Object
Dim strContent(1 To 2)
Dim varData
Dim pos, pos2


    Set xml = CreateObject("MSXML2.XMLHTTP")

    xml.Open "GET", strURL, False

    xml.send

    varData = xml.responseText

    Set doc = CreateObject("htmlfile")

    doc.body.innerhtml = varData

    strContent(1) = doc.getelementbyid("content").innerText

    pos = InStr(strContent(1), "SUMMARY AS OF")

    strContent(1) = Mid(strContent(1), pos)

    pos = InStrRev(strContent(1), "MAJOR ACTIONS")

    strContent(1) = Left(strContent(1), pos - 1)

    strContent(2) = doc.getelementbyid("content").innerText

    pos2 = InStr(strContent(2), "ALL ACTIONS:")

    strContent(2) = Mid(strContent(2), pos2)

    pos2 = InStrRev(strContent(2), "TITLE(S):")

    strContent(2) = Left(strContent(2), pos2 - 1)

    Set xml = Nothing

    GetBillData = strContent

End Function

Open in new window


I'll check to see if this is in summary too.
0
 
LVL 34

Expert Comment

by:Norie
ID: 39162161
No lengths or positions are carried over from the code to find the summary.

I'm not sure what you mean about subtracting the position of ALL ACTIONS.

The code looks for ALL ACTIONS: and then returns the contents from that point on and puts it in a variable.

Then it looks at that variable for TITLE(S) and returns the contents up to the point where TITLE(S) is found.

That should give us the contents between ALL ACTIONS and TITLES.

If TITLES wasn't found the code would error.
0
 
LVL 34

Expert Comment

by:Norie
ID: 39162193
Actually I think I figured out the problem, TITLE(S) is repeated.

Change this,

 
InStrRev(strContents(2), "TITLE(S)") 

Open in new window


to this.

InStr(strContents(2), "TITLE(S)") 

Open in new window


I don't have time to check that but it should work.
0
 

Author Comment

by:atljarman
ID: 39162195
I was think that maybe the length was relative and that when it was looking for titles that the code was starting from the beginning of the HTML file and not from the location of Actions making it search a length much longer.  So for example, if All Actions: was a character 2000 and Title(s): was at character 4000, then I was thinking that it was taking 4000 characters starting at character 2000.  This is probably way wrong, but this is what the very first one was for the Actions (Notice the Title(s): was included):

ALL ACTIONS:
1/22/2013:
Referred to the Committee on the Judiciary, and in addition to the Committees on Energy and Commerce, Education and the Workforce, Financial Services, and Natural Resources, for a period to be subsequently determined by the Speaker, in each case for consideration of such provisions as fall within the jurisdiction of the committee concerned.
1/22/2013:
Referred to House Judiciary
2/28/2013:
Referred to the Subcommittee on Crime, Terrorism, Homeland Security, And Investigations.
1/22/2013:
Referred to House Energy and Commerce
1/25/2013:
Referred to the Subcommittee on Health.
1/22/2013:
Referred to House Education and the Workforce
4/23/2013:
Referred to the Subcommittee on Early Childhood, Elementary, and Secondary Education.
4/23/2013:
Referred to the Subcommittee on Higher Education and Workforce Training.
1/22/2013:
Referred to House Financial Services
1/22/2013:
Referred to House Natural Resources
1/29/2013:
Referred to the Subcommittee Indian and Alaska Native Affairs.


TITLE(S):  (italics indicate a title for a portion of a bill)
POPULAR TITLE(S):
VAWA bill (identified by CRS)
SHORT
0
 
LVL 34

Assisted Solution

by:Norie
Norie earned 2000 total points
ID: 39162235
See my last post and attached file.

PS Just remembered, the billinfo table has some old data in it.
getsummaryrequest---updated-V2.accdb
0
 

Author Comment

by:atljarman
ID: 39162258
Any way to easliy remove the first line from each of the two columns before inserting them?  For example, the "ALL ACTIONS: " and "" are in each cell and duplicative.  It appears that there would be a chr(10) or chr(13) on the first row that one could break off at after the record is processed.  Not sure if you have encountered this before.  The InStr rather than the InStrRev worked.
0
 

Accepted Solution

by:
atljarman earned 0 total points
ID: 39162413
This works like a charm.

Option Compare Database
Option Explicit

Private Sub Command0_Click()
Dim rstThomas As DAO.Recordset
Dim rstBillInfo As DAO.Recordset
Dim strBillURL As String
Dim varContent

    Set rstThomas = CurrentDb.OpenRecordset("thomasdata")
    Set rstBillInfo = CurrentDb.OpenRecordset("billinfo")

    Me.lblStatus.Visible = True
    rstThomas.MoveFirst

    While Not (rstThomas.EOF)

        strBillURL = rstThomas("billinfo").Value

        Me.lblStatus.Caption = "Status: Retreiving data for Bill ID " & rstThomas("billuid").Value & "..."
        Me.Repaint

        With rstBillInfo
            .AddNew
            .Fields("thomasdataid").Value = rstThomas.Fields("id").Value
            varContent = GetBillData(strBillURL)
            '.Fields("billsummary").Value = varContent(1)  'Mid([billsummary],InStr(1,[BillSummary],Chr(13))+2,10000)
            .Fields("billsummary").Value = Mid(varContent(1), InStr(1, varContent(1), Chr(13)) + 2, 10000)
            '.Fields("actions").Value = varContent(2)
            .Fields("actions").Value = Mid(varContent(2), InStr(1, varContent(2), Chr(13)) + 2, 10000)
            .Update
        End With

        rstThomas.MoveNext
    Wend

    rstThomas.Close
    rstBillInfo.Close

    Set rstThomas = Nothing
    Set rstBillInfo = Nothing

    Me.lblStatus.Caption = "Status: Completed"

End Sub

Function GetBillData(strURL As String)
Dim xml As Object
Dim doc As Object
Dim strContent(1 To 2)
Dim varData
Dim pos


    Set xml = CreateObject("MSXML2.XMLHTTP")

    xml.Open "GET", strURL, False

    xml.send

    varData = xml.responseText

    Set doc = CreateObject("htmlfile")

    doc.body.innerhtml = varData

    strContent(1) = doc.getelementbyid("content").innerText

    pos = InStr(strContent(1), "SUMMARY AS OF")

    strContent(1) = Mid(strContent(1), pos)

    pos = InStrRev(strContent(1), "MAJOR ACTIONS")

    strContent(1) = Left(strContent(1), pos - 1)

    strContent(2) = doc.getelementbyid("content").innerText

    pos = InStr(strContent(2), "ALL ACTIONS:")

    strContent(2) = Mid(strContent(2), pos)

    pos = InStr(strContent(2), "TITLE(S)")

    strContent(2) = Left(strContent(2), pos - 1)

    Set xml = Nothing

    GetBillData = strContent

End Function

Open in new window

0
 
LVL 34

Expert Comment

by:Norie
ID: 39162772
I was just about to post code to exclude SUMMARY etc, it can actually go in the function.

PS You don't need the 10000 in Mid, it will return to the end of the string no matter the  length.
0
 

Author Comment

by:atljarman
ID: 39162809
awesome.  Thank you again.
0
 

Author Comment

by:atljarman
ID: 39162863
I'm running into an error with this code. Not sure if you see where the error is:

Option Compare Database
Option Explicit

Private Sub Command0_Click()
Dim rstThomas As DAO.Recordset
Dim rstBillInfo As DAO.Recordset
Dim strBillURL As String
Dim varContent

    Set rstThomas = CurrentDb.OpenRecordset("thomasdata")
    Set rstBillInfo = CurrentDb.OpenRecordset("billinfo")

    Me.lblStatus.Visible = True
    rstThomas.MoveFirst

    While Not (rstThomas.EOF)

        strBillURL = rstThomas("billinfo").Value

        Me.lblStatus.Caption = "Status: Retreiving data for Bill ID " & rstThomas("billuid").Value & "..."
        Me.Repaint

        With rstBillInfo
            .AddNew
            .Fields("thomasdataid").Value = rstThomas.Fields("id").Value
            varContent = GetBillData(strBillURL)
            .Fields("billsummary").Value = Mid(varContent(1), InStr(1, varContent(1), Chr(13)) + 2, 10000)
            .Fields("actions").Value = Mid(varContent(2), InStr(1, varContent(2), Chr(13)) + 2, 10000)
            .Fields("relatedbills").Value = Mid(varContent(3), InStr(1, varContent(3), Chr(13)) + 2, 10000)
            .Fields("billtitles").Value = Mid(varContent(4), InStr(1, varContent(4), Chr(13)) + 2, 10000)
            .Update
        End With

        rstThomas.MoveNext
    Wend

    rstThomas.Close
    rstBillInfo.Close

    Set rstThomas = Nothing
    Set rstBillInfo = Nothing

    Me.lblStatus.Caption = "Status: Completed"

End Sub

Function GetBillData(strURL As String)
Dim xml As Object
Dim doc As Object
Dim strContent(1 To 4)
Dim varData
Dim pos


    Set xml = CreateObject("MSXML2.XMLHTTP")

    xml.Open "GET", strURL, False

    xml.send

    varData = xml.responseText

    Set doc = CreateObject("htmlfile")

    doc.body.innerhtml = varData

    'Extract the Summary

    strContent(1) = doc.getelementbyid("content").innerText

    pos = InStr(strContent(1), "SUMMARY AS OF")

    strContent(1) = Mid(strContent(1), pos)

    pos = InStrRev(strContent(1), "MAJOR ACTIONS")

    strContent(1) = Left(strContent(1), pos - 1)
    
    'Extract the All Actions

    strContent(2) = doc.getelementbyid("content").innerText

    pos = InStr(strContent(2), "ALL ACTIONS:")

    strContent(2) = Mid(strContent(2), pos)

    pos = InStr(strContent(2), "TITLE(S)")

    strContent(2) = Left(strContent(2), pos - 1)
    
    'Extract the Title(s)
    
    strContent(3) = doc.getelementbyid("content").innerText

    pos = InStr(strContent(3), "TITLE(S)")

    strContent(3) = Mid(strContent(3), pos)

    pos = InStr(strContent(3), "COSPONSOR(S):")

    strContent(3) = Left(strContent(3), pos - 1)

    'Extract the Related Bills

    strContent(4) = doc.getelementbyid("content").innerText

    pos = InStr(strContent(4), "RELATED BILL DETAILS")

    strContent(4) = Mid(strContent(1), pos)

    pos = InStr(strContent(4), "AMENDMENT(S)")

    strContent(4) = Left(strContent(4), pos - 1)



    Set xml = Nothing

    GetBillData = strContent

End Function

Open in new window

0
 

Author Comment

by:atljarman
ID: 39162896
I got it. It was an issue with cosponsor.  Thanks a million.
0
 

Author Comment

by:atljarman
ID: 39162946
Here is the final version of the database for anyone trying to retrieve BillInfo from Thomas.gov
getsummaryrequest---updated-FINA.accdb
0
 

Author Closing Comment

by:atljarman
ID: 39176901
Excellent, thanks for your help on this.  Grade A work.
0
 
LVL 34

Expert Comment

by:Norie
ID: 39177633
Assist?
0
 

Author Comment

by:atljarman
ID: 39177645
No offense implied; you did the work here.  I wanted to clearly identify the final code to do the action and gave you all the points with the db you created,  this is an issue with experts exchange. I can sign my answer.
0
 
LVL 34

Expert Comment

by:Norie
ID: 39177711
It's not a problem.:)
0
 

Author Comment

by:atljarman
ID: 39177950
Thanks again.
0
 

Expert Comment

by:Ettore Cicinelli
ID: 41778131
I try your file but gives runtime error 91
0

Featured Post

[Webinar] How Hackers Steal Your Credentials

Do You Know How Hackers Steal Your Credentials? Join us and Skyport Systems to learn how hackers steal your credentials and why Active Directory must be secure to stop them.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

This post looks at MongoDB and MySQL, and covers high-level MongoDB strengths, weaknesses, features, and uses from the perspective of an SQL user.
Code that checks the QuickBooks schema table for non-updateable fields and then disables those controls on a form so users don't try to update them.
In this fifth video of the Xpdf series, we discuss and demonstrate the PDFdetach utility, which is able to list and, more importantly, extract attachments that are embedded in PDF files. It does this via a command line interface, making it suitable …
Have you created a query with information for a calendar? ... and then, abra-cadabra, the calendar is done?! I am going to show you how to make that happen. Visualize your data!  ... really see it To use the code to create a calendar from a q…

762 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question