Link to home
Start Free TrialLog in
Avatar of geselaw
geselaw

asked on

XML web scraping a website with dynamic key

Hi,
I am new to this site and new to XML. I apologies in advance if I break any rule.
I have been scraping this site from excel using IE but recently using IE has been inconsistent and slow. My list is usually around 500 to 1000 so I have to run the macro over night. Recently the macro started to hangup. That is why I decided to explorer scraping with MSXML2 for the first time.

The site needs no authentication but it has hidden input that changes dynamically.

What I have done.. I used GET to pull the site and extracted the dynamic key then tried to use POST to send the input data to the site. I kept on getting server error/run-time error. I have tried using different header request option but I am still not getting the result page.I have also tried to use MSXML2.ServerXMLHTTP. Am I in the right track?

Thanks


Sub test_66()
  Dim oXML_get
  'Dim oXML_post
  Dim sendText As String, s2 As String, n1 As Integer, postUrl As String,      sHTML As String, s1 As String

  ' Instantiate MSXML2
  Set oXML_get = New MSXML2.XMLHTTP

  oXML_get.Open "GET", "http://www.phila.gov/revenue/realestatetax/default.aspx", False
  oXML_get.setRequestHeader "Accept", "text/html;charset=UTF-8"
  oXML_get.setRequestHeader "Accept-Encoding", "identity"
  oXML_get.setRequestHeader "Accept-Charset", "UTF-8" 'Connection keep -alive
  oXML_get.setRequestHeader "Connection", "keep -alive"

  oXML_get.send

  sHTML = oXML_get.responseText
  'Debug.Print sHTML
  Dim hDOC As MSHTML.HTMLDocument
  Set hDOC = New MSHTML.HTMLDocument
          hDOC.body.innerHTML = sHTML
  s1 = Replace(hDOC.getElementsByTagName("input").Item(2).Value, "/", "%2F")
  s2 = Replace(hDOC.getElementsByTagName("input").Item(3).Value, "/", "%2F")

  sendText = "?__VIEWSTATE=" & s1 & "&__EVENTVALIDATION=" & s2 & "&ctl00%24BodyContentPlaceHolder%24SearchByBRTControl%24txtTaxInfo=043185500&ctl00%24BodyContentPlaceHolder%24SearchByBRTControl%24btnTaxByBRT=%20>>"
  Debug.Print sendText '"__EVENTTARGET=&__EVENTARGUMENT=&__VIEWSTATE=" & s1 & "__EVENTVALIDATION=" & s2 &

  oXML_get.Open "POST", "http://www.phila.gov/revenue/realestatetax/default.aspx", False
  oXML_get.setRequestHeader "Content-Type", "application/x-www-form-urlencoded"
  oXML_get.setRequestHeader "Accept", "text/html;charset=UTF-8"
  oXML_get.setRequestHeader "Accept-Encoding", "identity"
  oXML_get.setRequestHeader "Accept-Charset", "UTF-8" 'Connection keep -alive
  'oXML_get.setRequestHeader "Connection", "keep -alive"
  oXML_get.send (sendText)

  Dim objIE As Object: Set objIE = CreateObject("InternetExplorer.Application")
  objIE.navigate "about:blank"
  objIE.Visible = True
  objIE.document.Write oXML_get.responseText
End Sub

Open in new window

This is the Runtime Error message that I am getting....
Server Error in '/revenue/RealEstateTax' Application.
<!-- Web.Config Configuration File -->

<configuration>
    <system.web>
        <customErrors mode="Off"/>
    </system.web>
</configuration>

Open in new window

Avatar of ste5an
ste5an
Flag of Germany image

Just a comment: Why not asking them, whether they offer an easier way like an REST API?
There is a header setting that makes the server think you are an actual browser.
Example:
oXML_get.setRequestHeader "User-Agent", "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)"

Open in new window

You ought to try using Selenium for your automation tasks.
Avatar of geselaw
geselaw

ASKER

Thank you aikimark,
I will try the User-Agent setRequestHeader .I will update soon .I tried Selenium but found it to be processes intensive and slower than using IE.
Avatar of geselaw

ASKER

ste5an,
This is .gov ,they don't even respond when there is an outage. I read somewhere that there is a way to find out the api from analyzing the page. Do you know a way to do that?
Avatar of geselaw

ASKER

Aikimark,
I tried your suggestion but I am still getting the same error message. Do you have any more suggestions?
This is a problem I've encountered many times.  Web sites are becoming more and more javascript in their content rendering as well as a heavy reliance on CSS.  Both of these technologies can perform operations that expect to be working inside a browser.

One thing you might want to try is to disable javascript in your internet settings and try to do the search and navigation through your regular browser.  See what happens.  It might be possible that the code will detect the lack of JS support and fall back to a form that reveals an API (URI parameters).
Avatar of geselaw

ASKER

aikimark
I disabled javascript in firefox and tested the page. the page performed a little bit slow but did not reveal any url or api. And it is working .
So, no difference in what you normally see in the address box?
Avatar of geselaw

ASKER

nope. Everything is as it was before.
btw, use Fiddler to see what's being transmitted in your browser (e.g. cookies).
Avatar of geselaw

ASKER

ste5an,
I have never used Fiddler before but I am on it. Give me a little time to figure out how Fiddler work and I will report what I see. If there is any specific thing you want me to do on fiddler, please let me know.
It shows you the entire HTTP traffic. So you can dig into the details, whether you need a cookie for the session managment, cause the site is an ASP page.
Google Chrome also has these diagnostic features if that is your preferred browser.
But Chrome cannot hook into the MSXML2 code.. for comparision.
Chrome cannot hook into the MSXML2
That is true.  I had used the Chrome feature to look at what happened when a query was run and whether there might be a work-around to the page actions and restrictions.
Avatar of geselaw

ASKER

ste5an
I am still trying to figure out how to use fiddler. fiddler is detecting all session from all browsers but it is not detecting the xml code from excel. It is not capturing the http request from vba. I might have to reconfigure some network settings. Please let me know if you have any suggestion.
Caveat: XMLHttp caches internally. E.g. using this snippet

Option Explicit

Public Sub test_66()

  Dim oXML_get
  
  Set oXML_get = New MSXML2.XMLHTTP60
  oXML_get.Open "GET", "http://www.phila.gov/revenue/realestatetax/default.aspx", False
  oXML_get.setRequestHeader "Accept", "text/html;charset=UTF-8"
  oXML_get.setRequestHeader "Accept-Encoding", "identity"
  oXML_get.setRequestHeader "Accept-Charset", "UTF-8" 'Connection keep -alive
  oXML_get.setRequestHeader "Connection", "keep -alive"
  oXML_get.send
  Set oXML_get = Nothing
  
 End Sub

Open in new window


produces this Fiddler output once, as long as you don't restart Excel:

User generated image
4. First request
5.-8. Excel restart
9. New request
Avatar of geselaw

ASKER

ste5an,
Thank you for teaching me how to fish. After a lot of digging and learning, I make it work. But I am not sure if it is the right approach. The two keys(__VIEWSTATE and __EVENTVALIDATION) that I extracted using the GET method are not working when I insert them later with the rest of the inputs. That is when I found out that those two keys are not dynamic at all. So I did this .... and it worked. I still have one last question...
Public Sub test_68()

  Dim oXML_get As MSXML2.XMLHTTP60
  Dim sendText As String, s2 As String, n1 As Integer, postUrl As String, sHTML As String, s1 As String
  Set oXML_get = New MSXML2.XMLHTTP60
      s1 = "%2FwEPDwULLTEyNDQ4MDU4OTkPZBYCZg9kFgICAw9kFgICDQ9kFgYCAQ9kFgICAw9kFgICAQ8QZGQWAGQCBQ8PFgIeBFRleHRlZGQCDQ9kFgYCAQ88KwAKAGQCBQ8UKwACZBAWABYAFgBkAgcPPCsAEQEBEBYAFgAWAGQYAgVBY3RsMDAkQm9keUNvbnRlbnRQbGFjZUhvbGRlciRHZXRUYXhJbmZvQ29udHJvbCRncmRQYXltZW50c0hpc3RvcnkPZ2QFMmN0bDAwJEJvZHlDb250ZW50UGxhY2VIb2xkZXIkR2V0VGF4SW5mb0NvbnRyb2wkZnJtD2dkfeKn02F5%2FUVWA1ziT8O%2BTiqI1khz8dbn3xHnh76pFpg%3D"             '
      s2 = "%2FwEWBQLYwKbjCwLRzsWTBwLlpIbACAKV6q2KDQKIvdHyCQKihnn4%2FOrmyoTE2SX2X%2FqFnyExidQjjGtDkY91sD0P"
      sendText = "__EVENTTARGET=&__EVENTARGUMENT=&__VIEWSTATE=" & s1 & "&__EVENTVALIDATION=" & s2 & "&ctl00%24BodyContentPlaceHolder%24SearchByBRTControl%24txtTaxInfo=043185500&ctl00%24BodyContentPlaceHolder%24SearchByBRTControl%24btnTaxByBRT=+>>&ctl00%24BodyContentPlaceHolder%24SearchByBRTControl%24txtTaxInfo%3Dby%20BRT%20Number"
      
      oXML_get.Open "POST", "http://www.phila.gov/revenue/realestatetax/default.aspx?", False
      oXML_get.setRequestHeader "Request", "POST /revenue/RealEstateTax/ HTTP/1.1"
      oXML_get.setRequestHeader "Host", "www.phila.gov"
      oXML_get.setRequestHeader "Referer", "http://www.phila.gov/revenue/RealEstateTax/"
      oXML_get.setRequestHeader "User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko"
      oXML_get.setRequestHeader "Content-Type", "application/x-www-form-urlencoded"
      oXML_get.setRequestHeader "Accept", "text/html; charset=UTF-8"
      oXML_get.send (sendText)
      
      Do Until oXML_get.readyState = 4
            DoEvents
      Loop
      Dim objIE As Object: Set objIE = CreateObject("InternetExplorer.Application")
      If oXML_get.Status = 200 Then
          objIE.navigate "about:blank"
            objIE.Visible = True
            objIE.document.Write oXML_get.responseText
      End If
      Debug.Print oXML_get.responseText
      Set oXML_get = Nothing
 End Sub

Open in new window


My question is ..do you think my s1 and s2 in the code might change if I run this code in another PC or if I clear my cache??

Thanks again
Viewstate and eventvalidation are ASP specific session variables. Just using recorded ones for replay will not work. You normally need to use the actual ones.

On the other hand, as you already us IE, why not doing the entire thing in IE?
What happens if you don't include the __EVENTTARGET=&__EVENTARGUMENT=&__VIEWSTATE=?  To test, just include the __EVENTVALIDATION= and later part of the string.
SOLUTION
Avatar of ste5an
ste5an
Flag of Germany image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
ASKER CERTIFIED SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of geselaw

ASKER

aikimark,
I tested by excluding the hidden keys, the status was 200 but it did not pull the data. it just gave me the original page.

ste5an
I don't want to use IE because the code became inconsistent. Working for few minutes then hangs. I used IE in this code just for debugging.
Avatar of geselaw

ASKER

Because I solved the issue myself.