XML web scraping a website with dynamic key

Hi,
I am new to this site and new to XML. I apologies in advance if I break any rule.
I have been scraping this site from excel using IE but recently using IE has been inconsistent and slow. My list is usually around 500 to 1000 so I have to run the macro over night. Recently the macro started to hangup. That is why I decided to explorer scraping with MSXML2 for the first time.

The site needs no authentication but it has hidden input that changes dynamically.

What I have done.. I used GET to pull the site and extracted the dynamic key then tried to use POST to send the input data to the site. I kept on getting server error/run-time error. I have tried using different header request option but I am still not getting the result page.I have also tried to use MSXML2.ServerXMLHTTP. Am I in the right track?

Thanks


Sub test_66()
  Dim oXML_get
  'Dim oXML_post
  Dim sendText As String, s2 As String, n1 As Integer, postUrl As String,      sHTML As String, s1 As String

  ' Instantiate MSXML2
  Set oXML_get = New MSXML2.XMLHTTP

  oXML_get.Open "GET", "http://www.phila.gov/revenue/realestatetax/default.aspx", False
  oXML_get.setRequestHeader "Accept", "text/html;charset=UTF-8"
  oXML_get.setRequestHeader "Accept-Encoding", "identity"
  oXML_get.setRequestHeader "Accept-Charset", "UTF-8" 'Connection keep -alive
  oXML_get.setRequestHeader "Connection", "keep -alive"

  oXML_get.send

  sHTML = oXML_get.responseText
  'Debug.Print sHTML
  Dim hDOC As MSHTML.HTMLDocument
  Set hDOC = New MSHTML.HTMLDocument
          hDOC.body.innerHTML = sHTML
  s1 = Replace(hDOC.getElementsByTagName("input").Item(2).Value, "/", "%2F")
  s2 = Replace(hDOC.getElementsByTagName("input").Item(3).Value, "/", "%2F")

  sendText = "?__VIEWSTATE=" & s1 & "&__EVENTVALIDATION=" & s2 & "&ctl00%24BodyContentPlaceHolder%24SearchByBRTControl%24txtTaxInfo=043185500&ctl00%24BodyContentPlaceHolder%24SearchByBRTControl%24btnTaxByBRT=%20>>"
  Debug.Print sendText '"__EVENTTARGET=&__EVENTARGUMENT=&__VIEWSTATE=" & s1 & "__EVENTVALIDATION=" & s2 &

  oXML_get.Open "POST", "http://www.phila.gov/revenue/realestatetax/default.aspx", False
  oXML_get.setRequestHeader "Content-Type", "application/x-www-form-urlencoded"
  oXML_get.setRequestHeader "Accept", "text/html;charset=UTF-8"
  oXML_get.setRequestHeader "Accept-Encoding", "identity"
  oXML_get.setRequestHeader "Accept-Charset", "UTF-8" 'Connection keep -alive
  'oXML_get.setRequestHeader "Connection", "keep -alive"
  oXML_get.send (sendText)

  Dim objIE As Object: Set objIE = CreateObject("InternetExplorer.Application")
  objIE.navigate "about:blank"
  objIE.Visible = True
  objIE.document.Write oXML_get.responseText
End Sub

Open in new window

This is the Runtime Error message that I am getting....
Server Error in '/revenue/RealEstateTax' Application.
<!-- Web.Config Configuration File -->

<configuration>
    <system.web>
        <customErrors mode="Off"/>
    </system.web>
</configuration>

Open in new window

geselawAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

ste5anSenior DeveloperCommented:
Just a comment: Why not asking them, whether they offer an easier way like an REST API?
aikimarkCommented:
There is a header setting that makes the server think you are an actual browser.
Example:
oXML_get.setRequestHeader "User-Agent", "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)"

Open in new window

aikimarkCommented:
You ought to try using Selenium for your automation tasks.
Price Your IT Services for Profit

Managed service contracts are great - when they're making you money. Yes, you’re getting paid monthly, but is it actually profitable? Learn to calculate your hourly overhead burden so you can master your IT services pricing strategy.

geselawAuthor Commented:
Thank you aikimark,
I will try the User-Agent setRequestHeader .I will update soon .I tried Selenium but found it to be processes intensive and slower than using IE.
geselawAuthor Commented:
ste5an,
This is .gov ,they don't even respond when there is an outage. I read somewhere that there is a way to find out the api from analyzing the page. Do you know a way to do that?
geselawAuthor Commented:
Aikimark,
I tried your suggestion but I am still getting the same error message. Do you have any more suggestions?
aikimarkCommented:
This is a problem I've encountered many times.  Web sites are becoming more and more javascript in their content rendering as well as a heavy reliance on CSS.  Both of these technologies can perform operations that expect to be working inside a browser.

One thing you might want to try is to disable javascript in your internet settings and try to do the search and navigation through your regular browser.  See what happens.  It might be possible that the code will detect the lack of JS support and fall back to a form that reveals an API (URI parameters).
geselawAuthor Commented:
aikimark
I disabled javascript in firefox and tested the page. the page performed a little bit slow but did not reveal any url or api. And it is working .
aikimarkCommented:
So, no difference in what you normally see in the address box?
geselawAuthor Commented:
nope. Everything is as it was before.
ste5anSenior DeveloperCommented:
btw, use Fiddler to see what's being transmitted in your browser (e.g. cookies).
geselawAuthor Commented:
ste5an,
I have never used Fiddler before but I am on it. Give me a little time to figure out how Fiddler work and I will report what I see. If there is any specific thing you want me to do on fiddler, please let me know.
ste5anSenior DeveloperCommented:
It shows you the entire HTTP traffic. So you can dig into the details, whether you need a cookie for the session managment, cause the site is an ASP page.
aikimarkCommented:
Google Chrome also has these diagnostic features if that is your preferred browser.
ste5anSenior DeveloperCommented:
But Chrome cannot hook into the MSXML2 code.. for comparision.
aikimarkCommented:
Chrome cannot hook into the MSXML2
That is true.  I had used the Chrome feature to look at what happened when a query was run and whether there might be a work-around to the page actions and restrictions.
geselawAuthor Commented:
ste5an
I am still trying to figure out how to use fiddler. fiddler is detecting all session from all browsers but it is not detecting the xml code from excel. It is not capturing the http request from vba. I might have to reconfigure some network settings. Please let me know if you have any suggestion.
ste5anSenior DeveloperCommented:
Caveat: XMLHttp caches internally. E.g. using this snippet

Option Explicit

Public Sub test_66()

  Dim oXML_get
  
  Set oXML_get = New MSXML2.XMLHTTP60
  oXML_get.Open "GET", "http://www.phila.gov/revenue/realestatetax/default.aspx", False
  oXML_get.setRequestHeader "Accept", "text/html;charset=UTF-8"
  oXML_get.setRequestHeader "Accept-Encoding", "identity"
  oXML_get.setRequestHeader "Accept-Charset", "UTF-8" 'Connection keep -alive
  oXML_get.setRequestHeader "Connection", "keep -alive"
  oXML_get.send
  Set oXML_get = Nothing
  
 End Sub

Open in new window


produces this Fiddler output once, as long as you don't restart Excel:

Capture.PNG
4. First request
5.-8. Excel restart
9. New request
geselawAuthor Commented:
ste5an,
Thank you for teaching me how to fish. After a lot of digging and learning, I make it work. But I am not sure if it is the right approach. The two keys(__VIEWSTATE and __EVENTVALIDATION) that I extracted using the GET method are not working when I insert them later with the rest of the inputs. That is when I found out that those two keys are not dynamic at all. So I did this .... and it worked. I still have one last question...
Public Sub test_68()

  Dim oXML_get As MSXML2.XMLHTTP60
  Dim sendText As String, s2 As String, n1 As Integer, postUrl As String, sHTML As String, s1 As String
  Set oXML_get = New MSXML2.XMLHTTP60
      s1 = "%2FwEPDwULLTEyNDQ4MDU4OTkPZBYCZg9kFgICAw9kFgICDQ9kFgYCAQ9kFgICAw9kFgICAQ8QZGQWAGQCBQ8PFgIeBFRleHRlZGQCDQ9kFgYCAQ88KwAKAGQCBQ8UKwACZBAWABYAFgBkAgcPPCsAEQEBEBYAFgAWAGQYAgVBY3RsMDAkQm9keUNvbnRlbnRQbGFjZUhvbGRlciRHZXRUYXhJbmZvQ29udHJvbCRncmRQYXltZW50c0hpc3RvcnkPZ2QFMmN0bDAwJEJvZHlDb250ZW50UGxhY2VIb2xkZXIkR2V0VGF4SW5mb0NvbnRyb2wkZnJtD2dkfeKn02F5%2FUVWA1ziT8O%2BTiqI1khz8dbn3xHnh76pFpg%3D"             '
      s2 = "%2FwEWBQLYwKbjCwLRzsWTBwLlpIbACAKV6q2KDQKIvdHyCQKihnn4%2FOrmyoTE2SX2X%2FqFnyExidQjjGtDkY91sD0P"
      sendText = "__EVENTTARGET=&__EVENTARGUMENT=&__VIEWSTATE=" & s1 & "&__EVENTVALIDATION=" & s2 & "&ctl00%24BodyContentPlaceHolder%24SearchByBRTControl%24txtTaxInfo=043185500&ctl00%24BodyContentPlaceHolder%24SearchByBRTControl%24btnTaxByBRT=+>>&ctl00%24BodyContentPlaceHolder%24SearchByBRTControl%24txtTaxInfo%3Dby%20BRT%20Number"
      
      oXML_get.Open "POST", "http://www.phila.gov/revenue/realestatetax/default.aspx?", False
      oXML_get.setRequestHeader "Request", "POST /revenue/RealEstateTax/ HTTP/1.1"
      oXML_get.setRequestHeader "Host", "www.phila.gov"
      oXML_get.setRequestHeader "Referer", "http://www.phila.gov/revenue/RealEstateTax/"
      oXML_get.setRequestHeader "User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko"
      oXML_get.setRequestHeader "Content-Type", "application/x-www-form-urlencoded"
      oXML_get.setRequestHeader "Accept", "text/html; charset=UTF-8"
      oXML_get.send (sendText)
      
      Do Until oXML_get.readyState = 4
            DoEvents
      Loop
      Dim objIE As Object: Set objIE = CreateObject("InternetExplorer.Application")
      If oXML_get.Status = 200 Then
          objIE.navigate "about:blank"
            objIE.Visible = True
            objIE.document.Write oXML_get.responseText
      End If
      Debug.Print oXML_get.responseText
      Set oXML_get = Nothing
 End Sub

Open in new window


My question is ..do you think my s1 and s2 in the code might change if I run this code in another PC or if I clear my cache??

Thanks again
ste5anSenior DeveloperCommented:
Viewstate and eventvalidation are ASP specific session variables. Just using recorded ones for replay will not work. You normally need to use the actual ones.

On the other hand, as you already us IE, why not doing the entire thing in IE?
aikimarkCommented:
What happens if you don't include the __EVENTTARGET=&__EVENTARGUMENT=&__VIEWSTATE=?  To test, just include the __EVENTVALIDATION= and later part of the string.
ste5anSenior DeveloperCommented:
This depends on the ASP application. Test it. But as I already said, these variables are used to persist the application's state. So it may not work without them.
geselawAuthor Commented:
ste5an,
You are absolutely right. I used another machine and the code didn't work but i fixed it. the problem in the first code was a url encoding. this is what i did..
I changed   s1=Replace(hDOC.getElementsByTagName("input").Item(2).Value, "/", "%2F") to
s1 = WorksheetFunction.EncodeURL(hDOC.getElementsByTagName("input").Item(2).Value)
I also changed s2=Replace(hDOC.getElementsByTagName("input").Item(3).Value, "/", "%2F") to
s2 = WorksheetFunction.EncodeURL(hDOC.getElementsByTagName("input").Item(3).Value)

Now everything is working . WorksheetFunction.EncodeURL is excel 2013 and above function.

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
geselawAuthor Commented:
aikimark,
I tested by excluding the hidden keys, the status was 200 but it did not pull the data. it just gave me the original page.

ste5an
I don't want to use IE because the code became inconsistent. Working for few minutes then hangs. I used IE in this code just for debugging.
geselawAuthor Commented:
Because I solved the issue myself.
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
XML

From novice to tech pro — start learning today.