Unable to scrape website anymore

Hello Experts!
I used to be able to scrape information from this website using PHP curl - but now it is not working anymore.

https://officialrecords.broward.org/oncorev2/showdetails.aspx?cfn=113063141

It seems they put a page in-front of this page and now you have to click "agree" button from this new page to be allowed to the page I need to scrape.

In other words, the page I need to access is only reachable from the new "agree" page.
Example:
If you go here:
https://officialrecords.broward.org/oncorev2/showdetails.aspx?cfn=113063141
You will get an error.
But if you go here first and click "agree"
https://officialrecords.broward.org/oncorev2/
and then again you go to :
https://officialrecords.broward.org/oncorev2/showdetails.aspx?cfn=113063141

You get the data I need.

Is there any way to programmatic accomplish this flow (click "agree" button) so I can get to the final page using PHP??

Thanks!
MHPCAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Ray PaseurCommented:
Executive summary: They do not want you to be able to scrape the records from the web site, and have changed the site behavior to keep you from doing that.

I can tell you what is happening.  The "agree" page sets a cookie on your browser.  You can probably fire this programmatically and get the cookie, but (of course) your PHP script will have to act like a well-behaved web browser, reading and filling in forms, accepting and returning cookies and following redirects.

Stripped of all the fluff, here is the form that you must submit to get the cookie.
<form name="Form1" method="post" action="https://officialrecords.broward.org/oncorev2/default.aspx" id="Form1">
<input type="hidden" name="__VIEWSTATE" value="/wEPDwUKLTM1MzE5NzMwNQ9kFgJmD2QWAgIBDxYCHgRUZXh0BaYIPGNlbnRlcj48QSBIUkVGPSAgImh0dHA6Ly93d3cuYnJvd2FyZC5vcmcvUmVjb3Jkc1RheGVzVHJlYXN1cnkvUmVjb3Jkcy9QYWdlcy9SZW1vdmVGcm9tUHVibGljUmVjb3JkLmFzcHgiID5DTElDSyBIRVJFIFRPIFJFTU9WRSBPUiBCTE9DSyBJTkZPUk1BVElPTiBGUk9NIFBVQkxJQyBSRUNPUkQ8L0E+ICA8L2NlbnRlcj4KPHA+VGhlIEJyb3dhcmQgQ291bnR5IFJlY29yZHMgRGl2aXNpb24gcHJlc2VudHMgdGhlIGluZm9ybWF0aW9uIG9uIHRoaXMgd2ViIHNpdGUgYXMgYSBzZXJ2aWNlIHRvIHRoZSBwdWJsaWMuIFdlIGhhdmUgdHJpZWQgdG8gZW5zdXJlIHRoYXQgdGhlIGluZm9ybWF0aW9uIGNvbnRhaW5lZCBpbiB0aGlzIGVsZWN0cm9uaWMgc2VhcmNoIHN5c3RlbSBpcyBhY2N1cmF0ZS4gQnJvd2FyZCBDb3VudHkgUmVjb3JkcyBEaXZpc2lvbiBtYWtlcyBubyB3YXJyYW50eSBvciBndWFyYW50ZWUgY29uY2VybmluZyB0aGUgYWNjdXJhY3kgb3IgcmVsaWFiaWxpdHkgb2YgdGhlIGNvbnRlbnQgYXQgdGhpcyBzaXRlIG9yIGF0IG90aGVyIHNpdGVzIHRvIHdoaWNoIHdlIGxpbmsuIEFzc2Vzc2luZyBhY2N1cmFjeSBhbmQgcmVsaWFiaWxpdHkgb2YgaW5mb3JtYXRpb24gaXMgdGhlIHJlc3BvbnNpYmlsaXR5IG9mIHRoZSB1c2VyLiBUaGUgdXNlciBpcyBhZHZpc2VkIHRvIHNlYXJjaCBvbiBhbGwgcG9zc2libGUgc3BlbGxpbmcgdmFyaWF0aW9ucyBvZiBwcm9wZXIgbmFtZXMsIGluIG9yZGVyIHRvIG1heGltaXplIHNlYXJjaCByZXN1bHRzLiA8L3A+IAo8cD5UaGUgQnJvd2FyZCBDb3VudHkgUmVjb3JkcyBEaXZpc2lvbiBzaGFsbCBub3QgYmUgbGlhYmxlIGZvciBlcnJvcnMgY29udGFpbmVkIGhlcmVpbiBvciBmb3IgYW55IGRhbWFnZXMgaW4gY29ubmVjdGlvbiB3aXRoIHRoZSB1c2Ugb2YgdGhlIGluZm9ybWF0aW9uIGNvbnRhaW5lZCBoZXJlaW4uPC9wPiAKPHA+SWYgeW91IGNob29zZSBub3QgdG8gYWNjZXB0IHRoZSBjb25kaXRpb25zIHN0YXRlZCBhYm92ZSBwbGVhc2UgY2xpY2sgb24gSG9tZSB0byBleGl0IHRoaXMgc2VhcmNoIGFwcGxpY2F0aW9uLiA8L3A+ZGRnTPCHBcrvj/YhNNJuhjTyhYdqPA==" />
<input type="hidden" name="__VIEWSTATEGENERATOR" value="597CA4B2" />
<input type="hidden" name="__EVENTVALIDATION" value="/wEWAgLgy+mvDQLeyeihA9iWQjUHxdXKDffFwPcbvM9MacmN" />
<input type="submit" name="cmdAccept" value="I accept the conditions above" id="cmdAccept" /></DIV>
</form>

Open in new window

The three hidden inputs are generated on a per-request basis and are checked in the POST response, so your first cURL request must be GET to the oncorev2 page, where you must parse the document, isolate these input fields and send them back in the POST request.

That may not be the end of the troubles, however.  It looks like part of the information is in an iframe, and part of it is in a PDF.  How to get both of these?  For me that would be a research project.

Since this appears to be a matter of public records, you might want to file a freedom of information request, specifying that the remedy sought is an API that will accept automated requests and return the records.  It might already exist, and the government has not published it yet.
0
Mark BradyPrincipal Data EngineerCommented:
Interesting you say you get an error when you go to
https://officialrecords.broward.org/oncorev2/showdetails.aspx?cfn=113063141

I was able to click on it and I got the correct page results (No error). I am certain I have never been there before on this computer so I would not have the required cookie. Weird!

In saying that, there is a very good tool for webscrapes that will show you every request and response including all cookies and data sent and received. Download Fiddler4 and play around with it. I use it regularly to scrape sites. It will show you more than any browser tools/developer can and allows you to mimic the request and edit what you send.

For example if you had fiddler open and you locked it to your browser then went to a page, you will see the raw request, headers, cookies etc as well as the response. You can then open a compose window and drag the request to it. In the header section I start removing headers and submitting the request again. I keep doing this until it throws an error in the response. This way you can cut down on what you need to send (easier for clarity in your coding).
0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
MHPCAuthor Commented:
Thanks!
0
Ray PaseurCommented:
Cautionary note here, found in the Terms of Use.  I think you may want to write to Broward County and explain your plans to use software to scrape their web site.  If they want you to do that, they will give you permission.

1.11 Intellectual Property Rights
Broward County owns all rights, title and interest in the trademarks, logos, and other protected intellectual property of Broward County.  You may not use the Broward County trademarks, logos, or other protected intellectual property without Broward County's prior written permission.  You acknowledge and agree that nothing on the Site grants, expressly or implicitly, by estoppel or otherwise, any right or license to use any of the Broward County intellectual property rights or may be construed to mean that Broward County has authority to grant any right or license on behalf of any third-party trademark owner.  You may not reproduce, distribute, display, transmit, modify, perform, adapt, generate derivative works of, or otherwise use the content of the Site without the prior written permission of Broward County unless your use qualifies as "Fair Use" under applicable law.
0
MHPCAuthor Commented:
OK, thanks!
0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
PHP

From novice to tech pro — start learning today.

Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.