Link to home
Start Free TrialLog in
Avatar of MHPC
MHPCFlag for United States of America

asked on

Unable to scrape website anymore

Hello Experts!
I used to be able to scrape information from this website using PHP curl - but now it is not working anymore.

https://officialrecords.broward.org/oncorev2/showdetails.aspx?cfn=113063141

It seems they put a page in-front of this page and now you have to click "agree" button from this new page to be allowed to the page I need to scrape.

In other words, the page I need to access is only reachable from the new "agree" page.
Example:
If you go here:
https://officialrecords.broward.org/oncorev2/showdetails.aspx?cfn=113063141
You will get an error.
But if you go here first and click "agree"
https://officialrecords.broward.org/oncorev2/
and then again you go to :
https://officialrecords.broward.org/oncorev2/showdetails.aspx?cfn=113063141

You get the data I need.

Is there any way to programmatic accomplish this flow (click "agree" button) so I can get to the final page using PHP??

Thanks!
Avatar of Ray Paseur
Ray Paseur
Flag of United States of America image

Executive summary: They do not want you to be able to scrape the records from the web site, and have changed the site behavior to keep you from doing that.

I can tell you what is happening.  The "agree" page sets a cookie on your browser.  You can probably fire this programmatically and get the cookie, but (of course) your PHP script will have to act like a well-behaved web browser, reading and filling in forms, accepting and returning cookies and following redirects.

Stripped of all the fluff, here is the form that you must submit to get the cookie.
<form name="Form1" method="post" action="https://officialrecords.broward.org/oncorev2/default.aspx" id="Form1">
<input type="hidden" name="__VIEWSTATE" value="/wEPDwUKLTM1MzE5NzMwNQ9kFgJmD2QWAgIBDxYCHgRUZXh0BaYIPGNlbnRlcj48QSBIUkVGPSAgImh0dHA6Ly93d3cuYnJvd2FyZC5vcmcvUmVjb3Jkc1RheGVzVHJlYXN1cnkvUmVjb3Jkcy9QYWdlcy9SZW1vdmVGcm9tUHVibGljUmVjb3JkLmFzcHgiID5DTElDSyBIRVJFIFRPIFJFTU9WRSBPUiBCTE9DSyBJTkZPUk1BVElPTiBGUk9NIFBVQkxJQyBSRUNPUkQ8L0E+ICA8L2NlbnRlcj4KPHA+VGhlIEJyb3dhcmQgQ291bnR5IFJlY29yZHMgRGl2aXNpb24gcHJlc2VudHMgdGhlIGluZm9ybWF0aW9uIG9uIHRoaXMgd2ViIHNpdGUgYXMgYSBzZXJ2aWNlIHRvIHRoZSBwdWJsaWMuIFdlIGhhdmUgdHJpZWQgdG8gZW5zdXJlIHRoYXQgdGhlIGluZm9ybWF0aW9uIGNvbnRhaW5lZCBpbiB0aGlzIGVsZWN0cm9uaWMgc2VhcmNoIHN5c3RlbSBpcyBhY2N1cmF0ZS4gQnJvd2FyZCBDb3VudHkgUmVjb3JkcyBEaXZpc2lvbiBtYWtlcyBubyB3YXJyYW50eSBvciBndWFyYW50ZWUgY29uY2VybmluZyB0aGUgYWNjdXJhY3kgb3IgcmVsaWFiaWxpdHkgb2YgdGhlIGNvbnRlbnQgYXQgdGhpcyBzaXRlIG9yIGF0IG90aGVyIHNpdGVzIHRvIHdoaWNoIHdlIGxpbmsuIEFzc2Vzc2luZyBhY2N1cmFjeSBhbmQgcmVsaWFiaWxpdHkgb2YgaW5mb3JtYXRpb24gaXMgdGhlIHJlc3BvbnNpYmlsaXR5IG9mIHRoZSB1c2VyLiBUaGUgdXNlciBpcyBhZHZpc2VkIHRvIHNlYXJjaCBvbiBhbGwgcG9zc2libGUgc3BlbGxpbmcgdmFyaWF0aW9ucyBvZiBwcm9wZXIgbmFtZXMsIGluIG9yZGVyIHRvIG1heGltaXplIHNlYXJjaCByZXN1bHRzLiA8L3A+IAo8cD5UaGUgQnJvd2FyZCBDb3VudHkgUmVjb3JkcyBEaXZpc2lvbiBzaGFsbCBub3QgYmUgbGlhYmxlIGZvciBlcnJvcnMgY29udGFpbmVkIGhlcmVpbiBvciBmb3IgYW55IGRhbWFnZXMgaW4gY29ubmVjdGlvbiB3aXRoIHRoZSB1c2Ugb2YgdGhlIGluZm9ybWF0aW9uIGNvbnRhaW5lZCBoZXJlaW4uPC9wPiAKPHA+SWYgeW91IGNob29zZSBub3QgdG8gYWNjZXB0IHRoZSBjb25kaXRpb25zIHN0YXRlZCBhYm92ZSBwbGVhc2UgY2xpY2sgb24gSG9tZSB0byBleGl0IHRoaXMgc2VhcmNoIGFwcGxpY2F0aW9uLiA8L3A+ZGRnTPCHBcrvj/YhNNJuhjTyhYdqPA==" />
<input type="hidden" name="__VIEWSTATEGENERATOR" value="597CA4B2" />
<input type="hidden" name="__EVENTVALIDATION" value="/wEWAgLgy+mvDQLeyeihA9iWQjUHxdXKDffFwPcbvM9MacmN" />
<input type="submit" name="cmdAccept" value="I accept the conditions above" id="cmdAccept" /></DIV>
</form>

Open in new window

The three hidden inputs are generated on a per-request basis and are checked in the POST response, so your first cURL request must be GET to the oncorev2 page, where you must parse the document, isolate these input fields and send them back in the POST request.

That may not be the end of the troubles, however.  It looks like part of the information is in an iframe, and part of it is in a PDF.  How to get both of these?  For me that would be a research project.

Since this appears to be a matter of public records, you might want to file a freedom of information request, specifying that the remedy sought is an API that will accept automated requests and return the records.  It might already exist, and the government has not published it yet.
ASKER CERTIFIED SOLUTION
Avatar of Mark Brady
Mark Brady
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of MHPC

ASKER

Thanks!
Cautionary note here, found in the Terms of Use.  I think you may want to write to Broward County and explain your plans to use software to scrape their web site.  If they want you to do that, they will give you permission.

1.11 Intellectual Property Rights
Broward County owns all rights, title and interest in the trademarks, logos, and other protected intellectual property of Broward County.  You may not use the Broward County trademarks, logos, or other protected intellectual property without Broward County's prior written permission.  You acknowledge and agree that nothing on the Site grants, expressly or implicitly, by estoppel or otherwise, any right or license to use any of the Broward County intellectual property rights or may be construed to mean that Broward County has authority to grant any right or license on behalf of any third-party trademark owner.  You may not reproduce, distribute, display, transmit, modify, perform, adapt, generate derivative works of, or otherwise use the content of the Site without the prior written permission of Broward County unless your use qualifies as "Fair Use" under applicable law.
Avatar of MHPC

ASKER

OK, thanks!