Avatar of ChilliSauce
ChilliSauceFlag for Afghanistan

asked on 

PHP - How to scrape a remote webpage that requires cookies and javascript

Hi,

I am trying to scape a remote webpage that requires cookies and javascript to be enabled to view the page.

Can anyone give me some php code which would scrape the webpage as if it were a normal browser with javascript and cookies enabled.

Thanks
PHPScripting LanguagesJavaScript

Avatar of undefined
Last Comment
Ray Paseur
Avatar of leakim971
leakim971
Flag of Guadeloupe image

what is your final goal?
Avatar of ChilliSauce
ChilliSauce
Flag of Afghanistan image

ASKER

To be able to scrape the web page
Avatar of Ray Paseur
Ray Paseur
Flag of United States of America image

I think the answer is going to be "no" because while cookies are easy enough to deal with, writing a PHP script that acts like a JavaScript engine is a rather large task, one that would probably be worthy of a paid engagement over several months.

What is the exact URL you want to scrape?  What is the information you want to retrieve?
Avatar of ChilliSauce
ChilliSauce
Flag of Afghanistan image

ASKER

Avatar of Ray Paseur
Ray Paseur
Flag of United States of America image

On this page:
http://www.similarsites.com/site/yahoo.com

The "view source" shows 1670 lines.  What information do you want to extract?
Avatar of ChilliSauce
ChilliSauce
Flag of Afghanistan image

ASKER

Some of the "Similar Sites" URl's

I've got the code to extract the URL's which is easy, but when trying to get the URL data via a php call, the site throws an error saying Javascript and Cookies are required.

Obviously its not just that page, but where I can change the appending URL EG http://www.similarsites.com/site/somewebsite.com
Avatar of Ray Paseur
Ray Paseur
Flag of United States of America image

Interesting... Have you contacted the publisher about getting access via an API?  I ask because when web publishers go out of their way to require JavaScript and Cookies for a GET-method request, it's usually intended to prevent scraping their sites.  After all, as the publisher, they get to set the rules about that sort of thing.
Avatar of ChilliSauce
ChilliSauce
Flag of Afghanistan image

ASKER

Ahh - just noticed they had an API - will try that.
Avatar of Ray Paseur
Ray Paseur
Flag of United States of America image

That will probably be much more promising than scraping!  Best of luck with it, ~Ray
Avatar of skullnobrains
skullnobrains

i don't know how you attempt to read the file in php, but i tryed using url wrappers and it works pretty fine

echo '<? print(file_get_contents("http://www.similarsites.com/site/somewebsite.com"));' | php | sed -ne 's/.*"\(http:\/\/[^"]*\)".*/\1/p'
http://www.google.com/chromeframe/?redirect=true
http://images2.similargroup.com/image?url=somewebsite.com&amp;t=2&amp;s=10&amp;h=11361262101460171923
http://www.similarsites.com/site/somewebsite.com
http://www.similarsites.com/site/somewebsite.com
http://www.similarsites.com/site/centurylink.net
http://www.similarsites.com/site/google.com
http://www.similarsites.com/site/tumblrplayer.com
http://pl.similarsites.com/site/znajdz.interia.pl
http://www.similarsites.com/site/compiledk.blogspot.com
http://www.similarsites.com/site/linkblip.com
http://www.similarsites.com/site/citebite.com
http://www.similarsites.com/site/dead-links.com
http://www.similarsites.com/site/linkdiagnosis.com
http://www.similarsites.com/site/wholinks2me.com
http://www.similarsites.com/site/artofthetitle.com
http://www.similarsites.com/site/nmvtis.gov
http://www.similarsites.com/site/alta.org
http://www.similarsites.com/site/firstam.com
http://www.similarsites.com/site/stewart.com
http://www.similarsites.com/site/alistapart.com
http://www.similarsites.com/site/w3schools.com
http://www.similarsites.com/site/validator.w3.org
http://www.similarsites.com/site/smashingmagazine.com
http://www.similarsites.com/site/browsershots.org
http://www.similarsites.com/site/fr.jimdo.com
http://www.similarsites.com/site/sitereloader.appspot.com
http://www.similarsites.com/site/knowthesigns.com
http://www.similarsites.com/site/kiubi.com
http://fr.similarsites.com/site/joomla.fr
http://disqus.com/?ref_noscript
http://disqus.com
http://www.google.com/recaptcha/api/js/recaptcha_ajax.js
http://code.jquery.com/jquery-1.10.1.min.js
http://www.statcounter.com/counter/counter.js

Open in new window


the sed barely extracts double-quoted urls starting with http:// (well the first of each line but that's more than enough to demonstrate it does work

you can easily extract the portion after /site/ using a similar ereg in php if that is your final goal
ASKER CERTIFIED SOLUTION
Avatar of skullnobrains
skullnobrains

Blurred text
THIS SOLUTION IS ONLY AVAILABLE TO MEMBERS.
View this solution by signing up for a free trial.
Members can start a 7-Day free trial and enjoy unlimited access to the platform.
See Pricing Options
Start Free Trial
Avatar of Ray Paseur
Ray Paseur
Flag of United States of America image

If we can read it with file_get_contents() the webpage does not require cookies and javascript to be enabled.  Would have been nice to omit that bogus requirement from the question, so we didn't have to waste your time as we chase a red herring!
JavaScript
JavaScript

JavaScript is a dynamic, object-based language commonly used for client-side scripting in web browsers. Recently, server side JavaScript frameworks have also emerged. JavaScript runs on nearly every operating system and in almost every mainstream web browser.

127K
Questions
--
Followers
--
Top Experts
Get a personalized solution from industry experts
Ask the experts
Read over 600 more reviews

TRUSTED BY

IBM logoIntel logoMicrosoft logoUbisoft logoSAP logo
Qualcomm logoCitrix Systems logoWorkday logoErnst & Young logo
High performer badgeUsers love us badge
LinkedIn logoFacebook logoX logoInstagram logoTikTok logoYouTube logo