Avatar of ChilliSauce
ChilliSauceFlag for Afghanistan asked on

PHP - How to scrape a remote webpage that requires cookies and javascript

Hi,

I am trying to scape a remote webpage that requires cookies and javascript to be enabled to view the page.

Can anyone give me some php code which would scrape the webpage as if it were a normal browser with javascript and cookies enabled.

Thanks
PHPScripting LanguagesJavaScript

Avatar of undefined
Last Comment
Ray Paseur

8/22/2022 - Mon
leakim971

what is your final goal?
ASKER
ChilliSauce

To be able to scrape the web page
Ray Paseur

I think the answer is going to be "no" because while cookies are easy enough to deal with, writing a PHP script that acts like a JavaScript engine is a rather large task, one that would probably be worthy of a paid engagement over several months.

What is the exact URL you want to scrape?  What is the information you want to retrieve?
I started with Experts Exchange in 2004 and it's been a mainstay of my professional computing life since. It helped me launch a career as a programmer / Oracle data analyst
William Peck
ASKER
ChilliSauce

Ray Paseur

On this page:
http://www.similarsites.com/site/yahoo.com

The "view source" shows 1670 lines.  What information do you want to extract?
ASKER
ChilliSauce

Some of the "Similar Sites" URl's

I've got the code to extract the URL's which is easy, but when trying to get the URL data via a php call, the site throws an error saying Javascript and Cookies are required.

Obviously its not just that page, but where I can change the appending URL EG http://www.similarsites.com/site/somewebsite.com
Get an unlimited membership to EE for less than $4 a week.
Unlimited question asking, solutions, articles and more.
Ray Paseur

Interesting... Have you contacted the publisher about getting access via an API?  I ask because when web publishers go out of their way to require JavaScript and Cookies for a GET-method request, it's usually intended to prevent scraping their sites.  After all, as the publisher, they get to set the rules about that sort of thing.
ASKER
ChilliSauce

Ahh - just noticed they had an API - will try that.
Ray Paseur

That will probably be much more promising than scraping!  Best of luck with it, ~Ray
Experts Exchange has (a) saved my job multiple times, (b) saved me hours, days, and even weeks of work, and often (c) makes me look like a superhero! This place is MAGIC!
Walt Forbes
skullnobrains

i don't know how you attempt to read the file in php, but i tryed using url wrappers and it works pretty fine

echo '<? print(file_get_contents("http://www.similarsites.com/site/somewebsite.com"));' | php | sed -ne 's/.*"\(http:\/\/[^"]*\)".*/\1/p'
http://www.google.com/chromeframe/?redirect=true
http://images2.similargroup.com/image?url=somewebsite.com&amp;t=2&amp;s=10&amp;h=11361262101460171923
http://www.similarsites.com/site/somewebsite.com
http://www.similarsites.com/site/somewebsite.com
http://www.similarsites.com/site/centurylink.net
http://www.similarsites.com/site/google.com
http://www.similarsites.com/site/tumblrplayer.com
http://pl.similarsites.com/site/znajdz.interia.pl
http://www.similarsites.com/site/compiledk.blogspot.com
http://www.similarsites.com/site/linkblip.com
http://www.similarsites.com/site/citebite.com
http://www.similarsites.com/site/dead-links.com
http://www.similarsites.com/site/linkdiagnosis.com
http://www.similarsites.com/site/wholinks2me.com
http://www.similarsites.com/site/artofthetitle.com
http://www.similarsites.com/site/nmvtis.gov
http://www.similarsites.com/site/alta.org
http://www.similarsites.com/site/firstam.com
http://www.similarsites.com/site/stewart.com
http://www.similarsites.com/site/alistapart.com
http://www.similarsites.com/site/w3schools.com
http://www.similarsites.com/site/validator.w3.org
http://www.similarsites.com/site/smashingmagazine.com
http://www.similarsites.com/site/browsershots.org
http://www.similarsites.com/site/fr.jimdo.com
http://www.similarsites.com/site/sitereloader.appspot.com
http://www.similarsites.com/site/knowthesigns.com
http://www.similarsites.com/site/kiubi.com
http://fr.similarsites.com/site/joomla.fr
http://disqus.com/?ref_noscript
http://disqus.com
http://www.google.com/recaptcha/api/js/recaptcha_ajax.js
http://code.jquery.com/jquery-1.10.1.min.js
http://www.statcounter.com/counter/counter.js

Open in new window


the sed barely extracts double-quoted urls starting with http:// (well the first of each line but that's more than enough to demonstrate it does work

you can easily extract the portion after /site/ using a similar ereg in php if that is your final goal
ASKER CERTIFIED SOLUTION
skullnobrains

Log in or sign up to see answer
Become an EE member today7-DAY FREE TRIAL
Members can start a 7-Day Free trial then enjoy unlimited access to the platform
Sign up - Free for 7 days
or
Learn why we charge membership fees
We get it - no one likes a content blocker. Take one extra minute and find out why we block content.
See how we're fighting big data
Not exactly the question you had in mind?
Sign up for an EE membership and get your own personalized solution. With an EE membership, you can ask unlimited troubleshooting, research, or opinion questions.
ask a question
Ray Paseur

If we can read it with file_get_contents() the webpage does not require cookies and javascript to be enabled.  Would have been nice to omit that bogus requirement from the question, so we didn't have to waste your time as we chase a red herring!