PHP - How to scrape a remote webpage that requires cookies and javascript


I am trying to scape a remote webpage that requires cookies and javascript to be enabled to view the page.

Can anyone give me some php code which would scrape the webpage as if it were a normal browser with javascript and cookies enabled.

what is your final goal?
ChilliSauceAuthor Commented:
To be able to scrape the web page
Ray PaseurCommented:
I think the answer is going to be "no" because while cookies are easy enough to deal with, writing a PHP script that acts like a JavaScript engine is a rather large task, one that would probably be worthy of a paid engagement over several months.

What is the exact URL you want to scrape?  What is the information you want to retrieve?
ChilliSauceAuthor Commented:
Ray PaseurCommented:
On this page:

The "view source" shows 1670 lines.  What information do you want to extract?
ChilliSauceAuthor Commented:
Some of the "Similar Sites" URl's

I've got the code to extract the URL's which is easy, but when trying to get the URL data via a php call, the site throws an error saying Javascript and Cookies are required.

Obviously its not just that page, but where I can change the appending URL EG
Ray PaseurCommented:
Interesting... Have you contacted the publisher about getting access via an API?  I ask because when web publishers go out of their way to require JavaScript and Cookies for a GET-method request, it's usually intended to prevent scraping their sites.  After all, as the publisher, they get to set the rules about that sort of thing.
ChilliSauceAuthor Commented:
Ahh - just noticed they had an API - will try that.
Ray PaseurCommented:
That will probably be much more promising than scraping!  Best of luck with it, ~Ray
i don't know how you attempt to read the file in php, but i tryed using url wrappers and it works pretty fine

echo '<? print(file_get_contents(""));' | php | sed -ne 's/.*"\(http:\/\/[^"]*\)".*/\1/p';t=2&amp;s=10&amp;h=11361262101460171923

Open in new window

the sed barely extracts double-quoted urls starting with http:// (well the first of each line but that's more than enough to demonstrate it does work

you can easily extract the portion after /site/ using a similar ereg in php if that is your final goal
a late update

$ echo '<? $matches=array();print(preg_match_all(",/site/[\w.]*,",file_get_contents(""),$matches));print_r($matches);' | php
    [0] => Array
            [0] => /site/
            [1] => /site/
            [2] => /site/
            [3] => /site/
            [4] => /site/
            [5] => /site/
            [6] => /site/
            [7] => /site/
            [8] => /site/
            [9] => /site/dead
            [10] => /site/
            [11] => /site/
            [12] => /site/
            [13] => /site/
            [14] => /site/
            [15] => /site/
            [16] => /site/
            [17] => /site/
            [18] => /site/
            [19] => /site/
            [20] => /site/
            [21] => /site/
            [22] => /site/
            [23] => /site/
            [24] => /site/
            [25] => /site/
            [26] => /site/


Open in new window

but if they do not want you to scrape, you might get kicked soon. it's polite to ask the site admin beforehand if you expect to run this often.

Ray PaseurCommented:
If we can read it with file_get_contents() the webpage does not require cookies and javascript to be enabled.  Would have been nice to omit that bogus requirement from the question, so we didn't have to waste your time as we chase a red herring!
