PHP - How to scrape a remote webpage that requires cookies and javascript

Hi,

I am trying to scape a remote webpage that requires cookies and javascript to be enabled to view the page.

Can anyone give me some php code which would scrape the webpage as if it were a normal browser with javascript and cookies enabled.

Thanks
ChilliSauceAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

leakim971PluritechnicianCommented:
what is your final goal?
0
ChilliSauceAuthor Commented:
To be able to scrape the web page
0
Ray PaseurCommented:
I think the answer is going to be "no" because while cookies are easy enough to deal with, writing a PHP script that acts like a JavaScript engine is a rather large task, one that would probably be worthy of a paid engagement over several months.

What is the exact URL you want to scrape?  What is the information you want to retrieve?
0
Determine the Perfect Price for Your IT Services

Do you wonder if your IT business is truly profitable or if you should raise your prices? Learn how to calculate your overhead burden with our free interactive tool and use it to determine the right price for your IT services. Download your free eBook now!

ChilliSauceAuthor Commented:
0
Ray PaseurCommented:
On this page:
http://www.similarsites.com/site/yahoo.com

The "view source" shows 1670 lines.  What information do you want to extract?
0
ChilliSauceAuthor Commented:
Some of the "Similar Sites" URl's

I've got the code to extract the URL's which is easy, but when trying to get the URL data via a php call, the site throws an error saying Javascript and Cookies are required.

Obviously its not just that page, but where I can change the appending URL EG http://www.similarsites.com/site/somewebsite.com
0
Ray PaseurCommented:
Interesting... Have you contacted the publisher about getting access via an API?  I ask because when web publishers go out of their way to require JavaScript and Cookies for a GET-method request, it's usually intended to prevent scraping their sites.  After all, as the publisher, they get to set the rules about that sort of thing.
0
ChilliSauceAuthor Commented:
Ahh - just noticed they had an API - will try that.
0
Ray PaseurCommented:
That will probably be much more promising than scraping!  Best of luck with it, ~Ray
0
skullnobrainsCommented:
i don't know how you attempt to read the file in php, but i tryed using url wrappers and it works pretty fine

echo '<? print(file_get_contents("http://www.similarsites.com/site/somewebsite.com"));' | php | sed -ne 's/.*"\(http:\/\/[^"]*\)".*/\1/p'
http://www.google.com/chromeframe/?redirect=true
http://images2.similargroup.com/image?url=somewebsite.com&amp;t=2&amp;s=10&amp;h=11361262101460171923
http://www.similarsites.com/site/somewebsite.com
http://www.similarsites.com/site/somewebsite.com
http://www.similarsites.com/site/centurylink.net
http://www.similarsites.com/site/google.com
http://www.similarsites.com/site/tumblrplayer.com
http://pl.similarsites.com/site/znajdz.interia.pl
http://www.similarsites.com/site/compiledk.blogspot.com
http://www.similarsites.com/site/linkblip.com
http://www.similarsites.com/site/citebite.com
http://www.similarsites.com/site/dead-links.com
http://www.similarsites.com/site/linkdiagnosis.com
http://www.similarsites.com/site/wholinks2me.com
http://www.similarsites.com/site/artofthetitle.com
http://www.similarsites.com/site/nmvtis.gov
http://www.similarsites.com/site/alta.org
http://www.similarsites.com/site/firstam.com
http://www.similarsites.com/site/stewart.com
http://www.similarsites.com/site/alistapart.com
http://www.similarsites.com/site/w3schools.com
http://www.similarsites.com/site/validator.w3.org
http://www.similarsites.com/site/smashingmagazine.com
http://www.similarsites.com/site/browsershots.org
http://www.similarsites.com/site/fr.jimdo.com
http://www.similarsites.com/site/sitereloader.appspot.com
http://www.similarsites.com/site/knowthesigns.com
http://www.similarsites.com/site/kiubi.com
http://fr.similarsites.com/site/joomla.fr
http://disqus.com/?ref_noscript
http://disqus.com
http://www.google.com/recaptcha/api/js/recaptcha_ajax.js
http://code.jquery.com/jquery-1.10.1.min.js
http://www.statcounter.com/counter/counter.js

Open in new window


the sed barely extracts double-quoted urls starting with http:// (well the first of each line but that's more than enough to demonstrate it does work

you can easily extract the portion after /site/ using a similar ereg in php if that is your final goal
0
skullnobrainsCommented:
a late update

$ echo '<? $matches=array();print(preg_match_all(",/site/[\w.]*,",file_get_contents("http://www.similarsites.com/site/somewebsite.com"),$matches));print_r($matches);' | php
27Array
(
    [0] => Array
        (
            [0] => /site/somewebsite.com
            [1] => /site/somewebsite.com
            [2] => /site/centurylink.net
            [3] => /site/google.com
            [4] => /site/tumblrplayer.com
            [5] => /site/znajdz.interia.pl
            [6] => /site/compiledk.blogspot.com
            [7] => /site/linkblip.com
            [8] => /site/citebite.com
            [9] => /site/dead
            [10] => /site/linkdiagnosis.com
            [11] => /site/wholinks2me.com
            [12] => /site/artofthetitle.com
            [13] => /site/nmvtis.gov
            [14] => /site/alta.org
            [15] => /site/firstam.com
            [16] => /site/stewart.com
            [17] => /site/alistapart.com
            [18] => /site/w3schools.com
            [19] => /site/validator.w3.org
            [20] => /site/smashingmagazine.com
            [21] => /site/browsershots.org
            [22] => /site/fr.jimdo.com
            [23] => /site/sitereloader.appspot.com
            [24] => /site/knowthesigns.com
            [25] => /site/kiubi.com
            [26] => /site/joomla.fr
        )

)

Open in new window


but if they do not want you to scrape, you might get kicked soon. it's polite to ask the site admin beforehand if you expect to run this often.
0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
Ray PaseurCommented:
If we can read it with file_get_contents() the webpage does not require cookies and javascript to be enabled.  Would have been nice to omit that bogus requirement from the question, so we didn't have to waste your time as we chase a red herring!
0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
PHP

From novice to tech pro — start learning today.