Link to home
Start Free TrialLog in
Avatar of Thread7
Thread7

asked on

Problem getting page with CURL and PHP

I had written some PHP code to periodically scrape a URL and it was working fine. Then the site must have changed something and now it doesn't work. It works fine through FireFox but I get a 400 Bad Request through CURL. It seems like I've tried every curl_opt setting with no success. I'm thinking if I can just send the exact same Request headers as Firefox I should be fine. But how to do that?
CURL seems to add a few extra items without my telling it to.
Lately I've been setting my own header with pretty much the same items as Firefox like this:
---------
$header = array("Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "Accept-Language: en-us,en;q=0.5",
"Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7", "Keep-Alive: 300", "Connection: keep-alive", "Cache-Control: max-age=0", "Accept-Encoding: gzip,deflate");
curl_setopt($ch,CURLOPT_HTTPHEADER,$header);
------
***The working FireFox header is basically this:
Host: www.example.com
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.0.10) Gecko/2009042316 Firefox/3.0.10 GTB7.0 (.NET CLR 3.5.30729)
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-us,en;q=0.5
Accept-Encoding: gzip,deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Keep-Alive: 300
Connection: keep-alive
Cookie: noscript=1; userid=1550521915; xsession=d9c73c024e99af04581a30521d3558ba; datrval=1276442132-05e4a9265e4ac217a93748a73720f4becd56decd0c7d576d04eb8
Cache-Control: max-age=0
--------

There is a login that I run through curl before my request for the page I want to scrape and some of those cookies get there. But I'm pretty confident that the user and session cookies are not the problem. When I look at the header returned by curl_getinfo I see a few differences and figure one of these is the problem.
*** The non working CURL header I am sending is this:
POST /datadirectory/viewinfo.php HTTP/1.0
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.0.10) Gecko/2009042316 Firefox/3.0.10 GTB7.0
(.NET CLR 3.5.30729)
Host: www.example.com
Cookie: xsession=d9c73c024e99af04581a30521d3558ba; userid=1550521915; noscript=1; datrval=1276442132-05e4a9265e4ac217a93748a73720f4becd56decd0c7d576d04eb8
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-us,en;q=0.5
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Keep-Alive: 300
Connection: keep-alive
Cache-Control: max-age=0
Accept-Encoding: gzip,deflate
Content-Length: 0
Content-Type: application/x-www-form-urlencoded

The differences that I think may be it are:
*** POST /datadirectory/viewinfo.php -- Huh? Why does CURL send this as post? The site along with this is the url I want. http://www.example.com/datadirectory/viewinfo.php

*** Content-Length: 0 -- Why am I sending it Content-Length: 0? I'd like to just leave this out since Firefox doesn't send it. But CURL is automatically adding it. Maybe that is saying the POST data length is 0?

*** Accept-Encoding: gzip,deflate. I set this manually in the CURLOPT_HTTPHEADER but if I leave it out I still have the problem.

*** I tried with setting curl to HTTP 1.0 and HTTP 1.1, neither made a difference.

Any ideas??
ASKER CERTIFIED SOLUTION
Avatar of Thread7
Thread7

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of Thread7
Thread7

ASKER

Actually I figured it out.  It was the POST setting.  Changing it to a GET did the trick.
Please don't scrape content from other sites without talking to the owner of the site.

If you get agreement from the owner of the site, he or she will be able to tell you what changes have been made so that you can update your code with less hassle.

Further, if a webmaster does find their content on your site without permission, they can have your pages delisted from search engines.
Avatar of Thread7

ASKER

Yes, I do have permission.  I don't think you need to be the Expert Exchange cop.