We help IT Professionals succeed at work.

Problem getting page with CURL and PHP

Thread7
Thread7 asked
on
I had written some PHP code to periodically scrape a URL and it was working fine. Then the site must have changed something and now it doesn't work. It works fine through FireFox but I get a 400 Bad Request through CURL. It seems like I've tried every curl_opt setting with no success. I'm thinking if I can just send the exact same Request headers as Firefox I should be fine. But how to do that?
CURL seems to add a few extra items without my telling it to.
Lately I've been setting my own header with pretty much the same items as Firefox like this:
---------
$header = array("Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "Accept-Language: en-us,en;q=0.5",
"Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7", "Keep-Alive: 300", "Connection: keep-alive", "Cache-Control: max-age=0", "Accept-Encoding: gzip,deflate");
curl_setopt($ch,CURLOPT_HTTPHEADER,$header);
------
***The working FireFox header is basically this:
Host: www.example.com
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.0.10) Gecko/2009042316 Firefox/3.0.10 GTB7.0 (.NET CLR 3.5.30729)
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-us,en;q=0.5
Accept-Encoding: gzip,deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Keep-Alive: 300
Connection: keep-alive
Cookie: noscript=1; userid=1550521915; xsession=d9c73c024e99af04581a30521d3558ba; datrval=1276442132-05e4a9265e4ac217a93748a73720f4becd56decd0c7d576d04eb8
Cache-Control: max-age=0
--------

There is a login that I run through curl before my request for the page I want to scrape and some of those cookies get there. But I'm pretty confident that the user and session cookies are not the problem. When I look at the header returned by curl_getinfo I see a few differences and figure one of these is the problem.
*** The non working CURL header I am sending is this:
POST /datadirectory/viewinfo.php HTTP/1.0
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.0.10) Gecko/2009042316 Firefox/3.0.10 GTB7.0
(.NET CLR 3.5.30729)
Host: www.example.com
Cookie: xsession=d9c73c024e99af04581a30521d3558ba; userid=1550521915; noscript=1; datrval=1276442132-05e4a9265e4ac217a93748a73720f4becd56decd0c7d576d04eb8
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-us,en;q=0.5
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Keep-Alive: 300
Connection: keep-alive
Cache-Control: max-age=0
Accept-Encoding: gzip,deflate
Content-Length: 0
Content-Type: application/x-www-form-urlencoded

The differences that I think may be it are:
*** POST /datadirectory/viewinfo.php -- Huh? Why does CURL send this as post? The site along with this is the url I want. http://www.example.com/datadirectory/viewinfo.php

*** Content-Length: 0 -- Why am I sending it Content-Length: 0? I'd like to just leave this out since Firefox doesn't send it. But CURL is automatically adding it. Maybe that is saying the POST data length is 0?

*** Accept-Encoding: gzip,deflate. I set this manually in the CURLOPT_HTTPHEADER but if I leave it out I still have the problem.

*** I tried with setting curl to HTTP 1.0 and HTTP 1.1, neither made a difference.

Any ideas??
Comment
Watch Question

Commented:
Actually the more troubleshooting I do, I realize it must have something to do with the session cookies.  If I simply don't send any cookies then I get a valid non-authenticated version of the page with response code 200.  When I add the cookies in then it gets mad and give me a 303 response code a redirects me to an error page.  But there are 4 cookies and I'm sending the same thing whether using Firefox or CURL:
----
Cookie: noscript=1; userid=1550521915; xsession=d9c73c024e99af04581a30521d3558ba; datrval=1276442132-05e4a9265e4ac217a93748a73720f4becd56decd0c7d576d04eb8
----
So what could be the problem.
Thought #1. There is something subtle about this cookie string that is different in Firefox than CURL.
Thought #2. There is some other header value that changes the cookies in some way.  Like changes their encoding method or invalidates them.  But what you see above is what I am sending.

Author

Commented:
Actually I figured it out.  It was the POST setting.  Changing it to a GET did the trick.
Please don't scrape content from other sites without talking to the owner of the site.

If you get agreement from the owner of the site, he or she will be able to tell you what changes have been made so that you can update your code with less hassle.

Further, if a webmaster does find their content on your site without permission, they can have your pages delisted from search engines.

Author

Commented:
Yes, I do have permission.  I don't think you need to be the Expert Exchange cop.