Thread7
asked on
Problem getting page with CURL and PHP
I had written some PHP code to periodically scrape a URL and it was working fine. Then the site must have changed something and now it doesn't work. It works fine through FireFox but I get a 400 Bad Request through CURL. It seems like I've tried every curl_opt setting with no success. I'm thinking if I can just send the exact same Request headers as Firefox I should be fine. But how to do that?
CURL seems to add a few extra items without my telling it to.
Lately I've been setting my own header with pretty much the same items as Firefox like this:
---------
$header = array("Accept: text/html,application/xhtm l+xml,appl ication/xm l;q=0.9,*/ *;q=0.8", "Accept-Language: en-us,en;q=0.5",
"Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q =0.7", "Keep-Alive: 300", "Connection: keep-alive", "Cache-Control: max-age=0", "Accept-Encoding: gzip,deflate");
curl_setopt($ch,CURLOPT_HT TPHEADER,$ header);
------
***The working FireFox header is basically this:
Host: www.example.com
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.0.10) Gecko/2009042316 Firefox/3.0.10 GTB7.0 (.NET CLR 3.5.30729)
Accept: text/html,application/xhtm l+xml,appl ication/xm l;q=0.9,*/ *;q=0.8
Accept-Language: en-us,en;q=0.5
Accept-Encoding: gzip,deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q =0.7
Keep-Alive: 300
Connection: keep-alive
Cookie: noscript=1; userid=1550521915; xsession=d9c73c024e99af045 81a30521d3 558ba; datrval=1276442132-05e4a92 65e4ac217a 93748a7372 0f4becd56d ecd0c7d576 d04eb8
Cache-Control: max-age=0
--------
There is a login that I run through curl before my request for the page I want to scrape and some of those cookies get there. But I'm pretty confident that the user and session cookies are not the problem. When I look at the header returned by curl_getinfo I see a few differences and figure one of these is the problem.
*** The non working CURL header I am sending is this:
POST /datadirectory/viewinfo.ph p HTTP/1.0
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.0.10) Gecko/2009042316 Firefox/3.0.10 GTB7.0
(.NET CLR 3.5.30729)
Host: www.example.com
Cookie: xsession=d9c73c024e99af045 81a30521d3 558ba; userid=1550521915; noscript=1; datrval=1276442132-05e4a92 65e4ac217a 93748a7372 0f4becd56d ecd0c7d576 d04eb8
Accept: text/html,application/xhtm l+xml,appl ication/xm l;q=0.9,*/ *;q=0.8
Accept-Language: en-us,en;q=0.5
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q =0.7
Keep-Alive: 300
Connection: keep-alive
Cache-Control: max-age=0
Accept-Encoding: gzip,deflate
Content-Length: 0
Content-Type: application/x-www-form-url encoded
The differences that I think may be it are:
*** POST /datadirectory/viewinfo.ph p -- Huh? Why does CURL send this as post? The site along with this is the url I want. http://www.example.com/datadirectory/viewinfo.php
*** Content-Length: 0 -- Why am I sending it Content-Length: 0? I'd like to just leave this out since Firefox doesn't send it. But CURL is automatically adding it. Maybe that is saying the POST data length is 0?
*** Accept-Encoding: gzip,deflate. I set this manually in the CURLOPT_HTTPHEADER but if I leave it out I still have the problem.
*** I tried with setting curl to HTTP 1.0 and HTTP 1.1, neither made a difference.
Any ideas??
CURL seems to add a few extra items without my telling it to.
Lately I've been setting my own header with pretty much the same items as Firefox like this:
---------
$header = array("Accept: text/html,application/xhtm
"Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q
curl_setopt($ch,CURLOPT_HT
------
***The working FireFox header is basically this:
Host: www.example.com
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.0.10) Gecko/2009042316 Firefox/3.0.10 GTB7.0 (.NET CLR 3.5.30729)
Accept: text/html,application/xhtm
Accept-Language: en-us,en;q=0.5
Accept-Encoding: gzip,deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q
Keep-Alive: 300
Connection: keep-alive
Cookie: noscript=1; userid=1550521915; xsession=d9c73c024e99af045
Cache-Control: max-age=0
--------
There is a login that I run through curl before my request for the page I want to scrape and some of those cookies get there. But I'm pretty confident that the user and session cookies are not the problem. When I look at the header returned by curl_getinfo I see a few differences and figure one of these is the problem.
*** The non working CURL header I am sending is this:
POST /datadirectory/viewinfo.ph
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.0.10) Gecko/2009042316 Firefox/3.0.10 GTB7.0
(.NET CLR 3.5.30729)
Host: www.example.com
Cookie: xsession=d9c73c024e99af045
Accept: text/html,application/xhtm
Accept-Language: en-us,en;q=0.5
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q
Keep-Alive: 300
Connection: keep-alive
Cache-Control: max-age=0
Accept-Encoding: gzip,deflate
Content-Length: 0
Content-Type: application/x-www-form-url
The differences that I think may be it are:
*** POST /datadirectory/viewinfo.ph
*** Content-Length: 0 -- Why am I sending it Content-Length: 0? I'd like to just leave this out since Firefox doesn't send it. But CURL is automatically adding it. Maybe that is saying the POST data length is 0?
*** Accept-Encoding: gzip,deflate. I set this manually in the CURLOPT_HTTPHEADER but if I leave it out I still have the problem.
*** I tried with setting curl to HTTP 1.0 and HTTP 1.1, neither made a difference.
Any ideas??
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Please don't scrape content from other sites without talking to the owner of the site.
If you get agreement from the owner of the site, he or she will be able to tell you what changes have been made so that you can update your code with less hassle.
Further, if a webmaster does find their content on your site without permission, they can have your pages delisted from search engines.
If you get agreement from the owner of the site, he or she will be able to tell you what changes have been made so that you can update your code with less hassle.
Further, if a webmaster does find their content on your site without permission, they can have your pages delisted from search engines.
ASKER
Yes, I do have permission. I don't think you need to be the Expert Exchange cop.
ASKER