Learn how to a build a cloud-first strategyRegister Now

x
  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 975
  • Last Modified:

PHP Curl Multiple Cookies

I'm trying to scrape a site that uses multiple cookies. There are 3 cookies in use on the page I want to scrape: 2 session cookies and 1 persistent cookie.

I'm able to get a single level down in pages using a single cookie (1 session cookie) but I'm getting stopped. I suspect it's because of a lack of a specific one (the 2nd session cookie). How can I accept all cookies in a session to allow me to "authenticate" properly?
0
kjenney
Asked:
kjenney
1 Solution
 
Ray PaseurCommented:
Please post the actual URL of the site you want to scrape.  There may be more to it than cookies, and we would need to see the HTML documents, as well as the JavaScript, to give you a good answer.

This script will handle the cookies correctly, accepting and returning them.

<?php // RAY_curl_get_cookies.php
error_reporting(E_ALL);


// DEMONSTRATE THE BASICS OF CURL
// SOMETHING LIKE RAY_curl_get_cookies.php?url=http://twitter.com


// YOU COULD HAVE SOMETHING LIKE THIS
$url = isset($_GET["url"]) ? $_GET["url"] : 'http://twitter.com';

// BUT SINCE IT IS ON MY SERVER, I HAVE HARD-CODED THIS
$url = 'http://twitter.com';

// TRY THE REMOTE WEB SERVICE
$htm = my_curl($url);
$dat = file_get_contents('cookie.txt');

// SHOW THE WORK PRODUCT OR BARK OUT ERROR MESSAGES
echo "<pre>";
echo PHP_EOL . "<b>$url</b>";
echo PHP_EOL . "<i>$dat</i>";
echo PHP_EOL;

// ACTIVATE THIS TO SEE THE HTML STRING
// echo PHP_EOL . htmlentities($htm);


// A FUNCTION TO RUN A CURL-GET CLIENT CALL TO A FOREIGN SERVER
function my_curl
( $url
, $timeout=3
, $error_report=TRUE
)
{
    $curl = curl_init();

    // HEADERS AND OPTIONS APPEAR TO BE A FIREFOX BROWSER REFERRED BY GOOGLE
    $header[] = "Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5";
    $header[] = "Cache-Control: max-age=0";
    $header[] = "Connection: keep-alive";
    $header[] = "Keep-Alive: 300";
    $header[] = "Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7";
    $header[] = "Accept-Language: en-us,en;q=0.5";
    $header[] = "Pragma: "; // BROWSERS USUALLY LEAVE BLANK

    // SET THE CURL OPTIONS - SEE http://php.net/manual/en/function.curl-setopt.php
    curl_setopt( $curl, CURLOPT_URL,            $url  );
    curl_setopt( $curl, CURLOPT_USERAGENT,      'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'  );
    curl_setopt( $curl, CURLOPT_HTTPHEADER,     $header  );
    curl_setopt( $curl, CURLOPT_REFERER,        'http://www.google.com'  );
    curl_setopt( $curl, CURLOPT_ENCODING,       'gzip,deflate'  );
    curl_setopt( $curl, CURLOPT_AUTOREFERER,    TRUE  );
    curl_setopt( $curl, CURLOPT_RETURNTRANSFER, TRUE  );
    curl_setopt( $curl, CURLOPT_FOLLOWLOCATION, TRUE  );
    curl_setopt( $curl, CURLOPT_COOKIEFILE,     'cookie.txt' );
    curl_setopt( $curl, CURLOPT_COOKIEJAR,      'cookie.txt' );
    curl_setopt( $curl, CURLOPT_TIMEOUT,        $timeout  );

    // RUN THE CURL REQUEST AND GET THE RESULTS
    $htm = curl_exec($curl);

    // ON FAILURE HANDLE ERROR MESSAGE
    if ($htm === FALSE)
    {
        if ($error_report)
        {
            $err = curl_errno($curl);
            $inf = curl_getinfo($curl);
            echo "CURL FAIL: $url TIMEOUT=$timeout, CURL_ERRNO=$err";
            var_dump($inf);
        }
        curl_close($curl);
        return FALSE;
    }

    // ON SUCCESS RETURN XML / HTML STRING
    curl_close($curl);
    return $htm;
}

Open in new window

0
 
kjenneyAuthor Commented:
Worked perfectly!
0

Featured Post

Keep up with what's happening at Experts Exchange!

Sign up to receive Decoded, a new monthly digest with product updates, feature release info, continuing education opportunities, and more.

Tackle projects and never again get stuck behind a technical roadblock.
Join Now