Avatar of Stephen Forlance
Stephen Forlance

asked on 

How difficult would it be build a cookie scanner in PHP?

Hi all,
how difficult would it be to build a PHP script that would scan a website and work out which cookies it users? Something similar to

http://cookiepedia.co.uk/
PHP

Avatar of undefined
Last Comment
Ray Paseur
Avatar of Ray Paseur
Ray Paseur
Flag of United States of America image

OK, let's start with an understanding of cookies.  When a client comes to your web site for the first time, you can put a cookie (or cookies) on the browser using PHP setcookie().  This is how the PHP session works -- it stores a cookie that allows the server to link the client browser to a data store on the server.  When the client comes back to your web site via another HTTP request, the client browser sends back the cookie.  Whatever data you stored in the cookie on the initial visit will be returned to the server.  Each subsequent "N + 1" visit can add, change, or remove cookies.

From this design, it is obvious that HTTP cookies are potentially variable in a 1:1 ratio to the number of client HTTP requests.  In practice that is rarely the case, but if you want to account for all the cookies, you have to account for all of the possible paths that a client might take as they browse the web site.  In a shopping site, this might be quite a lot of paths.

It's also worth knowing that the cookie itself carries very little information - just a link to an information set that is stored on the server.

Writing a cURL script that will follow all of the links in a web site is computationally trivial, but running the script may be resource-intensive.  I tried several PHP solutions for web site search about a decade ago and never found any PHP-based spider that could run very fast.  That aside, as each of the links is followed (via cURL), you can tell cURL to collect the cookies and put them into a "cookie jar."  If you use a new cookie jar for each page load, you can keep track of which page set which cookie.

The browser will return cookies to the server on the basis of a domain and/or a subdomain.  So cookies from Google.com are not returned to Facebook.com, and it's possible that cookies from example.com are not returned to www.example.com.  But this is the tip of the iceberg.  The cookies and associated information may be shared in a variety of ways that are not evident in the HTTP protocol.  To see this in action, visit EBay.com and make a search for "copper stock pot."  Then visit Amazon.com and see what comes up in the suggestions feed.

I don't think cookiepedia is scanning web sites, so much as it's trying to aggregate information from a human client base about the cookies they have found on their browsers.  I'm not sure this will be very fruitful, because (at least for me) I don't care what cookies get put on my browser; I delete the cookies from time to time; I'm not aware of any value proposition that would encourage me to give my information to cookiepedia.

Coming at the cookies from another angle, I can see all of the cookies on my browser (most browsers have such a feature).  But getting to these cookies without personal, human, intervention is impossible.  For the most part, servers can only see the cookies that are applicable to the individual server.  I would have no way of knowing your browser history, and a server cannot initiate communication with a client browser.
Avatar of Dave Baldwin
Dave Baldwin
Flag of United States of America image

Cookiepedia says near the bottom of the page that it is collecting info from people and not sites.
Avatar of Stephen Forlance
Stephen Forlance

ASKER

Just putting aside the reason for building something like this, I was using a script built using CURL, but I couldnt view the cookies in cookie.txt, as it was blank... any ideas where this is going wrong?

$get_cookie_page = 'http://www.google.com';
echo curl_download($get_cookie_page);

function curl_download($Url){
  $ch = curl_init();
  curl_setopt($ch, CURLOPT_URL, $Url);
  curl_setopt($ch, CURLOPT_NOBODY, true);
  curl_setopt($ch, CURLOPT_COOKIEJAR, 'cookie.txt');
  $http_headers = array(
                    'Host: www.google.ca',
                    'User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:6.0.2) Gecko/20100101 Firefox/6.0.2',
                    'Accept: */*',
                    'Accept-Language: en-us,en;q=0.5',
                    'Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7',
                    'Connection: keep-alive'
                  );
  curl_setopt($ch, CURLOPT_HEADER, true);
  curl_setopt($ch, CURLOPT_HTTPHEADER, $http_headers);
  curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
  curl_setopt($ch, CURLOPT_TIMEOUT, 10);
  $output = curl_exec($ch);
  curl_close($ch);
  return $output;
}

Open in new window

Avatar of Dave Baldwin
Dave Baldwin
Flag of United States of America image

I ran your code and this is what I get and the cookie is in 'cookies.txt'.
HTTP/1.1 200 OK
Date: Wed, 12 Apr 2017 20:49:05 GMT
Expires: -1
Cache-Control: private, max-age=0
Content-Type: text/html; charset=UTF-8
P3P: CP="This is not a P3P policy! See https://www.google.com/support/accounts/answer/151657?hl=en for more info."
Server: gws
X-XSS-Protection: 1; mode=block
X-Frame-Options: SAMEORIGIN
Set-Cookie: NID=101=vVz6x8dP2GHQPhu6gJsceLoGnOUKiznAiLkZg0E2BzlCf2fBxKUhGd0E9S1i6CouwdhZM4ZiL5LXEWSPwqPO56OGJumvTh9ZQRPYNUhwI7mNciKDytpDffNV_EassdYd; expires=Thu, 12-Oct-2017 20:49:05 GMT; path=/; domain=.google.ca; HttpOnly
Transfer-Encoding: chunked
Accept-Ranges: none
Vary: Accept-Encoding

Open in new window

Avatar of Stephen Forlance

ASKER

Hmm, my file is empty.
ASKER CERTIFIED SOLUTION
Avatar of Ray Paseur
Ray Paseur
Flag of United States of America image

Blurred text
THIS SOLUTION IS ONLY AVAILABLE TO MEMBERS.
View this solution by signing up for a free trial.
Members can start a 7-Day free trial and enjoy unlimited access to the platform.
See Pricing Options
Start Free Trial
Avatar of Stephen Forlance

ASKER

Thanks Ill give it a try.

The site Im now testing it on is https://www.cedexis.com/products/radar/

When I view the cookies set in Chrome it shows 57 cookies in use, these should all be saved to the cookie file correct?
Avatar of Stephen Forlance

ASKER

I think that sample code is beyond my skill level, even running it gave me an
Parse error: syntax error, unexpected '[' in index.php on line 40

But am I understanding correctly how curl cookies work, should be able to view them all in cookie txt file?
Avatar of Dave Baldwin
Dave Baldwin
Flag of United States of America image

No, you should only see the cookies that are set during that visit.  And only on that domain.  I have a lot of cookies from Google that are on other Google sub-domains but I only got them because I visited those domains and sub-domains at some time.  On 'cedexis.com' I got about 20 cookies on my first visit in Firefox.
Avatar of Ray Paseur
Ray Paseur
Flag of United States of America image

What script did you run?  Line 40 in the sample I posted is a comment -- it cannot cause a parse error.
Avatar of Dave Baldwin
Dave Baldwin
Flag of United States of America image

If I use https://www.cedexis.com/products/radar/ in your code, I get nothing.  No response, no cookie.  That could be because it is being rejected by the site or because you didn't include the curl SSL/TLS options.
Avatar of Stephen Forlance

ASKER

This line seems to be causing the trouble:

  public function __construct($href, $user=NULL, $pass=NULL, $get_array=[], $title=NULL)
Avatar of Ray Paseur
Ray Paseur
Flag of United States of America image

When I cURL Twitter, I get this in the cookie jar.
# Netscape HTTP Cookie File
# http://curl.haxx.se/docs/http-cookies.html
# This file was generated by libcurl! Edit at your own risk.

#HttpOnly_.twitter.com	TRUE	/	TRUE	0	_twitter_sess	BAh7CSIKZmxhc2hJQzonQWN0aW9uQ29udHJvbGxlcjo6Rmxhc2g6OkZsYXNo%250ASGFzaHsABjoKQHVzZWR7ADoPY3JlYXRlZF9hdGwrCEpWL2RbAToMY3NyZl9p%250AZCIlNDczNjk0OGQxYjM1OUQ1NjljODQwNjUxNWRjMTAmYTU6B2lkIiU2N2Y3%250AZjQxMTdkY2FmYTg0ZGQwZmY5N2Q0YTY0NTk4Zg%253D%253D--daf4db39fd5de201758349f3bebc6cc804db09aa
twitter.com	FALSE	/	FALSE	1492639275	external_referer	padhuUp36zjgzgv1mFWxJ8aHbAM%2FyKh7|0|8e8t2xd8A2w%3D
.twitter.com	TRUE	/	TRUE	1492056075	ct0	5474a0a90005bd1a1361874852350b34
.twitter.com	TRUE	/	FALSE	1555106475	guest_id	v1%3A149203446559252039

Open in new window

However when I cURL https://www.cedexis.com/products/radar/ I do not get any cookies.  I noticed that with cookies turned off in Firefox, the page is reported to be insecure, even though it's HTTPS.  It may be that cURL does not like the changes that Cedexis is doing?  Here's a screen shot of the cookie names they set.User generated image
Avatar of Ray Paseur
Ray Paseur
Flag of United States of America image

public function __construct($href, $user=NULL, $pass=NULL, $get_array=[], $title=NULL)
If this is causing a parse error, it means you're running a dangerously out-of-date version of PHP.  PHP7 versions are the only ones that are supported today.  The new array notation using square brackets was introduced in one of the PHP5 releases.
Avatar of Stephen Forlance

ASKER

The server is running PHP Version 5.3.29
Avatar of Ray Paseur
Ray Paseur
Flag of United States of America image

Yeah, PHP 5.3 dates from June, 2009.  PHP 5.3.29 goes back to August, 2014.  There were other branches in there, along the way: PHP5.4, PHP5.5, PHP5.6.  But all of those are in the "steady state" with respect to active development.  IIRC some of the PHP5.6 branch will still get security fixes, but all of the others are obsolete today.  I would recommend you consider upgrading to the current release at PHP 7.1.3.  It's always given on the upper right-hand corner of the PHP home page.

The PHP5 changelog is here:
http://php.net/ChangeLog-5.php
Avatar of Dave Baldwin
Dave Baldwin
Flag of United States of America image

A couple of points.  For many 'https' sites, you will have to get to at least PHP 5.6 to access them.  The 'curl' code that comes with previous versions does not support the ciphers and certificates that are necessary access sites with recent SSL/TLS support.  I had to write a test program that I could run on all my sites to check the SSL/TLS support.  It is important for connecting to sites like Paypal.

With respect to your original question, this is just a sample of how difficult it would be to try to create a 'cookie scanner'.  On each site, you will only get the cookies that are set by that page.  There is no way to get 'all' the cookies without visiting ALL the pages.  In addition, the next cookie value sometimes depends on the previous value of the cookie.  The values are not necessarily static.
Avatar of Stephen Forlance

ASKER

So, I got myself a server with PHP 7 and Ray your code seemed to work fine, except that cookie.txt is still blank....
Avatar of Ray Paseur
Ray Paseur
Flag of United States of America image

Yes, that was my experience, too (assuming we are still talking about cedexis.com).  There are some weird things about that site.  Look at the cookie display in the image I posted above.  There are cookies with the same names.  This is kind of illogical because if you're going to try to parse the $_COOKIE array in PHP, the identical names will cause the cookies to be overwritten.  It may be an error on their part, or it may be some kind of convoluted redirect scheme.
Avatar of Stephen Forlance

ASKER

Hmm I wonder whats going on with them, Ill change the URL and use a simpler site and see if that works (maybe experts-exchange.com!)
Avatar of Ray Paseur
Ray Paseur
Flag of United States of America image

I've never tried Experts-Exchange before!  

Their HTML markup is pretty lame.
https://validator.w3.org/nu/?doc=https%3A%2F%2Fwww.experts-exchange.com%2F

Here's what I get in the cookie file.
# Netscape HTTP Cookie File
# http://curl.haxx.se/docs/http-cookies.html
# This file was generated by libcurl! Edit at your own risk.

#HttpOnly_.experts-exchange.com	TRUE	/	FALSE	1523648872	__cfduid	dd77e5ff755d33da9ee06ae7fdd18ca7c1492112872
#HttpOnly_www.Experts-Exchange.com	FALSE	/	FALSE	0	JSESSIONID	0D238DEC084E78ACD02F616DF9E3D0D9
.www.experts-exchange.com	TRUE	/	TRUE	3639596520	CC_0	"OE=j1gth8r1:j1gth8r1&VTCT=1&exp_rs_1161=LO_REG&exp_r_162=exp_rs_1161"
#HttpOnly_www.Experts-Exchange.com	FALSE	/	FALSE	0	AWSELB	C92B9F45167AF63E26ED5CB181FE9F92011716F33629D2F6AAA18745709E160BDA07DB75A75A98E2393A9BE60320C6672EE57A66871DB88E007B56E04C85085E76F34CB5AF921BC6A737753A0A3F8887530A7B3DBB

Open in new window

PHP
PHP

PHP is a widely-used server-side scripting language especially suited for web development, powering tens of millions of sites from Facebook to personal WordPress blogs. PHP is often paired with the MySQL relational database, but includes support for most other mainstream databases. By utilizing different Server APIs, PHP can work on many different web servers as a server-side scripting language.

125K
Questions
--
Followers
--
Top Experts
Get a personalized solution from industry experts
Ask the experts
Read over 600 more reviews

TRUSTED BY

IBM logoIntel logoMicrosoft logoUbisoft logoSAP logo
Qualcomm logoCitrix Systems logoWorkday logoErnst & Young logo
High performer badgeUsers love us badge
LinkedIn logoFacebook logoX logoInstagram logoTikTok logoYouTube logo