Link to home
Start Free TrialLog in
Avatar of SOakley54
SOakley54

asked on

Scrape Google SERPs

Hi,

I've been at this for a bit today but I had the bright idea to build a script that would crawl Google SERPs to see how well we're ranking for certain keywords. It's going to be doing a thing or two extra but that's easy stuff, no help is needed there. The intent of this class is to return all links found in the results pages and store them in a text file. I'm stuck at find the links.

I've spent quite a bit of time ON Google today trying to find some good info on how to go about this. I've investigated and found a few decent approaches on how to crawl google but ultimately they have all failed due to Google returning their source as Javascript. I've found using cURL has worked but I can't seem to extract what I need so...help?


<?php
class crawl {
	public function __construct($keyword, $se) {
		$this->keyword 				= $keyword;
		$this->prepend_txt_raw 		= 'raw_';
 
		switch($se) {
			default:break;
			case 'google';
				$engine = 'google';
				$this->crawl_google($this->keyword, $this->prepend_txt_raw . $engine);
			break;
		}
 
		$this->extract_links($this->prepend_txt_raw . $engine . '.txt');
	}
 
	private function crawl_google($keyword, $fn){
		$ch 	= curl_init('http://www.google.com/search?hl=en&q='.urlencode($keyword).'&btnG=Google+Search&meta=');
		$file 	= fopen($fn . '.txt', "w");
		curl_setopt($ch, CURLOPT_FILE, $file);
		curl_setopt($ch, CURLOPT_HEADER, 0);
		curl_exec($ch);
		curl_close($ch);
		fclose($file);
	}
 
	private function extract_links($page) {
		$myFile = $page;
		$fh = fopen($myFile, 'r');
		$theData = fread($fh, filesize($myFile));
 
		preg_match('#<body[^>]*>.*?</body>#is', $data, $body);
		preg_match_all('/https?:[^\'" <>]+/i',$body[0],$matches);
	  	for ($i = 0; $i < count($matches[0]); $i++) {
	    	if(!preg_match('/^.*\.(?:js|gif|jpg|png|bmp)$/',$matches[0][$i])){
	      		$urls[]=$matches[0][$i];
	    	}
	  	}
	 	$urls = array_unique($urls);
 
		fclose($fh);
 
		return $urls;
	}
 
}
 
$c = new crawl('bird+is+the+word', 'google');
?>

Open in new window

ASKER CERTIFIED SOLUTION
Avatar of Beverley Portlock
Beverley Portlock
Flag of United Kingdom of Great Britain and Northern Ireland image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Thanks for the points.  

Sidebar note to Brian and all those who have ereg functions in their code (and I have a LOT of ereg functions - both code that I wrote and code that I inherited).  See the note here:

http://us2.php.net/manual/en/function.ereg.php

That change will account for probably 75% of my time in refactoring.  If somebody wanted to do the community a great favor, a function-to-function map that converted ereg to preg would be pretty spiffy!

Best to all, ~Ray
Ray - same problem here with ereg.  I always like ereg, it was less of a PITA than preg...
Yep, me too.  But if PHP can have a GOTO statement, maybe we can get ereg carried forward.  

Feh.

~Ray