asked on

Scrape Google SERPs

Hi,

I've been at this for a bit today but I had the bright idea to build a script that would crawl Google SERPs to see how well we're ranking for certain keywords. It's going to be doing a thing or two extra but that's easy stuff, no help is needed there. The intent of this class is to return all links found in the results pages and store them in a text file. I'm stuck at find the links.

I've spent quite a bit of time ON Google today trying to find some good info on how to go about this. I've investigated and found a few decent approaches on how to crawl google but ultimately they have all failed due to Google returning their source as Javascript. I've found using cURL has worked but I can't seem to extract what I need so...help?

<?php
class crawl {
	public function __construct($keyword, $se) {
		$this->keyword 				= $keyword;
		$this->prepend_txt_raw 		= 'raw_';
 
		switch($se) {
			default:break;
			case 'google';
				$engine = 'google';
				$this->crawl_google($this->keyword, $this->prepend_txt_raw . $engine);
			break;
		}
 
		$this->extract_links($this->prepend_txt_raw . $engine . '.txt');
	}
 
	private function crawl_google($keyword, $fn){
		$ch 	= curl_init('http://www.google.com/search?hl=en&q='.urlencode($keyword).'&btnG=Google+Search&meta=');
		$file 	= fopen($fn . '.txt', "w");
		curl_setopt($ch, CURLOPT_FILE, $file);
		curl_setopt($ch, CURLOPT_HEADER, 0);
		curl_exec($ch);
		curl_close($ch);
		fclose($file);
	}
 
	private function extract_links($page) {
		$myFile = $page;
		$fh = fopen($myFile, 'r');
		$theData = fread($fh, filesize($myFile));
 
		preg_match('#<body[^>]*>.*?</body>#is', $data, $body);
		preg_match_all('/https?:[^\'" <>]+/i',$body[0],$matches);
	  	for ($i = 0; $i < count($matches[0]); $i++) {
	    	if(!preg_match('/^.*\.(?:js|gif|jpg|png|bmp)$/',$matches[0][$i])){
	      		$urls[]=$matches[0][$i];
	    	}
	  	}
	 	$urls = array_unique($urls);
 
		fclose($fh);
 
		return $urls;
	}
 
}
 
$c = new crawl('bird+is+the+word', 'google');
?>

Open in new window

ASKER CERTIFIED SOLUTION

Beverley Portlock

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

SOLUTION

Ray Paseur

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

Ray Paseur

Thanks for the points.

Sidebar note to Brian and all those who have ereg functions in their code (and I have a LOT of ereg functions - both code that I wrote and code that I inherited). See the note here:

http://us2.php.net/manual/en/function.ereg.php

That change will account for probably 75% of my time in refactoring. If somebody wanted to do the community a great favor, a function-to-function map that converted ereg to preg would be pretty spiffy!

Best to all, ~Ray

Beverley Portlock

Ray - same problem here with ereg. I always like ereg, it was less of a PITA than preg...

Ray Paseur

Yep, me too. But if PHP can have a GOTO statement, maybe we can get ereg carried forward.

Feh.

~Ray