[Last Call] Learn how to a build a cloud-first strategyRegister Now

x
?
Solved

Scrape Google SERPs

Posted on 2009-04-23
5
Medium Priority
?
598 Views
Last Modified: 2012-05-06
Hi,

I've been at this for a bit today but I had the bright idea to build a script that would crawl Google SERPs to see how well we're ranking for certain keywords. It's going to be doing a thing or two extra but that's easy stuff, no help is needed there. The intent of this class is to return all links found in the results pages and store them in a text file. I'm stuck at find the links.

I've spent quite a bit of time ON Google today trying to find some good info on how to go about this. I've investigated and found a few decent approaches on how to crawl google but ultimately they have all failed due to Google returning their source as Javascript. I've found using cURL has worked but I can't seem to extract what I need so...help?


<?php
class crawl {
	public function __construct($keyword, $se) {
		$this->keyword 				= $keyword;
		$this->prepend_txt_raw 		= 'raw_';
 
		switch($se) {
			default:break;
			case 'google';
				$engine = 'google';
				$this->crawl_google($this->keyword, $this->prepend_txt_raw . $engine);
			break;
		}
 
		$this->extract_links($this->prepend_txt_raw . $engine . '.txt');
	}
 
	private function crawl_google($keyword, $fn){
		$ch 	= curl_init('http://www.google.com/search?hl=en&q='.urlencode($keyword).'&btnG=Google+Search&meta=');
		$file 	= fopen($fn . '.txt', "w");
		curl_setopt($ch, CURLOPT_FILE, $file);
		curl_setopt($ch, CURLOPT_HEADER, 0);
		curl_exec($ch);
		curl_close($ch);
		fclose($file);
	}
 
	private function extract_links($page) {
		$myFile = $page;
		$fh = fopen($myFile, 'r');
		$theData = fread($fh, filesize($myFile));
 
		preg_match('#<body[^>]*>.*?</body>#is', $data, $body);
		preg_match_all('/https?:[^\'" <>]+/i',$body[0],$matches);
	  	for ($i = 0; $i < count($matches[0]); $i++) {
	    	if(!preg_match('/^.*\.(?:js|gif|jpg|png|bmp)$/',$matches[0][$i])){
	      		$urls[]=$matches[0][$i];
	    	}
	  	}
	 	$urls = array_unique($urls);
 
		fclose($fh);
 
		return $urls;
	}
 
}
 
$c = new crawl('bird+is+the+word', 'google');
?>

Open in new window

0
Comment
Question by:SOakley54
  • 3
  • 2
5 Comments
 
LVL 34

Accepted Solution

by:
Beverley Portlock earned 1000 total points
ID: 24220039
Use this pattern

"return clk\(this\.href,'','','res','([0-9]+)',''\)\"

and the position in the Google rankings is given by the [0-9] bit. This was pulled from an EREGI rather than a PREG so a bit of recoding is probably needed.
0
 
LVL 111

Assisted Solution

by:Ray Paseur
Ray Paseur earned 1000 total points
ID: 24226253
Check Line 31:   $theData = fread($fh, filesize($myFile));
Check Line 33:    preg_match('#<body[^>]*>.*?</body>#is', $data, $body);

Should both of those be using the same variable - $theData vs $data - ?
0
 
LVL 111

Expert Comment

by:Ray Paseur
ID: 25344862
Thanks for the points.  

Sidebar note to Brian and all those who have ereg functions in their code (and I have a LOT of ereg functions - both code that I wrote and code that I inherited).  See the note here:

http://us2.php.net/manual/en/function.ereg.php

That change will account for probably 75% of my time in refactoring.  If somebody wanted to do the community a great favor, a function-to-function map that converted ereg to preg would be pretty spiffy!

Best to all, ~Ray
0
 
LVL 34

Expert Comment

by:Beverley Portlock
ID: 25345053
Ray - same problem here with ereg.  I always like ereg, it was less of a PITA than preg...
0
 
LVL 111

Expert Comment

by:Ray Paseur
ID: 25345069
Yep, me too.  But if PHP can have a GOTO statement, maybe we can get ereg carried forward.  

Feh.

~Ray
0

Featured Post

Free Tool: Subnet Calculator

The subnet calculator helps you design networks by taking an IP address and network mask and returning information such as network, broadcast address, and host range.

One of a set of tools we're offering as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Since pre-biblical times, humans have sought ways to keep secrets, and share the secrets selectively.  This article explores the ways PHP can be used to hide and encrypt information.
The title says it all. Writing any type of PHP Application or API code that provides high throughput, while under a heavy load, seems to be an arcane art form (Black Magic). This article aims to provide some general guidelines for producing this typ…
Learn how to match and substitute tagged data using PHP regular expressions. Demonstrated on Windows 7, but also applies to other operating systems. Demonstrated technique applies to PHP (all versions) and Firefox, but very similar techniques will w…
The viewer will learn how to create and use a small PHP class to apply a watermark to an image. This video shows the viewer the setup for the PHP watermark as well as important coding language. Continue to Part 2 to learn the core code used in creat…
Suggested Courses
Course of the Month18 days, 5 hours left to enroll

830 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question