SOakley54
asked on
Scrape Google SERPs
Hi,
I've been at this for a bit today but I had the bright idea to build a script that would crawl Google SERPs to see how well we're ranking for certain keywords. It's going to be doing a thing or two extra but that's easy stuff, no help is needed there. The intent of this class is to return all links found in the results pages and store them in a text file. I'm stuck at find the links.
I've spent quite a bit of time ON Google today trying to find some good info on how to go about this. I've investigated and found a few decent approaches on how to crawl google but ultimately they have all failed due to Google returning their source as Javascript. I've found using cURL has worked but I can't seem to extract what I need so...help?
I've been at this for a bit today but I had the bright idea to build a script that would crawl Google SERPs to see how well we're ranking for certain keywords. It's going to be doing a thing or two extra but that's easy stuff, no help is needed there. The intent of this class is to return all links found in the results pages and store them in a text file. I'm stuck at find the links.
I've spent quite a bit of time ON Google today trying to find some good info on how to go about this. I've investigated and found a few decent approaches on how to crawl google but ultimately they have all failed due to Google returning their source as Javascript. I've found using cURL has worked but I can't seem to extract what I need so...help?
<?php
class crawl {
public function __construct($keyword, $se) {
$this->keyword = $keyword;
$this->prepend_txt_raw = 'raw_';
switch($se) {
default:break;
case 'google';
$engine = 'google';
$this->crawl_google($this->keyword, $this->prepend_txt_raw . $engine);
break;
}
$this->extract_links($this->prepend_txt_raw . $engine . '.txt');
}
private function crawl_google($keyword, $fn){
$ch = curl_init('http://www.google.com/search?hl=en&q='.urlencode($keyword).'&btnG=Google+Search&meta=');
$file = fopen($fn . '.txt', "w");
curl_setopt($ch, CURLOPT_FILE, $file);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_exec($ch);
curl_close($ch);
fclose($file);
}
private function extract_links($page) {
$myFile = $page;
$fh = fopen($myFile, 'r');
$theData = fread($fh, filesize($myFile));
preg_match('#<body[^>]*>.*?</body>#is', $data, $body);
preg_match_all('/https?:[^\'" <>]+/i',$body[0],$matches);
for ($i = 0; $i < count($matches[0]); $i++) {
if(!preg_match('/^.*\.(?:js|gif|jpg|png|bmp)$/',$matches[0][$i])){
$urls[]=$matches[0][$i];
}
}
$urls = array_unique($urls);
fclose($fh);
return $urls;
}
}
$c = new crawl('bird+is+the+word', 'google');
?>
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Ray - same problem here with ereg. I always like ereg, it was less of a PITA than preg...
Yep, me too. But if PHP can have a GOTO statement, maybe we can get ereg carried forward.
Feh.
~Ray
Feh.
~Ray
Sidebar note to Brian and all those who have ereg functions in their code (and I have a LOT of ereg functions - both code that I wrote and code that I inherited). See the note here:
http://us2.php.net/manual/en/function.ereg.php
That change will account for probably 75% of my time in refactoring. If somebody wanted to do the community a great favor, a function-to-function map that converted ereg to preg would be pretty spiffy!
Best to all, ~Ray