Solved

Download google search links to csv

Posted on 2011-02-21
3
378 Views
Last Modified: 2013-11-19
Hi,

I would like to download all the results for a search term to csv or xls. Ideally this would just be the links and not the description.

Anyone know a way of doing this?

thanks.

w
0
Comment
Question by:wilflife
3 Comments
 
LVL 38

Expert Comment

by:Aaron Tomosky
Comment Utility
Curl + pregmatch.
0
 
LVL 5

Assisted Solution

by:onemadeye
onemadeye earned 250 total points
Comment Utility
Okay ... here's modified from one of my raw development.
Save this as a PHP file on your server and run it from browser.

Before you run the script please adjust some params (at top of the file) to suit your needs.
Here is it:
$q              = 'learn php curl'; // Search query
$findLink = '10'; // Amount of links to found
$avoidUrl = array(  // URL(s) listed here will not included in the result
                              'amazon.com',
                              'google.com',
                              'googleusercontent.com',
                              'wikipedia.org',
                              'youtube.com',
                          );
$csvFile  = 'result.csv'; // Name of the file to save results (csv or txt)

Let me know...
<?php // MEDY @ EE googsearch.php 23-02-2011

########## EDIT BELOW PARAMS TO SUIT YOUR NEEDS ##########

$q		  = 'learn php curl'; // Search query
$findLink = '10'; // Amount of links to found // Too many searches might hang your server :p
$avoidUrl = array(  // URL(s) listed here will not included in the result
					'amazon.com',
					'google.com',
					'googleusercontent.com',
					'wikipedia.org',
					'youtube.com',
				  );
$csvFile  = 'result.csv'; // Name of the file to save results (csv or txt)

########## STOP EDITING unless you know what you're doing ##########

function cURLcheckBasicFunctions()
{
	// checking if function exist
  if( !function_exists("curl_init") &&
      !function_exists("curl_setopt") &&
      !function_exists("curl_exec") &&
      !function_exists("curl_close") ) return false;
  else return true;
} 

function getPage($url, $referer, $agent, $header, $timeout) 
{ 
    $ch = curl_init();
    curl_setopt($ch, CURLOPT_URL, $url);
    curl_setopt($ch, CURLOPT_HEADER, $header);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
    //curl_setopt($ch, CURLOPT_PROXY, $proxy);
    curl_setopt($ch, CURLOPT_HTTPPROXYTUNNEL, 1);
    curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
    curl_setopt($ch, CURLOPT_REFERER, $referer);
    curl_setopt($ch, CURLOPT_USERAGENT, $agent);
 
    $result['EXE'] = curl_exec($ch);
    $result['INF'] = curl_getinfo($ch);
    $result['ERR'] = curl_error($ch);
 
    curl_close($ch);
 
    return $result;
}

function getUrl($url) // Turns anysite.com/whatever.htm to anysite.com
{
    preg_match("%^(http://)?([^/]+)%si", $url, $matches);
    $host = trim($matches[2]);
    preg_match("%[^./]+\.(....|...|..|za.org|travel|co.?\...)$%si", $host, $matches);
    return trim($matches[0]);
}

// Check if the server suports cURL function
$cURL_supported = cURLcheckBasicFunctions();
if ( !$cURL_supported ) { die("Cannot run script because server not supports cURL!"); }

// Check if no search term supplied
if ( !$q ) { die("Have you specify a keyword?"); }

		$i = 0;
		while( $i < count($aLink) )  // Loops through previous result set and cleans
		{
			$avoidUrl[$i] = $aLink[$i][url];
			$i++;
		}
		//print_r($avoidUrl);

       $j = 1;
       while (count($readyArray) < $findLink )
       {
           $startPage = ($findLink * $j) - ($findLink);
           $num = $findLink * 2;

	$wordQuery = urlencode($q);
	$fileTemp = "temp.txt"; // Define name for temporary TXT file

	$resPage = getPage(
    //'', // use valid proxy [proxy IP]:[port]
    'http://74.125.87.104/search?hl=en&q='.$wordQuery.'&start='.$startPage.'&num='.$num, // 74.125.87.104 = Google IP
    'http://www.google.com/',
    'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.8) Gecko/2009032609 Firefox/3.0.8',
    1,
    200); 

if ( empty($resPage['ERR']) || $resPage['INF']==200 ) { 
	@$fpWrite = $resPage['EXE'];
    // Job's done! Parse, save, etc.
			// First, save scrape results to TXT file (works best on some hosting) or most?
				if ( !file_exists($fileTemp)){
				        touch ($fileTemp);
						chmod ($fileTemp,0666); 
				        $is_open = fopen ($fileTemp, 'r+');
				        $strTemp = $fpWrite;
				} else {
				        //include "temp.txt"; // debug
				        $strTemp = $fpWrite;
				        $is_open = fopen ($fileTemp, 'r+');
				}
			fwrite ($is_open, $strTemp);
			fclose ($is_open);
} else { 
	$findLink = -1; // Get out of while loop if no connection
	die("Problems occured on the server!");
}
//print_r($resPage);

@$fp = file_get_contents($fileTemp); // debug
//file_put_contents('temp.txt', $fp, FILE_APPEND); // debug 

           if(!$fp)
           {
               $findLink = -1; // Get out of while loop if no connection
               die("Problems occured!");
           }
           else
           {
               if ($j > 1) // If used in subsequent iterations, clear $urlsFoundArray
               {
                   // Check to make sure subsequent iterations array content is not the same
                   // If the same, break out of loop - this is to stop infinite loops
                   $prevUrlsFoundArray = $urlsFoundArray;
                   unset($urlsFoundArray);
               }

			   // preg_match from google search results in temporary file
               preg_match_all('%<div class="?s"?>.*?http://?(.*?)/?".*?>.*?</a>%si', $fp, $urlsFoundArray); 
               // Full Urls are located in $urlsFoundArray[1][$a]
               //print_r($urlsFoundArray);

               if(($j > 1) && ($prevUrlsFoundArray==$urlsFoundArray))
               {      
                   break; // echo "Found duplicated entries";
               }

               if(count($urlsFoundArray)  < 1)
               { 
				   die("Cannot scrape Google at the moment. Try again later!");
               }
               elseif(count($urlsFoundArray[1]) == 0)
               { 
				   die("Could not find any links with specified keyword!");
               }
               else
                {
                   // Assign full URL to index and cleaned URL to value and clean duplicate values
                   foreach ($urlsFoundArray[1] as $value)
                   {
                       $cleanURLArray[$value] =  getUrl($value);

                       if ($cleanURLArray[$value] == "")
                       {
                           unset($cleanURLArray[$value]);
                       }
                   }
                   $cleanURLArray = array_unique($cleanURLArray);
                   //print_r($cleanURLArray);

                   // Creates readyArray with full and unique (recently found and in db) urls
                   $i = 0;
                   foreach( $cleanURLArray as $key => $value)
                   { 
                       if ( count($avoidUrl) > 0 )
                       {
                           if ( !in_array($value, $avoidUrl) ) // Make sure not in forbidden link list
                           {
								$readyArray[$i]["URL"] = $value; 
								$avoidUrl[] = $value;
                           }
                       }
                       else
                       {
								$readyArray[$i]["URL"] = $value; 
								$avoidUrl[] = $value;
                       }
					  $i++;
						if ( count($readyArray) == $findLink)
						{
							break;
						}
                   }// End of foreach
               } // End of else
           }
           $j++;
       } // End of while

	$countReadyArray = count($readyArray);
	//print_r($readyArray);
	echo 'Total <b>'.$countReadyArray.'</b> URL have been collected to '.$csvFile.'<br />';
		foreach ($readyArray as $value)
           {
               $printme['url'] = $value["URL"];
			   echo $printme['url'].'<br />';
				// Saving results to file
				$handle = fopen($csvFile,'a');
				fwrite($handle, $printme['url']."\r\n");
				fclose($handle);
		   }
	// LAST, Delete temporary file
	if ( file_exists($fileTemp) ) { 
		unlink ($fileTemp); 
	}

/* 
 * ###############################
 * 23-02-2011 by onemadeye @ EE
 * ###############################
 */

Open in new window

0
 
LVL 16

Accepted Solution

by:
Geoff Kenyon earned 250 total points
Comment Utility
You can scrape the SERPs pretty easily using the spreadsheet doc in Google Docs (kinda ironic as they don't like people scraping the serps :))

Here is a pretty good guide to getting started scraping using Google Docs.

You use the query URL: http://www.google.com/search?q=scrape+google+using+google+docs&pws=0&hl=en&num=10

You can change the number of results that you scrape by change &num=10 to be whatever the number of results you want to scrape (so &num=30 would scrape 30 results)
0

Featured Post

What Should I Do With This Threat Intelligence?

Are you wondering if you actually need threat intelligence? The answer is yes. We explain the basics for creating useful threat intelligence.

Join & Write a Comment

Suggested Solutions

Read about the 3 stages of the buyer's journey: awareness, consideration, and decision.
Marketing can be an uncomfortable undertaking, especially if your material is technology based. Luckily, we’ve compiled some simple and (relatively) painless tips to put an end to your trepidation and start your path to success.
The viewer will learn how to create and use a small PHP class to apply a watermark to an image. This video shows the viewer the setup for the PHP watermark as well as important coding language. Continue to Part 2 to learn the core code used in creat…
An overview of how to create reports in Adobe Analytics (formerly Omniture Site Catalyst) using pageNames, events, eVars and props. This video will show you how to install the Omniture Debugger tool so can see (and test) what is being passed int…

743 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

16 Experts available now in Live!

Get 1:1 Help Now