• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 468
  • Last Modified:

Download google search links to csv

Hi,

I would like to download all the results for a search term to csv or xls. Ideally this would just be the links and not the description.

Anyone know a way of doing this?

thanks.

w
0
wilflife
Asked:
wilflife
2 Solutions
 
Aaron TomoskySD-WAN SimplifiedCommented:
Curl + pregmatch.
0
 
onemadeyeCommented:
Okay ... here's modified from one of my raw development.
Save this as a PHP file on your server and run it from browser.

Before you run the script please adjust some params (at top of the file) to suit your needs.
Here is it:
$q              = 'learn php curl'; // Search query
$findLink = '10'; // Amount of links to found
$avoidUrl = array(  // URL(s) listed here will not included in the result
                              'amazon.com',
                              'google.com',
                              'googleusercontent.com',
                              'wikipedia.org',
                              'youtube.com',
                          );
$csvFile  = 'result.csv'; // Name of the file to save results (csv or txt)

Let me know...
<?php // MEDY @ EE googsearch.php 23-02-2011

########## EDIT BELOW PARAMS TO SUIT YOUR NEEDS ##########

$q		  = 'learn php curl'; // Search query
$findLink = '10'; // Amount of links to found // Too many searches might hang your server :p
$avoidUrl = array(  // URL(s) listed here will not included in the result
					'amazon.com',
					'google.com',
					'googleusercontent.com',
					'wikipedia.org',
					'youtube.com',
				  );
$csvFile  = 'result.csv'; // Name of the file to save results (csv or txt)

########## STOP EDITING unless you know what you're doing ##########

function cURLcheckBasicFunctions()
{
	// checking if function exist
  if( !function_exists("curl_init") &&
      !function_exists("curl_setopt") &&
      !function_exists("curl_exec") &&
      !function_exists("curl_close") ) return false;
  else return true;
} 

function getPage($url, $referer, $agent, $header, $timeout) 
{ 
    $ch = curl_init();
    curl_setopt($ch, CURLOPT_URL, $url);
    curl_setopt($ch, CURLOPT_HEADER, $header);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
    //curl_setopt($ch, CURLOPT_PROXY, $proxy);
    curl_setopt($ch, CURLOPT_HTTPPROXYTUNNEL, 1);
    curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
    curl_setopt($ch, CURLOPT_REFERER, $referer);
    curl_setopt($ch, CURLOPT_USERAGENT, $agent);
 
    $result['EXE'] = curl_exec($ch);
    $result['INF'] = curl_getinfo($ch);
    $result['ERR'] = curl_error($ch);
 
    curl_close($ch);
 
    return $result;
}

function getUrl($url) // Turns anysite.com/whatever.htm to anysite.com
{
    preg_match("%^(http://)?([^/]+)%si", $url, $matches);
    $host = trim($matches[2]);
    preg_match("%[^./]+\.(....|...|..|za.org|travel|co.?\...)$%si", $host, $matches);
    return trim($matches[0]);
}

// Check if the server suports cURL function
$cURL_supported = cURLcheckBasicFunctions();
if ( !$cURL_supported ) { die("Cannot run script because server not supports cURL!"); }

// Check if no search term supplied
if ( !$q ) { die("Have you specify a keyword?"); }

		$i = 0;
		while( $i < count($aLink) )  // Loops through previous result set and cleans
		{
			$avoidUrl[$i] = $aLink[$i][url];
			$i++;
		}
		//print_r($avoidUrl);

       $j = 1;
       while (count($readyArray) < $findLink )
       {
           $startPage = ($findLink * $j) - ($findLink);
           $num = $findLink * 2;

	$wordQuery = urlencode($q);
	$fileTemp = "temp.txt"; // Define name for temporary TXT file

	$resPage = getPage(
    //'', // use valid proxy [proxy IP]:[port]
    'http://74.125.87.104/search?hl=en&q='.$wordQuery.'&start='.$startPage.'&num='.$num, // 74.125.87.104 = Google IP
    'http://www.google.com/',
    'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.8) Gecko/2009032609 Firefox/3.0.8',
    1,
    200); 

if ( empty($resPage['ERR']) || $resPage['INF']==200 ) { 
	@$fpWrite = $resPage['EXE'];
    // Job's done! Parse, save, etc.
			// First, save scrape results to TXT file (works best on some hosting) or most?
				if ( !file_exists($fileTemp)){
				        touch ($fileTemp);
						chmod ($fileTemp,0666); 
				        $is_open = fopen ($fileTemp, 'r+');
				        $strTemp = $fpWrite;
				} else {
				        //include "temp.txt"; // debug
				        $strTemp = $fpWrite;
				        $is_open = fopen ($fileTemp, 'r+');
				}
			fwrite ($is_open, $strTemp);
			fclose ($is_open);
} else { 
	$findLink = -1; // Get out of while loop if no connection
	die("Problems occured on the server!");
}
//print_r($resPage);

@$fp = file_get_contents($fileTemp); // debug
//file_put_contents('temp.txt', $fp, FILE_APPEND); // debug 

           if(!$fp)
           {
               $findLink = -1; // Get out of while loop if no connection
               die("Problems occured!");
           }
           else
           {
               if ($j > 1) // If used in subsequent iterations, clear $urlsFoundArray
               {
                   // Check to make sure subsequent iterations array content is not the same
                   // If the same, break out of loop - this is to stop infinite loops
                   $prevUrlsFoundArray = $urlsFoundArray;
                   unset($urlsFoundArray);
               }

			   // preg_match from google search results in temporary file
               preg_match_all('%<div class="?s"?>.*?http://?(.*?)/?".*?>.*?</a>%si', $fp, $urlsFoundArray); 
               // Full Urls are located in $urlsFoundArray[1][$a]
               //print_r($urlsFoundArray);

               if(($j > 1) && ($prevUrlsFoundArray==$urlsFoundArray))
               {      
                   break; // echo "Found duplicated entries";
               }

               if(count($urlsFoundArray)  < 1)
               { 
				   die("Cannot scrape Google at the moment. Try again later!");
               }
               elseif(count($urlsFoundArray[1]) == 0)
               { 
				   die("Could not find any links with specified keyword!");
               }
               else
                {
                   // Assign full URL to index and cleaned URL to value and clean duplicate values
                   foreach ($urlsFoundArray[1] as $value)
                   {
                       $cleanURLArray[$value] =  getUrl($value);

                       if ($cleanURLArray[$value] == "")
                       {
                           unset($cleanURLArray[$value]);
                       }
                   }
                   $cleanURLArray = array_unique($cleanURLArray);
                   //print_r($cleanURLArray);

                   // Creates readyArray with full and unique (recently found and in db) urls
                   $i = 0;
                   foreach( $cleanURLArray as $key => $value)
                   { 
                       if ( count($avoidUrl) > 0 )
                       {
                           if ( !in_array($value, $avoidUrl) ) // Make sure not in forbidden link list
                           {
								$readyArray[$i]["URL"] = $value; 
								$avoidUrl[] = $value;
                           }
                       }
                       else
                       {
								$readyArray[$i]["URL"] = $value; 
								$avoidUrl[] = $value;
                       }
					  $i++;
						if ( count($readyArray) == $findLink)
						{
							break;
						}
                   }// End of foreach
               } // End of else
           }
           $j++;
       } // End of while

	$countReadyArray = count($readyArray);
	//print_r($readyArray);
	echo 'Total <b>'.$countReadyArray.'</b> URL have been collected to '.$csvFile.'<br />';
		foreach ($readyArray as $value)
           {
               $printme['url'] = $value["URL"];
			   echo $printme['url'].'<br />';
				// Saving results to file
				$handle = fopen($csvFile,'a');
				fwrite($handle, $printme['url']."\r\n");
				fclose($handle);
		   }
	// LAST, Delete temporary file
	if ( file_exists($fileTemp) ) { 
		unlink ($fileTemp); 
	}

/* 
 * ###############################
 * 23-02-2011 by onemadeye @ EE
 * ###############################
 */

Open in new window

0
 
Geoff KenyonCommented:
You can scrape the SERPs pretty easily using the spreadsheet doc in Google Docs (kinda ironic as they don't like people scraping the serps :))

Here is a pretty good guide to getting started scraping using Google Docs.

You use the query URL: http://www.google.com/search?q=scrape+google+using+google+docs&pws=0&hl=en&num=10

You can change the number of results that you scrape by change &num=10 to be whatever the number of results you want to scrape (so &num=30 would scrape 30 results)
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

Join & Write a Comment

Featured Post

Cloud Class® Course: Microsoft Windows 7 Basic

This introductory course to Windows 7 environment will teach you about working with the Windows operating system. You will learn about basic functions including start menu; the desktop; managing files, folders, and libraries.

Tackle projects and never again get stuck behind a technical roadblock.
Join Now