Solved

Download google search links to csv

Posted on 2011-02-21
3
391 Views
Last Modified: 2013-11-19
Hi,

I would like to download all the results for a search term to csv or xls. Ideally this would just be the links and not the description.

Anyone know a way of doing this?

thanks.

w
0
Comment
Question by:wilflife
3 Comments
 
LVL 38

Expert Comment

by:Aaron Tomosky
ID: 34948628
Curl + pregmatch.
0
 
LVL 5

Assisted Solution

by:onemadeye
onemadeye earned 250 total points
ID: 34954389
Okay ... here's modified from one of my raw development.
Save this as a PHP file on your server and run it from browser.

Before you run the script please adjust some params (at top of the file) to suit your needs.
Here is it:
$q              = 'learn php curl'; // Search query
$findLink = '10'; // Amount of links to found
$avoidUrl = array(  // URL(s) listed here will not included in the result
                              'amazon.com',
                              'google.com',
                              'googleusercontent.com',
                              'wikipedia.org',
                              'youtube.com',
                          );
$csvFile  = 'result.csv'; // Name of the file to save results (csv or txt)

Let me know...
<?php // MEDY @ EE googsearch.php 23-02-2011

########## EDIT BELOW PARAMS TO SUIT YOUR NEEDS ##########

$q		  = 'learn php curl'; // Search query
$findLink = '10'; // Amount of links to found // Too many searches might hang your server :p
$avoidUrl = array(  // URL(s) listed here will not included in the result
					'amazon.com',
					'google.com',
					'googleusercontent.com',
					'wikipedia.org',
					'youtube.com',
				  );
$csvFile  = 'result.csv'; // Name of the file to save results (csv or txt)

########## STOP EDITING unless you know what you're doing ##########

function cURLcheckBasicFunctions()
{
	// checking if function exist
  if( !function_exists("curl_init") &&
      !function_exists("curl_setopt") &&
      !function_exists("curl_exec") &&
      !function_exists("curl_close") ) return false;
  else return true;
} 

function getPage($url, $referer, $agent, $header, $timeout) 
{ 
    $ch = curl_init();
    curl_setopt($ch, CURLOPT_URL, $url);
    curl_setopt($ch, CURLOPT_HEADER, $header);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
    //curl_setopt($ch, CURLOPT_PROXY, $proxy);
    curl_setopt($ch, CURLOPT_HTTPPROXYTUNNEL, 1);
    curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
    curl_setopt($ch, CURLOPT_REFERER, $referer);
    curl_setopt($ch, CURLOPT_USERAGENT, $agent);
 
    $result['EXE'] = curl_exec($ch);
    $result['INF'] = curl_getinfo($ch);
    $result['ERR'] = curl_error($ch);
 
    curl_close($ch);
 
    return $result;
}

function getUrl($url) // Turns anysite.com/whatever.htm to anysite.com
{
    preg_match("%^(http://)?([^/]+)%si", $url, $matches);
    $host = trim($matches[2]);
    preg_match("%[^./]+\.(....|...|..|za.org|travel|co.?\...)$%si", $host, $matches);
    return trim($matches[0]);
}

// Check if the server suports cURL function
$cURL_supported = cURLcheckBasicFunctions();
if ( !$cURL_supported ) { die("Cannot run script because server not supports cURL!"); }

// Check if no search term supplied
if ( !$q ) { die("Have you specify a keyword?"); }

		$i = 0;
		while( $i < count($aLink) )  // Loops through previous result set and cleans
		{
			$avoidUrl[$i] = $aLink[$i][url];
			$i++;
		}
		//print_r($avoidUrl);

       $j = 1;
       while (count($readyArray) < $findLink )
       {
           $startPage = ($findLink * $j) - ($findLink);
           $num = $findLink * 2;

	$wordQuery = urlencode($q);
	$fileTemp = "temp.txt"; // Define name for temporary TXT file

	$resPage = getPage(
    //'', // use valid proxy [proxy IP]:[port]
    'http://74.125.87.104/search?hl=en&q='.$wordQuery.'&start='.$startPage.'&num='.$num, // 74.125.87.104 = Google IP
    'http://www.google.com/',
    'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.8) Gecko/2009032609 Firefox/3.0.8',
    1,
    200); 

if ( empty($resPage['ERR']) || $resPage['INF']==200 ) { 
	@$fpWrite = $resPage['EXE'];
    // Job's done! Parse, save, etc.
			// First, save scrape results to TXT file (works best on some hosting) or most?
				if ( !file_exists($fileTemp)){
				        touch ($fileTemp);
						chmod ($fileTemp,0666); 
				        $is_open = fopen ($fileTemp, 'r+');
				        $strTemp = $fpWrite;
				} else {
				        //include "temp.txt"; // debug
				        $strTemp = $fpWrite;
				        $is_open = fopen ($fileTemp, 'r+');
				}
			fwrite ($is_open, $strTemp);
			fclose ($is_open);
} else { 
	$findLink = -1; // Get out of while loop if no connection
	die("Problems occured on the server!");
}
//print_r($resPage);

@$fp = file_get_contents($fileTemp); // debug
//file_put_contents('temp.txt', $fp, FILE_APPEND); // debug 

           if(!$fp)
           {
               $findLink = -1; // Get out of while loop if no connection
               die("Problems occured!");
           }
           else
           {
               if ($j > 1) // If used in subsequent iterations, clear $urlsFoundArray
               {
                   // Check to make sure subsequent iterations array content is not the same
                   // If the same, break out of loop - this is to stop infinite loops
                   $prevUrlsFoundArray = $urlsFoundArray;
                   unset($urlsFoundArray);
               }

			   // preg_match from google search results in temporary file
               preg_match_all('%<div class="?s"?>.*?http://?(.*?)/?".*?>.*?</a>%si', $fp, $urlsFoundArray); 
               // Full Urls are located in $urlsFoundArray[1][$a]
               //print_r($urlsFoundArray);

               if(($j > 1) && ($prevUrlsFoundArray==$urlsFoundArray))
               {      
                   break; // echo "Found duplicated entries";
               }

               if(count($urlsFoundArray)  < 1)
               { 
				   die("Cannot scrape Google at the moment. Try again later!");
               }
               elseif(count($urlsFoundArray[1]) == 0)
               { 
				   die("Could not find any links with specified keyword!");
               }
               else
                {
                   // Assign full URL to index and cleaned URL to value and clean duplicate values
                   foreach ($urlsFoundArray[1] as $value)
                   {
                       $cleanURLArray[$value] =  getUrl($value);

                       if ($cleanURLArray[$value] == "")
                       {
                           unset($cleanURLArray[$value]);
                       }
                   }
                   $cleanURLArray = array_unique($cleanURLArray);
                   //print_r($cleanURLArray);

                   // Creates readyArray with full and unique (recently found and in db) urls
                   $i = 0;
                   foreach( $cleanURLArray as $key => $value)
                   { 
                       if ( count($avoidUrl) > 0 )
                       {
                           if ( !in_array($value, $avoidUrl) ) // Make sure not in forbidden link list
                           {
								$readyArray[$i]["URL"] = $value; 
								$avoidUrl[] = $value;
                           }
                       }
                       else
                       {
								$readyArray[$i]["URL"] = $value; 
								$avoidUrl[] = $value;
                       }
					  $i++;
						if ( count($readyArray) == $findLink)
						{
							break;
						}
                   }// End of foreach
               } // End of else
           }
           $j++;
       } // End of while

	$countReadyArray = count($readyArray);
	//print_r($readyArray);
	echo 'Total <b>'.$countReadyArray.'</b> URL have been collected to '.$csvFile.'<br />';
		foreach ($readyArray as $value)
           {
               $printme['url'] = $value["URL"];
			   echo $printme['url'].'<br />';
				// Saving results to file
				$handle = fopen($csvFile,'a');
				fwrite($handle, $printme['url']."\r\n");
				fclose($handle);
		   }
	// LAST, Delete temporary file
	if ( file_exists($fileTemp) ) { 
		unlink ($fileTemp); 
	}

/* 
 * ###############################
 * 23-02-2011 by onemadeye @ EE
 * ###############################
 */

Open in new window

0
 
LVL 16

Accepted Solution

by:
Geoff Kenyon earned 250 total points
ID: 34967800
You can scrape the SERPs pretty easily using the spreadsheet doc in Google Docs (kinda ironic as they don't like people scraping the serps :))

Here is a pretty good guide to getting started scraping using Google Docs.

You use the query URL: http://www.google.com/search?q=scrape+google+using+google+docs&pws=0&hl=en&num=10

You can change the number of results that you scrape by change &num=10 to be whatever the number of results you want to scrape (so &num=30 would scrape 30 results)
0

Featured Post

3 Use Cases for Connected Systems

Our Dev teams are like yours. They’re continually cranking out code for new features/bugs fixes, testing, deploying, testing some more, responding to production monitoring events and more. It’s complex. So, we thought you’d like to see what’s working for us.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

A great marketing strategy is diverse.  Read about the not so popular, yet effective, marketing tactics you can start using today!
FAQ pages provide a simple way for you to supply and for customers to find answers to the most common questions about your company. Here are six reasons why your company website should have a FAQ page
The viewer will learn how to create and use a small PHP class to apply a watermark to an image. This video shows the viewer the setup for the PHP watermark as well as important coding language. Continue to Part 2 to learn the core code used in creat…
An overview of how to create reports in Adobe Analytics (formerly Omniture Site Catalyst) using pageNames, events, eVars and props. This video will show you how to install the Omniture Debugger tool so can see (and test) what is being passed int…

831 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question