?
Solved

Download google search links to csv

Posted on 2011-02-21
3
Medium Priority
?
412 Views
Last Modified: 2013-11-19
Hi,

I would like to download all the results for a search term to csv or xls. Ideally this would just be the links and not the description.

Anyone know a way of doing this?

thanks.

w
0
Comment
Question by:wilflife
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
3 Comments
 
LVL 39

Expert Comment

by:Aaron Tomosky
ID: 34948628
Curl + pregmatch.
0
 
LVL 5

Assisted Solution

by:onemadeye
onemadeye earned 1000 total points
ID: 34954389
Okay ... here's modified from one of my raw development.
Save this as a PHP file on your server and run it from browser.

Before you run the script please adjust some params (at top of the file) to suit your needs.
Here is it:
$q              = 'learn php curl'; // Search query
$findLink = '10'; // Amount of links to found
$avoidUrl = array(  // URL(s) listed here will not included in the result
                              'amazon.com',
                              'google.com',
                              'googleusercontent.com',
                              'wikipedia.org',
                              'youtube.com',
                          );
$csvFile  = 'result.csv'; // Name of the file to save results (csv or txt)

Let me know...
<?php // MEDY @ EE googsearch.php 23-02-2011

########## EDIT BELOW PARAMS TO SUIT YOUR NEEDS ##########

$q		  = 'learn php curl'; // Search query
$findLink = '10'; // Amount of links to found // Too many searches might hang your server :p
$avoidUrl = array(  // URL(s) listed here will not included in the result
					'amazon.com',
					'google.com',
					'googleusercontent.com',
					'wikipedia.org',
					'youtube.com',
				  );
$csvFile  = 'result.csv'; // Name of the file to save results (csv or txt)

########## STOP EDITING unless you know what you're doing ##########

function cURLcheckBasicFunctions()
{
	// checking if function exist
  if( !function_exists("curl_init") &&
      !function_exists("curl_setopt") &&
      !function_exists("curl_exec") &&
      !function_exists("curl_close") ) return false;
  else return true;
} 

function getPage($url, $referer, $agent, $header, $timeout) 
{ 
    $ch = curl_init();
    curl_setopt($ch, CURLOPT_URL, $url);
    curl_setopt($ch, CURLOPT_HEADER, $header);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
    //curl_setopt($ch, CURLOPT_PROXY, $proxy);
    curl_setopt($ch, CURLOPT_HTTPPROXYTUNNEL, 1);
    curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
    curl_setopt($ch, CURLOPT_REFERER, $referer);
    curl_setopt($ch, CURLOPT_USERAGENT, $agent);
 
    $result['EXE'] = curl_exec($ch);
    $result['INF'] = curl_getinfo($ch);
    $result['ERR'] = curl_error($ch);
 
    curl_close($ch);
 
    return $result;
}

function getUrl($url) // Turns anysite.com/whatever.htm to anysite.com
{
    preg_match("%^(http://)?([^/]+)%si", $url, $matches);
    $host = trim($matches[2]);
    preg_match("%[^./]+\.(....|...|..|za.org|travel|co.?\...)$%si", $host, $matches);
    return trim($matches[0]);
}

// Check if the server suports cURL function
$cURL_supported = cURLcheckBasicFunctions();
if ( !$cURL_supported ) { die("Cannot run script because server not supports cURL!"); }

// Check if no search term supplied
if ( !$q ) { die("Have you specify a keyword?"); }

		$i = 0;
		while( $i < count($aLink) )  // Loops through previous result set and cleans
		{
			$avoidUrl[$i] = $aLink[$i][url];
			$i++;
		}
		//print_r($avoidUrl);

       $j = 1;
       while (count($readyArray) < $findLink )
       {
           $startPage = ($findLink * $j) - ($findLink);
           $num = $findLink * 2;

	$wordQuery = urlencode($q);
	$fileTemp = "temp.txt"; // Define name for temporary TXT file

	$resPage = getPage(
    //'', // use valid proxy [proxy IP]:[port]
    'http://74.125.87.104/search?hl=en&q='.$wordQuery.'&start='.$startPage.'&num='.$num, // 74.125.87.104 = Google IP
    'http://www.google.com/',
    'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.8) Gecko/2009032609 Firefox/3.0.8',
    1,
    200); 

if ( empty($resPage['ERR']) || $resPage['INF']==200 ) { 
	@$fpWrite = $resPage['EXE'];
    // Job's done! Parse, save, etc.
			// First, save scrape results to TXT file (works best on some hosting) or most?
				if ( !file_exists($fileTemp)){
				        touch ($fileTemp);
						chmod ($fileTemp,0666); 
				        $is_open = fopen ($fileTemp, 'r+');
				        $strTemp = $fpWrite;
				} else {
				        //include "temp.txt"; // debug
				        $strTemp = $fpWrite;
				        $is_open = fopen ($fileTemp, 'r+');
				}
			fwrite ($is_open, $strTemp);
			fclose ($is_open);
} else { 
	$findLink = -1; // Get out of while loop if no connection
	die("Problems occured on the server!");
}
//print_r($resPage);

@$fp = file_get_contents($fileTemp); // debug
//file_put_contents('temp.txt', $fp, FILE_APPEND); // debug 

           if(!$fp)
           {
               $findLink = -1; // Get out of while loop if no connection
               die("Problems occured!");
           }
           else
           {
               if ($j > 1) // If used in subsequent iterations, clear $urlsFoundArray
               {
                   // Check to make sure subsequent iterations array content is not the same
                   // If the same, break out of loop - this is to stop infinite loops
                   $prevUrlsFoundArray = $urlsFoundArray;
                   unset($urlsFoundArray);
               }

			   // preg_match from google search results in temporary file
               preg_match_all('%<div class="?s"?>.*?http://?(.*?)/?".*?>.*?</a>%si', $fp, $urlsFoundArray); 
               // Full Urls are located in $urlsFoundArray[1][$a]
               //print_r($urlsFoundArray);

               if(($j > 1) && ($prevUrlsFoundArray==$urlsFoundArray))
               {      
                   break; // echo "Found duplicated entries";
               }

               if(count($urlsFoundArray)  < 1)
               { 
				   die("Cannot scrape Google at the moment. Try again later!");
               }
               elseif(count($urlsFoundArray[1]) == 0)
               { 
				   die("Could not find any links with specified keyword!");
               }
               else
                {
                   // Assign full URL to index and cleaned URL to value and clean duplicate values
                   foreach ($urlsFoundArray[1] as $value)
                   {
                       $cleanURLArray[$value] =  getUrl($value);

                       if ($cleanURLArray[$value] == "")
                       {
                           unset($cleanURLArray[$value]);
                       }
                   }
                   $cleanURLArray = array_unique($cleanURLArray);
                   //print_r($cleanURLArray);

                   // Creates readyArray with full and unique (recently found and in db) urls
                   $i = 0;
                   foreach( $cleanURLArray as $key => $value)
                   { 
                       if ( count($avoidUrl) > 0 )
                       {
                           if ( !in_array($value, $avoidUrl) ) // Make sure not in forbidden link list
                           {
								$readyArray[$i]["URL"] = $value; 
								$avoidUrl[] = $value;
                           }
                       }
                       else
                       {
								$readyArray[$i]["URL"] = $value; 
								$avoidUrl[] = $value;
                       }
					  $i++;
						if ( count($readyArray) == $findLink)
						{
							break;
						}
                   }// End of foreach
               } // End of else
           }
           $j++;
       } // End of while

	$countReadyArray = count($readyArray);
	//print_r($readyArray);
	echo 'Total <b>'.$countReadyArray.'</b> URL have been collected to '.$csvFile.'<br />';
		foreach ($readyArray as $value)
           {
               $printme['url'] = $value["URL"];
			   echo $printme['url'].'<br />';
				// Saving results to file
				$handle = fopen($csvFile,'a');
				fwrite($handle, $printme['url']."\r\n");
				fclose($handle);
		   }
	// LAST, Delete temporary file
	if ( file_exists($fileTemp) ) { 
		unlink ($fileTemp); 
	}

/* 
 * ###############################
 * 23-02-2011 by onemadeye @ EE
 * ###############################
 */

Open in new window

0
 
LVL 16

Accepted Solution

by:
Geoff Kenyon earned 1000 total points
ID: 34967800
You can scrape the SERPs pretty easily using the spreadsheet doc in Google Docs (kinda ironic as they don't like people scraping the serps :))

Here is a pretty good guide to getting started scraping using Google Docs.

You use the query URL: http://www.google.com/search?q=scrape+google+using+google+docs&pws=0&hl=en&num=10

You can change the number of results that you scrape by change &num=10 to be whatever the number of results you want to scrape (so &num=30 would scrape 30 results)
0

Featured Post

Free Tool: Site Down Detector

Helpful to verify reports of your own downtime, or to double check a downed website you are trying to access.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

When it comes to write a Context Sensitive Help (an online help that is obtained from a specific point in state of software to provide help with that state) ,  first we need to make the file that contains all topics, which are given exclusive IDs. …
When it comes to security, close monitoring is a must. According to WhiteHat Security annual report, a substantial number of all web applications are vulnerable always. Monitis offers a new product - fully-featured Website security monitoring and pr…
This tutorial demonstrates how to identify and create boundary or building outlines in Google Maps. In this example, I outline the boundaries of an enclosed skatepark within a community park.  Login to your Google Account, then  Google for "Google M…
Use Wufoo, an online form creation tool, to make powerful forms. Learn how to choose which pages of your form are visible to your users based on their inputs. The page rules feature provides you with an opportunity to create if:then statements for y…
Suggested Courses

764 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question