Want to win a PS4? Go Premium and enter to win our High-Tech Treats giveaway. Enter to Win

x
?
Solved

Download google search links to csv

Posted on 2011-02-21
3
Medium Priority
?
417 Views
Last Modified: 2013-11-19
Hi,

I would like to download all the results for a search term to csv or xls. Ideally this would just be the links and not the description.

Anyone know a way of doing this?

thanks.

w
0
Comment
Question by:wilflife
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
3 Comments
 
LVL 39

Expert Comment

by:Aaron Tomosky
ID: 34948628
Curl + pregmatch.
0
 
LVL 5

Assisted Solution

by:onemadeye
onemadeye earned 1000 total points
ID: 34954389
Okay ... here's modified from one of my raw development.
Save this as a PHP file on your server and run it from browser.

Before you run the script please adjust some params (at top of the file) to suit your needs.
Here is it:
$q              = 'learn php curl'; // Search query
$findLink = '10'; // Amount of links to found
$avoidUrl = array(  // URL(s) listed here will not included in the result
                              'amazon.com',
                              'google.com',
                              'googleusercontent.com',
                              'wikipedia.org',
                              'youtube.com',
                          );
$csvFile  = 'result.csv'; // Name of the file to save results (csv or txt)

Let me know...
<?php // MEDY @ EE googsearch.php 23-02-2011

########## EDIT BELOW PARAMS TO SUIT YOUR NEEDS ##########

$q		  = 'learn php curl'; // Search query
$findLink = '10'; // Amount of links to found // Too many searches might hang your server :p
$avoidUrl = array(  // URL(s) listed here will not included in the result
					'amazon.com',
					'google.com',
					'googleusercontent.com',
					'wikipedia.org',
					'youtube.com',
				  );
$csvFile  = 'result.csv'; // Name of the file to save results (csv or txt)

########## STOP EDITING unless you know what you're doing ##########

function cURLcheckBasicFunctions()
{
	// checking if function exist
  if( !function_exists("curl_init") &&
      !function_exists("curl_setopt") &&
      !function_exists("curl_exec") &&
      !function_exists("curl_close") ) return false;
  else return true;
} 

function getPage($url, $referer, $agent, $header, $timeout) 
{ 
    $ch = curl_init();
    curl_setopt($ch, CURLOPT_URL, $url);
    curl_setopt($ch, CURLOPT_HEADER, $header);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
    //curl_setopt($ch, CURLOPT_PROXY, $proxy);
    curl_setopt($ch, CURLOPT_HTTPPROXYTUNNEL, 1);
    curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
    curl_setopt($ch, CURLOPT_REFERER, $referer);
    curl_setopt($ch, CURLOPT_USERAGENT, $agent);
 
    $result['EXE'] = curl_exec($ch);
    $result['INF'] = curl_getinfo($ch);
    $result['ERR'] = curl_error($ch);
 
    curl_close($ch);
 
    return $result;
}

function getUrl($url) // Turns anysite.com/whatever.htm to anysite.com
{
    preg_match("%^(http://)?([^/]+)%si", $url, $matches);
    $host = trim($matches[2]);
    preg_match("%[^./]+\.(....|...|..|za.org|travel|co.?\...)$%si", $host, $matches);
    return trim($matches[0]);
}

// Check if the server suports cURL function
$cURL_supported = cURLcheckBasicFunctions();
if ( !$cURL_supported ) { die("Cannot run script because server not supports cURL!"); }

// Check if no search term supplied
if ( !$q ) { die("Have you specify a keyword?"); }

		$i = 0;
		while( $i < count($aLink) )  // Loops through previous result set and cleans
		{
			$avoidUrl[$i] = $aLink[$i][url];
			$i++;
		}
		//print_r($avoidUrl);

       $j = 1;
       while (count($readyArray) < $findLink )
       {
           $startPage = ($findLink * $j) - ($findLink);
           $num = $findLink * 2;

	$wordQuery = urlencode($q);
	$fileTemp = "temp.txt"; // Define name for temporary TXT file

	$resPage = getPage(
    //'', // use valid proxy [proxy IP]:[port]
    'http://74.125.87.104/search?hl=en&q='.$wordQuery.'&start='.$startPage.'&num='.$num, // 74.125.87.104 = Google IP
    'http://www.google.com/',
    'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.8) Gecko/2009032609 Firefox/3.0.8',
    1,
    200); 

if ( empty($resPage['ERR']) || $resPage['INF']==200 ) { 
	@$fpWrite = $resPage['EXE'];
    // Job's done! Parse, save, etc.
			// First, save scrape results to TXT file (works best on some hosting) or most?
				if ( !file_exists($fileTemp)){
				        touch ($fileTemp);
						chmod ($fileTemp,0666); 
				        $is_open = fopen ($fileTemp, 'r+');
				        $strTemp = $fpWrite;
				} else {
				        //include "temp.txt"; // debug
				        $strTemp = $fpWrite;
				        $is_open = fopen ($fileTemp, 'r+');
				}
			fwrite ($is_open, $strTemp);
			fclose ($is_open);
} else { 
	$findLink = -1; // Get out of while loop if no connection
	die("Problems occured on the server!");
}
//print_r($resPage);

@$fp = file_get_contents($fileTemp); // debug
//file_put_contents('temp.txt', $fp, FILE_APPEND); // debug 

           if(!$fp)
           {
               $findLink = -1; // Get out of while loop if no connection
               die("Problems occured!");
           }
           else
           {
               if ($j > 1) // If used in subsequent iterations, clear $urlsFoundArray
               {
                   // Check to make sure subsequent iterations array content is not the same
                   // If the same, break out of loop - this is to stop infinite loops
                   $prevUrlsFoundArray = $urlsFoundArray;
                   unset($urlsFoundArray);
               }

			   // preg_match from google search results in temporary file
               preg_match_all('%<div class="?s"?>.*?http://?(.*?)/?".*?>.*?</a>%si', $fp, $urlsFoundArray); 
               // Full Urls are located in $urlsFoundArray[1][$a]
               //print_r($urlsFoundArray);

               if(($j > 1) && ($prevUrlsFoundArray==$urlsFoundArray))
               {      
                   break; // echo "Found duplicated entries";
               }

               if(count($urlsFoundArray)  < 1)
               { 
				   die("Cannot scrape Google at the moment. Try again later!");
               }
               elseif(count($urlsFoundArray[1]) == 0)
               { 
				   die("Could not find any links with specified keyword!");
               }
               else
                {
                   // Assign full URL to index and cleaned URL to value and clean duplicate values
                   foreach ($urlsFoundArray[1] as $value)
                   {
                       $cleanURLArray[$value] =  getUrl($value);

                       if ($cleanURLArray[$value] == "")
                       {
                           unset($cleanURLArray[$value]);
                       }
                   }
                   $cleanURLArray = array_unique($cleanURLArray);
                   //print_r($cleanURLArray);

                   // Creates readyArray with full and unique (recently found and in db) urls
                   $i = 0;
                   foreach( $cleanURLArray as $key => $value)
                   { 
                       if ( count($avoidUrl) > 0 )
                       {
                           if ( !in_array($value, $avoidUrl) ) // Make sure not in forbidden link list
                           {
								$readyArray[$i]["URL"] = $value; 
								$avoidUrl[] = $value;
                           }
                       }
                       else
                       {
								$readyArray[$i]["URL"] = $value; 
								$avoidUrl[] = $value;
                       }
					  $i++;
						if ( count($readyArray) == $findLink)
						{
							break;
						}
                   }// End of foreach
               } // End of else
           }
           $j++;
       } // End of while

	$countReadyArray = count($readyArray);
	//print_r($readyArray);
	echo 'Total <b>'.$countReadyArray.'</b> URL have been collected to '.$csvFile.'<br />';
		foreach ($readyArray as $value)
           {
               $printme['url'] = $value["URL"];
			   echo $printme['url'].'<br />';
				// Saving results to file
				$handle = fopen($csvFile,'a');
				fwrite($handle, $printme['url']."\r\n");
				fclose($handle);
		   }
	// LAST, Delete temporary file
	if ( file_exists($fileTemp) ) { 
		unlink ($fileTemp); 
	}

/* 
 * ###############################
 * 23-02-2011 by onemadeye @ EE
 * ###############################
 */

Open in new window

0
 
LVL 16

Accepted Solution

by:
Geoff Kenyon earned 1000 total points
ID: 34967800
You can scrape the SERPs pretty easily using the spreadsheet doc in Google Docs (kinda ironic as they don't like people scraping the serps :))

Here is a pretty good guide to getting started scraping using Google Docs.

You use the query URL: http://www.google.com/search?q=scrape+google+using+google+docs&pws=0&hl=en&num=10

You can change the number of results that you scrape by change &num=10 to be whatever the number of results you want to scrape (so &num=30 would scrape 30 results)
0

Featured Post

Simplify Your Workload with One Tool

How do you combat today’s intelligent hacker while managing multiple domains and platforms? By simplifying your workload with one tool. With Lunarpages hosting through Plesk Onyx, you can:

Automate SSL generation and installation with two clicks
Experience total server control

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Q&A with Course Creator, Mark Lassoff, on the importance of HTML5 in the career of a modern-day developer.
Choosing a core focus or particular set of features and options can be tough. To help out, we’re going to highlight a handful of things your business needs on one of your social media pages. In other words, if one of these is missing, you should imp…
This tutorial walks through the best practices in adding a local business to Google Maps including how to properly search for duplicates, marker placement, and inputing business details. Login to your Google Account, then search for "Google Mapmaker…
Any person in technology especially those working for big companies should at least know about the basics of web accessibility. Believe it or not there are even laws in place that require businesses to provide such means for the disabled and aging p…
Suggested Courses

618 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question