Solved

Scrape not working consistently and as intented

Posted on 2010-09-01
7
538 Views
Last Modified: 2012-05-10
Hi - I have a script where the user enters a keyword to search and the script is supposed to return the number of  "intitle:" phrase matches.

Right now the script seems to work great for results that are >10 but seems to return an empty result when <10.

if you enter           landscaper wausau      here:  
http://prontopage.net/local3b3.php

you get the results generated here:
http://prontopage.net/local3b3.php?s=landscaper+wausau&Submit=Count

at the top of the result page you see the url:  
http://www.google.com/search?hl=en&q=intitle:%22landscaper+wausau%22

which when followed generates a page with "3" results.  That's the number I need to pull.   $search_number should display as:
Pages Indexed: 3   (at the bottom of the screen)

If you look at the echoed data it displays
Pages Indexed:     (empty)

If you look in the echoed data you can see  
"WebResults 1 - 3 of about 3 for"    (the 3 between "about" and "for" is what I need)

Oddly a correct number is displayed if we enter:
landscaper madison

The result is 37

The url generated is http://www.google.com/search?hl=en&q=intitle:%22landscaper+madison%22

and in the echoed data you see:
Results 1 - 10 of about 37 for

and at the bottom of the page you see:
Pages Indexed: 37

That is perfect....

can someone please look at this and tell me how I might get the proper result for <10 results?

Thanks

oddly if you enter
<?php
function my_fetch($url,$user_agent='Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)')
{
$ch = curl_init();
curl_setopt ($ch, CURLOPT_URL, $url);
curl_setopt ($ch, CURLOPT_USERAGENT, $user_agent);
curl_setopt ($ch, CURLOPT_HEADER, 0);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt ($ch, CURLOPT_REFERER, 'http://www.google.com/');
$result = curl_exec ($ch);
curl_close ($ch);
return $result;
}

$s = $_GET['s'];
if (isset($s))
{
echo "<p><i>Search for $s</i></p>";
	echo $url='http://www.google.com/search?hl=en&q=intitle:'.urlencode('"'.$s.'"');
	$data = my_fetch($url);


	//$data = my_fetch('http://www.google.com/search?q=intitle%3A%22'. $s .'%22&ie=utf-8&oe=utf-8&aq=t&rls=org.mozilla:en-US:official&client=firefox-a');


    //strip off HTML
    $data = strip_tags($data);
	echo '<br><br>';
	echo $data;
    //now $data only has text NO HTML
    //these have to found out in the fetched data
    $find = 'Results 1 - 10 of about ';
    $find2 = ' for';
    //have text beginning from $find
    $data = strstr($data, $find);
	

    //find position of $find2
    //there might be many occurence
    //but it'd give position of the first one,
    //which is what we want, anyway
    $pos = strpos($data, $find2);

//take substring out, which'd be the number we want
$search_number=substr($data,strlen($find), $pos-strlen($find));

echo '<br><br>';
echo "Pages Indexed: $search_number";
}
else
{
    ?>

<form name="form1" id="form1" method="get" action="">
<div align="center">
<p>  <input name="s" type="text" id="s" size="50" />
<input type="submit" name="Submit" value="Count" /></p>
</div>
</form>
<p>&nbsp;</p>
<p>&nbsp;</p>
<p>&nbsp;</p>
<p>&nbsp;</p>
<p>
<?php
}
?>

Open in new window

0
Comment
Question by:chrisj1963
  • 3
  • 2
  • 2
7 Comments
 
LVL 82

Expert Comment

by:Dave Baldwin
ID: 33582154
I think you're looking for the wrong thing.  Try searching for "<div id=resultStats>" and the next number after that should be the number of results found.  Notice below there's more than one format for that string.
>>>> wausau
<div id=resultStats>3 results<nobr>  (0.19 seconds)&nbsp;</nobr></div>
>>> Madison
<div id=resultStats>About 35 results<nobr>  (0.31 seconds)&nbsp;</nobr></div>

Open in new window

0
 
LVL 82

Expert Comment

by:Dave Baldwin
ID: 33582164
Oh, you may have to find that before you strip the HTML.  That tag tells you where the results are going to be.
0
 
LVL 108

Expert Comment

by:Ray Paseur
ID: 33582488
It looks like this may be something of a culprit on line 32...

$find = 'Results 1 - 10 of about ';

I'll see what I can do to help with this.
0
Maximize Your Threat Intelligence Reporting

Reporting is one of the most important and least talked about aspects of a world-class threat intelligence program. Here’s how to do it right.

 
LVL 108

Accepted Solution

by:
Ray Paseur earned 250 total points
ID: 33582504
Google returns different HTML depending on the browser you use (or simulate).  This script uses Mozilla and looks like FF3, so it gets the latest HTML and CSS.

Test cases:
http://www.laprbass.com/RAY_temp_chrisj1963.php?s=landscaper
http://www.laprbass.com/RAY_temp_chrisj1963.php?s=landscaper+virginia
http://www.laprbass.com/RAY_temp_chrisj1963.php?s=landscaper+wausau
http://www.laprbass.com/RAY_temp_chrisj1963.php?s=landscaper+wausaw

Best regards, ~Ray
<?php // RAY_temp_chrisj1963.php
error_reporting(E_ALL);


function my_curl($url, $timeout=2, $error_report=FALSE)
{
    $curl = curl_init();

    // HEADERS FROM FIREFOX - APPEARS TO BE A BROWSER REFERRED BY GOOGLE
    $header[] = "Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5";
    $header[] = "Cache-Control: max-age=0";
    $header[] = "Connection: keep-alive";
    $header[] = "Keep-Alive: 300";
    $header[] = "Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7";
    $header[] = "Accept-Language: en-us,en;q=0.5";
    $header[] = "Pragma: "; // browsers keep this blank.

    // SET THE CURL OPTIONS - SEE http://php.net/manual/en/function.curl-setopt.php
    curl_setopt($curl, CURLOPT_URL,            $url);
    curl_setopt($curl, CURLOPT_USERAGENT,      'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6');
    curl_setopt($curl, CURLOPT_HTTPHEADER,     $header);
    curl_setopt($curl, CURLOPT_REFERER,        'http://www.google.com');
    curl_setopt($curl, CURLOPT_ENCODING,       'gzip,deflate');
    curl_setopt($curl, CURLOPT_AUTOREFERER,    TRUE);
    curl_setopt($curl, CURLOPT_RETURNTRANSFER, TRUE);
    curl_setopt($curl, CURLOPT_FOLLOWLOCATION, TRUE);
    curl_setopt($curl, CURLOPT_TIMEOUT,        $timeout);

    // RUN THE CURL REQUEST AND GET THE RESULTS
    $htm = curl_exec($curl);
    $err = curl_errno($curl);
    $inf = curl_getinfo($curl);
    curl_close($curl);

    // ON FAILURE
    if (!$htm)
    {
        // PROCESS ERRORS HERE
        if ($error_report)
        {
            echo "CURL FAIL: $url TIMEOUT=$timeout, CURL_ERRNO=$err";
            var_dump($inf);
        }
        return FALSE;
    }

    // ON SUCCESS
    return $htm;
}


// THINGS TO LOOK FOR
$rs = '<div id=resultStats>';
$nb = '<nobr>';

// WHERE TO LOOK
$url = 'http://www.google.com/search?q=intitle:';

// IF THERE IS AN ARGUMENT, CALL GOOGLE
if (isset($_GET['s']))
{
    echo "<p><i>Search for {$_GET["s"]}</i></p>";

    // CONSTRUCT URL
    echo $url = $url . '"' . urlencode($_GET["s"]) . '"';
    $htm = my_curl($url);

    // ACTIVATE THIS TO SEE THE HTML FROM GOOGLE
    // echo htmlentities($htm);

    // USE BREAKPOINTS APPROPRIATE TO THE MOZILLA BROWSER RESPONSE TEXT
    $arr = explode($rs, $htm);
    $arr = @explode($nb, $arr[1]);
    $str = preg_replace('/[^0-9]/', '', $arr[0]);
    $num = number_format($str);

    echo '<br><br>';

    if ($num == 0)
    {
        echo "No Pages Found";
        die();
    }
    echo "Pages Indexed: $num";
    die();
}
// END OF PHP, PUT UP THE FORM
?>
<form>
<div align="center">
<p>
<input name="s" type="text" id="s" size="50" />
<input type="submit" name="Submit" value="Count" />
</p>
</div>
</form>

Open in new window

0
 

Author Closing Comment

by:chrisj1963
ID: 33582679
Thanks Ray!  works great.  appreciate it
0
 

Author Comment

by:chrisj1963
ID: 33582690
wow. I looked more closely at the code.   suuuuuper sweeet.  thanks again!!!!
0
 
LVL 108

Expert Comment

by:Ray Paseur
ID: 33586859
Thanks for the points - it's a great question! ~Ray
0

Featured Post

Threat Intelligence Starter Resources

Integrating threat intelligence can be challenging, and not all companies are ready. These resources can help you build awareness and prepare for defense.

Join & Write a Comment

Suggested Solutions

Title # Comments Views Activity
Modify PHP Code on the Fly? 8 43
php image upload 3 27
php connect() failed error 25 17
How can I do a Select All on this page? 8 14
This article will explain how to display the first page of your Microsoft Word documents (e.g. .doc, .docx, etc...) as images in a web page programatically. I have scoured the web on a way to do this unsuccessfully. The goal is to produce something …
Developers of all skill levels should learn to use current best practices when developing websites. However many developers, new and old, fall into the trap of using deprecated features because this is what so many tutorials and books tell them to u…
The viewer will learn how to create and use a small PHP class to apply a watermark to an image. This video shows the viewer the setup for the PHP watermark as well as important coding language. Continue to Part 2 to learn the core code used in creat…
The viewer will learn how to create a basic form using some HTML5 and PHP for later processing. Set up your basic HTML file. Open your form tag and set the method and action attributes.: (CODE) Set up your first few inputs one for the name and …

743 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

11 Experts available now in Live!

Get 1:1 Help Now