Link to home
Start Free TrialLog in
Avatar of chrisj1963
chrisj1963

asked on

Scrape not working consistently and as intented

Hi - I have a script where the user enters a keyword to search and the script is supposed to return the number of  "intitle:" phrase matches.

Right now the script seems to work great for results that are >10 but seems to return an empty result when <10.

if you enter           landscaper wausau      here:  
http://prontopage.net/local3b3.php

you get the results generated here:
http://prontopage.net/local3b3.php?s=landscaper+wausau&Submit=Count 

at the top of the result page you see the url:  
http://www.google.com/search?hl=en&q=intitle:%22landscaper+wausau%22 

which when followed generates a page with "3" results.  That's the number I need to pull.   $search_number should display as:
Pages Indexed: 3   (at the bottom of the screen)

If you look at the echoed data it displays
Pages Indexed:     (empty)

If you look in the echoed data you can see  
"WebResults 1 - 3 of about 3 for"    (the 3 between "about" and "for" is what I need)

Oddly a correct number is displayed if we enter:
landscaper madison

The result is 37

The url generated is http://www.google.com/search?hl=en&q=intitle:%22landscaper+madison%22

and in the echoed data you see:
Results 1 - 10 of about 37 for

and at the bottom of the page you see:
Pages Indexed: 37

That is perfect....

can someone please look at this and tell me how I might get the proper result for <10 results?

Thanks

oddly if you enter
<?php
function my_fetch($url,$user_agent='Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)')
{
$ch = curl_init();
curl_setopt ($ch, CURLOPT_URL, $url);
curl_setopt ($ch, CURLOPT_USERAGENT, $user_agent);
curl_setopt ($ch, CURLOPT_HEADER, 0);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt ($ch, CURLOPT_REFERER, 'http://www.google.com/');
$result = curl_exec ($ch);
curl_close ($ch);
return $result;
}

$s = $_GET['s'];
if (isset($s))
{
echo "<p><i>Search for $s</i></p>";
	echo $url='http://www.google.com/search?hl=en&q=intitle:'.urlencode('"'.$s.'"');
	$data = my_fetch($url);


	//$data = my_fetch('http://www.google.com/search?q=intitle%3A%22'. $s .'%22&ie=utf-8&oe=utf-8&aq=t&rls=org.mozilla:en-US:official&client=firefox-a');


    //strip off HTML
    $data = strip_tags($data);
	echo '<br><br>';
	echo $data;
    //now $data only has text NO HTML
    //these have to found out in the fetched data
    $find = 'Results 1 - 10 of about ';
    $find2 = ' for';
    //have text beginning from $find
    $data = strstr($data, $find);
	

    //find position of $find2
    //there might be many occurence
    //but it'd give position of the first one,
    //which is what we want, anyway
    $pos = strpos($data, $find2);

//take substring out, which'd be the number we want
$search_number=substr($data,strlen($find), $pos-strlen($find));

echo '<br><br>';
echo "Pages Indexed: $search_number";
}
else
{
    ?>

<form name="form1" id="form1" method="get" action="">
<div align="center">
<p>  <input name="s" type="text" id="s" size="50" />
<input type="submit" name="Submit" value="Count" /></p>
</div>
</form>
<p>&nbsp;</p>
<p>&nbsp;</p>
<p>&nbsp;</p>
<p>&nbsp;</p>
<p>
<?php
}
?>

Open in new window

Avatar of Dave Baldwin
Dave Baldwin
Flag of United States of America image

I think you're looking for the wrong thing.  Try searching for "<div id=resultStats>" and the next number after that should be the number of results found.  Notice below there's more than one format for that string.
>>>> wausau
<div id=resultStats>3 results<nobr>  (0.19 seconds)&nbsp;</nobr></div>
>>> Madison
<div id=resultStats>About 35 results<nobr>  (0.31 seconds)&nbsp;</nobr></div>

Open in new window

Oh, you may have to find that before you strip the HTML.  That tag tells you where the results are going to be.
It looks like this may be something of a culprit on line 32...

$find = 'Results 1 - 10 of about ';

I'll see what I can do to help with this.
ASKER CERTIFIED SOLUTION
Avatar of Ray Paseur
Ray Paseur
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of chrisj1963
chrisj1963

ASKER

Thanks Ray!  works great.  appreciate it
wow. I looked more closely at the code.   suuuuuper sweeet.  thanks again!!!!
Thanks for the points - it's a great question! ~Ray