chrisj1963
asked on
Scrape not working consistently and as intented
Hi - I have a script where the user enters a keyword to search and the script is supposed to return the number of "intitle:" phrase matches.
Right now the script seems to work great for results that are >10 but seems to return an empty result when <10.
if you enter landscaper wausau here:
http://prontopage.net/local3b3.php
you get the results generated here:
http://prontopage.net/local3b3.php?s=landscaper+wausau&Submit=Count
at the top of the result page you see the url:
http://www.google.com/search?hl=en&q=intitle:%22landscaper+wausau%22
which when followed generates a page with "3" results. That's the number I need to pull. $search_number should display as:
Pages Indexed: 3 (at the bottom of the screen)
If you look at the echoed data it displays
Pages Indexed: (empty)
If you look in the echoed data you can see
"WebResults 1 - 3 of about 3 for" (the 3 between "about" and "for" is what I need)
Oddly a correct number is displayed if we enter:
landscaper madison
The result is 37
The url generated is http://www.google.com/search?hl=en&q=intitle:%22landscaper+madison%22
and in the echoed data you see:
Results 1 - 10 of about 37 for
and at the bottom of the page you see:
Pages Indexed: 37
That is perfect....
can someone please look at this and tell me how I might get the proper result for <10 results?
Thanks
oddly if you enter
Right now the script seems to work great for results that are >10 but seems to return an empty result when <10.
if you enter landscaper wausau here:
http://prontopage.net/local3b3.php
you get the results generated here:
http://prontopage.net/local3b3.php?s=landscaper+wausau&Submit=Count
at the top of the result page you see the url:
http://www.google.com/search?hl=en&q=intitle:%22landscaper+wausau%22
which when followed generates a page with "3" results. That's the number I need to pull. $search_number should display as:
Pages Indexed: 3 (at the bottom of the screen)
If you look at the echoed data it displays
Pages Indexed: (empty)
If you look in the echoed data you can see
"WebResults 1 - 3 of about 3 for" (the 3 between "about" and "for" is what I need)
Oddly a correct number is displayed if we enter:
landscaper madison
The result is 37
The url generated is http://www.google.com/search?hl=en&q=intitle:%22landscaper+madison%22
and in the echoed data you see:
Results 1 - 10 of about 37 for
and at the bottom of the page you see:
Pages Indexed: 37
That is perfect....
can someone please look at this and tell me how I might get the proper result for <10 results?
Thanks
oddly if you enter
<?php
function my_fetch($url,$user_agent='Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)')
{
$ch = curl_init();
curl_setopt ($ch, CURLOPT_URL, $url);
curl_setopt ($ch, CURLOPT_USERAGENT, $user_agent);
curl_setopt ($ch, CURLOPT_HEADER, 0);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt ($ch, CURLOPT_REFERER, 'http://www.google.com/');
$result = curl_exec ($ch);
curl_close ($ch);
return $result;
}
$s = $_GET['s'];
if (isset($s))
{
echo "<p><i>Search for $s</i></p>";
echo $url='http://www.google.com/search?hl=en&q=intitle:'.urlencode('"'.$s.'"');
$data = my_fetch($url);
//$data = my_fetch('http://www.google.com/search?q=intitle%3A%22'. $s .'%22&ie=utf-8&oe=utf-8&aq=t&rls=org.mozilla:en-US:official&client=firefox-a');
//strip off HTML
$data = strip_tags($data);
echo '<br><br>';
echo $data;
//now $data only has text NO HTML
//these have to found out in the fetched data
$find = 'Results 1 - 10 of about ';
$find2 = ' for';
//have text beginning from $find
$data = strstr($data, $find);
//find position of $find2
//there might be many occurence
//but it'd give position of the first one,
//which is what we want, anyway
$pos = strpos($data, $find2);
//take substring out, which'd be the number we want
$search_number=substr($data,strlen($find), $pos-strlen($find));
echo '<br><br>';
echo "Pages Indexed: $search_number";
}
else
{
?>
<form name="form1" id="form1" method="get" action="">
<div align="center">
<p> <input name="s" type="text" id="s" size="50" />
<input type="submit" name="Submit" value="Count" /></p>
</div>
</form>
<p> </p>
<p> </p>
<p> </p>
<p> </p>
<p>
<?php
}
?>
Oh, you may have to find that before you strip the HTML. That tag tells you where the results are going to be.
It looks like this may be something of a culprit on line 32...
$find = 'Results 1 - 10 of about ';
I'll see what I can do to help with this.
$find = 'Results 1 - 10 of about ';
I'll see what I can do to help with this.
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
Thanks Ray! works great. appreciate it
ASKER
wow. I looked more closely at the code. suuuuuper sweeet. thanks again!!!!
Thanks for the points - it's a great question! ~Ray
Open in new window