Solved

Scrape not working consistently and as intented

Posted on 2010-09-01
7
548 Views
Last Modified: 2012-05-10
Hi - I have a script where the user enters a keyword to search and the script is supposed to return the number of  "intitle:" phrase matches.

Right now the script seems to work great for results that are >10 but seems to return an empty result when <10.

if you enter           landscaper wausau      here:  
http://prontopage.net/local3b3.php

you get the results generated here:
http://prontopage.net/local3b3.php?s=landscaper+wausau&Submit=Count 

at the top of the result page you see the url:  
http://www.google.com/search?hl=en&q=intitle:%22landscaper+wausau%22 

which when followed generates a page with "3" results.  That's the number I need to pull.   $search_number should display as:
Pages Indexed: 3   (at the bottom of the screen)

If you look at the echoed data it displays
Pages Indexed:     (empty)

If you look in the echoed data you can see  
"WebResults 1 - 3 of about 3 for"    (the 3 between "about" and "for" is what I need)

Oddly a correct number is displayed if we enter:
landscaper madison

The result is 37

The url generated is http://www.google.com/search?hl=en&q=intitle:%22landscaper+madison%22

and in the echoed data you see:
Results 1 - 10 of about 37 for

and at the bottom of the page you see:
Pages Indexed: 37

That is perfect....

can someone please look at this and tell me how I might get the proper result for <10 results?

Thanks

oddly if you enter
<?php
function my_fetch($url,$user_agent='Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)')
{
$ch = curl_init();
curl_setopt ($ch, CURLOPT_URL, $url);
curl_setopt ($ch, CURLOPT_USERAGENT, $user_agent);
curl_setopt ($ch, CURLOPT_HEADER, 0);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt ($ch, CURLOPT_REFERER, 'http://www.google.com/');
$result = curl_exec ($ch);
curl_close ($ch);
return $result;
}

$s = $_GET['s'];
if (isset($s))
{
echo "<p><i>Search for $s</i></p>";
	echo $url='http://www.google.com/search?hl=en&q=intitle:'.urlencode('"'.$s.'"');
	$data = my_fetch($url);


	//$data = my_fetch('http://www.google.com/search?q=intitle%3A%22'. $s .'%22&ie=utf-8&oe=utf-8&aq=t&rls=org.mozilla:en-US:official&client=firefox-a');


    //strip off HTML
    $data = strip_tags($data);
	echo '<br><br>';
	echo $data;
    //now $data only has text NO HTML
    //these have to found out in the fetched data
    $find = 'Results 1 - 10 of about ';
    $find2 = ' for';
    //have text beginning from $find
    $data = strstr($data, $find);
	

    //find position of $find2
    //there might be many occurence
    //but it'd give position of the first one,
    //which is what we want, anyway
    $pos = strpos($data, $find2);

//take substring out, which'd be the number we want
$search_number=substr($data,strlen($find), $pos-strlen($find));

echo '<br><br>';
echo "Pages Indexed: $search_number";
}
else
{
    ?>

<form name="form1" id="form1" method="get" action="">
<div align="center">
<p>  <input name="s" type="text" id="s" size="50" />
<input type="submit" name="Submit" value="Count" /></p>
</div>
</form>
<p>&nbsp;</p>
<p>&nbsp;</p>
<p>&nbsp;</p>
<p>&nbsp;</p>
<p>
<?php
}
?>

Open in new window

0
Comment
Question by:chrisj1963
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 3
  • 2
  • 2
7 Comments
 
LVL 83

Expert Comment

by:Dave Baldwin
ID: 33582154
I think you're looking for the wrong thing.  Try searching for "<div id=resultStats>" and the next number after that should be the number of results found.  Notice below there's more than one format for that string.
>>>> wausau
<div id=resultStats>3 results<nobr>  (0.19 seconds)&nbsp;</nobr></div>
>>> Madison
<div id=resultStats>About 35 results<nobr>  (0.31 seconds)&nbsp;</nobr></div>

Open in new window

0
 
LVL 83

Expert Comment

by:Dave Baldwin
ID: 33582164
Oh, you may have to find that before you strip the HTML.  That tag tells you where the results are going to be.
0
 
LVL 110

Expert Comment

by:Ray Paseur
ID: 33582488
It looks like this may be something of a culprit on line 32...

$find = 'Results 1 - 10 of about ';

I'll see what I can do to help with this.
0
Don't Cry: How Liquid Web is Ensuring Security

WannaCry is just the start. Read how Liquid Web is protecting itself and its customers against new threats.

 
LVL 110

Accepted Solution

by:
Ray Paseur earned 250 total points
ID: 33582504
Google returns different HTML depending on the browser you use (or simulate).  This script uses Mozilla and looks like FF3, so it gets the latest HTML and CSS.

Test cases:
http://www.laprbass.com/RAY_temp_chrisj1963.php?s=landscaper
http://www.laprbass.com/RAY_temp_chrisj1963.php?s=landscaper+virginia
http://www.laprbass.com/RAY_temp_chrisj1963.php?s=landscaper+wausau
http://www.laprbass.com/RAY_temp_chrisj1963.php?s=landscaper+wausaw

Best regards, ~Ray
<?php // RAY_temp_chrisj1963.php
error_reporting(E_ALL);


function my_curl($url, $timeout=2, $error_report=FALSE)
{
    $curl = curl_init();

    // HEADERS FROM FIREFOX - APPEARS TO BE A BROWSER REFERRED BY GOOGLE
    $header[] = "Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5";
    $header[] = "Cache-Control: max-age=0";
    $header[] = "Connection: keep-alive";
    $header[] = "Keep-Alive: 300";
    $header[] = "Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7";
    $header[] = "Accept-Language: en-us,en;q=0.5";
    $header[] = "Pragma: "; // browsers keep this blank.

    // SET THE CURL OPTIONS - SEE http://php.net/manual/en/function.curl-setopt.php
    curl_setopt($curl, CURLOPT_URL,            $url);
    curl_setopt($curl, CURLOPT_USERAGENT,      'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6');
    curl_setopt($curl, CURLOPT_HTTPHEADER,     $header);
    curl_setopt($curl, CURLOPT_REFERER,        'http://www.google.com');
    curl_setopt($curl, CURLOPT_ENCODING,       'gzip,deflate');
    curl_setopt($curl, CURLOPT_AUTOREFERER,    TRUE);
    curl_setopt($curl, CURLOPT_RETURNTRANSFER, TRUE);
    curl_setopt($curl, CURLOPT_FOLLOWLOCATION, TRUE);
    curl_setopt($curl, CURLOPT_TIMEOUT,        $timeout);

    // RUN THE CURL REQUEST AND GET THE RESULTS
    $htm = curl_exec($curl);
    $err = curl_errno($curl);
    $inf = curl_getinfo($curl);
    curl_close($curl);

    // ON FAILURE
    if (!$htm)
    {
        // PROCESS ERRORS HERE
        if ($error_report)
        {
            echo "CURL FAIL: $url TIMEOUT=$timeout, CURL_ERRNO=$err";
            var_dump($inf);
        }
        return FALSE;
    }

    // ON SUCCESS
    return $htm;
}


// THINGS TO LOOK FOR
$rs = '<div id=resultStats>';
$nb = '<nobr>';

// WHERE TO LOOK
$url = 'http://www.google.com/search?q=intitle:';

// IF THERE IS AN ARGUMENT, CALL GOOGLE
if (isset($_GET['s']))
{
    echo "<p><i>Search for {$_GET["s"]}</i></p>";

    // CONSTRUCT URL
    echo $url = $url . '"' . urlencode($_GET["s"]) . '"';
    $htm = my_curl($url);

    // ACTIVATE THIS TO SEE THE HTML FROM GOOGLE
    // echo htmlentities($htm);

    // USE BREAKPOINTS APPROPRIATE TO THE MOZILLA BROWSER RESPONSE TEXT
    $arr = explode($rs, $htm);
    $arr = @explode($nb, $arr[1]);
    $str = preg_replace('/[^0-9]/', '', $arr[0]);
    $num = number_format($str);

    echo '<br><br>';

    if ($num == 0)
    {
        echo "No Pages Found";
        die();
    }
    echo "Pages Indexed: $num";
    die();
}
// END OF PHP, PUT UP THE FORM
?>
<form>
<div align="center">
<p>
<input name="s" type="text" id="s" size="50" />
<input type="submit" name="Submit" value="Count" />
</p>
</div>
</form>

Open in new window

0
 

Author Closing Comment

by:chrisj1963
ID: 33582679
Thanks Ray!  works great.  appreciate it
0
 

Author Comment

by:chrisj1963
ID: 33582690
wow. I looked more closely at the code.   suuuuuper sweeet.  thanks again!!!!
0
 
LVL 110

Expert Comment

by:Ray Paseur
ID: 33586859
Thanks for the points - it's a great question! ~Ray
0

Featured Post

Independent Software Vendors: We Want Your Opinion

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Things That Drive Us Nuts Have you noticed the use of the reCaptcha feature at EE and other web sites?  It wants you to read and retype something that looks like this. Insanity!  It's not EE's fault - that's just the way reCaptcha works.  But it i…
This article discusses four methods for overlaying images in a container on a web page
The viewer will learn how to create and use a small PHP class to apply a watermark to an image. This video shows the viewer the setup for the PHP watermark as well as important coding language. Continue to Part 2 to learn the core code used in creat…
The viewer will learn how to create a basic form using some HTML5 and PHP for later processing. Set up your basic HTML file. Open your form tag and set the method and action attributes.: (CODE) Set up your first few inputs one for the name and …

729 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question