Still celebrating National IT Professionals Day with 3 months of free Premium Membership. Use Code ITDAY17

x
?
Solved

Scrape not working consistently and as intented

Posted on 2010-09-01
7
Medium Priority
?
552 Views
Last Modified: 2012-05-10
Hi - I have a script where the user enters a keyword to search and the script is supposed to return the number of  "intitle:" phrase matches.

Right now the script seems to work great for results that are >10 but seems to return an empty result when <10.

if you enter           landscaper wausau      here:  
http://prontopage.net/local3b3.php

you get the results generated here:
http://prontopage.net/local3b3.php?s=landscaper+wausau&Submit=Count 

at the top of the result page you see the url:  
http://www.google.com/search?hl=en&q=intitle:%22landscaper+wausau%22 

which when followed generates a page with "3" results.  That's the number I need to pull.   $search_number should display as:
Pages Indexed: 3   (at the bottom of the screen)

If you look at the echoed data it displays
Pages Indexed:     (empty)

If you look in the echoed data you can see  
"WebResults 1 - 3 of about 3 for"    (the 3 between "about" and "for" is what I need)

Oddly a correct number is displayed if we enter:
landscaper madison

The result is 37

The url generated is http://www.google.com/search?hl=en&q=intitle:%22landscaper+madison%22

and in the echoed data you see:
Results 1 - 10 of about 37 for

and at the bottom of the page you see:
Pages Indexed: 37

That is perfect....

can someone please look at this and tell me how I might get the proper result for <10 results?

Thanks

oddly if you enter
<?php
function my_fetch($url,$user_agent='Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)')
{
$ch = curl_init();
curl_setopt ($ch, CURLOPT_URL, $url);
curl_setopt ($ch, CURLOPT_USERAGENT, $user_agent);
curl_setopt ($ch, CURLOPT_HEADER, 0);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt ($ch, CURLOPT_REFERER, 'http://www.google.com/');
$result = curl_exec ($ch);
curl_close ($ch);
return $result;
}

$s = $_GET['s'];
if (isset($s))
{
echo "<p><i>Search for $s</i></p>";
	echo $url='http://www.google.com/search?hl=en&q=intitle:'.urlencode('"'.$s.'"');
	$data = my_fetch($url);


	//$data = my_fetch('http://www.google.com/search?q=intitle%3A%22'. $s .'%22&ie=utf-8&oe=utf-8&aq=t&rls=org.mozilla:en-US:official&client=firefox-a');


    //strip off HTML
    $data = strip_tags($data);
	echo '<br><br>';
	echo $data;
    //now $data only has text NO HTML
    //these have to found out in the fetched data
    $find = 'Results 1 - 10 of about ';
    $find2 = ' for';
    //have text beginning from $find
    $data = strstr($data, $find);
	

    //find position of $find2
    //there might be many occurence
    //but it'd give position of the first one,
    //which is what we want, anyway
    $pos = strpos($data, $find2);

//take substring out, which'd be the number we want
$search_number=substr($data,strlen($find), $pos-strlen($find));

echo '<br><br>';
echo "Pages Indexed: $search_number";
}
else
{
    ?>

<form name="form1" id="form1" method="get" action="">
<div align="center">
<p>  <input name="s" type="text" id="s" size="50" />
<input type="submit" name="Submit" value="Count" /></p>
</div>
</form>
<p>&nbsp;</p>
<p>&nbsp;</p>
<p>&nbsp;</p>
<p>&nbsp;</p>
<p>
<?php
}
?>

Open in new window

0
Comment
Question by:chrisj1963
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 3
  • 2
  • 2
7 Comments
 
LVL 84

Expert Comment

by:Dave Baldwin
ID: 33582154
I think you're looking for the wrong thing.  Try searching for "<div id=resultStats>" and the next number after that should be the number of results found.  Notice below there's more than one format for that string.
>>>> wausau
<div id=resultStats>3 results<nobr>  (0.19 seconds)&nbsp;</nobr></div>
>>> Madison
<div id=resultStats>About 35 results<nobr>  (0.31 seconds)&nbsp;</nobr></div>

Open in new window

0
 
LVL 84

Expert Comment

by:Dave Baldwin
ID: 33582164
Oh, you may have to find that before you strip the HTML.  That tag tells you where the results are going to be.
0
 
LVL 111

Expert Comment

by:Ray Paseur
ID: 33582488
It looks like this may be something of a culprit on line 32...

$find = 'Results 1 - 10 of about ';

I'll see what I can do to help with this.
0
Industry Leaders: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 
LVL 111

Accepted Solution

by:
Ray Paseur earned 1000 total points
ID: 33582504
Google returns different HTML depending on the browser you use (or simulate).  This script uses Mozilla and looks like FF3, so it gets the latest HTML and CSS.

Test cases:
http://www.laprbass.com/RAY_temp_chrisj1963.php?s=landscaper
http://www.laprbass.com/RAY_temp_chrisj1963.php?s=landscaper+virginia
http://www.laprbass.com/RAY_temp_chrisj1963.php?s=landscaper+wausau
http://www.laprbass.com/RAY_temp_chrisj1963.php?s=landscaper+wausaw

Best regards, ~Ray
<?php // RAY_temp_chrisj1963.php
error_reporting(E_ALL);


function my_curl($url, $timeout=2, $error_report=FALSE)
{
    $curl = curl_init();

    // HEADERS FROM FIREFOX - APPEARS TO BE A BROWSER REFERRED BY GOOGLE
    $header[] = "Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5";
    $header[] = "Cache-Control: max-age=0";
    $header[] = "Connection: keep-alive";
    $header[] = "Keep-Alive: 300";
    $header[] = "Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7";
    $header[] = "Accept-Language: en-us,en;q=0.5";
    $header[] = "Pragma: "; // browsers keep this blank.

    // SET THE CURL OPTIONS - SEE http://php.net/manual/en/function.curl-setopt.php
    curl_setopt($curl, CURLOPT_URL,            $url);
    curl_setopt($curl, CURLOPT_USERAGENT,      'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6');
    curl_setopt($curl, CURLOPT_HTTPHEADER,     $header);
    curl_setopt($curl, CURLOPT_REFERER,        'http://www.google.com');
    curl_setopt($curl, CURLOPT_ENCODING,       'gzip,deflate');
    curl_setopt($curl, CURLOPT_AUTOREFERER,    TRUE);
    curl_setopt($curl, CURLOPT_RETURNTRANSFER, TRUE);
    curl_setopt($curl, CURLOPT_FOLLOWLOCATION, TRUE);
    curl_setopt($curl, CURLOPT_TIMEOUT,        $timeout);

    // RUN THE CURL REQUEST AND GET THE RESULTS
    $htm = curl_exec($curl);
    $err = curl_errno($curl);
    $inf = curl_getinfo($curl);
    curl_close($curl);

    // ON FAILURE
    if (!$htm)
    {
        // PROCESS ERRORS HERE
        if ($error_report)
        {
            echo "CURL FAIL: $url TIMEOUT=$timeout, CURL_ERRNO=$err";
            var_dump($inf);
        }
        return FALSE;
    }

    // ON SUCCESS
    return $htm;
}


// THINGS TO LOOK FOR
$rs = '<div id=resultStats>';
$nb = '<nobr>';

// WHERE TO LOOK
$url = 'http://www.google.com/search?q=intitle:';

// IF THERE IS AN ARGUMENT, CALL GOOGLE
if (isset($_GET['s']))
{
    echo "<p><i>Search for {$_GET["s"]}</i></p>";

    // CONSTRUCT URL
    echo $url = $url . '"' . urlencode($_GET["s"]) . '"';
    $htm = my_curl($url);

    // ACTIVATE THIS TO SEE THE HTML FROM GOOGLE
    // echo htmlentities($htm);

    // USE BREAKPOINTS APPROPRIATE TO THE MOZILLA BROWSER RESPONSE TEXT
    $arr = explode($rs, $htm);
    $arr = @explode($nb, $arr[1]);
    $str = preg_replace('/[^0-9]/', '', $arr[0]);
    $num = number_format($str);

    echo '<br><br>';

    if ($num == 0)
    {
        echo "No Pages Found";
        die();
    }
    echo "Pages Indexed: $num";
    die();
}
// END OF PHP, PUT UP THE FORM
?>
<form>
<div align="center">
<p>
<input name="s" type="text" id="s" size="50" />
<input type="submit" name="Submit" value="Count" />
</p>
</div>
</form>

Open in new window

0
 

Author Closing Comment

by:chrisj1963
ID: 33582679
Thanks Ray!  works great.  appreciate it
0
 

Author Comment

by:chrisj1963
ID: 33582690
wow. I looked more closely at the code.   suuuuuper sweeet.  thanks again!!!!
0
 
LVL 111

Expert Comment

by:Ray Paseur
ID: 33586859
Thanks for the points - it's a great question! ~Ray
0

Featured Post

VIDEO: THE CONCERTO CLOUD FOR HEALTHCARE

Modern healthcare requires a modern cloud. View this brief video to understand how the Concerto Cloud for Healthcare can help your organization.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

These days socially coordinated efforts have turned into a critical requirement for enterprises.
Originally, this post was published on Monitis Blog, you can check it here . In business circles, we sometimes hear that today is the “age of the customer.” And so it is. Thanks to the enormous advances over the past few years in consumer techno…
This tutorial will teach you the core code needed to finalize the addition of a watermark to your image. The viewer will use a small PHP class to learn and create a watermark.
The viewer will learn how to create a basic form using some HTML5 and PHP for later processing. Set up your basic HTML file. Open your form tag and set the method and action attributes.: (CODE) Set up your first few inputs one for the name and …

662 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question