Solved

Scrape not working consistently and as intented

Posted on 2010-09-01
7
545 Views
Last Modified: 2012-05-10
Hi - I have a script where the user enters a keyword to search and the script is supposed to return the number of  "intitle:" phrase matches.

Right now the script seems to work great for results that are >10 but seems to return an empty result when <10.

if you enter           landscaper wausau      here:  
http://prontopage.net/local3b3.php

you get the results generated here:
http://prontopage.net/local3b3.php?s=landscaper+wausau&Submit=Count 

at the top of the result page you see the url:  
http://www.google.com/search?hl=en&q=intitle:%22landscaper+wausau%22 

which when followed generates a page with "3" results.  That's the number I need to pull.   $search_number should display as:
Pages Indexed: 3   (at the bottom of the screen)

If you look at the echoed data it displays
Pages Indexed:     (empty)

If you look in the echoed data you can see  
"WebResults 1 - 3 of about 3 for"    (the 3 between "about" and "for" is what I need)

Oddly a correct number is displayed if we enter:
landscaper madison

The result is 37

The url generated is http://www.google.com/search?hl=en&q=intitle:%22landscaper+madison%22

and in the echoed data you see:
Results 1 - 10 of about 37 for

and at the bottom of the page you see:
Pages Indexed: 37

That is perfect....

can someone please look at this and tell me how I might get the proper result for <10 results?

Thanks

oddly if you enter
<?php
function my_fetch($url,$user_agent='Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)')
{
$ch = curl_init();
curl_setopt ($ch, CURLOPT_URL, $url);
curl_setopt ($ch, CURLOPT_USERAGENT, $user_agent);
curl_setopt ($ch, CURLOPT_HEADER, 0);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt ($ch, CURLOPT_REFERER, 'http://www.google.com/');
$result = curl_exec ($ch);
curl_close ($ch);
return $result;
}

$s = $_GET['s'];
if (isset($s))
{
echo "<p><i>Search for $s</i></p>";
	echo $url='http://www.google.com/search?hl=en&q=intitle:'.urlencode('"'.$s.'"');
	$data = my_fetch($url);


	//$data = my_fetch('http://www.google.com/search?q=intitle%3A%22'. $s .'%22&ie=utf-8&oe=utf-8&aq=t&rls=org.mozilla:en-US:official&client=firefox-a');


    //strip off HTML
    $data = strip_tags($data);
	echo '<br><br>';
	echo $data;
    //now $data only has text NO HTML
    //these have to found out in the fetched data
    $find = 'Results 1 - 10 of about ';
    $find2 = ' for';
    //have text beginning from $find
    $data = strstr($data, $find);
	

    //find position of $find2
    //there might be many occurence
    //but it'd give position of the first one,
    //which is what we want, anyway
    $pos = strpos($data, $find2);

//take substring out, which'd be the number we want
$search_number=substr($data,strlen($find), $pos-strlen($find));

echo '<br><br>';
echo "Pages Indexed: $search_number";
}
else
{
    ?>

<form name="form1" id="form1" method="get" action="">
<div align="center">
<p>  <input name="s" type="text" id="s" size="50" />
<input type="submit" name="Submit" value="Count" /></p>
</div>
</form>
<p>&nbsp;</p>
<p>&nbsp;</p>
<p>&nbsp;</p>
<p>&nbsp;</p>
<p>
<?php
}
?>

Open in new window

0
Comment
Question by:chrisj1963
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 3
  • 2
  • 2
7 Comments
 
LVL 83

Expert Comment

by:Dave Baldwin
ID: 33582154
I think you're looking for the wrong thing.  Try searching for "<div id=resultStats>" and the next number after that should be the number of results found.  Notice below there's more than one format for that string.
>>>> wausau
<div id=resultStats>3 results<nobr>  (0.19 seconds)&nbsp;</nobr></div>
>>> Madison
<div id=resultStats>About 35 results<nobr>  (0.31 seconds)&nbsp;</nobr></div>

Open in new window

0
 
LVL 83

Expert Comment

by:Dave Baldwin
ID: 33582164
Oh, you may have to find that before you strip the HTML.  That tag tells you where the results are going to be.
0
 
LVL 110

Expert Comment

by:Ray Paseur
ID: 33582488
It looks like this may be something of a culprit on line 32...

$find = 'Results 1 - 10 of about ';

I'll see what I can do to help with this.
0
PeopleSoft Has Never Been Easier

PeopleSoft Adoption Made Smooth & Simple!

On-The-Job Training Is made Intuitive & Easy With WalkMe's On-Screen Guidance Tool.  Claim Your Free WalkMe Account Now

 
LVL 110

Accepted Solution

by:
Ray Paseur earned 250 total points
ID: 33582504
Google returns different HTML depending on the browser you use (or simulate).  This script uses Mozilla and looks like FF3, so it gets the latest HTML and CSS.

Test cases:
http://www.laprbass.com/RAY_temp_chrisj1963.php?s=landscaper
http://www.laprbass.com/RAY_temp_chrisj1963.php?s=landscaper+virginia
http://www.laprbass.com/RAY_temp_chrisj1963.php?s=landscaper+wausau
http://www.laprbass.com/RAY_temp_chrisj1963.php?s=landscaper+wausaw

Best regards, ~Ray
<?php // RAY_temp_chrisj1963.php
error_reporting(E_ALL);


function my_curl($url, $timeout=2, $error_report=FALSE)
{
    $curl = curl_init();

    // HEADERS FROM FIREFOX - APPEARS TO BE A BROWSER REFERRED BY GOOGLE
    $header[] = "Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5";
    $header[] = "Cache-Control: max-age=0";
    $header[] = "Connection: keep-alive";
    $header[] = "Keep-Alive: 300";
    $header[] = "Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7";
    $header[] = "Accept-Language: en-us,en;q=0.5";
    $header[] = "Pragma: "; // browsers keep this blank.

    // SET THE CURL OPTIONS - SEE http://php.net/manual/en/function.curl-setopt.php
    curl_setopt($curl, CURLOPT_URL,            $url);
    curl_setopt($curl, CURLOPT_USERAGENT,      'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6');
    curl_setopt($curl, CURLOPT_HTTPHEADER,     $header);
    curl_setopt($curl, CURLOPT_REFERER,        'http://www.google.com');
    curl_setopt($curl, CURLOPT_ENCODING,       'gzip,deflate');
    curl_setopt($curl, CURLOPT_AUTOREFERER,    TRUE);
    curl_setopt($curl, CURLOPT_RETURNTRANSFER, TRUE);
    curl_setopt($curl, CURLOPT_FOLLOWLOCATION, TRUE);
    curl_setopt($curl, CURLOPT_TIMEOUT,        $timeout);

    // RUN THE CURL REQUEST AND GET THE RESULTS
    $htm = curl_exec($curl);
    $err = curl_errno($curl);
    $inf = curl_getinfo($curl);
    curl_close($curl);

    // ON FAILURE
    if (!$htm)
    {
        // PROCESS ERRORS HERE
        if ($error_report)
        {
            echo "CURL FAIL: $url TIMEOUT=$timeout, CURL_ERRNO=$err";
            var_dump($inf);
        }
        return FALSE;
    }

    // ON SUCCESS
    return $htm;
}


// THINGS TO LOOK FOR
$rs = '<div id=resultStats>';
$nb = '<nobr>';

// WHERE TO LOOK
$url = 'http://www.google.com/search?q=intitle:';

// IF THERE IS AN ARGUMENT, CALL GOOGLE
if (isset($_GET['s']))
{
    echo "<p><i>Search for {$_GET["s"]}</i></p>";

    // CONSTRUCT URL
    echo $url = $url . '"' . urlencode($_GET["s"]) . '"';
    $htm = my_curl($url);

    // ACTIVATE THIS TO SEE THE HTML FROM GOOGLE
    // echo htmlentities($htm);

    // USE BREAKPOINTS APPROPRIATE TO THE MOZILLA BROWSER RESPONSE TEXT
    $arr = explode($rs, $htm);
    $arr = @explode($nb, $arr[1]);
    $str = preg_replace('/[^0-9]/', '', $arr[0]);
    $num = number_format($str);

    echo '<br><br>';

    if ($num == 0)
    {
        echo "No Pages Found";
        die();
    }
    echo "Pages Indexed: $num";
    die();
}
// END OF PHP, PUT UP THE FORM
?>
<form>
<div align="center">
<p>
<input name="s" type="text" id="s" size="50" />
<input type="submit" name="Submit" value="Count" />
</p>
</div>
</form>

Open in new window

0
 

Author Closing Comment

by:chrisj1963
ID: 33582679
Thanks Ray!  works great.  appreciate it
0
 

Author Comment

by:chrisj1963
ID: 33582690
wow. I looked more closely at the code.   suuuuuper sweeet.  thanks again!!!!
0
 
LVL 110

Expert Comment

by:Ray Paseur
ID: 33586859
Thanks for the points - it's a great question! ~Ray
0

Featured Post

Free Tool: Path Explorer

An intuitive utility to help find the CSS path to UI elements on a webpage. These paths are used frequently in a variety of front-end development and QA automation tasks.

One of a set of tools we're offering as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Author Note: Since this E-E article was originally written, years ago, formal testing has come into common use in the world of PHP.  PHPUnit (http://en.wikipedia.org/wiki/PHPUnit) and similar technologies have enjoyed wide adoption, making it possib…
Build an array called $myWeek which will hold the array elements Today, Yesterday and then builds up the rest of the week by the name of the day going back 1 week.   (CODE) (CODE) Then you just need to pass your date to the function. If i…
The viewer will learn how to count occurrences of each item in an array.
The viewer will learn how to create and use a small PHP class to apply a watermark to an image. This video shows the viewer the setup for the PHP watermark as well as important coding language. Continue to Part 2 to learn the core code used in creat…

726 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question