Want to protect your cyber security and still get fast solutions? Ask a secure question today.Go Premium

x
  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 2383
  • Last Modified:

PHP Solution to detect crawlers and bots

My research has revealed two options for doing this

Checking for known bot names in $_SERVER['HTTP_USER_AGENT'] .

The other alternative is to use the PHP get_browser function and check the value of crawler.

Naturally I would prefer to use the get_browser option than having to maintain a list of bots and check through an extensive list each time I need to check.

The question is - how effective is the get_browser call in identifying crawlers?

I don't need a 100% hit rate - just looking for a way to detect the majority so I don't have to waste cpu cycles and db space clogging up my visitor reports.

I would appreciate input from anyone who has had experience with this topic that can give some insight into the best option to choose.
0
Julian Hansen
Asked:
Julian Hansen
  • 4
  • 3
2 Solutions
 
Dave BaldwinFixer of ProblemsCommented:
Have you already put up an appropriate 'robots.txt' file?  The 'good' bots will pay attention to it and the 'bad' bots won't look like bots.
0
 
Ray PaseurCommented:
Relying on this comment...
http://us2.php.net/manual/en/function.get-browser.php#92310

I decided I could make do with something like this code.

<?php // RAY_bad_robots.php
error_reporting(E_ALL);


// USE CASE:
if (bad_robots())
{
    echo "YOU ARE A BOT";
}
else
{
    echo "YOU ARE NOT A BOT";
}


// A FUNCTION TO IDENTIFY THE BOTS
function bad_robots()
{
    // THE BOTS WE WANT TO IGNORE
    static
    $bad_robots
    = array
    ( 'crawler'
    , 'spider'
    , 'robot'
    , 'slurp'
    , 'Atomz'
    , 'googlebot'
    , 'VoilaBot'
    , 'msnbot'
    , 'Gaisbot'
    , 'Gigabot'
    , 'SBIder'
    , 'Zyborg'
    , 'FunWebProducts'
    , 'findlinks'
    , 'ia_archiver'
    , 'MJ12bot'
    , 'Ask Jeeves'
    , 'NG/2.0'
    , 'voyager'
    , 'Exabot'
    , 'Nutch'
    , 'Hercules'
    , 'psbot'
    , 'LocalcomBot'
    )
    ;

    // COMPARE THE BOT STRINGS TO THE USER AGENT STRING
    foreach ($bad_robots as $spider)
    {
        $spider = '#' . $spider . '#i';
        if (preg_match($spider, $_SERVER["HTTP_USER_AGENT"])) return TRUE;
    }
    return FALSE;
}

Open in new window

HTH, ~Ray
0
 
Julian HansenAuthor Commented:
@DaveBaldwin - I don't want them to stop crawling my pages I just don't want to perform unnecessary db and other type operations. When a visitor hits the site the engine does a GeoIP lookup on their IP and saves their country ID in the session. The lookup is based on a subscription service so I don't want to be doing country lookups for bots - that's just one of the reasons. We also store tracking data in our database and we don't want that getting clogged up with crawler data.

@Ray_Paseur - thanks Ray - as I mentioned in my opening post I am aware of the HTTP_USER_AGENT check approach - and I know it works - I just would like to avoid having to maintain a list of bots. I did implement sample from here

http://mattgeri.com/blog/2009/01/how-to-detect-a-search-engine-spidercrawler-with-php/

I was hoping to find someone out there who has used the get_browser() PHP function before and could comment on its accuracy. I can write a test script to test but in order to see how effective it is - it needs to run for a couple of weeks to build up some data and I don't have that amount of time.
0
VIDEO: THE CONCERTO CLOUD FOR HEALTHCARE

Modern healthcare requires a modern cloud. View this brief video to understand how the Concerto Cloud for Healthcare can help your organization.

 
Ray PaseurCommented:
My point about the PHP.net comment was that get_browser() must search a 300K file, perhaps of dubious quality.  I do not have any idea how, from this side of the search process, I could discern the accuracy of the get_browser() function.  What would I compare the data to?

Now that I understand your objective better, this article might be helpful.  You do not need to be dependent on a call to a subscription service - you could just integrate this API into your application.  You would still need to update the data base periodically, but that's not a painful process at all.
http://www.experts-exchange.com/Web_Development/Web_Languages-Standards/PHP/PHP_Databases/A_3437-IP-Address-to-Country-in-PHP.html

Another possibility would be to install a script that actually used get_browser() and that wrote the browser data into a data base table.  You might put a human-invisible link into your site so the spiders could follow it, but people would mostly overlook it.  You would be able to see what 'bots are hitting your site.  You could make a SELECT from the table to populate the $bad_robots array, or you could periodically condense the information from the table into a hard-coded $bad_robots array.

In my sites I load the hard-coded $bad_robots array as part of my framework.  Performance is fine, since there is very little to look up.
0
 
Julian HansenAuthor Commented:
@Ray, Thanks for that. We are already using the Maxmind database but for various reasons the client does not want to go with the update database file solution (I have used that approach several times on other projects so I am aware of it). The client has decided they want to go the subscription route.

Another possibility would be to install a script that actually used get_browser() and that wrote the browser data into a data base table.

This is what I would normally do but it is not going to give me any real results by tomorrow - hence the post here - trying to short cut the process.

In my sites I load the hard-coded $bad_robots array as part of my framework.  Performance is fine, since there is very little to look up.

I am sure this is a good solution - I just was looking for some insight into how effective the the PHP one was.
0
 
Julian HansenAuthor Commented:
@DaveBaldwin , @Ray_Paseur,

Thanks for the feedback - I think I have what I was looking for.

Go with the HTTP_USER_AGENT approach - I have done some more research and I am not convinced that the get_browser approach is reliable and worth the minor-inconenience of maintaining a bot list.
0
 
Ray PaseurCommented:
Hmm... Maybe there is another argument against get_browser().  

"In order for this to work, your browscap configuration setting in php.ini must point to the correct location of the browscap.ini file on your system.

"browscap.ini is not bundled with PHP, but you may find an up-to-date » php_browscap.ini file here."

The link is 404.
Firefox can't find the server at browsers.garykeith.com.

In any case, I think you're headed in the right direction.  Thanks for the points, ~Ray
0
 
Julian HansenAuthor Commented:
No problem - I also saw that just before closing this - did not fill me with confidence - but it did help to crystalise a solutioin going forward.

I think I might have been able to convince these guys to go with a static database on the GeoIP rather than the subscription - showed them your article - so thanks for that.
0

Featured Post

[Webinar] Database Backup and Recovery

Does your company store data on premises, off site, in the cloud, or a combination of these? If you answered “yes”, you need a data backup recovery plan that fits each and every platform. Watch now as as Percona teaches us how to build agile data backup recovery plan.

  • 4
  • 3
Tackle projects and never again get stuck behind a technical roadblock.
Join Now