PHP Solution to detect crawlers and bots

Posted on 2012-08-19
Last Modified: 2012-08-19
My research has revealed two options for doing this

Checking for known bot names in $_SERVER['HTTP_USER_AGENT'] .

The other alternative is to use the PHP get_browser function and check the value of crawler.

Naturally I would prefer to use the get_browser option than having to maintain a list of bots and check through an extensive list each time I need to check.

The question is - how effective is the get_browser call in identifying crawlers?

I don't need a 100% hit rate - just looking for a way to detect the majority so I don't have to waste cpu cycles and db space clogging up my visitor reports.

I would appreciate input from anyone who has had experience with this topic that can give some insight into the best option to choose.
Question by:Julian Hansen
    LVL 82

    Assisted Solution

    by:Dave Baldwin
    Have you already put up an appropriate 'robots.txt' file?  The 'good' bots will pay attention to it and the 'bad' bots won't look like bots.
    LVL 107

    Accepted Solution

    Relying on this comment...

    I decided I could make do with something like this code.

    <?php // RAY_bad_robots.php
    // USE CASE:
    if (bad_robots())
        echo "YOU ARE A BOT";
        echo "YOU ARE NOT A BOT";
    function bad_robots()
        = array
        ( 'crawler'
        , 'spider'
        , 'robot'
        , 'slurp'
        , 'Atomz'
        , 'googlebot'
        , 'VoilaBot'
        , 'msnbot'
        , 'Gaisbot'
        , 'Gigabot'
        , 'SBIder'
        , 'Zyborg'
        , 'FunWebProducts'
        , 'findlinks'
        , 'ia_archiver'
        , 'MJ12bot'
        , 'Ask Jeeves'
        , 'NG/2.0'
        , 'voyager'
        , 'Exabot'
        , 'Nutch'
        , 'Hercules'
        , 'psbot'
        , 'LocalcomBot'
        foreach ($bad_robots as $spider)
            $spider = '#' . $spider . '#i';
            if (preg_match($spider, $_SERVER["HTTP_USER_AGENT"])) return TRUE;
        return FALSE;

    Open in new window

    HTH, ~Ray
    LVL 49

    Author Comment

    by:Julian Hansen
    @DaveBaldwin - I don't want them to stop crawling my pages I just don't want to perform unnecessary db and other type operations. When a visitor hits the site the engine does a GeoIP lookup on their IP and saves their country ID in the session. The lookup is based on a subscription service so I don't want to be doing country lookups for bots - that's just one of the reasons. We also store tracking data in our database and we don't want that getting clogged up with crawler data.

    @Ray_Paseur - thanks Ray - as I mentioned in my opening post I am aware of the HTTP_USER_AGENT check approach - and I know it works - I just would like to avoid having to maintain a list of bots. I did implement sample from here

    I was hoping to find someone out there who has used the get_browser() PHP function before and could comment on its accuracy. I can write a test script to test but in order to see how effective it is - it needs to run for a couple of weeks to build up some data and I don't have that amount of time.
    LVL 107

    Expert Comment

    by:Ray Paseur
    My point about the comment was that get_browser() must search a 300K file, perhaps of dubious quality.  I do not have any idea how, from this side of the search process, I could discern the accuracy of the get_browser() function.  What would I compare the data to?

    Now that I understand your objective better, this article might be helpful.  You do not need to be dependent on a call to a subscription service - you could just integrate this API into your application.  You would still need to update the data base periodically, but that's not a painful process at all.

    Another possibility would be to install a script that actually used get_browser() and that wrote the browser data into a data base table.  You might put a human-invisible link into your site so the spiders could follow it, but people would mostly overlook it.  You would be able to see what 'bots are hitting your site.  You could make a SELECT from the table to populate the $bad_robots array, or you could periodically condense the information from the table into a hard-coded $bad_robots array.

    In my sites I load the hard-coded $bad_robots array as part of my framework.  Performance is fine, since there is very little to look up.
    LVL 49

    Author Comment

    by:Julian Hansen
    @Ray, Thanks for that. We are already using the Maxmind database but for various reasons the client does not want to go with the update database file solution (I have used that approach several times on other projects so I am aware of it). The client has decided they want to go the subscription route.

    Another possibility would be to install a script that actually used get_browser() and that wrote the browser data into a data base table.

    This is what I would normally do but it is not going to give me any real results by tomorrow - hence the post here - trying to short cut the process.

    In my sites I load the hard-coded $bad_robots array as part of my framework.  Performance is fine, since there is very little to look up.

    I am sure this is a good solution - I just was looking for some insight into how effective the the PHP one was.
    LVL 49

    Author Closing Comment

    by:Julian Hansen
    @DaveBaldwin , @Ray_Paseur,

    Thanks for the feedback - I think I have what I was looking for.

    Go with the HTTP_USER_AGENT approach - I have done some more research and I am not convinced that the get_browser approach is reliable and worth the minor-inconenience of maintaining a bot list.
    LVL 107

    Expert Comment

    by:Ray Paseur
    Hmm... Maybe there is another argument against get_browser().  

    "In order for this to work, your browscap configuration setting in php.ini must point to the correct location of the browscap.ini file on your system.

    "browscap.ini is not bundled with PHP, but you may find an up-to-date » php_browscap.ini file here."

    The link is 404.
    Firefox can't find the server at

    In any case, I think you're headed in the right direction.  Thanks for the points, ~Ray
    LVL 49

    Author Comment

    by:Julian Hansen
    No problem - I also saw that just before closing this - did not fill me with confidence - but it did help to crystalise a solutioin going forward.

    I think I might have been able to convince these guys to go with a static database on the GeoIP rather than the subscription - showed them your article - so thanks for that.

    Write Comment

    Please enter a first name

    Please enter a last name

    We will never share this with anyone.

    Featured Post

    What Is Threat Intelligence?

    Threat intelligence is often discussed, but rarely understood. Starting with a precise definition, along with clear business goals, is essential.

    This is a general how to create your own custom plugin system for your PHP application that you designed (or wish to extend a third party program to have plugin functionality that doesn't have it yet).  This is not how to make plugins for existing s…
    Developers of all skill levels should learn to use current best practices when developing websites. However many developers, new and old, fall into the trap of using deprecated features because this is what so many tutorials and books tell them to u…
    The viewer will learn how to count occurrences of each item in an array.
    This tutorial will teach you the core code needed to finalize the addition of a watermark to your image. The viewer will use a small PHP class to learn and create a watermark.

    760 members asked questions and received personalized solutions in the past 7 days.

    Join the community of 500,000 technology professionals and ask your questions.

    Join & Ask a Question

    Need Help in Real-Time?

    Connect with top rated Experts

    8 Experts available now in Live!

    Get 1:1 Help Now