Solved

How to block this crawler from crawling my website

Posted on 2012-03-22
10
491 Views
Last Modified: 2012-04-25
Hi,

I want to block this crawler from crawling my website, see the image attached to see the agent info.

Thank you
agent.jpg
0
Comment
Question by:Fernanditos
  • 4
  • 3
  • 2
  • +1
10 Comments
 
LVL 6

Expert Comment

by:Tomislavj
ID: 37751204
try with adding to your robots.txt file:

User-agent: Attributor
Disallow: /
0
 

Author Comment

by:Fernanditos
ID: 37751213
This crawler ignores the robots.txt
0
 
LVL 51

Expert Comment

by:ahoffmann
ID: 37751887
# quick&dirty
RewriteCond %{HTTP_USER_AGENT} Attributor [NC]
RewriteRule ^.*$ /robots.txt [L,R=400]
0
 

Author Comment

by:Fernanditos
ID: 37752529
@ahoffmann, could you please explain what does this rule?

thank you for the solution.
0
 
LVL 51

Expert Comment

by:ahoffmann
ID: 37752633
when the User-Agent header in the request matches "Attributor" the request will be redirected to /robots.txt and a HTTP status 400 will be returned
as I said, this is a quick&dirty solution using mod_rewrite, better approaches would be to use a WAF or IPS which could block the calling client completely
but keep in mind: the nature of a public website is to be connected by everybody, if you don't want that, either remove the site from internet or use proper protection with credentials
0
Microsoft Certification Exam 74-409

Veeam® is happy to provide the Microsoft community with a study guide prepared by MVP and MCT, Orin Thomas. This guide will take you through each of the exam objectives, helping you to prepare for and pass the examination.

 
LVL 108

Accepted Solution

by:
Ray Paseur earned 250 total points
ID: 37753381
You might use something like this as part of the common headers for your web page.  You can change the echo to die();
<?php // RAY_bad_robots.php
error_reporting(E_ALL);


// USE CASE:
if (bad_robots())
{
    echo "YOU ARE A BOT";
} 
else
{
    echo "YOU ARE NOT A BOT";
}


// A FUNCTION TO IDENTIFY THE BOTS
function bad_robots()
{
    // THE BOTS WE WANT TO IGNORE
    static
    $bad_robots
    = array
    ( 'crawler'
    , 'spider'
    , 'robot'
    , 'slurp'
    , 'Atomz'
    , 'googlebot'
    , 'VoilaBot'
    , 'msnbot'
    , 'Gaisbot'
    , 'Gigabot'
    , 'SBIder'
    , 'Zyborg'
    , 'FunWebProducts'
    , 'findlinks'
    , 'ia_archiver'
    , 'MJ12bot'
    , 'Ask Jeeves'
    , 'NG/2.0'
    , 'voyager'
    , 'Exabot'
    , 'Nutch'
    , 'Hercules'
    , 'psbot'
    , 'LocalcomBot'
    )
    ;

    // COMPARE THE BOT STRINGS TO THE USER AGENT STRING
    foreach ($bad_robots as $spider)
    {
        $spider = '#' . $spider . '#i';
        if (preg_match($spider, $_SERVER["HTTP_USER_AGENT"])) return TRUE;
    }
    return FALSE;
}

Open in new window

0
 

Author Comment

by:Fernanditos
ID: 37753396
@ahoffmann I know the nature of a website and I don't want to remove it from internet, that argument is really useless. I do have strong reasons to exclude that crawler which is really hurting my business.

I would be interested in a professional solutions instead of "quick&dirty" solution although I do appreciate the solution you posted, I learned something new with it.

thank you.
0
 

Author Comment

by:Fernanditos
ID: 37753402
@Ray_Paseur I did not see your comment before I replied. I will check your solution now, by the way, I love them.
0
 
LVL 108

Expert Comment

by:Ray Paseur
ID: 37753483
Not that I have anything against any of those 'bots - I just use them as demonstration data.  You can make up your own list from the general design.  This might be useful...
http://www.robotstxt.org/db.html
0
 
LVL 51

Assisted Solution

by:ahoffmann
ahoffmann earned 250 total points
ID: 37754840
@Fernanditos, hope you didn't take my comment as offence ;-)
> .. argument is really useless.
hmm, probaly I should have expressed more clearly that my RewriteRule suggestion may have performance issues, that's why I marked it quick&dirty
I also pointed out what would be a more professional solution: WAF or IDS
so it's up to you to make a decission which way to go
0

Featured Post

PRTG Network Monitor: Intuitive Network Monitoring

Network Monitoring is essential to ensure that computer systems and network devices are running. Use PRTG to monitor LANs, servers, websites, applications and devices, bandwidth, virtual environments, remote systems, IoT, and many more. PRTG is easy to set up & use.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

When it comes to showing a 404 error page to your visitors, you do not want that generic page to show, and you especially do not want your hosting provider’s ad error page to show either. In this article, I will show you how to enable the custom 40…
This article discusses how to create an extensible mechanism for linked drop downs.
The viewer will learn how to look for a specific file type in a local or remote server directory using PHP.
The viewer will learn how to create and use a small PHP class to apply a watermark to an image. This video shows the viewer the setup for the PHP watermark as well as important coding language. Continue to Part 2 to learn the core code used in creat…

920 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

16 Experts available now in Live!

Get 1:1 Help Now