Solved

How to block this crawler from crawling my website

Posted on 2012-03-22
10
499 Views
Last Modified: 2012-04-25
Hi,

I want to block this crawler from crawling my website, see the image attached to see the agent info.

Thank you
agent.jpg
0
Comment
Question by:Fernanditos
  • 4
  • 3
  • 2
  • +1
10 Comments
 
LVL 6

Expert Comment

by:Tomislavj
ID: 37751204
try with adding to your robots.txt file:

User-agent: Attributor
Disallow: /
0
 

Author Comment

by:Fernanditos
ID: 37751213
This crawler ignores the robots.txt
0
 
LVL 51

Expert Comment

by:ahoffmann
ID: 37751887
# quick&dirty
RewriteCond %{HTTP_USER_AGENT} Attributor [NC]
RewriteRule ^.*$ /robots.txt [L,R=400]
0
What is SQL Server and how does it work?

The purpose of this paper is to provide you background on SQL Server. It’s your self-study guide for learning fundamentals. It includes both the history of SQL and its technical basics. Concepts and definitions will form the solid foundation of your future DBA expertise.

 

Author Comment

by:Fernanditos
ID: 37752529
@ahoffmann, could you please explain what does this rule?

thank you for the solution.
0
 
LVL 51

Expert Comment

by:ahoffmann
ID: 37752633
when the User-Agent header in the request matches "Attributor" the request will be redirected to /robots.txt and a HTTP status 400 will be returned
as I said, this is a quick&dirty solution using mod_rewrite, better approaches would be to use a WAF or IPS which could block the calling client completely
but keep in mind: the nature of a public website is to be connected by everybody, if you don't want that, either remove the site from internet or use proper protection with credentials
0
 
LVL 109

Accepted Solution

by:
Ray Paseur earned 250 total points
ID: 37753381
You might use something like this as part of the common headers for your web page.  You can change the echo to die();
<?php // RAY_bad_robots.php
error_reporting(E_ALL);


// USE CASE:
if (bad_robots())
{
    echo "YOU ARE A BOT";
} 
else
{
    echo "YOU ARE NOT A BOT";
}


// A FUNCTION TO IDENTIFY THE BOTS
function bad_robots()
{
    // THE BOTS WE WANT TO IGNORE
    static
    $bad_robots
    = array
    ( 'crawler'
    , 'spider'
    , 'robot'
    , 'slurp'
    , 'Atomz'
    , 'googlebot'
    , 'VoilaBot'
    , 'msnbot'
    , 'Gaisbot'
    , 'Gigabot'
    , 'SBIder'
    , 'Zyborg'
    , 'FunWebProducts'
    , 'findlinks'
    , 'ia_archiver'
    , 'MJ12bot'
    , 'Ask Jeeves'
    , 'NG/2.0'
    , 'voyager'
    , 'Exabot'
    , 'Nutch'
    , 'Hercules'
    , 'psbot'
    , 'LocalcomBot'
    )
    ;

    // COMPARE THE BOT STRINGS TO THE USER AGENT STRING
    foreach ($bad_robots as $spider)
    {
        $spider = '#' . $spider . '#i';
        if (preg_match($spider, $_SERVER["HTTP_USER_AGENT"])) return TRUE;
    }
    return FALSE;
}

Open in new window

0
 

Author Comment

by:Fernanditos
ID: 37753396
@ahoffmann I know the nature of a website and I don't want to remove it from internet, that argument is really useless. I do have strong reasons to exclude that crawler which is really hurting my business.

I would be interested in a professional solutions instead of "quick&dirty" solution although I do appreciate the solution you posted, I learned something new with it.

thank you.
0
 

Author Comment

by:Fernanditos
ID: 37753402
@Ray_Paseur I did not see your comment before I replied. I will check your solution now, by the way, I love them.
0
 
LVL 109

Expert Comment

by:Ray Paseur
ID: 37753483
Not that I have anything against any of those 'bots - I just use them as demonstration data.  You can make up your own list from the general design.  This might be useful...
http://www.robotstxt.org/db.html
0
 
LVL 51

Assisted Solution

by:ahoffmann
ahoffmann earned 250 total points
ID: 37754840
@Fernanditos, hope you didn't take my comment as offence ;-)
> .. argument is really useless.
hmm, probaly I should have expressed more clearly that my RewriteRule suggestion may have performance issues, that's why I marked it quick&dirty
I also pointed out what would be a more professional solution: WAF or IDS
so it's up to you to make a decission which way to go
0

Featured Post

What is SQL Server and how does it work?

The purpose of this paper is to provide you background on SQL Server. It’s your self-study guide for learning fundamentals. It includes both the history of SQL and its technical basics. Concepts and definitions will form the solid foundation of your future DBA expertise.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

These days socially coordinated efforts have turned into a critical requirement for enterprises.
Nothing in an HTTP request can be trusted, including HTTP headers and form data.  A form token is a tool that can be used to guard against request forgeries (CSRF).  This article shows an improved approach to form tokens, making it more difficult to…
Learn how to match and substitute tagged data using PHP regular expressions. Demonstrated on Windows 7, but also applies to other operating systems. Demonstrated technique applies to PHP (all versions) and Firefox, but very similar techniques will w…
The viewer will learn how to create a basic form using some HTML5 and PHP for later processing. Set up your basic HTML file. Open your form tag and set the method and action attributes.: (CODE) Set up your first few inputs one for the name and …

820 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question