Solved

How to block this crawler from crawling my website

Posted on 2012-03-22
10
501 Views
Last Modified: 2012-04-25
Hi,

I want to block this crawler from crawling my website, see the image attached to see the agent info.

Thank you
agent.jpg
0
Comment
Question by:Fernanditos
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 4
  • 3
  • 2
  • +1
10 Comments
 
LVL 6

Expert Comment

by:Tomislavj
ID: 37751204
try with adding to your robots.txt file:

User-agent: Attributor
Disallow: /
0
 

Author Comment

by:Fernanditos
ID: 37751213
This crawler ignores the robots.txt
0
 
LVL 51

Expert Comment

by:ahoffmann
ID: 37751887
# quick&dirty
RewriteCond %{HTTP_USER_AGENT} Attributor [NC]
RewriteRule ^.*$ /robots.txt [L,R=400]
0
Industry Leaders: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 

Author Comment

by:Fernanditos
ID: 37752529
@ahoffmann, could you please explain what does this rule?

thank you for the solution.
0
 
LVL 51

Expert Comment

by:ahoffmann
ID: 37752633
when the User-Agent header in the request matches "Attributor" the request will be redirected to /robots.txt and a HTTP status 400 will be returned
as I said, this is a quick&dirty solution using mod_rewrite, better approaches would be to use a WAF or IPS which could block the calling client completely
but keep in mind: the nature of a public website is to be connected by everybody, if you don't want that, either remove the site from internet or use proper protection with credentials
0
 
LVL 110

Accepted Solution

by:
Ray Paseur earned 250 total points
ID: 37753381
You might use something like this as part of the common headers for your web page.  You can change the echo to die();
<?php // RAY_bad_robots.php
error_reporting(E_ALL);


// USE CASE:
if (bad_robots())
{
    echo "YOU ARE A BOT";
} 
else
{
    echo "YOU ARE NOT A BOT";
}


// A FUNCTION TO IDENTIFY THE BOTS
function bad_robots()
{
    // THE BOTS WE WANT TO IGNORE
    static
    $bad_robots
    = array
    ( 'crawler'
    , 'spider'
    , 'robot'
    , 'slurp'
    , 'Atomz'
    , 'googlebot'
    , 'VoilaBot'
    , 'msnbot'
    , 'Gaisbot'
    , 'Gigabot'
    , 'SBIder'
    , 'Zyborg'
    , 'FunWebProducts'
    , 'findlinks'
    , 'ia_archiver'
    , 'MJ12bot'
    , 'Ask Jeeves'
    , 'NG/2.0'
    , 'voyager'
    , 'Exabot'
    , 'Nutch'
    , 'Hercules'
    , 'psbot'
    , 'LocalcomBot'
    )
    ;

    // COMPARE THE BOT STRINGS TO THE USER AGENT STRING
    foreach ($bad_robots as $spider)
    {
        $spider = '#' . $spider . '#i';
        if (preg_match($spider, $_SERVER["HTTP_USER_AGENT"])) return TRUE;
    }
    return FALSE;
}

Open in new window

0
 

Author Comment

by:Fernanditos
ID: 37753396
@ahoffmann I know the nature of a website and I don't want to remove it from internet, that argument is really useless. I do have strong reasons to exclude that crawler which is really hurting my business.

I would be interested in a professional solutions instead of "quick&dirty" solution although I do appreciate the solution you posted, I learned something new with it.

thank you.
0
 

Author Comment

by:Fernanditos
ID: 37753402
@Ray_Paseur I did not see your comment before I replied. I will check your solution now, by the way, I love them.
0
 
LVL 110

Expert Comment

by:Ray Paseur
ID: 37753483
Not that I have anything against any of those 'bots - I just use them as demonstration data.  You can make up your own list from the general design.  This might be useful...
http://www.robotstxt.org/db.html
0
 
LVL 51

Assisted Solution

by:ahoffmann
ahoffmann earned 250 total points
ID: 37754840
@Fernanditos, hope you didn't take my comment as offence ;-)
> .. argument is really useless.
hmm, probaly I should have expressed more clearly that my RewriteRule suggestion may have performance issues, that's why I marked it quick&dirty
I also pointed out what would be a more professional solution: WAF or IDS
so it's up to you to make a decission which way to go
0

Featured Post

Free Tool: Path Explorer

An intuitive utility to help find the CSS path to UI elements on a webpage. These paths are used frequently in a variety of front-end development and QA automation tasks.

One of a set of tools we're offering as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Build an array called $myWeek which will hold the array elements Today, Yesterday and then builds up the rest of the week by the name of the day going back 1 week.   (CODE) (CODE) Then you just need to pass your date to the function. If i…
Nothing in an HTTP request can be trusted, including HTTP headers and form data.  A form token is a tool that can be used to guard against request forgeries (CSRF).  This article shows an improved approach to form tokens, making it more difficult to…
Learn how to match and substitute tagged data using PHP regular expressions. Demonstrated on Windows 7, but also applies to other operating systems. Demonstrated technique applies to PHP (all versions) and Firefox, but very similar techniques will w…
The viewer will learn how to look for a specific file type in a local or remote server directory using PHP.

759 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question