Still celebrating National IT Professionals Day with 3 months of free Premium Membership. Use Code ITDAY17

x
?
Solved

How to block this crawler from crawling my website

Posted on 2012-03-22
10
Medium Priority
?
504 Views
Last Modified: 2012-04-25
Hi,

I want to block this crawler from crawling my website, see the image attached to see the agent info.

Thank you
agent.jpg
0
Comment
Question by:Fernanditos
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 4
  • 3
  • 2
  • +1
10 Comments
 
LVL 6

Expert Comment

by:Tomislavj
ID: 37751204
try with adding to your robots.txt file:

User-agent: Attributor
Disallow: /
0
 

Author Comment

by:Fernanditos
ID: 37751213
This crawler ignores the robots.txt
0
 
LVL 51

Expert Comment

by:ahoffmann
ID: 37751887
# quick&dirty
RewriteCond %{HTTP_USER_AGENT} Attributor [NC]
RewriteRule ^.*$ /robots.txt [L,R=400]
0
Simplify Your Workload with One Tool

How do you combat today’s intelligent hacker while managing multiple domains and platforms? By simplifying your workload with one tool. With Lunarpages hosting through Plesk Onyx, you can:

Automate SSL generation and installation with two clicks
Experience total server control

 

Author Comment

by:Fernanditos
ID: 37752529
@ahoffmann, could you please explain what does this rule?

thank you for the solution.
0
 
LVL 51

Expert Comment

by:ahoffmann
ID: 37752633
when the User-Agent header in the request matches "Attributor" the request will be redirected to /robots.txt and a HTTP status 400 will be returned
as I said, this is a quick&dirty solution using mod_rewrite, better approaches would be to use a WAF or IPS which could block the calling client completely
but keep in mind: the nature of a public website is to be connected by everybody, if you don't want that, either remove the site from internet or use proper protection with credentials
0
 
LVL 111

Accepted Solution

by:
Ray Paseur earned 1000 total points
ID: 37753381
You might use something like this as part of the common headers for your web page.  You can change the echo to die();
<?php // RAY_bad_robots.php
error_reporting(E_ALL);


// USE CASE:
if (bad_robots())
{
    echo "YOU ARE A BOT";
} 
else
{
    echo "YOU ARE NOT A BOT";
}


// A FUNCTION TO IDENTIFY THE BOTS
function bad_robots()
{
    // THE BOTS WE WANT TO IGNORE
    static
    $bad_robots
    = array
    ( 'crawler'
    , 'spider'
    , 'robot'
    , 'slurp'
    , 'Atomz'
    , 'googlebot'
    , 'VoilaBot'
    , 'msnbot'
    , 'Gaisbot'
    , 'Gigabot'
    , 'SBIder'
    , 'Zyborg'
    , 'FunWebProducts'
    , 'findlinks'
    , 'ia_archiver'
    , 'MJ12bot'
    , 'Ask Jeeves'
    , 'NG/2.0'
    , 'voyager'
    , 'Exabot'
    , 'Nutch'
    , 'Hercules'
    , 'psbot'
    , 'LocalcomBot'
    )
    ;

    // COMPARE THE BOT STRINGS TO THE USER AGENT STRING
    foreach ($bad_robots as $spider)
    {
        $spider = '#' . $spider . '#i';
        if (preg_match($spider, $_SERVER["HTTP_USER_AGENT"])) return TRUE;
    }
    return FALSE;
}

Open in new window

0
 

Author Comment

by:Fernanditos
ID: 37753396
@ahoffmann I know the nature of a website and I don't want to remove it from internet, that argument is really useless. I do have strong reasons to exclude that crawler which is really hurting my business.

I would be interested in a professional solutions instead of "quick&dirty" solution although I do appreciate the solution you posted, I learned something new with it.

thank you.
0
 

Author Comment

by:Fernanditos
ID: 37753402
@Ray_Paseur I did not see your comment before I replied. I will check your solution now, by the way, I love them.
0
 
LVL 111

Expert Comment

by:Ray Paseur
ID: 37753483
Not that I have anything against any of those 'bots - I just use them as demonstration data.  You can make up your own list from the general design.  This might be useful...
http://www.robotstxt.org/db.html
0
 
LVL 51

Assisted Solution

by:ahoffmann
ahoffmann earned 1000 total points
ID: 37754840
@Fernanditos, hope you didn't take my comment as offence ;-)
> .. argument is really useless.
hmm, probaly I should have expressed more clearly that my RewriteRule suggestion may have performance issues, that's why I marked it quick&dirty
I also pointed out what would be a more professional solution: WAF or IDS
so it's up to you to make a decission which way to go
0

Featured Post

Industry Leaders: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Since pre-biblical times, humans have sought ways to keep secrets, and share the secrets selectively.  This article explores the ways PHP can be used to hide and encrypt information.
This article discusses how to implement server side field validation and display customized error messages to the client.
Learn how to match and substitute tagged data using PHP regular expressions. Demonstrated on Windows 7, but also applies to other operating systems. Demonstrated technique applies to PHP (all versions) and Firefox, but very similar techniques will w…
The viewer will learn how to create and use a small PHP class to apply a watermark to an image. This video shows the viewer the setup for the PHP watermark as well as important coding language. Continue to Part 2 to learn the core code used in creat…

704 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question