Solved

How to block this crawler from crawling my website

Posted on 2012-03-22
10
488 Views
Last Modified: 2012-04-25
Hi,

I want to block this crawler from crawling my website, see the image attached to see the agent info.

Thank you
agent.jpg
0
Comment
Question by:Fernanditos
  • 4
  • 3
  • 2
  • +1
10 Comments
 
LVL 6

Expert Comment

by:Tomislavj
ID: 37751204
try with adding to your robots.txt file:

User-agent: Attributor
Disallow: /
0
 

Author Comment

by:Fernanditos
ID: 37751213
This crawler ignores the robots.txt
0
 
LVL 51

Expert Comment

by:ahoffmann
ID: 37751887
# quick&dirty
RewriteCond %{HTTP_USER_AGENT} Attributor [NC]
RewriteRule ^.*$ /robots.txt [L,R=400]
0
 

Author Comment

by:Fernanditos
ID: 37752529
@ahoffmann, could you please explain what does this rule?

thank you for the solution.
0
 
LVL 51

Expert Comment

by:ahoffmann
ID: 37752633
when the User-Agent header in the request matches "Attributor" the request will be redirected to /robots.txt and a HTTP status 400 will be returned
as I said, this is a quick&dirty solution using mod_rewrite, better approaches would be to use a WAF or IPS which could block the calling client completely
but keep in mind: the nature of a public website is to be connected by everybody, if you don't want that, either remove the site from internet or use proper protection with credentials
0
How your wiki can always stay up-to-date

Quip doubles as a “living” wiki and a project management tool that evolves with your organization. As you finish projects in Quip, the work remains, easily accessible to all team members, new and old.
- Increase transparency
- Onboard new hires faster
- Access from mobile/offline

 
LVL 108

Accepted Solution

by:
Ray Paseur earned 250 total points
ID: 37753381
You might use something like this as part of the common headers for your web page.  You can change the echo to die();
<?php // RAY_bad_robots.php
error_reporting(E_ALL);


// USE CASE:
if (bad_robots())
{
    echo "YOU ARE A BOT";
} 
else
{
    echo "YOU ARE NOT A BOT";
}


// A FUNCTION TO IDENTIFY THE BOTS
function bad_robots()
{
    // THE BOTS WE WANT TO IGNORE
    static
    $bad_robots
    = array
    ( 'crawler'
    , 'spider'
    , 'robot'
    , 'slurp'
    , 'Atomz'
    , 'googlebot'
    , 'VoilaBot'
    , 'msnbot'
    , 'Gaisbot'
    , 'Gigabot'
    , 'SBIder'
    , 'Zyborg'
    , 'FunWebProducts'
    , 'findlinks'
    , 'ia_archiver'
    , 'MJ12bot'
    , 'Ask Jeeves'
    , 'NG/2.0'
    , 'voyager'
    , 'Exabot'
    , 'Nutch'
    , 'Hercules'
    , 'psbot'
    , 'LocalcomBot'
    )
    ;

    // COMPARE THE BOT STRINGS TO THE USER AGENT STRING
    foreach ($bad_robots as $spider)
    {
        $spider = '#' . $spider . '#i';
        if (preg_match($spider, $_SERVER["HTTP_USER_AGENT"])) return TRUE;
    }
    return FALSE;
}

Open in new window

0
 

Author Comment

by:Fernanditos
ID: 37753396
@ahoffmann I know the nature of a website and I don't want to remove it from internet, that argument is really useless. I do have strong reasons to exclude that crawler which is really hurting my business.

I would be interested in a professional solutions instead of "quick&dirty" solution although I do appreciate the solution you posted, I learned something new with it.

thank you.
0
 

Author Comment

by:Fernanditos
ID: 37753402
@Ray_Paseur I did not see your comment before I replied. I will check your solution now, by the way, I love them.
0
 
LVL 108

Expert Comment

by:Ray Paseur
ID: 37753483
Not that I have anything against any of those 'bots - I just use them as demonstration data.  You can make up your own list from the general design.  This might be useful...
http://www.robotstxt.org/db.html
0
 
LVL 51

Assisted Solution

by:ahoffmann
ahoffmann earned 250 total points
ID: 37754840
@Fernanditos, hope you didn't take my comment as offence ;-)
> .. argument is really useless.
hmm, probaly I should have expressed more clearly that my RewriteRule suggestion may have performance issues, that's why I marked it quick&dirty
I also pointed out what would be a more professional solution: WAF or IDS
so it's up to you to make a decission which way to go
0

Featured Post

How to run any project with ease

Manage projects of all sizes how you want. Great for personal to-do lists, project milestones, team priorities and launch plans.
- Combine task lists, docs, spreadsheets, and chat in one
- View and edit from mobile/offline
- Cut down on emails

Join & Write a Comment

Suggested Solutions

Title # Comments Views Activity
Mail Not Sent 6 42
Export MYSQL to .CSV Add Column Names PHP 2 21
Not needed 13 54
Return data with AJAX, JQUERY and PHP 13 27
This article discusses how to create an extensible mechanism for linked drop downs.
Since pre-biblical times, humans have sought ways to keep secrets, and share the secrets selectively.  This article explores the ways PHP can be used to hide and encrypt information.
The viewer will learn how to dynamically set the form action using jQuery.
The viewer will learn how to create and use a small PHP class to apply a watermark to an image. This video shows the viewer the setup for the PHP watermark as well as important coding language. Continue to Part 2 to learn the core code used in creat…

707 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

13 Experts available now in Live!

Get 1:1 Help Now