asked on

Help with 503 in ROBOTS.TXT

I want to create a robots.txt file which will disable crawling on my site by returning a 503 server unavailable error. I cannot find the syntax to this anywhere. Can someone help with this one?

Many thanks
Chris

Dr. Klahn

This can not be done. The Robots Exclusion Standard allows only for the specification of robot names and limited URLs. The hope is that the robot will see it is excluded in robots.txt and then go away quietly.

In order for that to work, the robot must adhere to the standard -- and most do not. Even some of the halfway respectable robots look at robots.txt, see they are excluded, then ignore it and pull pages anyway.

Refusing requests according to the agent string doesn't work either. All bad robots fake their user-agent strings so that they look like real browsers run by humans.

There is no simple solution to this problem and there is not even one that works half the time. Multiple approaches are required and even then 80% recognition should be considered unbelievable success. Have a look at the page below where I detail some travails with bots.

http://www.miim.com/thebside/security/dumbbottricks.shtml

Chris Kenward

ASKER

Hi there and thank you very much for the fast response. This is what I received from my client:

" A 503 Service Unavailable http status code just pauses the crawls and doesn't impact on existing results. This is therefore what is needed for the moment."

Are you saying that this is not possible at all?

Regards
Chris

Dr. Klahn

It cannot be done in robots.txt at all.

If the robot in question is well-behaved, and identifies itself properly with an unique browser ident string, then it is possible (in Apache) to write a mod_rewrite rule to force a 503 Service Unavailable for any URL requested by that robot. But this handles only that one specific robot.

Chris Kenward

ASKER

Gosh that's a shame. The site is a Wordpress site and had been hacked. What we wanted to do was try and deter all crawlers from the site until we were sure we had removed all the malware and stopped the creation of any further bad posts.

Many thanks - I guess it's back to the drawing board. In the meantime I have checked the box in Wordpress for the site that indicates that it will request crawlers to NOT index the site.

Best wishes
Chris

David Favor

Adding the following comment, for future similar queries...

1) As Dr. Klahn stated. This cannot be done via robots.txt as this file is only a hint file used by Bots to scrape data out of all blocks listed in the file.

In other words, if you say block a directory /foo evil Bots will first visit /foo looking for data to scrape, so...

Using robots.txt is a fools errand.

2) This said, you can easily accomplish what you're trying to accomplish... if you must...

By "if you must" I mean if you have sufficient will + budget + time, this can be accomplished.

This can easily be accomplished by sensing when Bots visit, using a combination of Fail2Ban + other technologies.

Note: You can only block brain dead, simple Bots using this tech.

3) You can also block IP block ranges owned by companies like Google. For example, it's trivial to scrape a copy of the GoogleBot IP blocks out of DNS daily, add these to an iptables + ipset firewall setup, then block all incoming requests.

There is no way to easily produce an HTTPS level 503 using this tech, only a ICMP Reject (disallow any connection).

If you just return a 503, you'll use #2.

4) Implementations for these 2x options can be done. The primary consideration is if the money returned for doing this work justifies the time + cost of doing the work.

If sufficient return exists, then you can hire someone to do this work fairly easy.

Caveat: To implement #2 or #3 to work quickly + consistently, requires a good starting point. Preferably Kernel version 4.15+ along with iptables/ipset, so you'll be using some Distro like Ubuntu Bionic or CentOS 8 as your starting point.

Then which ever option you choose will have it's own unique additional requirements.

Also, be sure your budget contains sufficient funding for managing this tech, as some number of hours/month are required to keep everything running.

Chris Kenward

ASKER

Thanks to you both! The robots.txt file is history. :)

Regards
Chris

ASKER CERTIFIED SOLUTION

Chris Kenward

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

David Favor

The solution is... generating a 503 is an incorrect solution.

Better, to me, to use a target like...

iptables --wait -A DROP-BOT -j REJECT --reject-with icmp-host-prohibited

Open in new window

Then route Bot requests through this target.

Or better use the TARPIT target, which very rapidly discourages any Bot from visiting your site.