Optimal Robots.txt

Is this an optimal robots.txt file?

# This file can be used to affect how search engines and other web site crawlers see your site.
# For more information, please see http://www.w3.org/TR/html4/appendix/notes.html#h-B.4.1.1
# WebMatrix 2.0

# ----------
# -- bingbot, microsoft indexer
User-agent: Bingbot
Crawl-delay: 10
# ----------
# -- msnbot, microsoft indexer
User-agent: msnbot
Crawl-delay: 10
# ----------
# -- Googlebot
User-agent: googlebot
# ----------
# -- Slurp, Yahoo indexer
Crawl-delay: 10
# ----------
# -- Default for all others
User-agent: *
Disallow: /
Bob SchneiderCo-OwnerAsked:
Who is Participating?
 
Dr. KlahnPrincipal Software EngineerCommented:
I would suggest something more along these lines:

# ----------
# -- bingbot, microsoft indexer
User-agent: Bingbot
Disallow: /images
Disallow: /cgi
Crawl-delay: 10
# ----------
# -- msnbot, microsoft indexer
User-agent: msnbot
Disallow: /images
Disallow: /cgi
Crawl-delay: 10
# ----------
# -- Googlebot
User-agent: googlebot
Disallow: /images
Disallow: /cgi
User-agent: googlebot-image
Disallow: /
# ----------
# -- Slurp, Yahoo indexer
User-agent: slurp
Disallow: /images
Disallow: /cgi
Crawl-delay: 10
# ----------
# -- Default for all others
User-agent: *
Disallow: /

Open in new window


Every site has some things that bots should not index; on my site it's the /images and /cgi subdirectories.  (And more, but only those two are shown here for illustrative purposes.)

Remember that robots.txt is a voluntary standard.  As such, many robots choose to completely ignore it.  Some of those can be excluded by regexes on their User-Agent strings, and consequent blocking.  As you can see from the Apache rules file below, there are a lot of them.

There's not much that can be done about bots that fake their User-Agent string, except place honeypots and hope that they fall into them.  Quite often they do.

# ===================== EXCLUDE AGENTS  =====================
#
#                     All virtual hosts
#
# ===================== EXCLUDE AGENTS  =====================

#
# ===========================================================
# Alphabetical list of user agent ID strings to be refused service
# ===========================================================
#

RewriteEngine on

# 0-9

RewriteCond %{HTTP_USER_AGENT} 4SeoHunt                         [NC,OR]

# A ==========

RewriteCond %{HTTP_USER_AGENT} ^A6-Indexer                      [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Aboundex                         [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^AboutUs                         [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Acoonbot                         [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Accoona                            [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Adnorm                             [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^ADSARobot                          [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^ah-ha                               [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^AHC                             [NC,OR]
RewriteCond %{HTTP_USER_AGENT} AhrefsBot                           [NC,OR]
RewriteCond %{HTTP_USER_AGENT} aiHitBot                            [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^aktuelles                          [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^amzn_assoc                         [NC,OR]
RewriteCond %{HTTP_USER_AGENT} almaden\.ibm                        [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Apache-HttpClient                  [NC,OR]
RewriteCond %{HTTP_USER_AGENT} archive\.org                       [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^ASPSeek                            [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^ASSORT                             [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^ATHENS                             [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^attache                            [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^autoemailspider                    [NC,OR]

# B ==========

RewriteCond %{HTTP_USER_AGENT} Baidu                            [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^betaBot                         [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^bew                             [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^big.brother                     [NC,OR]
RewriteCond %{HTTP_USER_AGENT} BingPreview                      [NC,OR]
RewriteCond %{HTTP_USER_AGENT} BoardReader                      [NC,OR]
RewriteCond %{HTTP_USER_AGENT} BrowserSpy                       [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Bullseye                        [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^bumblebee                       [NC,OR]

# C ==========

RewriteCond %{HTTP_USER_AGENT} ca-crawler                       [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Camont                          [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^CATExplorador                   [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^ccbot                           [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^cfetch                          [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^CherryPicker                    [NC,OR]
RewriteCond %{HTTP_USER_AGENT} chlooe                           [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^CIS                             [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Clicksense                       [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Cliqzbot                         [NC,OR]
RewriteCond %{HTTP_USER_AGENT} cmscrawler                       [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^COIParse                        [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Comodo                          [NC,OR]
RewriteCond %{HTTP_USER_AGENT} CompSpyBot                       [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Content.Crawler                 [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^ContextAd                       [NC,OR]
Rewritecond %{HTTP_USER_AGENT} ^crawler4j                       [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Crazyweb                         [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Crescent                        [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Curl                            [NC,OR]
RewriteCond %{HTTP_USER_AGENT} CyberPatrol                      [NC,OR]

# D ==========

# RewriteCond %{HTTP_USER_AGENT} Dataprovider                   [NC,OR]
# RewriteCond %{HTTP_USER_AGENT} ^del.icio.us                      [NC,OR]
Rewritecond %{HTTP_USER_AGENT} ^DavClnt                         [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^DEVONthink                      [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Deweb                           [NC,OR]
RewriteCond %{HTTP_USER_AGENT} developers\.google\.com          [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Diffbot                          [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Digimarc                        [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Dillo                           [NC,OR]
RewriteCond %{HTTP_USER_AGENT} disco                            [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Dispatch                        [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^dnstwist                        [NC,OR]
RewriteCond %{HTTP_USER_AGENT} docs.google                      [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Domain                           [NC,OR]
RewriteCond %{HTTP_USER_AGENT} DotBot                           [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Downloader                       [NC,OR]
RewriteCond %{HTTP_USER_AGENT} DTS.Agent                        [NC,OR]

# E ==========

RewriteCond %{HTTP_USER_AGENT} ^EasyBib                         [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^ecollector                      [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Edister                         [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^elefent                         [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Email                            [NC,OR]
RewriteCond %{HTTP_USER_AGENT} EVV/3\.0/EAK01AG9/LE             [NC,OR]
RewriteCond %{HTTP_USER_AGENT} exabot                           [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^ExtractorPro                    [NC,OR]

# F ==========

RewriteCond %{HTTP_USER_AGENT} ^facebook                        [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Fairshare                        [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^FAST                            [NC,OR]
RewriteCond %{HTTP_USER_AGENT} favicon                          [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^FavOrg                          [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Favorites.Sweeper               [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Feedly                          [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Fetch                           [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^FEZhead                         [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^'firefox                        [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Firefox.Addon                    [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^FlashGet                        [NC,OR]
RewriteCond %{HTTP_USER_AGENT} filterdb                         [NC,OR]
RewriteCond %{HTTP_USER_AGENT} fluffy                           [NC,OR]

# G ==========

RewriteCond %{HTTP_USER_AGENT} Genieo                           [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Generic                         [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Get                             [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Gigablast                       [NC,OR]
RewriteCond %{HTTP_USER_AGENT} GigaMega                         [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Gimme                            [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Girafa                           [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Go.1\.1                          [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^go-ahead-got-it                 [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Go-http-client                  [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Go!Zilla                        [NC,OR]
RewriteCond %{HTTP_USER_AGENT} code\.google\.com                [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Google\sBot                     [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Goose                           [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Grammarly                       [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^GuzzleHttp                      [NC,OR]

# H ==========

RewriteCond %{HTTP_USER_AGENT} Harvest                             [NC,OR]
RewriteCond %{HTTP_USER_AGENT} hbtronix                            [NC,OR]
RewriteCond %{HTTP_USER_AGENT} heritrix                            [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^HomePageSearch                  [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^HoundDog                        [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^HTMLParser                      [NC,OR]
RewriteCond %{HTTP_USER_AGENT} HTTP::                           [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^HTTPClient                      [NC,OR]
RewriteCond %{HTTP_USER_AGENT} HTTrack                          [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^HuaweiSymantec                     [NC,OR]
RewriteCond %{HTTP_USER_AGENT} human_curl                          [NC,OR]

# I ==========

RewriteCond %{HTTP_USER_AGENT} ia_archiver                      [NC,OR]
RewriteCond %{HTTP_USER_AGENT} iaskspider                       [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^IBM_Planetwide                  [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^IncyWincy                       [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Indeedbot                        [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Indy\sLibrary                    [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^InetURL                         [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Ingelin                         [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^InstantName                     [NC,OR]
RewriteCond %{HTTP_USER_AGENT} integromedb                      [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^intelium                        [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Internet                        [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^ip-web-crawler                  [NC,OR]
RewriteCond %{HTTP_USER_AGENT} IPTCBOT                          [NC,OR]
RewriteCond %{HTTP_USER_AGENT} IRLBot                           [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ISRCCrawler                      [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^iZSearch                        [NC,OR]

# J ==========

RewriteCond %{HTTP_USER_AGENT} ^Java                            [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^JetCar                          [NC,OR]

# K ==========

RewriteCond %{HTTP_USER_AGENT} ^Keycdn                          [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Kilomonkey                       [NC,OR]
RewriteCond %{HTTP_USER_AGENT} KKman2                           [NC,OR]
RewriteCond %{HTTP_USER_AGENT} KomodiaBot                       [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^KWebGet                         [NC,OR]

# L ==========

RewriteCond %{HTTP_USER_AGENT} larbin                           [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^leech                           [NC,OR]
RewriteCond %{HTTP_USER_AGENT} libcurl                          [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Lightspeedsystems                [NC,OR]
RewriteCond %{HTTP_USER_AGENT} libwww                           [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Link                            [NC,OR]
RewriteCond %{HTTP_USER_AGENT} linkchecker                      [NC,OR]
RewriteCond %{HTTP_USER_AGENT} linksmanager                     [NC,OR]
RewriteCond %{HTTP_USER_AGENT} LMQueueBot                       [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^LocalcomBot                     [NC,OR]
RewriteCond %{HTTP_USER_AGENT} LWP::                            [NC,OR]
RewriteCond %{HTTP_USER_AGENT} LWP|Digger                          [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Lynx                               [NC,OR]

# M ==========

RewriteCond %{HTTP_USER_AGENT} ^Masscan                            [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^MCspider                           [NC,OR]
RewriteCond %{HTTP_USER_AGENT} meanpathbot                      [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Mechanize                        [NC,OR]
RewriteCond %{HTTP_USER_AGENT} MegaIndex                        [NC,OR]
RewriteCond %{HTTP_USER_AGENT} MEGAUPLOAD                       [NC,OR]
RewriteCond %{HTTP_USER_AGENT} MetaIntelligence                 [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^MetaURI                         [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Microsoft.Data.Access           [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Microsoft.Office                [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Microsoft.URL                   [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Mirror                          [NC,OR]
RewriteCond %{HTTP_USER_AGENT} MixrankBot                       [NC,OR]
RewriteCond %{HTTP_USER_AGENT} MJ12bot                          [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Mnogosearch                     [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Mojolicious                     [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Mozilla/3.Mozilla/2.01          [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Mozilla/[3-8]\.0\s\(compatible\)$       [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Mozilla.*NEWT                   [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^MrCarlito                       [NC,OR]
RewriteCond %{HTTP_USER_AGENT} mshots                           [NC,OR]
RewriteCond %{HTTP_USER_AGENT} MS.Search                        [NC,OR]
RewriteCond %{HTTP_USER_AGENT} MSIECrawler                      [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^msnbot-media                    [NC,OR]
# Missing space between ":" and "Windows" means it's a bot
RewriteCond %{HTTP_USER_AGENT} MSIE\s[3-9]\.[0-9];Windows          [OR]

# N ==========

RewriteCond %{HTTP_USER_AGENT} ^NetAnts                         [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^NetCarta                        [NC,OR]
RewriteCond %{HTTP_USER_AGENT} netcraft                         [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^netprospector                   [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^NetResearchServer                  [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Netseer                             [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Net.Vampire                        [NC,OR]
RewriteCond %{HTTP_USER_AGENT} news.bot                            [NC,OR]
RewriteCond %{HTTP_USER_AGENT} nexen                               [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^NICErsPRO                       [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^niki-bot                        [NC,OR]
RewriteCond %{HTTP_USER_AGENT} NimbleCrawler                    [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Nmap                             [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^node-urllib                     [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Nokia                            [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Ning                            [NC,OR]
RewriteCond %{HTTP_USER_AGENT} nominet                          [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^nost\.info                      [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^NPBot                           [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Nutch                            [NC,OR]

# O ==========

RewriteCond %{HTTP_USER_AGENT} ^Offline.Explorer                [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^okhttp                          [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^OpaL                            [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Openfind                        [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^OpenTextSiteCrawler                [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^OppO                               [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^OrangeBot                          [NC,OR]

# P ==========

RewriteCond %{HTTP_USER_AGENT} ^PackRat                         [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Page.Speed                       [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^PagesInventory                  [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^panscient                       [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^pavuk                           [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Pcore                           [NC,OR]
RewriteCond %{HTTP_USER_AGENT} PECL::HTTP                       [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Perl                             [OR]
RewriteCond %{HTTP_USER_AGENT} PhantomJS                        [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^PHP                             [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Pingdom                          [NC,OR]
RewriteCond %{HTTP_USER_AGENT} pinyin                           [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Plesk                            [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Plukkie                          [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^POE                             [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Prlog                            [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Protocol.Discovery               [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^PSurf                           [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Purebot                             [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^PushSite                           [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Python                              [NC,OR]

# Q ==========

# R ==========

RewriteCond %{HTTP_USER_AGENT} Raloco                           [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^redback                         [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^reget                           [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^RepoMonkey                      [NC,OR]
RewriteCond %{HTTP_USER_AGENT} roboto                           [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Rover                            [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Rsync                           [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Ruby                            [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Robozilla                       [NC,OR]

# S ==========

RewriteCond %{HTTP_USER_AGENT} sai-crawler                      [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Scrap(e|y)                      [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^ScoutAbout                         [NC,OR]
RewriteCond %{HTTP_USER_AGENT} searchhippo                         [NC,OR]
RewriteCond %{HTTP_USER_AGENT} SearchMonkey                        [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^searchterms\.it                 [NC,OR]
RewriteCond %{HTTP_USER_AGENT} sees\.co                         [NC,OR]
RewriteCond %{HTTP_USER_AGENT} SEO.Robot                        [NC,OR]
RewriteCond %{HTTP_USER_AGENT} SEO.Spider                       [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^seostats                        [NC,OR]
RewriteCond %{HTTP_USER_AGENT} servernfo                        [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Setooz                           [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Shai                            [NC,OR]
RewriteCond %{HTTP_USER_AGENT} sistrix                          [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^sitecheck                       [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Siteimprove\.com                 [NC,OR]
RewriteCond %{HTTP_USER_AGENT} sitesell                         [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Sitetruth                       [NC,OR]
RewriteCond %{HTTP_USER_AGENT} SMTBot                           [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^SolomonoBot                     [NC,OR]
RewriteCond %{HTTP_USER_AGENT} spbot                            [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Spegla                          [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Spiderbook                      [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^SpiderBot                       [NC,OR]
RewriteCond %{HTTP_USER_AGENT} SpyOnWeb                         [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^SqWorm                          [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Statools                        [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^SuperBot                        [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^SurfWalker                      [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Surveybot                        [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Synapse                          [NC,OR]

# T ==========

RewriteCond %{HTTP_USER_AGENT} taptu                            [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^tarspider                          [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Teleport                           [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Telesoft                           [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Templeton                          [NC,OR]
RewriteCond %{HTTP_USER_AGENT} TheRarest                           [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Thunderstone                        [NC,OR]
RewriteCond %{HTTP_USER_AGENT} T-H-U-N-D-E-R-S-T-O-N-E             [NC,OR]
RewriteCond %{HTTP_USER_AGENT} TrueRobot                           [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^TurnitinBot                        [NC,OR]
Rewritecond %{HTTP_USER_AGENT} TweetmemeBot                        [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Twiceler                            [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Twisted                            [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Twitterbot                          [NC,OR]

# U ==========

RewriteCond %{HTTP_USER_AGENT} ^UIowaCrawler                    [NC,OR]
RewriteCond %{HTTP_USER_AGENT} UNTRUSTED                        [OR]
RewriteCond %{HTTP_USER_AGENT} ^UnwindFetchor                   [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^URLAppendBot                    [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^urlresolver                     [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^User.Agent                      [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^UtilMind                        [NC,OR]

# V ==========

RewriteCond %{HTTP_USER_AGENT} ^VidibleScraper                  [NC,OR]
RewriteCond %{HTTP_USER_AGENT} visaduhoc                        [NC,OR]
RewriteCond %{HTTP_USER_AGENT} voilabot                         [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^voyager                         [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^vspider                         [NC,OR]

# W ==========

RewriteCond %{HTTP_USER_AGENT} ^W3C_Validator                   [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^W3M                             [NC,OR]
RewriteCond %{HTTP_USER_AGENT} WBSearchBot                      [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^w3mir                           [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^WebBandit                       [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^WebCop                          [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^WebCollage                         [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^web.by.mail                        [NC,OR]
RewriteCond %{HTTP_USER_AGENT} webcraft@bea\.com                   [NC,OR]
RewriteCond %{HTTP_USER_AGENT} WebDataCentreBot                    [NC,OR]
RewriteCond %{HTTP_USER_AGENT} WebDAV                              [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^WebEMailExtrac                     [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^WEBMASTERS                         [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^WebMiner                           [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^WebReaper                          [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^WebSnake                           [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^WebStripper                     [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^WebThumbnail                    [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^webvac                          [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^webwalk                         [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^WebZIP                          [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^wget                            [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^WhatsApp                        [NC,OR]
RewriteCond %{HTTP_USER_AGENT} whois                            [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^who.is                          [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^WhosTalking                     [NC,OR]
RewriteCond %{HTTP_USER_AGENT} WinHTTP                          [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^WISEbot                         [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^woobot                          [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Wordpress                        [NC,OR]
RewriteCond %{HTTP_USER_AGENT} woriobot                         [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^wsr-agent                       [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^wscheck                         [NC,OR]
RewriteCond %{HTTP_USER_AGENT} WUMPUS                           [NC,OR]

# X ==========

RewriteCond %{HTTP_USER_AGENT} ^Xenu                               [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^XGET                               [NC,OR]

# Y ==========

RewriteCond %{HTTP_USER_AGENT} ^yacybot                            [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^YahooCache                      [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Yahoo.Link.Preview               [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Yandex                          [NC,OR]

# Z ==========

RewriteCond %{HTTP_USER_AGENT} ^Zend                               [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Zeus.*Webster                      [NC,OR]

#
# ===========================================================
# Special user-agent IDs recognizable by regexes
# ===========================================================
#

# Void/empty
RewriteCond %{HTTP_USER_AGENT} ^$                               [NC,OR]
# Any series of blanks
RewriteCond %{HTTP_USER_AGENT} ^\s+$                            [NC,OR]
# Any series of hyphens/dashes
RewriteCond %{HTTP_USER_AGENT} ^-+$                             [NC,OR]
# Dummies who typed in the ID wrong to a bot
RewriteCond %{HTTP_USER_AGENT} ^=                               [NC,OR]

# ===========================================================
# Stuff that should never appear in the user-agent string
# ===========================================================

RewriteCond %{HTTP_USER_AGENT} Accept-Encoding:                 [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Content-Type:                    [NC]

#
# ===========================================================
# Flush 'em all
# ===========================================================
#

# Don't log these, and tell the requester it's [G]one
RewriteRule .* - [C,E=nonlog-refer:1]
RewriteRule .* - [C,E=nonlog-request:1]
RewriteRule .* - [G,L]

Open in new window


See also these short monographs:

http://www.miim.com/thebside/privacy/nsftrapping.html
http://www.miim.com/thebside/privacy/spidertrap.html
0
 
Dave BaldwinFixer of ProblemsCommented:
On my customer's site, Facebook does not identify it's crawler except by IP address.  Our tracking shows some Amazon hosts but they look like client sites and not Amazon itself.

And it depends on your purpose for putting 'robots.txt' on your site.  At this point, I believe that Google and some others like Baidu crawl ALL pages and use the 'robots.txt' to decide which to show to the public.  They are trying to catalog Everything on the internet.  Especially Baidu which probably feeds results with certain keywords to the Chinese government.
0
 
Bob SchneiderCo-OwnerAuthor Commented:
Good information!  Thank you!
1) Should I put the Apache rules file in my site?  If so, where and what would I name it.
2) To Dave's point, If I have <meta name="robots" content="noindex, follow"> on pages that I don't want crawled, will the bots respect that?  The vast majority of these pages are dynamic pages that may appear as duplicates to the bots and they really have no impact on SEO.
0
 
Dr. KlahnPrincipal Software EngineerCommented:
Should I put the Apache rules file in my site?  If so, where and what would I name it.

Depends on how strict you want to be.  As always when filtering, a certain amount of grain will be thrown out with the chaff.  You need to decide how much of both can be tolerated.  Fortunately it is a quick matter to disable any offending rule by prefixing it with # and restarting Apache.

On my site (Debian) the Apache configuration files are found in

/usr/local/apache2/conf

I keep the exclusion rule files in the same directory.  The exclusion rule files are included from the vhost configuration files, which themselves are in turn included from httpd.conf.  If you have no vhosts you can include them directly from httpd.conf.  A convenient location is right after any mod_rewrite rules in httpd.conf.

... end of preceding mod_rewrite rules

# Include the unwanted agent exclusion list
Include conf/exclude_agents.conf

Open in new window

0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.