Robots.txt question

Hello,

If I want to ONLY allow SE bots to index files that are in directory "dir1" and matching a filename pattern *_files.asp, what syntax would that be in the robots.txt file?
skbohlerAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Dave BaldwinFixer of ProblemsCommented:
I don't know what "SE bots " are but you need to understand that anything in 'robots.txt' is only request and not a command.  There are many bots that will access every file anyway.  'robots.txt' is not any kind of 'protection'.
1
Dr. KlahnPrincipal Software EngineerCommented:
Dave is correct.  The only way to have absolute control over search engine bots is to watch everything they do and lock them out when they step over the line.  Both bingbot and googlebot follow links into prohibited areas.  google in particular frequently ran "experimental" bots into my site that ignored robots.txt entirely, such that I eventually had to ban most IP blocks under google control.

If you want to try it, here's an example.  As can be seen, the exclusion rules must be replicated for every bot to be allowed in.  The problem with your request is that you can't allow access to a directory, you can only disallow access.  So you have to disallow access to everything bots should not see.

But don't be surprised when bots ignore robots.txt and survey everything anyway.  There are even some bots that read robots.txt looking for exclusions and then go to the forbidden areas and slurp up the contents.

root:/www> cat robots.txt
# ----------
# -- bingbot, microsoft indexer
User-agent: Bingbot
Disallow: /parttob/
Disallow: /errors/
Disallow: /graphics/
Disallow: /security/docfiles/
Disallow: /documents/security/
Disallow: /scripts/
Disallow: /sql/
Disallow: /perl/
Disallow: /cgi/
Disallow: /blog/
Disallow: /sqllite/
Disallow: /java/
Disallow: /admin/
Disallow: /executables/
Crawl-delay: 10
# ----------
# -- msnbot, microsoft indexer
User-agent: msnbot
Disallow: /parttob/
Disallow: /errors/
Disallow: /graphics/
Disallow: /security/docfiles/
Disallow: /documents/security/
Disallow: /scripts/
Disallow: /sql/
Disallow: /perl/
Disallow: /cgi/
Disallow: /blog/
Disallow: /sqllite/
Disallow: /java/
Disallow: /admin/
Disallow: /executables/
Crawl-delay: 10
# ----------
# -- Googlebot
User-agent: googlebot
Disallow: /parttob/
Disallow: /errors/
Disallow: /graphics/
Disallow: /security/docfiles/
Disallow: /documents/security/
Disallow: /scripts/
Disallow: /sql/
Disallow: /perl/
Disallow: /cgi/
Disallow: /blog/
Disallow: /sqllite/
Disallow: /java/
Disallow: /admin/
Disallow: /executables/
User-agent: googlebot-image
Disallow: /
# ----------
# -- Slurp, Yahoo indexer
User-agent: slurp
Disallow: /parttob/
Disallow: /errors/
Disallow: /graphics/
Disallow: /security/docfiles/
Disallow: /documents/security/
Disallow: /scripts/
Disallow: /sql/
Disallow: /perl/
Disallow: /cgi/
Disallow: /blog/
Disallow: /sqllite/
Disallow: /java/
Disallow: /admin/
Disallow: /executables/
Crawl-delay: 10
# ----------
# -- Default for all others
User-agent: *
Disallow: /
# ---------
# Specific others
User-agent: mediapartners
Disallow: /
User-agent: Mediapartners-Google
Disallow: /
User-agent: MSNBot-Media
Disallow: /
User-agent: msnbot-media
Disallow: /
User-agent: BingPreview
Disallow: /
#
# Now the bad bots
#
User-agent: 80legs
Disallow: /
User-agent: Aboundex
Disallow: /
User-agent: AdnormCrawler
Disallow: /
User-agent: AhrefsBot
Disallow: /
User-agent: aiHitBot
Disallow: /
User-agent: archive.org_bot
Disallow: /
User-agent: Baidu
Disallow: /
User-agent: Baiduspider
Disallow: /
User-agent: Butterfly
Disallow: /
User-agent: Catchbot
Disallow: /
User-agent: CCbot
Disallow: /
User-agent: CityReview
Disallow: /
User-agent: CityReviewBot
Disallow: /
User-agent: coccoc
Disallow: /
User-agent: crawler4j
Disallow: /
User-agent: Dataprovider
Disallow: /
User-agent: del.icio.us
Disallow: /
User-agent: DiffBot
Disallow: /
User-agent: DomainCrawler
Disallow: /
User-agent: DotBot
Disallow: /
User-agent: DomainCrawler
Disallow: /
User-agent: EC2LinkFinder
Disallow: /
User-agent: Edisterbot
Disallow: /
User-agent: Ezooms
Disallow: /
User-agent: gigabot
Disallow: /
User-agent: GrapeshotCrawler
Disallow: /
User-agent: HuaweiSymantecSpider
Disallow: /
User-agent: ia_archiver
Disallow: /
User-agent: intelium_bot
Disallow: /
User-agent: Lipperhey
Disallow: /
User-agent: litefinder
Disallow: /
User-agent: MixrankBot
Disallow: /
User-agent: MLBot
Disallow: /
User-agent: Mojeek
Disallow: /
User-agent: MojeekBot
Disallow: /
User-agent: MSIECrawler
Disallow: /
User-agent: netEstate
Disallow: /
User-agent: Netseer
Disallow: /
User-agent: NextGenSearchBot
Disallow: /
User-agent: orangeask
Disallow: /
User-agent: ozzie
Disallow: /
User-agent: Plukkie
Disallow: /
User-agent: psbot
Disallow: /
User-agent: ScoutJet
Disallow: /
User-agent: Sezbot
Disallow: /
User-agent: Sistrix
Disallow: /
User-agent: Search17Bot
Disallow: /
User-agent: SMTBot
Disallow: /
User-agent: SEOENGWorldBot
Disallow: /
User-agent: Skimbot
Disallow: /
User-agent: SWEbot
Disallow: /
User-agent: Thunderstone
Disallow: /
User-agent: TurnitinBot
Disallow: /
User-agent: TweetmemeBot
Disallow: /
User-agent: Twiceler
Disallow: /
User-agent: Twitterbot
Disallow: /
User-agent: voilabot
Disallow: /
User-agent: voyager
Disallow: /
User-agent: WBSearchBot
Disallow: /
User-agent: woriobot
Disallow: /
User-agent: wotbox
Disallow: /
User-agent: yacybot
Disallow: /
User-agent: Yandexbot
Disallow: /
User-agent: Yeti
Disallow: /

Open in new window

1

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
Bryr de GraySEO TechnicianCommented:
It would be better to both use the allow and disallow functions for this one. Like this:

example:

User-agent: *
Allow: /dir1/*_files.asp/
Disallow:  /directory1/
Disallow: /directory2/
Disallow: /directory3/

after uploading/updating the robots.txt, check it at Google's robots.txt tester (Google Search Console) and see if it works fine. Or run a hears using the syntax in the search bar site:yoursite.com. Please do remember that it might take some time for Google to refresh its database, so you may still some pages that are indexed prior to updating the robots.txt.
0
Dave BaldwinFixer of ProblemsCommented:
Dr. Klahn, nice list of bots.  Too bad it is so useless for blocking.  As I always tell people, if you want something to be Private, do Not put it anywhere on the internet.  Other people have access to everything you put on the internet and none of them work for you.
0
Dr. KlahnPrincipal Software EngineerCommented:
Absolutely correct, Dave.  I personally rely mostly on blocking countries and server farm IP blocks.  robots.txt is a very poor protection for any server.  I also have browser ID blocking in place and other measures.

But there is no stopping a malbot that emanates from a clean IP block, fakes a valid browser ID, does not exceed the page throttle limit, and does not fall into a honeypot - in other words, one that acts like a normal human being using a browser.  And to some degree that's OK, because if it follows the rules then it's probably not going to create trouble.
1
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Search Engine Optimization (SEO)

From novice to tech pro — start learning today.

Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.