• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 1071
  • Last Modified:

ROBOTS.TXT - Allow and Disallow

I want to allow a few bots but deny others. I know how to disallow:

User-agent: baiduspider
User-agent: asterias
User-agent: ASPSeek
Disallow: /

But if I want, as an example, Googlebot to index the site (except for the two directories below), do I need some kind of statement that Googlebot is Allowed, or is that implicit by not including Googlebot in the Disallow line?

And I'd like to be sure I place these two lines correctly in realation to the other lines:

Disallow: /about/
Disallow: /help/

I want to be sure I get this correc the first time because I've been told that if it's wrong, it may take forever before a denied bot returns if first denied. Would someone give me an example of the entire combination for robots.txt for this situation. I'm awarding 500 points because I want to be sure so please be specific and detailed. And yes, I've read the links and other sites, but this is not clear to me, so I'd like it here, please.

 - Georgia



0
RollinNow
Asked:
RollinNow
  • 6
  • 3
  • 2
  • +1
2 Solutions
 
MessiadbunnyCommented:

User-agent: Googlebot
Disallow: /about/
Disallow: /help/

would be the correct syntax.

Then if you wanted to allow certain bots full access, such as the ones you listed, and disallow any other bots...

User-agent: Googlebot
Disallow: /about/
Disallow: /help/

User-agent: baiduspider
Disallow:
User-agent: asterias
Disallow:
User-agent: ASPSeek
Disallow:

User-agent: *
Disallow: /




0
 
RollinNowAuthor Commented:
Well yes, but that's just stating about what I already had. I'm asking to understand exactly what it is, in the way the lines are arranged, that allows and disallows. What are the rules? I've read them but cannot understand. That's what I was hoping for some detail in a narrative from you or someone.

We have:

User-agent: Googlebot

and we also have:

User-agent: *
Disallow: /

I'mnot totally certain as to what it all means, so I can avoid any confusion and errors.

 - Georgia


0
 
humeniukCommented:
User-agent: Googlebot
Disallow:

User-agent: Otherbot
Disallow: /

User-agent: *
Disallow: /tmp/
Disallow: /private/


The above config will allow the Googlebot to crawl the entire site, block Otherbot from crawling any of the site, and all others (*) will be blocked from crawling the /tmp/ and /private/ directories.  You can use that format to add other bots to allow or disallow.
0
Independent Software Vendors: We Want Your Opinion

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 
fostejoCommented:
RollinNow,

Have a look at http://en.wikipedia.org/wiki/Robots.txt which describes the format etc. and provides many other snippets of information about the robots.txt file along with a useful list of external links (at the bottom of the page)

cheers,

0
 
fostejoCommented:
RollinNow,

And after reading your question again (more closely this time!) have a look at the following specific page which shows some particular examples that relate to what you want to acheive:  http://www.robotstxt.org/wc/norobots.html

snippet:

This example "/robots.txt" file specifies that no robots should visit any URL starting with "/cyberworld/map/", except the robot called "cybermapper":
--------------------------------------------------------------------------------

# robots.txt for http://www.example.com/

User-agent: *
Disallow: /cyberworld/map/ # This is an infinite virtual URL space

# Cybermapper knows where to go.
User-agent: cybermapper
Disallow:


Hope that helps
0
 
RollinNowAuthor Commented:
I asked:

"I'm awarding 500 points because I want to be sure so please be specific and detailed."

"Do I need some kind of statement that Googlebot is Allowed, or is that implicit by not including Googlebot in the Disallow line?"

If that question basis is not obvious, with an easy answer, then perhaps explaining how this is used:

User-agent: Googlebot
Disallow:

Does leaving the Disallow line blank means it is allowed, or if not, then what? I don't need all the rules explained, just those in requard to my question.

I appreciate the examples but I need to understand without hesitation so I can decide, not simpy copy examples blindly. You see what I mean? All I'm getting are examples. I already have examples, tons of them. I need someone who will take two minutes for 500 points and write several sentences of explanation, perhaps about as much as I've written in this reply.

I've read the links, and others like them, before I asked this questioin. So, I still need an answer.

 - Georgia

0
 
RollinNowAuthor Commented:
I don't seem to be getting a detailed response about my "implicit" question so let's see if I can ask this a bit differently:

If I want to disallow multiple bots, do I have to use the "Disallow: /" line after each User-agent line, or can I group them like this? And is there a blank line required before or after?

User-agent: moget
User-agent: moget/2.1
User-agent: pavuk
User-agent: pcBrowser
User-agent: psbot
User-agent: searchpreview
User-agent: spanner
User-agent: suzuran
User-agent: tAkeOut
User-agent: toCrawl/UrlDispatcher
User-agent: turingos
User-agent: webfetch/2.1.0
User-agent: wget
Disallow: /
0
 
RollinNowAuthor Commented:
And in the same question goes for allowing a group of bots:

User-agent: Googlebot
User-agent: Yahoo
User-agent: MSN
Disallow: /

These last two comments are just examples. I'm feeling that I would rather use a disallow statement rather than allowing specific bots and eliminating all the rest. The reason is that there may be some "good" bots I don't  know about and I'd not want to eliminate them until I discover them.

I hope this helps to better illustrate what I'm looking for, which is not examples, but dialog.

  - Georgia
0
 
RollinNowAuthor Commented:
Just to be accurate (I hope), I understand that last comment should be:

User-agent: Googlebot
User-agent: Yahoo
User-agent: MSN
Disallow:

Instead of:

User-agent: Googlebot
User-agent: Yahoo
User-agent: MSN
Disallow: /
0
 
humeniukCommented:
Correct - "Disallow: /" is the correct syntax to exclude.  "Disallow:" will not exclude.

"Do I need some kind of statement that Googlebot is Allowed, or is that implicit by not including Googlebot in the Disallow line?"
A robots.txt file can only be used to exclude, so the default is to allow.  If you disallow a list of other bots, then Googlebot is allowed.  If you use a wildcard, ie "User-agent: * ", then you have to specify that the Googlebot is an exception to this or it will also be disallowed.

"If I want to disallow multiple bots, do I have to use the "Disallow: /" line after each User-agent line, or can I group them like this? And is there a blank line required before or after?"
I prefer to have an entry for each bot, but I'm not sure it's necessary.  Also, it is conventional to leave a blank line between, but it isn't a functional necessity.  Rather, it is a matter of keeping the file clear and easy to read.
0
 
fostejoCommented:
RollinNow,

Yes, you can list numerous user agents and then specify the 'disallow' as you've listed in your comment of 7:40GMT - as long as each line ends with a CR, CR+LF or just LF they should be interpreted correctly.

In your last comment, you've altered the 'Disallow: /' to a 'Disallow:" - the effect of this is that there is NO actual restriction as to what the bots can trawl - an empty Disallow field indicates that ALL URLs can be retrieved as per the description of the 'Disallow' field on the original link I posted, http://www.robotstxt.org/wc/norobots.html

So, from your original question you want to "allow a few bots but deny others" - to acheive this you'd have a robots.txt file like:

# Disallow ALL robots not matched by any other records..
User-agent: *
Disallow: /    

# Allow these specific robots access to everything..
User-agent: Googlebot
User-agent: Yahoo
User-agent: MSN
Disallow:


However, from your later posts, you state that you feel you'd be better off disallowing the known 'bad ones' and allowing everything else - in that case, your robots.txt file should probably look more like:

# Disallow these SPECIFIC robots, but allow everything else
User-agent: BadRobot
User-agent: HorridRobot
User-agent: NastyRobot
Disallow: /

Also, as you mention there's probably a number of robots out there that you're not aware of.. this web pages lists about 300!   http://www.robotstxt.org/wc/active/html/index.html

Hope that helps.
0
 
RollinNowAuthor Commented:
Okay, got it. I feel a little safer and a bit more confident now that someone else has the responsibility of robots.txt working, or not.  :)

 - Georgia
0

Featured Post

Industry Leaders: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

  • 6
  • 3
  • 2
  • +1
Tackle projects and never again get stuck behind a technical roadblock.
Join Now