Link to home
Start Free TrialLog in
Avatar of RollinNow
RollinNow

asked on

ROBOTS.TXT - Allow and Disallow

I want to allow a few bots but deny others. I know how to disallow:

User-agent: baiduspider
User-agent: asterias
User-agent: ASPSeek
Disallow: /

But if I want, as an example, Googlebot to index the site (except for the two directories below), do I need some kind of statement that Googlebot is Allowed, or is that implicit by not including Googlebot in the Disallow line?

And I'd like to be sure I place these two lines correctly in realation to the other lines:

Disallow: /about/
Disallow: /help/

I want to be sure I get this correc the first time because I've been told that if it's wrong, it may take forever before a denied bot returns if first denied. Would someone give me an example of the entire combination for robots.txt for this situation. I'm awarding 500 points because I want to be sure so please be specific and detailed. And yes, I've read the links and other sites, but this is not clear to me, so I'd like it here, please.

 - Georgia



Avatar of Messiadbunny
Messiadbunny


User-agent: Googlebot
Disallow: /about/
Disallow: /help/

would be the correct syntax.

Then if you wanted to allow certain bots full access, such as the ones you listed, and disallow any other bots...

User-agent: Googlebot
Disallow: /about/
Disallow: /help/

User-agent: baiduspider
Disallow:
User-agent: asterias
Disallow:
User-agent: ASPSeek
Disallow:

User-agent: *
Disallow: /




Avatar of RollinNow

ASKER

Well yes, but that's just stating about what I already had. I'm asking to understand exactly what it is, in the way the lines are arranged, that allows and disallows. What are the rules? I've read them but cannot understand. That's what I was hoping for some detail in a narrative from you or someone.

We have:

User-agent: Googlebot

and we also have:

User-agent: *
Disallow: /

I'mnot totally certain as to what it all means, so I can avoid any confusion and errors.

 - Georgia


User-agent: Googlebot
Disallow:

User-agent: Otherbot
Disallow: /

User-agent: *
Disallow: /tmp/
Disallow: /private/


The above config will allow the Googlebot to crawl the entire site, block Otherbot from crawling any of the site, and all others (*) will be blocked from crawling the /tmp/ and /private/ directories.  You can use that format to add other bots to allow or disallow.
RollinNow,

Have a look at http://en.wikipedia.org/wiki/Robots.txt which describes the format etc. and provides many other snippets of information about the robots.txt file along with a useful list of external links (at the bottom of the page)

cheers,

RollinNow,

And after reading your question again (more closely this time!) have a look at the following specific page which shows some particular examples that relate to what you want to acheive:  http://www.robotstxt.org/wc/norobots.html

snippet:

This example "/robots.txt" file specifies that no robots should visit any URL starting with "/cyberworld/map/", except the robot called "cybermapper":
--------------------------------------------------------------------------------

# robots.txt for http://www.example.com/

User-agent: *
Disallow: /cyberworld/map/ # This is an infinite virtual URL space

# Cybermapper knows where to go.
User-agent: cybermapper
Disallow:


Hope that helps
I asked:

"I'm awarding 500 points because I want to be sure so please be specific and detailed."

"Do I need some kind of statement that Googlebot is Allowed, or is that implicit by not including Googlebot in the Disallow line?"

If that question basis is not obvious, with an easy answer, then perhaps explaining how this is used:

User-agent: Googlebot
Disallow:

Does leaving the Disallow line blank means it is allowed, or if not, then what? I don't need all the rules explained, just those in requard to my question.

I appreciate the examples but I need to understand without hesitation so I can decide, not simpy copy examples blindly. You see what I mean? All I'm getting are examples. I already have examples, tons of them. I need someone who will take two minutes for 500 points and write several sentences of explanation, perhaps about as much as I've written in this reply.

I've read the links, and others like them, before I asked this questioin. So, I still need an answer.

 - Georgia

I don't seem to be getting a detailed response about my "implicit" question so let's see if I can ask this a bit differently:

If I want to disallow multiple bots, do I have to use the "Disallow: /" line after each User-agent line, or can I group them like this? And is there a blank line required before or after?

User-agent: moget
User-agent: moget/2.1
User-agent: pavuk
User-agent: pcBrowser
User-agent: psbot
User-agent: searchpreview
User-agent: spanner
User-agent: suzuran
User-agent: tAkeOut
User-agent: toCrawl/UrlDispatcher
User-agent: turingos
User-agent: webfetch/2.1.0
User-agent: wget
Disallow: /
And in the same question goes for allowing a group of bots:

User-agent: Googlebot
User-agent: Yahoo
User-agent: MSN
Disallow: /

These last two comments are just examples. I'm feeling that I would rather use a disallow statement rather than allowing specific bots and eliminating all the rest. The reason is that there may be some "good" bots I don't  know about and I'd not want to eliminate them until I discover them.

I hope this helps to better illustrate what I'm looking for, which is not examples, but dialog.

  - Georgia
Just to be accurate (I hope), I understand that last comment should be:

User-agent: Googlebot
User-agent: Yahoo
User-agent: MSN
Disallow:

Instead of:

User-agent: Googlebot
User-agent: Yahoo
User-agent: MSN
Disallow: /
SOLUTION
Avatar of humeniuk
humeniuk
Flag of Canada image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
ASKER CERTIFIED SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Okay, got it. I feel a little safer and a bit more confident now that someone else has the responsibility of robots.txt working, or not.  :)

 - Georgia