[2 days left] What’s wrong with your cloud strategy? Learn why multicloud solutions matter with Nimble Storage.Register Now

x
?
Solved

ROBOTS.TXT - Allow and Disallow

Posted on 2006-11-16
12
Medium Priority
?
1,037 Views
Last Modified: 2008-02-01
I want to allow a few bots but deny others. I know how to disallow:

User-agent: baiduspider
User-agent: asterias
User-agent: ASPSeek
Disallow: /

But if I want, as an example, Googlebot to index the site (except for the two directories below), do I need some kind of statement that Googlebot is Allowed, or is that implicit by not including Googlebot in the Disallow line?

And I'd like to be sure I place these two lines correctly in realation to the other lines:

Disallow: /about/
Disallow: /help/

I want to be sure I get this correc the first time because I've been told that if it's wrong, it may take forever before a denied bot returns if first denied. Would someone give me an example of the entire combination for robots.txt for this situation. I'm awarding 500 points because I want to be sure so please be specific and detailed. And yes, I've read the links and other sites, but this is not clear to me, so I'd like it here, please.

 - Georgia



0
Comment
Question by:RollinNow
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 6
  • 3
  • 2
  • +1
12 Comments
 
LVL 2

Expert Comment

by:Messiadbunny
ID: 17963206

User-agent: Googlebot
Disallow: /about/
Disallow: /help/

would be the correct syntax.

Then if you wanted to allow certain bots full access, such as the ones you listed, and disallow any other bots...

User-agent: Googlebot
Disallow: /about/
Disallow: /help/

User-agent: baiduspider
Disallow:
User-agent: asterias
Disallow:
User-agent: ASPSeek
Disallow:

User-agent: *
Disallow: /




0
 

Author Comment

by:RollinNow
ID: 17963286
Well yes, but that's just stating about what I already had. I'm asking to understand exactly what it is, in the way the lines are arranged, that allows and disallows. What are the rules? I've read them but cannot understand. That's what I was hoping for some detail in a narrative from you or someone.

We have:

User-agent: Googlebot

and we also have:

User-agent: *
Disallow: /

I'mnot totally certain as to what it all means, so I can avoid any confusion and errors.

 - Georgia


0
 
LVL 33

Expert Comment

by:humeniuk
ID: 17964995
User-agent: Googlebot
Disallow:

User-agent: Otherbot
Disallow: /

User-agent: *
Disallow: /tmp/
Disallow: /private/


The above config will allow the Googlebot to crawl the entire site, block Otherbot from crawling any of the site, and all others (*) will be blocked from crawling the /tmp/ and /private/ directories.  You can use that format to add other bots to allow or disallow.
0
Will your db performance match your db growth?

In Percona’s white paper “Performance at Scale: Keeping Your Database on Its Toes,” we take a high-level approach to what you need to think about when planning for database scalability.

 
LVL 10

Expert Comment

by:fostejo
ID: 17966252
RollinNow,

Have a look at http://en.wikipedia.org/wiki/Robots.txt which describes the format etc. and provides many other snippets of information about the robots.txt file along with a useful list of external links (at the bottom of the page)

cheers,

0
 
LVL 10

Expert Comment

by:fostejo
ID: 17966369
RollinNow,

And after reading your question again (more closely this time!) have a look at the following specific page which shows some particular examples that relate to what you want to acheive:  http://www.robotstxt.org/wc/norobots.html

snippet:

This example "/robots.txt" file specifies that no robots should visit any URL starting with "/cyberworld/map/", except the robot called "cybermapper":
--------------------------------------------------------------------------------

# robots.txt for http://www.example.com/

User-agent: *
Disallow: /cyberworld/map/ # This is an infinite virtual URL space

# Cybermapper knows where to go.
User-agent: cybermapper
Disallow:


Hope that helps
0
 

Author Comment

by:RollinNow
ID: 17966443
I asked:

"I'm awarding 500 points because I want to be sure so please be specific and detailed."

"Do I need some kind of statement that Googlebot is Allowed, or is that implicit by not including Googlebot in the Disallow line?"

If that question basis is not obvious, with an easy answer, then perhaps explaining how this is used:

User-agent: Googlebot
Disallow:

Does leaving the Disallow line blank means it is allowed, or if not, then what? I don't need all the rules explained, just those in requard to my question.

I appreciate the examples but I need to understand without hesitation so I can decide, not simpy copy examples blindly. You see what I mean? All I'm getting are examples. I already have examples, tons of them. I need someone who will take two minutes for 500 points and write several sentences of explanation, perhaps about as much as I've written in this reply.

I've read the links, and others like them, before I asked this questioin. So, I still need an answer.

 - Georgia

0
 

Author Comment

by:RollinNow
ID: 17967977
I don't seem to be getting a detailed response about my "implicit" question so let's see if I can ask this a bit differently:

If I want to disallow multiple bots, do I have to use the "Disallow: /" line after each User-agent line, or can I group them like this? And is there a blank line required before or after?

User-agent: moget
User-agent: moget/2.1
User-agent: pavuk
User-agent: pcBrowser
User-agent: psbot
User-agent: searchpreview
User-agent: spanner
User-agent: suzuran
User-agent: tAkeOut
User-agent: toCrawl/UrlDispatcher
User-agent: turingos
User-agent: webfetch/2.1.0
User-agent: wget
Disallow: /
0
 

Author Comment

by:RollinNow
ID: 17968075
And in the same question goes for allowing a group of bots:

User-agent: Googlebot
User-agent: Yahoo
User-agent: MSN
Disallow: /

These last two comments are just examples. I'm feeling that I would rather use a disallow statement rather than allowing specific bots and eliminating all the rest. The reason is that there may be some "good" bots I don't  know about and I'd not want to eliminate them until I discover them.

I hope this helps to better illustrate what I'm looking for, which is not examples, but dialog.

  - Georgia
0
 

Author Comment

by:RollinNow
ID: 17968086
Just to be accurate (I hope), I understand that last comment should be:

User-agent: Googlebot
User-agent: Yahoo
User-agent: MSN
Disallow:

Instead of:

User-agent: Googlebot
User-agent: Yahoo
User-agent: MSN
Disallow: /
0
 
LVL 33

Assisted Solution

by:humeniuk
humeniuk earned 600 total points
ID: 17969177
Correct - "Disallow: /" is the correct syntax to exclude.  "Disallow:" will not exclude.

"Do I need some kind of statement that Googlebot is Allowed, or is that implicit by not including Googlebot in the Disallow line?"
A robots.txt file can only be used to exclude, so the default is to allow.  If you disallow a list of other bots, then Googlebot is allowed.  If you use a wildcard, ie "User-agent: * ", then you have to specify that the Googlebot is an exception to this or it will also be disallowed.

"If I want to disallow multiple bots, do I have to use the "Disallow: /" line after each User-agent line, or can I group them like this? And is there a blank line required before or after?"
I prefer to have an entry for each bot, but I'm not sure it's necessary.  Also, it is conventional to leave a blank line between, but it isn't a functional necessity.  Rather, it is a matter of keeping the file clear and easy to read.
0
 
LVL 10

Accepted Solution

by:
fostejo earned 1400 total points
ID: 17969299
RollinNow,

Yes, you can list numerous user agents and then specify the 'disallow' as you've listed in your comment of 7:40GMT - as long as each line ends with a CR, CR+LF or just LF they should be interpreted correctly.

In your last comment, you've altered the 'Disallow: /' to a 'Disallow:" - the effect of this is that there is NO actual restriction as to what the bots can trawl - an empty Disallow field indicates that ALL URLs can be retrieved as per the description of the 'Disallow' field on the original link I posted, http://www.robotstxt.org/wc/norobots.html

So, from your original question you want to "allow a few bots but deny others" - to acheive this you'd have a robots.txt file like:

# Disallow ALL robots not matched by any other records..
User-agent: *
Disallow: /    

# Allow these specific robots access to everything..
User-agent: Googlebot
User-agent: Yahoo
User-agent: MSN
Disallow:


However, from your later posts, you state that you feel you'd be better off disallowing the known 'bad ones' and allowing everything else - in that case, your robots.txt file should probably look more like:

# Disallow these SPECIFIC robots, but allow everything else
User-agent: BadRobot
User-agent: HorridRobot
User-agent: NastyRobot
Disallow: /

Also, as you mention there's probably a number of robots out there that you're not aware of.. this web pages lists about 300!   http://www.robotstxt.org/wc/active/html/index.html

Hope that helps.
0
 

Author Comment

by:RollinNow
ID: 17969518
Okay, got it. I feel a little safer and a bit more confident now that someone else has the responsibility of robots.txt working, or not.  :)

 - Georgia
0

Featured Post

Understanding Web Applications

Without even knowing it, most of us are using web applications on a daily basis. Gmail and Yahoo email, Twitter, Facebook, and eBay are used by most of us daily—and they are web applications. We often confuse these web applications tools for websites.  So, what is the difference?

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Color can increase conversions, create feelings of warmth or even incite people to get behind a cause. If you want your website to really impact site visitors, then it is vital to consider the impact color has on them.
Dramatic changes are revolutionizing how we build and use technology. Every company is automating, digitizing, and modernizing operations. We need a better, more connected way to work together as teams so we can harness the insights from our system…
The viewer will learn how to dynamically set the form action using jQuery.
The viewer will learn how to look for a specific file type in a local or remote server directory using PHP.
Suggested Courses

649 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question