Solved

ROBOTS.TXT - Allow and Disallow

Posted on 2006-11-16
12
977 Views
Last Modified: 2008-02-01
I want to allow a few bots but deny others. I know how to disallow:

User-agent: baiduspider
User-agent: asterias
User-agent: ASPSeek
Disallow: /

But if I want, as an example, Googlebot to index the site (except for the two directories below), do I need some kind of statement that Googlebot is Allowed, or is that implicit by not including Googlebot in the Disallow line?

And I'd like to be sure I place these two lines correctly in realation to the other lines:

Disallow: /about/
Disallow: /help/

I want to be sure I get this correc the first time because I've been told that if it's wrong, it may take forever before a denied bot returns if first denied. Would someone give me an example of the entire combination for robots.txt for this situation. I'm awarding 500 points because I want to be sure so please be specific and detailed. And yes, I've read the links and other sites, but this is not clear to me, so I'd like it here, please.

 - Georgia



0
Comment
Question by:RollinNow
  • 6
  • 3
  • 2
  • +1
12 Comments
 
LVL 2

Expert Comment

by:Messiadbunny
Comment Utility

User-agent: Googlebot
Disallow: /about/
Disallow: /help/

would be the correct syntax.

Then if you wanted to allow certain bots full access, such as the ones you listed, and disallow any other bots...

User-agent: Googlebot
Disallow: /about/
Disallow: /help/

User-agent: baiduspider
Disallow:
User-agent: asterias
Disallow:
User-agent: ASPSeek
Disallow:

User-agent: *
Disallow: /




0
 

Author Comment

by:RollinNow
Comment Utility
Well yes, but that's just stating about what I already had. I'm asking to understand exactly what it is, in the way the lines are arranged, that allows and disallows. What are the rules? I've read them but cannot understand. That's what I was hoping for some detail in a narrative from you or someone.

We have:

User-agent: Googlebot

and we also have:

User-agent: *
Disallow: /

I'mnot totally certain as to what it all means, so I can avoid any confusion and errors.

 - Georgia


0
 
LVL 33

Expert Comment

by:humeniuk
Comment Utility
User-agent: Googlebot
Disallow:

User-agent: Otherbot
Disallow: /

User-agent: *
Disallow: /tmp/
Disallow: /private/


The above config will allow the Googlebot to crawl the entire site, block Otherbot from crawling any of the site, and all others (*) will be blocked from crawling the /tmp/ and /private/ directories.  You can use that format to add other bots to allow or disallow.
0
 
LVL 10

Expert Comment

by:fostejo
Comment Utility
RollinNow,

Have a look at http://en.wikipedia.org/wiki/Robots.txt which describes the format etc. and provides many other snippets of information about the robots.txt file along with a useful list of external links (at the bottom of the page)

cheers,

0
 
LVL 10

Expert Comment

by:fostejo
Comment Utility
RollinNow,

And after reading your question again (more closely this time!) have a look at the following specific page which shows some particular examples that relate to what you want to acheive:  http://www.robotstxt.org/wc/norobots.html

snippet:

This example "/robots.txt" file specifies that no robots should visit any URL starting with "/cyberworld/map/", except the robot called "cybermapper":
--------------------------------------------------------------------------------

# robots.txt for http://www.example.com/

User-agent: *
Disallow: /cyberworld/map/ # This is an infinite virtual URL space

# Cybermapper knows where to go.
User-agent: cybermapper
Disallow:


Hope that helps
0
 

Author Comment

by:RollinNow
Comment Utility
I asked:

"I'm awarding 500 points because I want to be sure so please be specific and detailed."

"Do I need some kind of statement that Googlebot is Allowed, or is that implicit by not including Googlebot in the Disallow line?"

If that question basis is not obvious, with an easy answer, then perhaps explaining how this is used:

User-agent: Googlebot
Disallow:

Does leaving the Disallow line blank means it is allowed, or if not, then what? I don't need all the rules explained, just those in requard to my question.

I appreciate the examples but I need to understand without hesitation so I can decide, not simpy copy examples blindly. You see what I mean? All I'm getting are examples. I already have examples, tons of them. I need someone who will take two minutes for 500 points and write several sentences of explanation, perhaps about as much as I've written in this reply.

I've read the links, and others like them, before I asked this questioin. So, I still need an answer.

 - Georgia

0
Enabling OSINT in Activity Based Intelligence

Activity based intelligence (ABI) requires access to all available sources of data. Recorded Future allows analysts to observe structured data on the open, deep, and dark web.

 

Author Comment

by:RollinNow
Comment Utility
I don't seem to be getting a detailed response about my "implicit" question so let's see if I can ask this a bit differently:

If I want to disallow multiple bots, do I have to use the "Disallow: /" line after each User-agent line, or can I group them like this? And is there a blank line required before or after?

User-agent: moget
User-agent: moget/2.1
User-agent: pavuk
User-agent: pcBrowser
User-agent: psbot
User-agent: searchpreview
User-agent: spanner
User-agent: suzuran
User-agent: tAkeOut
User-agent: toCrawl/UrlDispatcher
User-agent: turingos
User-agent: webfetch/2.1.0
User-agent: wget
Disallow: /
0
 

Author Comment

by:RollinNow
Comment Utility
And in the same question goes for allowing a group of bots:

User-agent: Googlebot
User-agent: Yahoo
User-agent: MSN
Disallow: /

These last two comments are just examples. I'm feeling that I would rather use a disallow statement rather than allowing specific bots and eliminating all the rest. The reason is that there may be some "good" bots I don't  know about and I'd not want to eliminate them until I discover them.

I hope this helps to better illustrate what I'm looking for, which is not examples, but dialog.

  - Georgia
0
 

Author Comment

by:RollinNow
Comment Utility
Just to be accurate (I hope), I understand that last comment should be:

User-agent: Googlebot
User-agent: Yahoo
User-agent: MSN
Disallow:

Instead of:

User-agent: Googlebot
User-agent: Yahoo
User-agent: MSN
Disallow: /
0
 
LVL 33

Assisted Solution

by:humeniuk
humeniuk earned 150 total points
Comment Utility
Correct - "Disallow: /" is the correct syntax to exclude.  "Disallow:" will not exclude.

"Do I need some kind of statement that Googlebot is Allowed, or is that implicit by not including Googlebot in the Disallow line?"
A robots.txt file can only be used to exclude, so the default is to allow.  If you disallow a list of other bots, then Googlebot is allowed.  If you use a wildcard, ie "User-agent: * ", then you have to specify that the Googlebot is an exception to this or it will also be disallowed.

"If I want to disallow multiple bots, do I have to use the "Disallow: /" line after each User-agent line, or can I group them like this? And is there a blank line required before or after?"
I prefer to have an entry for each bot, but I'm not sure it's necessary.  Also, it is conventional to leave a blank line between, but it isn't a functional necessity.  Rather, it is a matter of keeping the file clear and easy to read.
0
 
LVL 10

Accepted Solution

by:
fostejo earned 350 total points
Comment Utility
RollinNow,

Yes, you can list numerous user agents and then specify the 'disallow' as you've listed in your comment of 7:40GMT - as long as each line ends with a CR, CR+LF or just LF they should be interpreted correctly.

In your last comment, you've altered the 'Disallow: /' to a 'Disallow:" - the effect of this is that there is NO actual restriction as to what the bots can trawl - an empty Disallow field indicates that ALL URLs can be retrieved as per the description of the 'Disallow' field on the original link I posted, http://www.robotstxt.org/wc/norobots.html

So, from your original question you want to "allow a few bots but deny others" - to acheive this you'd have a robots.txt file like:

# Disallow ALL robots not matched by any other records..
User-agent: *
Disallow: /    

# Allow these specific robots access to everything..
User-agent: Googlebot
User-agent: Yahoo
User-agent: MSN
Disallow:


However, from your later posts, you state that you feel you'd be better off disallowing the known 'bad ones' and allowing everything else - in that case, your robots.txt file should probably look more like:

# Disallow these SPECIFIC robots, but allow everything else
User-agent: BadRobot
User-agent: HorridRobot
User-agent: NastyRobot
Disallow: /

Also, as you mention there's probably a number of robots out there that you're not aware of.. this web pages lists about 300!   http://www.robotstxt.org/wc/active/html/index.html

Hope that helps.
0
 

Author Comment

by:RollinNow
Comment Utility
Okay, got it. I feel a little safer and a bit more confident now that someone else has the responsibility of robots.txt working, or not.  :)

 - Georgia
0

Featured Post

How your wiki can always stay up-to-date

Quip doubles as a “living” wiki and a project management tool that evolves with your organization. As you finish projects in Quip, the work remains, easily accessible to all team members, new and old.
- Increase transparency
- Onboard new hires faster
- Access from mobile/offline

Join & Write a Comment

Deprecated and Headed for the Dustbin By now, you have probably heard that some PHP features, while convenient, can also cause PHP security problems.  This article discusses one of those, called register_globals.  It is a thing you do not want.  …
Why do we like using grid based layouts in website design? Let's look at the live examples of websites and compare them to grid based WordPress themes.
This tutorial demonstrates how to identify and create boundary or building outlines in Google Maps. In this example, I outline the boundaries of an enclosed skatepark within a community park.  Login to your Google Account, then  Google for "Google M…
This video teaches users how to migrate an existing Wordpress website to a new domain.

771 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

10 Experts available now in Live!

Get 1:1 Help Now