robot text and wild card

I don't want google to index my "client" folder.  Can I use a wild card or is it even necessary?

Should it be:

User-agent: *
Disallow: /clients/*

or: User-agent: *
Disallow: /clients/

Thanks


jason94024Asked:
Who is Participating?
 
DooDahConnect With a Mentor Commented:


How do users FIND your site?   Google, Yahoo and MSN?

When Google, Yahoo and MSN index your site, how do you want them to proceed ?

We can go with the HAIL-MARY variation of my original post, just to make sure everyBOT that cares to listen ( not all do ), but for the HONEST OPERATORS,  put it all in there.

User-agent: *
Disallow: /clients
Disallow: /clients/
Disallow: /clients/*
Disallow: /clients/*?

You're really ONLY talking to the SEARCH ENGINE BOTS that YOUR users use to find your site... Google, Yahoo, and MSN and the others as they drop off in popularity...

IMPORTANT:  Robots.Txt will NOT SECURE your site...  so if you are trying to Block Malicious Robots, you're just telling Malicious Robots what they want.

0
 
FayazCommented:
To prevent all robots from indexing a page on your site, use a noindex meta tag
http://www.google.com/support/webmasters/bin/answer.py?hl=en&answer=156412
0
Cloud Class® Course: C++ 11 Fundamentals

This course will introduce you to C++ 11 and teach you about syntax fundamentals.

 
torimarCommented:
Use the second version without the wildcard.
0
 
DooDahCommented:

Hello jason94024,

User-agent: *
Disallow: /clients/*?

In your robots.txt
0
 
jason94024Author Commented:
so if I use the second without the wildcard, it will disallow everything in that folder, including other folders?
0
 
FayazCommented:
What do you want to remove?My entire site or directory
To prevent robots from crawling your site, add the following directive to your robots.txt file:User-agent: * Disallow: /
To prevent just Googlebot from crawling your site in the future, use the following directive:User-agent: Googlebot Disallow: /
Each port must have its own robots.txt file. In particular, if you serve content via both http and https, you'll need a separate robots.txt file for each of these protocols. For example, to allow Googlebot to index all http pages but no https pages, you'd use the robots.txt directives below.
For your http protocol (http://yourserver.com/robots.txt): User-agent: * Allow: /
For the https protocol (https://yourserver.com/robots.txt):User-agent: * Disallow: /
0
 
jason94024Author Commented:
I want search engines to not index the folder on my site call clients and all the folders that are in it. But I want the SEs to search everything else.

so the folder is www.mysite.com/clients and  www.mysite.com/clients/example/index.htm should not be indexed.

Right now I just want to stop them from indexing, I will deal with removing them later.  So this is the right txt:

User-agent: *
Disallow: /clients/

?
0
 
DooDahCommented:

EVEN GOOGLE HAS PLACES on their site DISALLOWED
Use "http://www.google.com/robots.txt" as an examle...

User-agent: *
Disallow: /clients?
Disallow: /clients/?
0
 
jason94024Author Commented:
so the wildcare is a "?" not a *?
0
 
jason94024Author Commented:
I mean wildcard
0
 
DooDahCommented:


http://www.google.com/robots.txt

It evidently works for GOOGLE.   They are blocking all BOTS to several directories and their subdirectories.
.
EVEN GOOGLE HAS PLACES on their site DISALLOWED
Use "http://www.google.com/robots.txt" as an examle...

User-agent: *
Disallow: /clients?
Disallow: /clients/?
0
 
torimarCommented:
If you wish to deny bot access to the folder /clients, then this is the way to go:
User-agent: *
Disallow: /clients/

(check: http://www.robotstxt.org/robotstxt.html)

Do not use wildcards, because a) most bots do not understand them, b) you don't really know what the ones will do who actually understand them.
0
 
torimarCommented:
ps:
The above will also deny access to subfolders.
0
 
DooDahCommented:

Jason,

I n the end, you are going to have to make up your own mind, but I would compare what the BIG GUYS are doning on their WebSites...    Google and Yahoo might be a good place to start as they are not talking to themselves all day long... () most bots do not understand them I doubt that...

EVEN GOOGLE & YAHOO have PLACES on their sites DISALLOWED
Use " http://www.google.com/robots.txt " or  " http://m.www.yahoo.com/robots.txt " as an examle...

GOOGLE => http://www.google.com/robots.txt 
User-agent: *
Disallow: /clients?
Disallow: /clients/?

YAHOO => http://m.www.yahoo.com/robots.txt
User-agent: *
Disallow: /p/
Disallow: /r/
Disallow: /*?

IMPORTANT:  Robots.Txt will NOT SECURE your site...  so if you are trying to Block Malicious Robots, you're just telling Malicious Robots what they want.

0
 
torimarCommented:
http://www.robotstxt.org/robotstxt.html -- this is the most official documentation that there is.
See the following:
"Note also that globbing and regular expression are not supported in either the User-agent or Disallow lines. The '*' in the User-agent field is a special value meaning "any robot". Specifically, you cannot have lines like "User-agent: *bot*", "Disallow: /tmp/*" or "Disallow: *.gif"."

Also check out http://en.wikipedia.org/wiki/Robots.txt.
On this page (http://www.mcanerin.com/EN/search-engine/robots-txt.asp) you will learn that in general, wildcards should not be used, but that certain bots support certain wildcards, so Google, Yahoo and MSN support the "*" for any number of random characters.

But there are hundreds of bots, not only 3. So you would have to specify all of them specifically as different User-agents, otherwise it will not work. Yet when writing specific instructions for specific user-agents you face the problem that you need to understand the implementations that the respective bots use.
Just an example: When Google use a '?' in their robots.txt, this is not at all a wildcard; it actually refers to a question mark in the URL.
Check out the Google bot documentation here: http://www.google.com/support/webmasters/bin/answer.py?hl=en&answer=156449 -> see: "Manually create a robots.txt file"

There is also a test tool for robots.txt files on that Google page. Use it, and you will see that
User-agent: *
Disallow: /clients/
is by far the easiest and completest of solutions.
0
 
torimarConnect With a Mentor Commented:
> Disallow: /clients/
This command blocks the folder /clients and everything inside it. For Google, MSN, Yahoo *and* for all the other bots.

Hence the commands:
> Disallow: /clients/*
> Disallow: /clients/*?
are simply superfluous

And the command:
> Disallow: /clients
will block any file called "clients" (in the web root folder), and not the folder "/clients" - which is not in the intention of the asker
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.