[Okta Webinar] Learn how to a build a cloud-first strategyRegister Now

x
?
Solved

robot text and wild card

Posted on 2010-01-03
17
Medium Priority
?
333 Views
Last Modified: 2013-11-18
I don't want google to index my "client" folder.  Can I use a wild card or is it even necessary?

Should it be:

User-agent: *
Disallow: /clients/*

or: User-agent: *
Disallow: /clients/

Thanks


0
Comment
Question by:jason94024
  • 5
  • 5
  • 4
  • +1
17 Comments
 
LVL 10

Expert Comment

by:Fayaz
ID: 26169109
To prevent all robots from indexing a page on your site, use a noindex meta tag
http://www.google.com/support/webmasters/bin/answer.py?hl=en&answer=156412
0
 
LVL 35

Expert Comment

by:torimar
ID: 26169180
Use the second version without the wildcard.
0
[Video] Oticon Case Study

Open office environments can create the dynamics for innovation, but they also bring some challenges. With over 1,000 employees in an open office, Oticon needed a solution that would preserve the environment while mitigating disruptive background noises.

Watch how they did it.

 
LVL 3

Expert Comment

by:DooDah
ID: 26169284

Hello jason94024,

User-agent: *
Disallow: /clients/*?

In your robots.txt
0
 

Author Comment

by:jason94024
ID: 26169374
so if I use the second without the wildcard, it will disallow everything in that folder, including other folders?
0
 
LVL 10

Expert Comment

by:Fayaz
ID: 26169392
What do you want to remove?My entire site or directory
To prevent robots from crawling your site, add the following directive to your robots.txt file:User-agent: * Disallow: /
To prevent just Googlebot from crawling your site in the future, use the following directive:User-agent: Googlebot Disallow: /
Each port must have its own robots.txt file. In particular, if you serve content via both http and https, you'll need a separate robots.txt file for each of these protocols. For example, to allow Googlebot to index all http pages but no https pages, you'd use the robots.txt directives below.
For your http protocol (http://yourserver.com/robots.txt): User-agent: * Allow: /
For the https protocol (https://yourserver.com/robots.txt):User-agent: * Disallow: /
0
 

Author Comment

by:jason94024
ID: 26169453
I want search engines to not index the folder on my site call clients and all the folders that are in it. But I want the SEs to search everything else.

so the folder is www.mysite.com/clients and  www.mysite.com/clients/example/index.htm should not be indexed.

Right now I just want to stop them from indexing, I will deal with removing them later.  So this is the right txt:

User-agent: *
Disallow: /clients/

?
0
 
LVL 3

Expert Comment

by:DooDah
ID: 26169459

EVEN GOOGLE HAS PLACES on their site DISALLOWED
Use "http://www.google.com/robots.txt" as an examle...

User-agent: *
Disallow: /clients?
Disallow: /clients/?
0
 

Author Comment

by:jason94024
ID: 26169469
so the wildcare is a "?" not a *?
0
 

Author Comment

by:jason94024
ID: 26169498
I mean wildcard
0
 
LVL 3

Expert Comment

by:DooDah
ID: 26169599


http://www.google.com/robots.txt

It evidently works for GOOGLE.   They are blocking all BOTS to several directories and their subdirectories.
.
EVEN GOOGLE HAS PLACES on their site DISALLOWED
Use "http://www.google.com/robots.txt" as an examle...

User-agent: *
Disallow: /clients?
Disallow: /clients/?
0
 
LVL 35

Expert Comment

by:torimar
ID: 26170238
If you wish to deny bot access to the folder /clients, then this is the way to go:
User-agent: *
Disallow: /clients/

(check: http://www.robotstxt.org/robotstxt.html)

Do not use wildcards, because a) most bots do not understand them, b) you don't really know what the ones will do who actually understand them.
0
 
LVL 35

Expert Comment

by:torimar
ID: 26170242
ps:
The above will also deny access to subfolders.
0
 
LVL 3

Expert Comment

by:DooDah
ID: 26172553

Jason,

I n the end, you are going to have to make up your own mind, but I would compare what the BIG GUYS are doning on their WebSites...    Google and Yahoo might be a good place to start as they are not talking to themselves all day long... () most bots do not understand them I doubt that...

EVEN GOOGLE & YAHOO have PLACES on their sites DISALLOWED
Use " http://www.google.com/robots.txt " or  " http://m.www.yahoo.com/robots.txt " as an examle...

GOOGLE => http://www.google.com/robots.txt 
User-agent: *
Disallow: /clients?
Disallow: /clients/?

YAHOO => http://m.www.yahoo.com/robots.txt
User-agent: *
Disallow: /p/
Disallow: /r/
Disallow: /*?

IMPORTANT:  Robots.Txt will NOT SECURE your site...  so if you are trying to Block Malicious Robots, you're just telling Malicious Robots what they want.

0
 
LVL 35

Expert Comment

by:torimar
ID: 26172837
http://www.robotstxt.org/robotstxt.html -- this is the most official documentation that there is.
See the following:
"Note also that globbing and regular expression are not supported in either the User-agent or Disallow lines. The '*' in the User-agent field is a special value meaning "any robot". Specifically, you cannot have lines like "User-agent: *bot*", "Disallow: /tmp/*" or "Disallow: *.gif"."

Also check out http://en.wikipedia.org/wiki/Robots.txt.
On this page (http://www.mcanerin.com/EN/search-engine/robots-txt.asp) you will learn that in general, wildcards should not be used, but that certain bots support certain wildcards, so Google, Yahoo and MSN support the "*" for any number of random characters.

But there are hundreds of bots, not only 3. So you would have to specify all of them specifically as different User-agents, otherwise it will not work. Yet when writing specific instructions for specific user-agents you face the problem that you need to understand the implementations that the respective bots use.
Just an example: When Google use a '?' in their robots.txt, this is not at all a wildcard; it actually refers to a question mark in the URL.
Check out the Google bot documentation here: http://www.google.com/support/webmasters/bin/answer.py?hl=en&answer=156449 -> see: "Manually create a robots.txt file"

There is also a test tool for robots.txt files on that Google page. Use it, and you will see that
User-agent: *
Disallow: /clients/
is by far the easiest and completest of solutions.
0
 
LVL 3

Accepted Solution

by:
DooDah earned 1000 total points
ID: 26173087


How do users FIND your site?   Google, Yahoo and MSN?

When Google, Yahoo and MSN index your site, how do you want them to proceed ?

We can go with the HAIL-MARY variation of my original post, just to make sure everyBOT that cares to listen ( not all do ), but for the HONEST OPERATORS,  put it all in there.

User-agent: *
Disallow: /clients
Disallow: /clients/
Disallow: /clients/*
Disallow: /clients/*?

You're really ONLY talking to the SEARCH ENGINE BOTS that YOUR users use to find your site... Google, Yahoo, and MSN and the others as they drop off in popularity...

IMPORTANT:  Robots.Txt will NOT SECURE your site...  so if you are trying to Block Malicious Robots, you're just telling Malicious Robots what they want.

0
 
LVL 35

Assisted Solution

by:torimar
torimar earned 1000 total points
ID: 26173244
> Disallow: /clients/
This command blocks the folder /clients and everything inside it. For Google, MSN, Yahoo *and* for all the other bots.

Hence the commands:
> Disallow: /clients/*
> Disallow: /clients/*?
are simply superfluous

And the command:
> Disallow: /clients
will block any file called "clients" (in the web root folder), and not the folder "/clients" - which is not in the intention of the asker
0

Featured Post

Industry Leaders: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

It’s a season to be thankful, and we’re thankful for users like you who engage on site, solve technology problems, and network with others in the industry. What tech are we most thankful for? Keep reading.
Strategic internal linking is often considered an SEO power technique, especially for content marketing. Do you need to hire an SEO agency to optimize you internal linking? No, this article will help you understand the basics of internal linking and…
Use Wufoo, an online form creation tool, to make powerful forms. Learn how to choose which pages of your form are visible to your users based on their inputs. The page rules feature provides you with an opportunity to create if:then statements for y…
Learn how to set-up custom confirmation messages to users who complete your Wufoo form. Include inputs from fields in your form, webpage redirects, and more with Wufoo’s confirmation options.
Suggested Courses
Course of the Month19 days, 14 hours left to enroll

873 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question