Could you help me write a custom robots.txt file

I want to
allow
index.php
/ (main page)
about.php
privacy.php
contact.php


and disallow all the other folders

how could I write a robots.txt file that does this
LVL 1
rgb192Asked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

GaryCommented:
Disallow: /
Allow: /index.php
Allow: /about.php
Allow: /privacy.php
Allow: /contact.php

Open in new window

0
Ray PaseurCommented:
Reference data here: http://www.robotstxt.org/

You know that only well-behaved spiders will follow robots.txt, right?  Any rogue web crawler will still be able to get to anything that is linked.  That means things that are linked in your web site and things in your web site that are linked from external data sources, such as the WayBack Machine.
0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
rgb192Author Commented:
Disallow: /
Allow: /index.php
Allow: /about.php
Allow: /privacy.php
Allow: /contact.php


but how do I allow the main page which is index.php
but the url is
website.com/


I want this disallow to block all future folders


>>You know that only well-behaved spiders will follow robots.txt, right?
my client only cares about bing, google
0
PMI ACP® Project Management

Prepare for the PMI Agile Certified Practitioner (PMI-ACP)® exam, which formally recognizes your knowledge of agile principles and your skill with agile techniques.

GaryCommented:
The domain name itself is ignored in the robots.txt - it is irrelevant.
The example I posted above allows index.php
0
rgb192Author Commented:
The domain name itself is ignored in the robots.txt - it is irrelevant.
The example I posted above allows index.php

okay I understand

but what about future folders

website.com/future-October

website.com/future-November

website.com/future-December

I want these files/folders blocked
0
GaryCommented:
The example I posted above will block everything except those specific files that are specified as allowed.  So it doesn't matter what files/folders you add they will be blocked.
But as Ray has pointed out the bad robots will ignore your robots.txt file - they don't care about it because in general they are trying to copy your site/hijack your content.
0
rgb192Author Commented:
disallow: / may generate errors in google webmaster tools
disallow: / may generate errors in google webmaster tools

Disallow: /
Allow: /index.php
Allow: /about.php
Allow: /privacy.php
Allow: /contact.php
Allow: /robots.txt
Allow: /sitemap.xml
0
GaryCommented:
Try moving the Disallow so it is the last line
How many folders files are you talking about?
0
rgb192Author Commented:
I moved to last line.
i will wait minutes

There are many files that should be blocked and many test folders that the client adds that I do not know about.  And future files / folders
0
GaryCommented:
If not then try it this way - I thought Google understood Allow

Disallow: /~/index.php
Disallow: /~/about.php
Disallow: /~/privacy.php
Disallow: /~/contact.php
Disallow: /~/robots.txt
Disallow: /~/sitemap.xml
0
Bernard S.CTOCommented:
I'd like to reinforce Ray's caveat: rogue crawlers will not respect robots.txt disallows.

To which you said "my client cares just about Google and Bing"

Sure.
However, if you expect to hide things or to block access to them... remember they are in fact still open

Seems you need to educate your client so that s/he understands that you cannot block access to files just with robots.txt, and that if they add files these might get indexed

A last note: be aware also even well-behaved crawlers might index a "hopefully diasallowed file" in some occasions...
0
rgb192Author Commented:

If not then try it this way - I thought Google understood Allow

Disallow: /~/index.php
Disallow: /~/about.php
Disallow: /~/privacy.php
Disallow: /~/contact.php
Disallow: /~/robots.txt
Disallow: /~/sitemap.xml

I do not understand

I want to allow index, about, privacy, robots, sitemap

and I do not know the folder_names and file_names of future folders


Do you think taking out the disallow will work

Allow: /index.php
Allow: /about.php
Allow: /privacy.php
Allow: /contact.php
Allow: /robots.txt
Allow: /sitemap.xml

note there is no Disallow: /


Because it is a robots.txt with only allows
0
GaryCommented:
The ~ excludes the file from the disallow.
0
Bernard S.CTOCommented:
Well, if you just have allows, why take the pain to list allow files? (which btw does NOT work, it is not in the standard!)

If you want to allow eveything, either create an empty robots.txt file or one that would contain:
User-agent: *
Disallow:


If you place any disallow after that, anything else is still fair game
0
rgb192Author Commented:
maybe the
The ~ excludes the file from the disallow.

It is difficult to test.

So the answer would be
'disallow ~' could be 'allow'

thanks
0
Bernard S.CTOCommented:
B-) glad we could help. Thx for the grade and points.

One last note: google will look every time for robots.txt, unless it is already in  its own cache. So always provide a robots.txt file, even empty, so that it finds it fast
0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
PHP

From novice to tech pro — start learning today.