asked on

remove https from robots.txt file

From my robots.txt file, I need to prevent crawlers from indexing all pages that are https. How would I code that? If it means anything, I'm using coldfusion 8.

Jon500

The easiest way is to ensure that your http and https root folders have their own copy of robots.txt.

Do your https pages have their own root folder or web server?

Regards,
Jon

xpert13

I think that you can't do by robots.txt
But you can use htaccess for this

This code must work:

RewriteCond %{HTTP_USER_AGENT} (Googlebot|Slurp|spider|Twiceler|heritrix|
	Combine|appie|boitho|e-SocietyRobot|Exabot|Nutch|OmniExplorer|
	MJ12bot|ZyBorg/1|Ask\ Jeeves|AskJeeves|ActiveTouristBot|
	JemmaTheTourist| agadine3|BecomeBot|Clustered-Search-Bot|
	MSIECrawler|freefind|galaxy|genieknows|INGRID|grub-client|
	MojeekBot|NaverBot|NetNose-Crawler|OnetSzukaj|PrassoSunner|
	Asterias\ Crawler|T-H-U-N-D-E-R-S-T-O-N-E|GeorgeTheTouristBot|
	VoilaBot|Vagabondo|fantomBro wser|stealthBrowser|cloakBrowser|
	fantomCrew\ Browser|Girafabot|Indy\ Library|Intelliseek|Zealbot|
	Windows\ 95|^Mozilla/4\.05\ \[en\]$|^Mozilla/4\.0$) [NC]
RewriteRule ^(https://.*)$ - [F]

Open in new window

Mike Waller

ASKER

xpert13, what does your code do exactly?

Jon500, there is no seperate root folder or web folder.

xpert13

It must deny for all search bots access to https links.

But I didn't test it.

Mike Waller

ASKER

what if I do the following..

add to robots.txt:
RewriteCond %{SERVER_PORT} ^443$
RewriteRule ^robots\.txt$ robots_ssl.txt [L]

In robots_ssl.txt, add:
User-agent: *
Disallow: /

Should the above work? Also, will all major search engines crawl the robots file though?

ASKER CERTIFIED SOLUTION

xpert13

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

xpert13

"Also, will all major search engines crawl the robots file though?"
All search engines read robots.txt, but this file like recommendation, not a rule.

Mike Waller

ASKER

I saw it here.. http://www.webmasterworld.com/google/3876287.htm (look at key_master).

So I'm assuming what it does is if any requested page is https, it points them to robots_ssl.txt?

Obviously, I want all my other normal http pages to be crawled, just not the https pages. will this still work then?

SOLUTION

xpert13

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

SOLUTION

SidFishes

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

Mike Waller

ASKER

The problem is Bing.com indexed a secured site of mine that does not exist. in other words, the site on http exists but not https. why would they do that? I'm trying to find preventive measures so that it doesn't happen in the future. I don't want to buy another cert but prevent all SEs from indexing https pages.

Mike Waller

ASKER

one last question on this.. If I block the crawlers from indexing https pages yet already have an existing page that has a google page rank of 4 that is currently on https, will that page get dinged by google and not rank as high?

Mike Waller

ASKER

Thanks!