Link to home
Start Free TrialLog in
Avatar of Mike Waller
Mike WallerFlag for United States of America

asked on

remove https from robots.txt file

From my robots.txt file, I need to prevent crawlers from indexing all pages that are https.  How would I code that?  If it means anything, I'm using coldfusion 8.
Avatar of Jon500
Jon500
Flag of Brazil image

The easiest way is to ensure that your http and https root folders have their own copy of robots.txt.

Do your https pages have their own root folder or web server?

Regards,
Jon
Avatar of xpert13
xpert13

I think that you can't do by robots.txt
But you can use htaccess for this

This code must work:

RewriteCond %{HTTP_USER_AGENT} (Googlebot|Slurp|spider|Twiceler|heritrix|
	Combine|appie|boitho|e-SocietyRobot|Exabot|Nutch|OmniExplorer|
	MJ12bot|ZyBorg/1|Ask\ Jeeves|AskJeeves|ActiveTouristBot|
	JemmaTheTourist| agadine3|BecomeBot|Clustered-Search-Bot|
	MSIECrawler|freefind|galaxy|genieknows|INGRID|grub-client|
	MojeekBot|NaverBot|NetNose-Crawler|OnetSzukaj|PrassoSunner|
	Asterias\ Crawler|T-H-U-N-D-E-R-S-T-O-N-E|GeorgeTheTouristBot|
	VoilaBot|Vagabondo|fantomBro wser|stealthBrowser|cloakBrowser|
	fantomCrew\ Browser|Girafabot|Indy\ Library|Intelliseek|Zealbot|
	Windows\ 95|^Mozilla/4\.05\ \[en\]$|^Mozilla/4\.0$) [NC]
RewriteRule ^(https://.*)$ - [F]

Open in new window

Avatar of Mike Waller

ASKER

xpert13, what does your code do exactly?

Jon500, there is no seperate root folder or web folder.
It must deny for all search bots access to https links.

But I didn't test it.
what if I do the following..

add to robots.txt:
RewriteCond %{SERVER_PORT} ^443$
RewriteRule ^robots\.txt$ robots_ssl.txt [L]

In robots_ssl.txt, add:
User-agent: *
Disallow: /

Should the above work?  Also, will all major search engines crawl the robots file though?
ASKER CERTIFIED SOLUTION
Avatar of xpert13
xpert13

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
"Also, will all major search engines crawl the robots file though?"
All search engines read robots.txt, but this file like recommendation, not a rule.


I saw it here.. http://www.webmasterworld.com/google/3876287.htm (look at key_master).

So I'm assuming what it does is if any requested page is https, it points them to robots_ssl.txt?

Obviously, I want all my other normal http pages to be crawled, just not the https pages.  will this still work then?

SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
SOLUTION
Avatar of SidFishes
SidFishes
Flag of Canada image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
The problem is Bing.com indexed a secured site of mine that does not exist. in other words, the site on http exists but not https.  why would they do that?  I'm trying to find preventive measures so that it doesn't happen in the future.  I don't want to buy another cert but prevent all SEs from indexing https pages.
one last question on this.. If I block the crawlers from indexing https pages yet already have an existing page that has a google page rank of 4 that is currently on https, will that page get dinged by google and not rank as high?
Thanks!