Mike Waller
asked on
remove https from robots.txt file
From my robots.txt file, I need to prevent crawlers from indexing all pages that are https. How would I code that? If it means anything, I'm using coldfusion 8.
I think that you can't do by robots.txt
But you can use htaccess for this
This code must work:
But you can use htaccess for this
This code must work:
RewriteCond %{HTTP_USER_AGENT} (Googlebot|Slurp|spider|Twiceler|heritrix|
Combine|appie|boitho|e-SocietyRobot|Exabot|Nutch|OmniExplorer|
MJ12bot|ZyBorg/1|Ask\ Jeeves|AskJeeves|ActiveTouristBot|
JemmaTheTourist| agadine3|BecomeBot|Clustered-Search-Bot|
MSIECrawler|freefind|galaxy|genieknows|INGRID|grub-client|
MojeekBot|NaverBot|NetNose-Crawler|OnetSzukaj|PrassoSunner|
Asterias\ Crawler|T-H-U-N-D-E-R-S-T-O-N-E|GeorgeTheTouristBot|
VoilaBot|Vagabondo|fantomBro wser|stealthBrowser|cloakBrowser|
fantomCrew\ Browser|Girafabot|Indy\ Library|Intelliseek|Zealbot|
Windows\ 95|^Mozilla/4\.05\ \[en\]$|^Mozilla/4\.0$) [NC]
RewriteRule ^(https://.*)$ - [F]
ASKER
xpert13, what does your code do exactly?
Jon500, there is no seperate root folder or web folder.
Jon500, there is no seperate root folder or web folder.
It must deny for all search bots access to https links.
But I didn't test it.
But I didn't test it.
ASKER
what if I do the following..
add to robots.txt:
RewriteCond %{SERVER_PORT} ^443$
RewriteRule ^robots\.txt$ robots_ssl.txt [L]
In robots_ssl.txt, add:
User-agent: *
Disallow: /
Should the above work? Also, will all major search engines crawl the robots file though?
add to robots.txt:
RewriteCond %{SERVER_PORT} ^443$
RewriteRule ^robots\.txt$ robots_ssl.txt [L]
In robots_ssl.txt, add:
User-agent: *
Disallow: /
Should the above work? Also, will all major search engines crawl the robots file though?
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
"Also, will all major search engines crawl the robots file though?"
All search engines read robots.txt, but this file like recommendation, not a rule.
All search engines read robots.txt, but this file like recommendation, not a rule.
ASKER
I saw it here.. http://www.webmasterworld.com/google/3876287.htm (look at key_master).
So I'm assuming what it does is if any requested page is https, it points them to robots_ssl.txt?
Obviously, I want all my other normal http pages to be crawled, just not the https pages. will this still work then?
So I'm assuming what it does is if any requested page is https, it points them to robots_ssl.txt?
Obviously, I want all my other normal http pages to be crawled, just not the https pages. will this still work then?
SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
The problem is Bing.com indexed a secured site of mine that does not exist. in other words, the site on http exists but not https. why would they do that? I'm trying to find preventive measures so that it doesn't happen in the future. I don't want to buy another cert but prevent all SEs from indexing https pages.
ASKER
one last question on this.. If I block the crawlers from indexing https pages yet already have an existing page that has a google page rank of 4 that is currently on https, will that page get dinged by google and not rank as high?
ASKER
Thanks!
Do your https pages have their own root folder or web server?
Regards,
Jon