Want to protect your cyber security and still get fast solutions? Ask a secure question today.Go Premium

x
?
Solved

remove https from robots.txt file

Posted on 2010-01-11
13
Medium Priority
?
417 Views
Last Modified: 2013-12-24
From my robots.txt file, I need to prevent crawlers from indexing all pages that are https.  How would I code that?  If it means anything, I'm using coldfusion 8.
0
Comment
Question by:COwebmaster
13 Comments
 
LVL 8

Expert Comment

by:Jon500
ID: 26286513
The easiest way is to ensure that your http and https root folders have their own copy of robots.txt.

Do your https pages have their own root folder or web server?

Regards,
Jon
0
 
LVL 2

Expert Comment

by:xpert13
ID: 26286571
I think that you can't do by robots.txt
But you can use htaccess for this

This code must work:

RewriteCond %{HTTP_USER_AGENT} (Googlebot|Slurp|spider|Twiceler|heritrix|
	Combine|appie|boitho|e-SocietyRobot|Exabot|Nutch|OmniExplorer|
	MJ12bot|ZyBorg/1|Ask\ Jeeves|AskJeeves|ActiveTouristBot|
	JemmaTheTourist| agadine3|BecomeBot|Clustered-Search-Bot|
	MSIECrawler|freefind|galaxy|genieknows|INGRID|grub-client|
	MojeekBot|NaverBot|NetNose-Crawler|OnetSzukaj|PrassoSunner|
	Asterias\ Crawler|T-H-U-N-D-E-R-S-T-O-N-E|GeorgeTheTouristBot|
	VoilaBot|Vagabondo|fantomBro wser|stealthBrowser|cloakBrowser|
	fantomCrew\ Browser|Girafabot|Indy\ Library|Intelliseek|Zealbot|
	Windows\ 95|^Mozilla/4\.05\ \[en\]$|^Mozilla/4\.0$) [NC]
RewriteRule ^(https://.*)$ - [F]

Open in new window

0
 

Author Comment

by:COwebmaster
ID: 26287383
xpert13, what does your code do exactly?

Jon500, there is no seperate root folder or web folder.
0
Who's Defending Your Organization from Threats?

Protecting against advanced threats requires an IT dream team – a well-oiled machine of people and solutions working together to defend your organization. Download our resource kit today to learn more about the tools you need to build you IT Dream Team!

 
LVL 2

Expert Comment

by:xpert13
ID: 26287428
It must deny for all search bots access to https links.

But I didn't test it.
0
 

Author Comment

by:COwebmaster
ID: 26287541
what if I do the following..

add to robots.txt:
RewriteCond %{SERVER_PORT} ^443$
RewriteRule ^robots\.txt$ robots_ssl.txt [L]

In robots_ssl.txt, add:
User-agent: *
Disallow: /

Should the above work?  Also, will all major search engines crawl the robots file though?
0
 
LVL 2

Accepted Solution

by:
xpert13 earned 1336 total points
ID: 26287627
Those lines:
RewriteCond %{SERVER_PORT} ^443$
RewriteRule ^robots\.txt$ robots_ssl.txt [L]

Need add to ".htaccess" (not to robots.txt). And it should work fine.
You can test it: https://your-site.com/robots.txt
0
 
LVL 2

Expert Comment

by:xpert13
ID: 26287660
"Also, will all major search engines crawl the robots file though?"
All search engines read robots.txt, but this file like recommendation, not a rule.


0
 

Author Comment

by:COwebmaster
ID: 26287669
I saw it here.. http://www.webmasterworld.com/google/3876287.htm (look at key_master).

So I'm assuming what it does is if any requested page is https, it points them to robots_ssl.txt?

Obviously, I want all my other normal http pages to be crawled, just not the https pages.  will this still work then?

0
 
LVL 2

Assisted Solution

by:xpert13
xpert13 earned 1336 total points
ID: 26287735
Yes. All https pages open by using 443 port. 


---

In .htaccess:
RewriteCond %{SERVER_PORT} ^443$
RewriteRule ^robots\.txt$ robots_ssl.txt [L]
This rule means, that all request from 443 port to file robots.txt must be  redirected to robots_ssl.txt.

0
 
LVL 36

Assisted Solution

by:SidFishes
SidFishes earned 664 total points
ID: 26288468
the rewrite might work for you but I'd not count on robots.txt as noted above it's a "suggestion"

The only way to properly protect secure pages is to protect with some kind of login session tracking ie: a spider can't crawl pages that require it to be logged in.

and also use this

<cfif cgi.SERVER_PORT NEQ "443">
disallow or cflocate to secure url
<cfelse>
allow
</cfif>





0
 

Author Comment

by:COwebmaster
ID: 26288864
The problem is Bing.com indexed a secured site of mine that does not exist. in other words, the site on http exists but not https.  why would they do that?  I'm trying to find preventive measures so that it doesn't happen in the future.  I don't want to buy another cert but prevent all SEs from indexing https pages.
0
 

Author Comment

by:COwebmaster
ID: 26294391
one last question on this.. If I block the crawlers from indexing https pages yet already have an existing page that has a google page rank of 4 that is currently on https, will that page get dinged by google and not rank as high?
0
 

Author Closing Comment

by:COwebmaster
ID: 31675684
Thanks!
0

Featured Post

Free Tool: IP Lookup

Get more info about an IP address or domain name, such as organization, abuse contacts and geolocation.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

So you need a certificate so you can offer SSL encryption.  But which one should you get?  There are so many choices out there! Here is a generic overview of the main types of SSL certificates sold by the majority of commercial Certification Auth…
What You Need to Know when Searching for a Webhost Provider
Despite its rising prevalence in the business world, "the cloud" is still misunderstood. Some companies still believe common misconceptions about lack of security in cloud solutions and many misuses of cloud storage options still occur every day. …
Whether it be Exchange Server Crash Issues, Dirty Shutdown Errors or Failed to mount error, Stellar Phoenix Mailbox Exchange Recovery has always got your back. With the help of its easy to understand user interface and 3 simple steps recovery proced…

572 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question