12. How to prevent my website from being indexed by Baidu?http://help.baidu.com/question?prod_en=master&class=Baiduspider&id=1000973
Baidu strictly complies with robots.txt protocol. For detailed information, please visit http://www.robotstxt.org/.
You can prevent all your pages or parts of them from being crawled and indexed by Baidu with robots.txt. For specific method, please refer to How to Write Robots.txt.
If you set robots.txt to restrict crawling after Baidu has indexed your website, it usually takes 48 hours for the updated robots.txt to take effect and then the new pages won’t be indexed. Note that it may take several months for the contents which have been indexed by Baidu before the restriction of robots.txt to disappear from search results.
If you are in urgent need of restricting crawling, you can report to http://webmaster.baidu.com/feedback/index (arab,thai,Português) and we will deal with it as soon as possible.
14. I have set robots.txt to restrict the crawling of Baidu, but why it doesn’t take effect?
Baidu strictly complies with robots.txt protocol. But our DNS updates periodically. If you have set robots.txt, due to the updating, Baidu needs some time to stop crawling your site.
If you are in urgent need of restricting crawling, you can report to http://webmaster.baidu.com/feedback/index (arab,thai,Português).
Besides, please check whether your robots.txt is correct in format.
you should be able block all of them by adding the following to your .htaccess file:https://forums.powweb.com/showpost.php?s=d6cae04dddea865d9ad15fd555f32913&p=493285&postcount=3
Code:
RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} ^Baiduspider.* [NC]
RewriteRule .* - [F]
Or
Code:
RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} (baidu) [NC]
RewriteCond %{REQUEST_URI} !^/robots\.txt$
RewriteRule .* http://yoursite.com/robots.txt
you can add other "rogue" bots and spiders to the list by using the "pipe" character between the names. Code:
RewriteCond %{HTTP_USER_AGENT} (Exabot|baidu|siclab|SIBot|Searchmet ricsBot) [NC]
User-agent: Baiduspider+(+http://www.baidu.com/search/spider.htm)
Disallow: /
Robots that choose to follow the instructions try to fetch this file and read the instructions before fetching any other file from the web site. If this file doesn't exist, web robots assume that the web owner wishes to provide no specific instructions, and crawl the entire site.Also note that
A robots.txt file covers one origin. For websites with multiple subdomains, each subdomain must have its own robots.txt file. If example.com had a robots.txt file but a.example.com did not, the rules that would apply for example.com would not apply to a.example.com.
There are a lot of bad bots that crawls web pages to gather sensitive information that is written on them, this crawls doesn't respect the robot.txt file in a site and are a security risk.http://www.puntapirata.com/ModSec-Rules.php
A lot of sites on the web recommends to use .HTACCESS to block this bots but this only protects one directory or site, so, this rule is far better as you can block bad bots server wide.
ErrorDocument 503 "Site disabled for crawling"http://www.inmotionhosting.com/support/website/server-usage/identify-and-block-bad-robots-from-website
RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} !^.*(baidu).*$
RewriteCond %{REMOTE_ADDR} ^74.125
RewriteRule .* - [R=503,L]
User-agent: Baiduspider
Disallow: /
User-agent: Baiduspider-image
Disallow: /
User-agent: Baiduspider-video
Disallow: /
User-agent: Baiduspider-news
Disallow: /
User-agent: Baiduspider-favo
Disallow: /
User-agent: Baiduspider-cpro
Disallow: /
User-agent: Baiduspider-ads
Disallow: /
User-agent: Baidu
Disallow: /
I do not have the ability to restart Apache, as I do not personally host my website.
From the apache documentation: Most commonly, the problem is that AllowOverride is not set such that your configuration directives are being honored. Make sure that you don't have a AllowOverride None in effect for the file scope in question. A good test for this is to put garbage in your .htaccess file and reload. If a server error is not generated, then you almost certainly have AllowOverride None in effect.if poss, check your httpd.conf file for "AllowOverride" and make sure it is set to All.
On some servers, Apache is configured to ignore some or all directives in .htaccess files. This is for security reasons. The AllowOverride directive controls which features will be allowed in .htaccess files. For example AllowOverride None can turn off htaccess files for a folder and its subfolders.Also do verify that mod_rewrite is installed and enabled. If there is another .htaccess file in another directory in the path to your webpage, that htaccess file the may be overriding the settings in the htaccess you’re looking at.
Check your Apache configuration file for which AllowOverride directive is applied to the directory containing your problem htaccess file.
If you’re not sure which configuration file to look in, start with the main Apache configuration file httpd.conf or apache2.conf. If your website is configured in a file included by httpd.conf (e.g. a virtual hosts configuration file), you will need to look in that file.
.htaccess
RewriteCond %{HTTP_REFERER} ^http(s)?://(www\.)?baidu.
RewriteCond %{HTTP_USER_AGENT} ^Baidu*$ [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Baiduspider [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Baiduspider/2.0 [NC,OR]
Or
.htaccess
RewriteCond %{HTTP_USER_AGENT} ^Baiduspider.* [NC]
Or banning the ips
order allow,deny
deny from *.baidu.com
deny from 203.125.234.
deny from 220.181.7.
deny from 123.125.66.
deny from 123.125.71.
deny from 119.63.192.
deny from 119.63.193.
deny from 119.63.194.
deny from 119.63.195.
deny from 119.63.196.
deny from 119.63.197.
deny from 119.63.198
deny from 119.63.199.
deny from 180.76.5.
deny from 202.108.249.185
deny from 202.108.249.177
deny from 202.108.249.182
deny from 202.108.249.184
deny from 202.108.249.189
deny from 61.135.146.200
deny from 61.135.145.221
deny from 61.135.145.207
deny from 202.108.250.196
deny from 68.170.119.76
deny from 207.46.199.52
allow from all
or robots.txt
User-agent: Baiduspider
Disallow: /
User-agent: Baiduspider-image
Disallow: /
User-agent: Baiduspider-video
Disallow: /
User-agent: Baiduspider-news
Disallow: /
User-agent: Baiduspider-favo
Disallow: /
User-agent: Baiduspider-cpro
Disallow: /
User-agent: Baiduspider-ads
Disallow: /
User-agent: Baidu
Disallow: /