Blocking baiduspider bot with .htaccess

My site is crawled by baiduspider very frequently, and I would like to stop it from using bandwidth.  However, the exact IP of the bot varies from hour to hour.  Could you provide code for disallowing this, using .htaccess?
LVL 1
ddantesAsked:
Who is Participating?

[Product update] Infrastructure Analysis Tool is now available with Business Accounts.Learn More

x
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

InsoftserviceCommented:
There are few methods . please try

.htaccess
RewriteCond %{HTTP_REFERER} ^http(s)?://(www\.)?baidu.com.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^Baidu*$ [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Baiduspider [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Baiduspider/2.0 [NC,OR]

Or
.htaccess
RewriteCond %{HTTP_USER_AGENT} ^Baiduspider.* [NC]

Or banning the ips

order allow,deny
deny from *.baidu.com
deny from 203.125.234.
deny from 220.181.7.
deny from 123.125.66.
deny from 123.125.71.
deny from 119.63.192.
deny from 119.63.193.
deny from 119.63.194.
deny from 119.63.195.
deny from 119.63.196.
deny from 119.63.197.
deny from 119.63.198
deny from 119.63.199.
deny from 180.76.5.
deny from 202.108.249.185
deny from 202.108.249.177
deny from 202.108.249.182
deny from 202.108.249.184
deny from 202.108.249.189
deny from 61.135.146.200
deny from 61.135.145.221
deny from 61.135.145.207
deny from 202.108.250.196
deny from 68.170.119.76
deny from 207.46.199.52
allow from all





or robots.txt
User-agent: Baiduspider
Disallow: /
User-agent: Baiduspider-image
Disallow: /
User-agent: Baiduspider-video
Disallow: /
User-agent: Baiduspider-news
Disallow: /
User-agent: Baiduspider-favo
Disallow: /
User-agent: Baiduspider-cpro
Disallow: /
User-agent: Baiduspider-ads
Disallow: /
User-agent: Baidu
Disallow: /
ddantesAuthor Commented:
Thank you for those suggestions.  Please allow me some time to test the solutions and verify that the bot has been blocked.  Then I will post again.
ddantesAuthor Commented:
I've tried all the htaccess codes you specified , including banning the ips.  This hasn't stopped the bot's access.  I also updated robots.txt, but the spider visits are continuing.  Here's a typical excerpt from the server log:

baiduspider-180-76-15-145.crawl.baidu.com - - [14/Aug/2015:23:21:54 -0400] "GET / HTTP/1.1" 301 322 "-" "Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)"
baiduspider-180-76-15-22.crawl.baidu.com - - [14/Aug/2015:23:21:55 -0400] "GET / HTTP/1.1" 301 322 "-" "Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)"

Any other suggestion?
Protecting & Securing Your Critical Data

Considering 93 percent of companies file for bankruptcy within 12 months of a disaster that blocked access to their data for 10 days or more, planning for the worst is just smart business. Learn how Acronis Backup integrates security at every stage

btanExec ConsultantCommented:
they seem to have limitless ISPs and IPs to draw from so blocking IP is not going to be effective. also why the robot.txt did not seems to work, you can see its FAQ. But I see that only the real Baidu bots do respect robots.txt. A bot can spoofed and not play by book to listening to your robot.txt

Its official FAQ
12. How to prevent my website from being indexed by Baidu?
Baidu strictly complies with robots.txt protocol. For detailed information, please visit http://www.robotstxt.org/.
You can prevent all your pages or parts of them from being crawled and indexed by Baidu with robots.txt. For specific method, please refer to How to Write Robots.txt.
If you set robots.txt to restrict crawling after Baidu has indexed your website, it usually takes 48 hours for the updated robots.txt to take effect and then the new pages won’t be indexed. Note that it may take several months for the contents which have been indexed by Baidu before the restriction of robots.txt to disappear from search results.
If you are in urgent need of restricting crawling, you can report to http://webmaster.baidu.com/feedback/index (arab,thai,Português) and we will deal with it as soon as possible.

14. I have set robots.txt to restrict the crawling of Baidu, but why it doesn’t take effect?
Baidu strictly complies with robots.txt protocol. But our DNS updates periodically. If you have set robots.txt, due to the updating, Baidu needs some time to stop crawling your site.
If you are in urgent need of restricting crawling, you can report to http://webmaster.baidu.com/feedback/index (arab,thai,Português).
Besides, please check whether your robots.txt is correct in format.
http://help.baidu.com/question?prod_en=master&class=Baiduspider&id=1000973

But you may want to try this
you should be able block all of them by adding the following to your .htaccess file:
Code:
RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} ^Baiduspider.* [NC]
RewriteRule .* - [F]

Or

Code:
RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} (baidu) [NC]
RewriteCond %{REQUEST_URI} !^/robots\.txt$
RewriteRule .* http://yoursite.com/robots.txt

you can add other "rogue" bots and spiders to the list by using the "pipe" character between the names. Code:
RewriteCond %{HTTP_USER_AGENT} (Exabot|baidu|siclab|SIBot|SearchmetricsBot) [NC]
https://forums.powweb.com/showpost.php?s=d6cae04dddea865d9ad15fd555f32913&p=493285&postcount=3

if you use caching plugins or CDN, make sure to clear all your cache.
ddantesAuthor Commented:
Thank you for your input.  I have seen the FAQ on baidu.com, but several days have passed since robots.txt was updated, and the spider is still crawling.  I'll try the .htaccess code which you kindly supplied, although it is very close to versions offered in previous comments which were not successful.  Then I'll post again...
David Johnson, CD, MVPOwnerCommented:
you've already tried the robots.txt approach?
User-agent: Baiduspider+(+http://www.baidu.com/search/spider.htm)
Disallow: /

Open in new window

ddantesAuthor Commented:
I did try robots.txt, without success.  However, there have been no instances of that bot in my server log for four hours, after adding RewriteCond %{HTTP_USER_AGENT} ^Baiduspider.* [NC]  to .htaccess.   Please allow me until tomorrow to verify that baidu is no longer accessing the site.
ddantesAuthor Commented:
Unfortunately the baidu spider resumed crawling our site after a couple of hours.
btanExec ConsultantCommented:
you can try to remove it to monitor as well for verification...as mentioned robot.txt is surety as it need not be obey esp those spoofed client agent or bot
ddantesAuthor Commented:
Sorry, I didn't understand the last comment.  What does it mean "try to remove it to monitor as well for verification"?    What is meant by "robots.txt is surety"?    I apologize for my difficulty.
btanExec ConsultantCommented:
remove the write cond and see if those spider bot comes in the traffic.

User agent client to adhere to the Disallow stated in robot.txt need not be obeyed.
Robots that choose to follow the instructions try to fetch this file and read the instructions before fetching any other file from the web site. If this file doesn't exist, web robots assume that the web owner wishes to provide no specific instructions, and crawl the entire site.
Also note that
A robots.txt file covers one origin. For websites with multiple subdomains, each subdomain must have its own robots.txt file. If example.com had a robots.txt file but a.example.com did not, the rules that would apply for example.com would not apply to a.example.com.
InsoftserviceCommented:
Have not tried this method
The most effective approach is to use your server capabilities. Add the following rule to your nginx.conf file to block Baidu at server level.

if ($http_user_agent ~* ^Baiduspider) {
  return 403;
}
ddantesAuthor Commented:
Thank you for your comment.  I have an .htaccess file on the Apache server, but I'm not familiar with nginx.conf.
btanExec ConsultantCommented:
Like robots.txt, the .htaccess file applies to single domains only. The mod_rewrite, and mod_rewrite rules are not inherited by default into other vhosts in your Apache config. For one in covering your entire Web server, we can try using the Apache's httpd.conf in blocking spiders by listing the pertinent User Agent header fields there.

# This should be changed to whatever you set DocumentRoot to.
#

...
# Can substitute/add  other user_agent strings by adding lines with new “SetEnvIfNoCase”
SetEnvIfNoCase User-Agent "^Baiduspider" bad_bot
# this is more of IP based, we can leave it out for now >>
# SetEnvIf Remote_Addr "212\.100\.254\.105" bad_bot

<LocationMatch “/”>
Order Allow,Deny
Allow from all
Deny from env=bad_bot
</LocationMatch>

...
btanExec ConsultantCommented:
Remember to Restart Apache.

There is also the mod_security rules
There are a lot of bad bots that crawls web pages to gather sensitive information that is written on them, this crawls doesn't respect the robot.txt file in a site and are a security risk.
 
A lot of sites on the web recommends to use .HTACCESS to block this bots but this only protects one directory or site, so, this rule is far better as you can block bad bots server wide.
http://www.puntapirata.com/ModSec-Rules.php
ddantesAuthor Commented:
Thank you for your guidance.  I'm just concerned with this bot visiting the main domain.  However, none of the recommended code lines in .htaccess have stopped this.  I don't believe I have access to my web host's httpd.conf file.
btanExec ConsultantCommented:
Which is why the best means for a one stop is use the httpd.conf and/or modsecurity as the .htaccess is not guaranteed. we can also try putting the .htaccess in the directory and sub-directory under the domain to verify - should be still be of no effect in blocking too..

I am thinking of below too...(under "Block a bad robot" - replacing Gogle to baidu)
ErrorDocument 503 "Site disabled for crawling"
RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} !^.*(baidu).*$
RewriteCond %{REMOTE_ADDR} ^74.125
RewriteRule .* - [R=503,L]
http://www.inmotionhosting.com/support/website/server-usage/identify-and-block-bad-robots-from-website
ddantesAuthor Commented:
Thank you.  I tested the code in your last message...   Unfortunately the bot is still crawling the root domain, where .htaccess resides.
ddantesAuthor Commented:
To recap, I have tried every line of .htaccess code which has been suggested so far.   I've added the following code to robots.txt:  
User-agent: Baiduspider
 Disallow: /
 User-agent: Baiduspider-image
 Disallow: /
 User-agent: Baiduspider-video
 Disallow: /
 User-agent: Baiduspider-news
 Disallow: /
 User-agent: Baiduspider-favo
 Disallow: /
 User-agent: Baiduspider-cpro
 Disallow: /
 User-agent: Baiduspider-ads
 Disallow: /
 User-agent: Baidu
 Disallow: / 

Open in new window

I do not have the ability to restart Apache, as I do not personally host my website.
btanExec ConsultantCommented:
Actually not necessarily to restart.
From the apache documentation: Most commonly, the problem is that AllowOverride is not set such that your configuration directives are being honored. Make sure that you don't have a AllowOverride None in effect for the file scope in question. A good test for this is to put garbage in your .htaccess file and reload. If a server error is not generated, then you almost certainly have AllowOverride None in effect.
if poss,  check your httpd.conf file for "AllowOverride" and make sure it is set to All.
On some servers, Apache is configured to ignore some or all directives in .htaccess files. This is for security reasons. The AllowOverride directive controls which features will be allowed in .htaccess files. For example AllowOverride None can turn off htaccess files for a folder and its subfolders.

Check your Apache configuration file for which AllowOverride directive is applied to the directory containing your problem htaccess file.

If you’re not sure which configuration file to look in, start with the main Apache configuration file httpd.conf or apache2.conf. If your website is configured in a file included by httpd.conf (e.g. a virtual hosts configuration file), you will need to look in that file.
Also do verify that mod_rewrite is installed and enabled. If there is another .htaccess file in another directory in the path to your webpage, that htaccess file the may be overriding the settings in the htaccess you’re looking at.
ddantesAuthor Commented:
Thank you.  Garbage in .htaccess always generates a server error.  mod_rewrite is definitely enabled.  The .htaccess successfully blocks many bad referrers, as well as bad IP addresses.  It just doesn't block baidu.   There is only a single .htaccess, and it is in the root directory.
btanExec ConsultantCommented:
pls see as one has shared what he done quite similar to your rules too but not effective hence added more
I then added into the list of bad bots to block in htaccess file

 # Block Bad Bots
 RewriteCond %{HTTP_USER_AGENT} ^Baiduspider [OR]

 and then added to the bottom of the htaccess file

 <Files 403.shtml>
 order allow,deny
 allow from all
 </Files>
 deny from 180.76.0.0/16

 For the time being it seems to have stopped visiting. I dare say it will start again. The only other thing i could add is that it took a few weeks for them to stop.
ttp://forums.oscommerce.com/topic/382923-baiduspider-using-multiple-user-agents-how-to-stop-them/?p=1618626

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
ddantesAuthor Commented:
I came across that article, but was discouraged by the length of time it took to stop crawling, and his prediction that it would resume.  I'm going to abandon this project, for now, but I appreciate all your suggestions!
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Apache Web Server

From novice to tech pro — start learning today.