Solved

What is the best way to counteract spiders, crawlers, and bots on our website?

Posted on 2006-07-13
8
628 Views
Last Modified: 2010-04-11
Folks,

We're running Windows Small Business Server 2003, and we're having problems with various crawlers sucking up bandwidth (particularly Googlebot, MSNBot, and Yahoo's Inktomisearch).  What are the best ways to counteract their usage?

We've started blocking IP ranges, but that seems to help only a little, and I figure it's not a permanent solution anyways.

We've got robots.txt set properly as well as the Meta tags in the header of each page.

I've read about using traps like a 1 x 1 px transparent bitmap image link to another page that has redirects back into itself with like a 20-second delay.  Is this still a good solution, or have spiders been made smarter?  Any other ways to make bad bots pay for their crimes?

I'm not the main network person here, but I am his b----, so let me know if I can provide any more information.

--J
0
Comment
Question by:jammerms
8 Comments
 
LVL 9

Expert Comment

by:blandyuk
ID: 17104853
Are you running ASP pages? You could read the "User-Agent" header in the HTTP request. Most spiders specify a link to a page with regards to spidering pages like Google:

http://www.google.com/bot.html

It would look something like:

User-Agent: Mozilla/5.0 (compatible; MSIE 6.0; Windows NT 5.1, http://www.google.com/bot.html)

Once you have compiled a database of spiders, you can simply search for them in the header and simply "Response.End()" so saving bandwidth.

Not an easy method but at least you wouldn't have to worry about finding out all the IP ranges they have, which I can imagine is a lot!
0
 
LVL 1

Expert Comment

by:PugnaciousOne
ID: 17108044
Most spiders (not all) respect the robots.txt file as well.  You can create one to disallow specific bots. here's an easy tool.  ( http://www.mcanerin.com/EN/search-engine/robots-txt.asp )
0
 

Author Comment

by:jammerms
ID: 17110743
PugnaciousOne,
We'be got the robots.txt set.  It seems that Inktomisearch and msnbot are the big culprits.  The googlebots seem to repect robots.txt.

blandyuk,

I'll definitely follow through with this suggestion if I can.  That's an interesting approach.




Keep the good ideas a-comin'.

--J
0
 
LVL 38

Accepted Solution

by:
Rich Rumble earned 100 total points
ID: 17111506
There are a number of files you can add, or meta tags... no index, no follow, robots.txt  http://www.robotstxt.org/wc/faq.html#prevent all can and are ignored by spiders, maybe not by default, but they can be set to do so. Detection, account locking out (if possible), and IP blocking are the tried and true methods. Our corporation looked into this extensively, it's all about detection and reaction. We lock out accounts of abusers, and block ip's indefinately, and per the contract they've signed, we get paid to allow them back in if.
Here is some interesting approaches also: http://palisade.plynt.com/issues/2006Jul/anti-spidering/
http://www.robotstxt.org/wc/meta-user.html
-rich
0
Why You Should Analyze Threat Actor TTPs

After years of analyzing threat actor behavior, it’s become clear that at any given time there are specific tactics, techniques, and procedures (TTPs) that are particularly prevalent. By analyzing and understanding these TTPs, you can dramatically enhance your security program.

 

Author Comment

by:jammerms
ID: 17111700
richrumble,

We've got robots.txt set properly as well as the Meta tags in the header of each page.


That palisade.plynt.com link is really interesting.



Everyone,
I've read about using traps like a 1 x 1 px transparent bitmap image link to another page that has redirects back into itself with like a 20-second delay.  Is this still a good solution, or have spiders been made smarter?  Any other ways to make bad bots pay for their crimes?

Thanks again for the input.
0
 

Author Comment

by:jammerms
ID: 17111943
richrumble,

I see the part about traps in the Palisade article.  Thanks again for the pointer.




I'll give this over the weekend to see if any new ideas get posted in the meantime.

Thanks,
J
0
 
LVL 9

Assisted Solution

by:blandyuk
blandyuk earned 300 total points
ID: 17113989
With regards to the ASP code to get the User-Agent:

Request.ServerVariables("HTTP_USER_AGENT")

You could simply do an "InStr" on the User-Agent for particular strings which associate with bots. If it's greater than 1, Response.End() it. 3 easy one's to block:

http://www.google.com/bot.html
stumbleupon.com
Girafabot;

Here are some User-Agents I've taken from my tracking logs which are clearly bots:

Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; stumbleupon.com 1.926; VNIE5 RefIE5; .NET CLR 1.1.4322)
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 4.0; Girafabot; girafabot at girafa dot com; http://www.girafa.com)

I'll post some more when I find them.

Obviously your going to have to be careful on what you specify as you could easily block actual users :( If you are specific, you shouldn't have a problem.
0
 

Author Comment

by:jammerms
ID: 17124658
It turns out we're just doing HTML for our website, so the ASP solutions will have to wait.

I did notice that our robots.txt had a capital R, so I changed it to lowercase to see if that would help.

Thanks for the pointers, people.
0

Featured Post

6 Surprising Benefits of Threat Intelligence

All sorts of threat intelligence is available on the web. Intelligence you can learn from, and use to anticipate and prepare for future attacks.

Join & Write a Comment

Suggested Solutions

Title # Comments Views Activity
deny local logon 12 62
SSL certificate pack 6 110
Error viewing ASP page 12 94
clean-up rule netscreen firewall 3 57
Password hashing is better than message digests or encryption, and you should be using it instead of message digests or encryption.  Find out why and how in this article, which supplements the original article on PHP Client Registration, Login, Logo…
If you're not part of the solution, you're part of the problem.   Tips on how to secure IoT devices, even the dumbest ones, so they can't be used as part of a DDoS botnet.  Use PRTG Network Monitor as one of the building blocks, to detect unusual…
Sending a Secure fax is easy with eFax Corporate (http://www.enterprise.efax.com). First, Just open a new email message.  In the To field, type your recipient's fax number @efaxsend.com. You can even send a secure international fax — just include t…
In this seventh video of the Xpdf series, we discuss and demonstrate the PDFfonts utility, which lists all the fonts used in a PDF file. It does this via a command line interface, making it suitable for use in programs, scripts, batch files — any pl…

757 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

20 Experts available now in Live!

Get 1:1 Help Now