Link to home
Start Free TrialLog in
Avatar of jammerms
jammerms

asked on

What is the best way to counteract spiders, crawlers, and bots on our website?

Folks,

We're running Windows Small Business Server 2003, and we're having problems with various crawlers sucking up bandwidth (particularly Googlebot, MSNBot, and Yahoo's Inktomisearch).  What are the best ways to counteract their usage?

We've started blocking IP ranges, but that seems to help only a little, and I figure it's not a permanent solution anyways.

We've got robots.txt set properly as well as the Meta tags in the header of each page.

I've read about using traps like a 1 x 1 px transparent bitmap image link to another page that has redirects back into itself with like a 20-second delay.  Is this still a good solution, or have spiders been made smarter?  Any other ways to make bad bots pay for their crimes?

I'm not the main network person here, but I am his b----, so let me know if I can provide any more information.

--J
Avatar of blandyuk
blandyuk
Flag of United Kingdom of Great Britain and Northern Ireland image

Are you running ASP pages? You could read the "User-Agent" header in the HTTP request. Most spiders specify a link to a page with regards to spidering pages like Google:

http://www.google.com/bot.html

It would look something like:

User-Agent: Mozilla/5.0 (compatible; MSIE 6.0; Windows NT 5.1, http://www.google.com/bot.html)

Once you have compiled a database of spiders, you can simply search for them in the header and simply "Response.End()" so saving bandwidth.

Not an easy method but at least you wouldn't have to worry about finding out all the IP ranges they have, which I can imagine is a lot!
Avatar of PugnaciousOne
PugnaciousOne

Most spiders (not all) respect the robots.txt file as well.  You can create one to disallow specific bots. here's an easy tool.  ( http://www.mcanerin.com/EN/search-engine/robots-txt.asp )
Avatar of jammerms

ASKER

PugnaciousOne,
We'be got the robots.txt set.  It seems that Inktomisearch and msnbot are the big culprits.  The googlebots seem to repect robots.txt.

blandyuk,

I'll definitely follow through with this suggestion if I can.  That's an interesting approach.




Keep the good ideas a-comin'.

--J
ASKER CERTIFIED SOLUTION
Avatar of Rich Rumble
Rich Rumble
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
richrumble,

We've got robots.txt set properly as well as the Meta tags in the header of each page.


That palisade.plynt.com link is really interesting.



Everyone,
I've read about using traps like a 1 x 1 px transparent bitmap image link to another page that has redirects back into itself with like a 20-second delay.  Is this still a good solution, or have spiders been made smarter?  Any other ways to make bad bots pay for their crimes?

Thanks again for the input.
richrumble,

I see the part about traps in the Palisade article.  Thanks again for the pointer.




I'll give this over the weekend to see if any new ideas get posted in the meantime.

Thanks,
J
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
It turns out we're just doing HTML for our website, so the ASP solutions will have to wait.

I did notice that our robots.txt had a capital R, so I changed it to lowercase to see if that would help.

Thanks for the pointers, people.