Still celebrating National IT Professionals Day with 3 months of free Premium Membership. Use Code ITDAY17

x
?
Solved

differences between bots and real browsers

Posted on 2003-11-12
14
Medium Priority
?
263 Views
Last Modified: 2013-12-24
For the custom reports on traffic to our site, I need to be able to determine if a hit is from bots like googlebot or from real people looking at the site.  What is the best way to do this?  I'm currently using a list of useragents that I manually mark as human or bot.  Is there a better way?
0
Comment
Question by:cdillon
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
14 Comments
 
LVL 9

Expert Comment

by:shooksm
ID: 9732904
The hard thing is that I can write a bot that mimics a common browser.  Here are couple of suggestions for some other filters.

Bots should look for a robots.txt on the root of your server.  You could mark any IP addresses that request for a robots.txt as being a bot.

You could set a threshold of what percent of a site has been viewed.  Your average users are going to go to the page or quickly click to the piece of information they want then leave.  For a site with 100 pages, I highly doubt someone who has viewed 75 of those pages in one session is a real person.

You can cross off strange requests or vunerability checks as bots too.  For instance the common check to see if the cmd.com is accessible.

Just a couple of ideas.  Although I think your current method will work for the majority of requests.
0
 
LVL 19

Expert Comment

by:cheekycj
ID: 9740328
Alot of bots/spiders will have their Identity along side the browser like this:

Mozilla/4.0 (compatible; FastCrawler3, support-fastcrawler3@fast.no)


A good ref:
http://www.psychedelix.com/agents1.html

How are you tracking the reports or parsing yoru logs?

CJ
0
 
LVL 3

Author Comment

by:cdillon
ID: 9740378
We screen out bots that have their identity stated in the browser string.  The problem is that the list is changing/growing and then when a new bot finds our site, the reports suddenly show many more hits.
0
Three Reasons Why Backup is Strategic

Backup is strategic to your business because your data is strategic to your business. Without backup, your business will fail. This white paper explains why it is vital for you to design and immediately execute a backup strategy to protect 100 percent of your data.

 
LVL 19

Expert Comment

by:cheekycj
ID: 9740434
how are you screening them?  Is it inclusive or exclusive?

CJ
0
 
LVL 3

Author Comment

by:cdillon
ID: 9740516
We exclude browsers which have a user_agent with the words googlebot or scooter or ask jeeves or ....
0
 
LVL 19

Expert Comment

by:cheekycj
ID: 9740868
Instead of excluding those, why not have an inclusive list of tracked browsers.  New browsers can easily be added, but it allows for you to account for any of the new bots without needing to keep track of them.

CJ
0
 
LVL 3

Author Comment

by:cdillon
ID: 9741002
Every toolbar adds it's own portion to the browser, so do site providers and others.  I started keeping track and so far we've had over 10,000 distinct browsers looking at our site.  It's not very handy to have to determine if every new browser is bot or not.
0
 
LVL 19

Expert Comment

by:cheekycj
ID: 9773412
10,000 distinct browsers - does that include the bots?  Or are you saying that legit browsers have over 10K variations.

CJ
0
 
LVL 3

Author Comment

by:cdillon
ID: 9773774
mostly legit browsers and some bots mixed in.  By distinct browsers, I mean that the cgi.user_agent string is different.
0
 
LVL 19

Expert Comment

by:cheekycj
ID: 9773807
this is a tough one.  you should be able to look through history and find a robust set of CGI.user_agent strings that you want to track.  And go with that.  Being comprehensive would be very difficult.

CJ
0
 
LVL 3

Author Comment

by:cdillon
ID: 10986176
I recommend point refund.
0
 

Accepted Solution

by:
modulo earned 0 total points
ID: 11052448
PAQed, with points refunded (500)

modulo
Community Support Moderator
0

Featured Post

Learn how to optimize MySQL for your business need

With the increasing importance of apps & networks in both business & personal interconnections, perfor. has become one of the key metrics of successful communication. This ebook is a hands-on business-case-driven guide to understanding MySQL query parameter tuning & database perf

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

One of the typical problems I have experienced is when you have to move a web server from one hosting site to another. You normally prepare all on the new host, transfer the site, change DNS and cross your fingers hoping all will be ok on new server…
Meet the world's only “Transparent Cloud™” from Superb Internet Corporation. Now, you can experience firsthand a cloud platform that consistently outperforms Amazon Web Services (AWS), IBM’s Softlayer, and Microsoft’s Azure when it comes to CPU and …
Video by: ITPro.TV
In this episode Don builds upon the troubleshooting techniques by demonstrating how to properly monitor a vSphere deployment to detect problems before they occur. He begins the show using tools found within the vSphere suite as ends the show demonst…
Sometimes it takes a new vantage point, apart from our everyday security practices, to truly see our Active Directory (AD) vulnerabilities. We get used to implementing the same techniques and checking the same areas for a breach. This pattern can re…

722 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question