differences between bots and real browsers

For the custom reports on traffic to our site, I need to be able to determine if a hit is from bots like googlebot or from real people looking at the site.  What is the best way to do this?  I'm currently using a list of useragents that I manually mark as human or bot.  Is there a better way?
LVL 3
cdillonAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

shooksmCommented:
The hard thing is that I can write a bot that mimics a common browser.  Here are couple of suggestions for some other filters.

Bots should look for a robots.txt on the root of your server.  You could mark any IP addresses that request for a robots.txt as being a bot.

You could set a threshold of what percent of a site has been viewed.  Your average users are going to go to the page or quickly click to the piece of information they want then leave.  For a site with 100 pages, I highly doubt someone who has viewed 75 of those pages in one session is a real person.

You can cross off strange requests or vunerability checks as bots too.  For instance the common check to see if the cmd.com is accessible.

Just a couple of ideas.  Although I think your current method will work for the majority of requests.
0
cheekycjCommented:
Alot of bots/spiders will have their Identity along side the browser like this:

Mozilla/4.0 (compatible; FastCrawler3, support-fastcrawler3@fast.no)


A good ref:
http://www.psychedelix.com/agents1.html

How are you tracking the reports or parsing yoru logs?

CJ
0
cdillonAuthor Commented:
We screen out bots that have their identity stated in the browser string.  The problem is that the list is changing/growing and then when a new bot finds our site, the reports suddenly show many more hits.
0
Cloud Class® Course: Microsoft Azure 2017

Azure has a changed a lot since it was originally introduce by adding new services and features. Do you know everything you need to about Azure? This course will teach you about the Azure App Service, monitoring and application insights, DevOps, and Team Services.

cheekycjCommented:
how are you screening them?  Is it inclusive or exclusive?

CJ
0
cdillonAuthor Commented:
We exclude browsers which have a user_agent with the words googlebot or scooter or ask jeeves or ....
0
cheekycjCommented:
Instead of excluding those, why not have an inclusive list of tracked browsers.  New browsers can easily be added, but it allows for you to account for any of the new bots without needing to keep track of them.

CJ
0
cdillonAuthor Commented:
Every toolbar adds it's own portion to the browser, so do site providers and others.  I started keeping track and so far we've had over 10,000 distinct browsers looking at our site.  It's not very handy to have to determine if every new browser is bot or not.
0
cheekycjCommented:
10,000 distinct browsers - does that include the bots?  Or are you saying that legit browsers have over 10K variations.

CJ
0
cdillonAuthor Commented:
mostly legit browsers and some bots mixed in.  By distinct browsers, I mean that the cgi.user_agent string is different.
0
cheekycjCommented:
this is a tough one.  you should be able to look through history and find a robust set of CGI.user_agent strings that you want to track.  And go with that.  Being comprehensive would be very difficult.

CJ
0
cdillonAuthor Commented:
I recommend point refund.
0
moduloCommented:
PAQed, with points refunded (500)

modulo
Community Support Moderator
0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Web Servers

From novice to tech pro — start learning today.

Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.