OliWarner
asked on
User-agent strings
I need a list of the most popular spider user-agent strings. I've got several items on my website that log or increment things and I really only want some of those to be logging if the thing hitting the page is a real person and not a bot. So I'm left checking the user-agents.
I can either grab the most popular browsers or the most popular bots... Whatever is most efficient -- you decide!
Either way, time-complexity is an issue as it is a fairly busy site, so the shortest and most effective list wins =)
I can either grab the most popular browsers or the most popular bots... Whatever is most efficient -- you decide!
Either way, time-complexity is an issue as it is a fairly busy site, so the shortest and most effective list wins =)
Have you tried setting up robot.txt to block these spiders? This is by far the simplest way.
ASKER
I don't want to block them from viewing the pages -- just stop my logging script counting hits from them.
I've got them in a db, Oli. Give me a minute to extract the bots from the browsers.
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
> shortest and most effective list wins
So the list of all popular bot agent strings
*Bot*
Most of the spider user-agent strings will have the substring "Bot" in it.
Ordinary user-agent string from IE, FF, NS etc do not contain "Bot".
Try filter using this.
I'm already using it.
From the above list I found That "CrawlerBot" is also a substring of bot agent strings
So the list of all popular bot agent strings
*Bot*
Most of the spider user-agent strings will have the substring "Bot" in it.
Ordinary user-agent string from IE, FF, NS etc do not contain "Bot".
Try filter using this.
I'm already using it.
From the above list I found That "CrawlerBot" is also a substring of bot agent strings
No, in the list above, "CrawlerBot" is not part of the user agent string, it was a field in the database of my collection of live user agents taken from dozens of my web sites and hand categorized. Ignore: ,"CrawlerBot"
ASKER
Yeah I can parse those out without issue. Thanks Rod that looks like it'll do the job perfectly
Those were as of January. There could always be a few new ones cropping up.