Identifying robots

Is there a good way to know if the user-agent accessing my script is a robot? How?
LVL 5
yonatAsked:
Who is Participating?
 
ahoffmannConnect With a Mentor Commented:
most browsers can be configured to send whatever you like as HTTP_USER_AGENT and/or HTTP_REFERER (lynx, w3m, wget, Konqueror, Mozilla, Netscape)
These variables are not reliable (still said this), it's just a hint ... BTW, I don't know of any variable in the HEADER which is reliable (I can fake them all:-)
0
 
ahoffmannCommented:
check HTTP_USER_AGENT (in HTTP header)
keep in mind that you can write anything there
0
 
yonatAuthor Commented:
Right, but what should I look for in there? It seems the list is endless!
0
The 14th Annual Expert Award Winners

The results are in! Meet the top members of our 2017 Expert Awards. Congratulations to all who qualified!

 
ahoffmannCommented:
yes it is endless, somehow ..
kindly robots/spiders use their own identifier, but there are a lot which identify themselfs as Netscape or IE, for obvious reason ...

If a robot does camilflage as a browser, are you interrested then?
0
 
yonatAuthor Commented:
Yes - they overload the server. I already did some stuff to prevent most of their requests (URL fiddling mostly), but I wonder if there is a way to ignore them for all heavy processing requests.
0
 
ahoffmannCommented:
a combination of HTTP_USER_GENT and HTTP_REFERER
might narrow down the robots, 'cause they will not have set HTTP_REFERER, probably
But again, keep in mind that HTTP_REFER can be set to anything, it my even be missed.
0
 
yonatAuthor Commented:
That's interesting. I'll check the logs for empty referers.
Any other ideas?
0
 
samriCommented:
And remember that some browser does not send proper HTTP_REFERER header (or no HTTP_HEADER at all!) (if I recall reading somewhere).

cheers.
0
 
Droby10Commented:
ethical robots are supposed to check for robots.txt prior to any subsequent requests.  you could match for those entries and filter by those logged patterns.  for unethical robots, you might find something along the lines as indicated in a user-agent header; but you'll probably also have a few that use common values as well.
0
 
yonatAuthor Commented:
Thanks everyone for your replies!
I followed the server logs for some time, and based on the data I decided to do two things:

1. Not to invoke "heavy" queries using GET, so that robots will only get these pages once. (I pass the arguments using other methods instead.)

2. Mangle email addresses for any client with no cookies. If this will be insufficient, I will use the cookie to see how many page views the client does in, say, 30 secs, and decide if they are a robot based on that.
0
 
yonatAuthor Commented:
ahoffmann, do you want a B or would you rather pass?
0
 
ahoffmannCommented:
if my comments helped solving your problem, grade a comment, otherwise ask support @ EE to make a 0 points PAQ, they'll refund your points.
0
 
samriCommented:
yonat,

I think ahoffmann's comment is on the track along the way, it's possible to work with HTTP_USER_AGENT and HTTP_REFERER, but the drawback is still those header could be faked.

One approach (IMHO), would be comparing the HTTP_USER_AGENT against a list of "valid" agents, and then try to verify the validity of HTTP_REFEFER.  If I'm not mistaken, the value of HTTP_REFERER would be the full URL of the refeffing page.  So it's possible to backtrack the userpath -- and again this could be faked (note ahoffmann's comment).

I'd be pleased to give an A.  Be appreciative :)

my 2 cents.
0
 
samriCommented:
- hope ahoffmann won't jump at me for the B. :(
0
 
ahoffmannCommented:
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.