Solved

Identifying robots

Posted on 2002-07-24
15
347 Views
Last Modified: 2013-12-25
Is there a good way to know if the user-agent accessing my script is a robot? How?
0
Comment
Question by:yonat
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 6
  • 5
  • 3
  • +1
15 Comments
 
LVL 51

Expert Comment

by:ahoffmann
ID: 7176692
check HTTP_USER_AGENT (in HTTP header)
keep in mind that you can write anything there
0
 
LVL 5

Author Comment

by:yonat
ID: 7177190
Right, but what should I look for in there? It seems the list is endless!
0
 
LVL 51

Expert Comment

by:ahoffmann
ID: 7178400
yes it is endless, somehow ..
kindly robots/spiders use their own identifier, but there are a lot which identify themselfs as Netscape or IE, for obvious reason ...

If a robot does camilflage as a browser, are you interrested then?
0
Monthly Recap

May was a big month for new releases from Linux Academy! Take a look at what our team built recently in our blog. You can access the newest releases from our blog.

 
LVL 5

Author Comment

by:yonat
ID: 7178683
Yes - they overload the server. I already did some stuff to prevent most of their requests (URL fiddling mostly), but I wonder if there is a way to ignore them for all heavy processing requests.
0
 
LVL 51

Expert Comment

by:ahoffmann
ID: 7179490
a combination of HTTP_USER_GENT and HTTP_REFERER
might narrow down the robots, 'cause they will not have set HTTP_REFERER, probably
But again, keep in mind that HTTP_REFER can be set to anything, it my even be missed.
0
 
LVL 5

Author Comment

by:yonat
ID: 7180548
That's interesting. I'll check the logs for empty referers.
Any other ideas?
0
 
LVL 15

Expert Comment

by:samri
ID: 7184623
And remember that some browser does not send proper HTTP_REFERER header (or no HTTP_HEADER at all!) (if I recall reading somewhere).

cheers.
0
 
LVL 51

Accepted Solution

by:
ahoffmann earned 100 total points
ID: 7184786
most browsers can be configured to send whatever you like as HTTP_USER_AGENT and/or HTTP_REFERER (lynx, w3m, wget, Konqueror, Mozilla, Netscape)
These variables are not reliable (still said this), it's just a hint ... BTW, I don't know of any variable in the HEADER which is reliable (I can fake them all:-)
0
 
LVL 5

Expert Comment

by:Droby10
ID: 7247476
ethical robots are supposed to check for robots.txt prior to any subsequent requests.  you could match for those entries and filter by those logged patterns.  for unethical robots, you might find something along the lines as indicated in a user-agent header; but you'll probably also have a few that use common values as well.
0
 
LVL 5

Author Comment

by:yonat
ID: 7365286
Thanks everyone for your replies!
I followed the server logs for some time, and based on the data I decided to do two things:

1. Not to invoke "heavy" queries using GET, so that robots will only get these pages once. (I pass the arguments using other methods instead.)

2. Mangle email addresses for any client with no cookies. If this will be insufficient, I will use the cookie to see how many page views the client does in, say, 30 secs, and decide if they are a robot based on that.
0
 
LVL 5

Author Comment

by:yonat
ID: 7365297
ahoffmann, do you want a B or would you rather pass?
0
 
LVL 51

Expert Comment

by:ahoffmann
ID: 7365721
if my comments helped solving your problem, grade a comment, otherwise ask support @ EE to make a 0 points PAQ, they'll refund your points.
0
 
LVL 15

Expert Comment

by:samri
ID: 7365915
yonat,

I think ahoffmann's comment is on the track along the way, it's possible to work with HTTP_USER_AGENT and HTTP_REFERER, but the drawback is still those header could be faked.

One approach (IMHO), would be comparing the HTTP_USER_AGENT against a list of "valid" agents, and then try to verify the validity of HTTP_REFEFER.  If I'm not mistaken, the value of HTTP_REFERER would be the full URL of the refeffing page.  So it's possible to backtrack the userpath -- and again this could be faked (note ahoffmann's comment).

I'd be pleased to give an A.  Be appreciative :)

my 2 cents.
0
 
LVL 15

Expert Comment

by:samri
ID: 7366116
- hope ahoffmann won't jump at me for the B. :(
0
 
LVL 51

Expert Comment

by:ahoffmann
ID: 7368579
0

Featured Post

[Webinar] Learn How Hackers Steal Your Credentials

Do You Know How Hackers Steal Your Credentials? Join us and Skyport Systems to learn how hackers steal your credentials and why Active Directory must be secure to stop them. Thursday, July 13, 2017 10:00 A.M. PDT

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

It is a general practice to get rid of old user profiles on a computer  in a LAN environment. As I have been working with a company in a LAN environment where users move from one place to some other place at times. This will make many user profil…
In threads here at EE, each comment has a unique Identifier (ID). It is easy to get the full path for an ID via the right-click context menu. However, we often want to post a short link within a thread rather than the full link. This article shows a…
In this fifth video of the Xpdf series, we discuss and demonstrate the PDFdetach utility, which is able to list and, more importantly, extract attachments that are embedded in PDF files. It does this via a command line interface, making it suitable …
In a recent question (https://www.experts-exchange.com/questions/29004105/Run-AutoHotkey-script-directly-from-Notepad.html) here at Experts Exchange, a member asked how to run an AutoHotkey script (.AHK) directly from Notepad++ (aka NPP). This video…

724 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question