Solved

Prevent robots' visit to be registered in database

Posted on 2014-03-22
6
251 Views
Last Modified: 2014-03-22
Hi all.

In my site I used a robots.txt to prevent crawlers' visit to be recorded in the database.

I used this code

$allrobots = file_get_contents( 'allrobots.txt' ); //robot-name:
preg_match_all( '/(?<=robot-id:\s).*(?=$)/im', $allrobots, $crawlers );

if ( !in_array( strtolower( $_SERVER['HTTP_USER_AGENT'] ), $crawlers[0] ) )
{
    //here write to the database the visitor's data

Open in new window


But this seems to fail since I still get recorded visits from crawlers: for instance I still se in database visits of no more existent pages from Mountain View (that is by Google, isn't it?)

So what is the best way to accomplish my goal?

Thanks to all for any advice.

Cheers
0
Comment
Question by:Marco Gasi
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 3
  • 2
6 Comments
 
LVL 83

Expert Comment

by:Dave Baldwin
ID: 39947262
The search robots for Google and Bing and Baidu make a direct request for 'robots.txt' so I'm not sure what your code above would do for you.

This page http://www.robotstxt.org/robotstxt.html tells you that to tell all (obedient) robots to not scan your pages, you should use the following code in a file called 'robots.txt' in the root of your web directories.
User-agent: *
Disallow: /

Open in new window

0
 
LVL 31

Author Comment

by:Marco Gasi
ID: 39947267
Hi, Dave, thanks for your reply.

What I ned is not a way to prevent robots to scan my pages. I would only avoid to store in the database visits made by crawlers and generally by not human beings, but I don't know if it's possible.
0
 
LVL 110

Accepted Solution

by:
Ray Paseur earned 500 total points
ID: 39947364
A way to prevent robots...
simply doesn't exist at the 100% level.   If you're willing to tolerate a little "slop" in the process you can look for the substring "bot" in the HTTP_USER_AGENT.  That is almost always a strong clue.  If you have a common script that starts all of your web pages (something that starts session, connects database, etc.) you can put code into it that will test for the user agent and simply return a blank page to the spiders, or redirect to the home page.

In my experience with this, I have found that the overwhelming majority of 'bots obey robots.txt with only a few from Venezuela, China and Bulgaria that ignore the directives.  But this is the internet and there is no 100% certain way to identify 'bots.  I can write a cURL script that will look exactly like a Firefox browser referred by Google, and your server will not be able to detect the fact that there is no human behind the request.  And just today I got two requests from agent Java/1.6.0_34 somewhere in Sweden.  These are mostly edge cases.

If you want to do an experiment that will help you identify the good vs bad traffic, record all of the HTTP_USER_AGENT values in a small data base table over a period of time, perhaps a couple of weeks.  Then normalize the values to uppercase and sort them and count them.  You'll be able to see what's going on and you'll know exactly which requests to ignore.
0
Instantly Create Instructional Tutorials

Contextual Guidance at the moment of need helps your employees adopt to new software or processes instantly. Boost knowledge retention and employee engagement step-by-step with one easy solution.

 
LVL 31

Author Comment

by:Marco Gasi
ID: 39947523
Hi, Ray. I don't need a 100% level and I'm sure your suggestion will satisfy my needs the best possible way. I'll sure do suggested tests.
Thank you.

Marco
0
 
LVL 31

Author Closing Comment

by:Marco Gasi
ID: 39947525
Thank you both for your help. Have a nice week-end.
0
 
LVL 110

Expert Comment

by:Ray Paseur
ID: 39947720
Thanks, Marco.  You too!
0

Featured Post

Announcing the Most Valuable Experts of 2016

MVEs are more concerned with the satisfaction of those they help than with the considerable points they can earn. They are the types of people you feel privileged to call colleagues. Join us in honoring this amazing group of Experts.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Things That Drive Us Nuts Have you noticed the use of the reCaptcha feature at EE and other web sites?  It wants you to read and retype something that looks like this. Insanity!  It's not EE's fault - that's just the way reCaptcha works.  But it i…
Many old projects have bad code, but the budget doesn't exist to rewrite the codebase. You can update this code to be safer by introducing contemporary input validation, sanitation, and safer database queries.
The viewer will learn how to create and use a small PHP class to apply a watermark to an image. This video shows the viewer the setup for the PHP watermark as well as important coding language. Continue to Part 2 to learn the core code used in creat…
The viewer will learn how to create a basic form using some HTML5 and PHP for later processing. Set up your basic HTML file. Open your form tag and set the method and action attributes.: (CODE) Set up your first few inputs one for the name and …

759 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question