Solved

Prevent robots' visit to be registered in database

Posted on 2014-03-22
6
236 Views
Last Modified: 2014-03-22
Hi all.

In my site I used a robots.txt to prevent crawlers' visit to be recorded in the database.

I used this code

$allrobots = file_get_contents( 'allrobots.txt' ); //robot-name:
preg_match_all( '/(?<=robot-id:\s).*(?=$)/im', $allrobots, $crawlers );

if ( !in_array( strtolower( $_SERVER['HTTP_USER_AGENT'] ), $crawlers[0] ) )
{
    //here write to the database the visitor's data

Open in new window


But this seems to fail since I still get recorded visits from crawlers: for instance I still se in database visits of no more existent pages from Mountain View (that is by Google, isn't it?)

So what is the best way to accomplish my goal?

Thanks to all for any advice.

Cheers
0
Comment
Question by:Marco Gasi
  • 3
  • 2
6 Comments
 
LVL 82

Expert Comment

by:Dave Baldwin
Comment Utility
The search robots for Google and Bing and Baidu make a direct request for 'robots.txt' so I'm not sure what your code above would do for you.

This page http://www.robotstxt.org/robotstxt.html tells you that to tell all (obedient) robots to not scan your pages, you should use the following code in a file called 'robots.txt' in the root of your web directories.
User-agent: *
Disallow: /

Open in new window

0
 
LVL 30

Author Comment

by:Marco Gasi
Comment Utility
Hi, Dave, thanks for your reply.

What I ned is not a way to prevent robots to scan my pages. I would only avoid to store in the database visits made by crawlers and generally by not human beings, but I don't know if it's possible.
0
 
LVL 108

Accepted Solution

by:
Ray Paseur earned 500 total points
Comment Utility
A way to prevent robots...
simply doesn't exist at the 100% level.   If you're willing to tolerate a little "slop" in the process you can look for the substring "bot" in the HTTP_USER_AGENT.  That is almost always a strong clue.  If you have a common script that starts all of your web pages (something that starts session, connects database, etc.) you can put code into it that will test for the user agent and simply return a blank page to the spiders, or redirect to the home page.

In my experience with this, I have found that the overwhelming majority of 'bots obey robots.txt with only a few from Venezuela, China and Bulgaria that ignore the directives.  But this is the internet and there is no 100% certain way to identify 'bots.  I can write a cURL script that will look exactly like a Firefox browser referred by Google, and your server will not be able to detect the fact that there is no human behind the request.  And just today I got two requests from agent Java/1.6.0_34 somewhere in Sweden.  These are mostly edge cases.

If you want to do an experiment that will help you identify the good vs bad traffic, record all of the HTTP_USER_AGENT values in a small data base table over a period of time, perhaps a couple of weeks.  Then normalize the values to uppercase and sort them and count them.  You'll be able to see what's going on and you'll know exactly which requests to ignore.
0
Enabling OSINT in Activity Based Intelligence

Activity based intelligence (ABI) requires access to all available sources of data. Recorded Future allows analysts to observe structured data on the open, deep, and dark web.

 
LVL 30

Author Comment

by:Marco Gasi
Comment Utility
Hi, Ray. I don't need a 100% level and I'm sure your suggestion will satisfy my needs the best possible way. I'll sure do suggested tests.
Thank you.

Marco
0
 
LVL 30

Author Closing Comment

by:Marco Gasi
Comment Utility
Thank you both for your help. Have a nice week-end.
0
 
LVL 108

Expert Comment

by:Ray Paseur
Comment Utility
Thanks, Marco.  You too!
0

Featured Post

What Is Threat Intelligence?

Threat intelligence is often discussed, but rarely understood. Starting with a precise definition, along with clear business goals, is essential.

Join & Write a Comment

Nothing in an HTTP request can be trusted, including HTTP headers and form data.  A form token is a tool that can be used to guard against request forgeries (CSRF).  This article shows an improved approach to form tokens, making it more difficult to…
Since pre-biblical times, humans have sought ways to keep secrets, and share the secrets selectively.  This article explores the ways PHP can be used to hide and encrypt information.
The viewer will learn how to look for a specific file type in a local or remote server directory using PHP.
This tutorial will teach you the core code needed to finalize the addition of a watermark to your image. The viewer will use a small PHP class to learn and create a watermark.

763 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

7 Experts available now in Live!

Get 1:1 Help Now