Solved

Prevent robots' visit to be registered in database

Posted on 2014-03-22
6
256 Views
Last Modified: 2014-03-22
Hi all.

In my site I used a robots.txt to prevent crawlers' visit to be recorded in the database.

I used this code

$allrobots = file_get_contents( 'allrobots.txt' ); //robot-name:
preg_match_all( '/(?<=robot-id:\s).*(?=$)/im', $allrobots, $crawlers );

if ( !in_array( strtolower( $_SERVER['HTTP_USER_AGENT'] ), $crawlers[0] ) )
{
    //here write to the database the visitor's data

Open in new window


But this seems to fail since I still get recorded visits from crawlers: for instance I still se in database visits of no more existent pages from Mountain View (that is by Google, isn't it?)

So what is the best way to accomplish my goal?

Thanks to all for any advice.

Cheers
0
Comment
Question by:Marco Gasi
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 3
  • 2
6 Comments
 
LVL 83

Expert Comment

by:Dave Baldwin
ID: 39947262
The search robots for Google and Bing and Baidu make a direct request for 'robots.txt' so I'm not sure what your code above would do for you.

This page http://www.robotstxt.org/robotstxt.html tells you that to tell all (obedient) robots to not scan your pages, you should use the following code in a file called 'robots.txt' in the root of your web directories.
User-agent: *
Disallow: /

Open in new window

0
 
LVL 31

Author Comment

by:Marco Gasi
ID: 39947267
Hi, Dave, thanks for your reply.

What I ned is not a way to prevent robots to scan my pages. I would only avoid to store in the database visits made by crawlers and generally by not human beings, but I don't know if it's possible.
0
 
LVL 110

Accepted Solution

by:
Ray Paseur earned 500 total points
ID: 39947364
A way to prevent robots...
simply doesn't exist at the 100% level.   If you're willing to tolerate a little "slop" in the process you can look for the substring "bot" in the HTTP_USER_AGENT.  That is almost always a strong clue.  If you have a common script that starts all of your web pages (something that starts session, connects database, etc.) you can put code into it that will test for the user agent and simply return a blank page to the spiders, or redirect to the home page.

In my experience with this, I have found that the overwhelming majority of 'bots obey robots.txt with only a few from Venezuela, China and Bulgaria that ignore the directives.  But this is the internet and there is no 100% certain way to identify 'bots.  I can write a cURL script that will look exactly like a Firefox browser referred by Google, and your server will not be able to detect the fact that there is no human behind the request.  And just today I got two requests from agent Java/1.6.0_34 somewhere in Sweden.  These are mostly edge cases.

If you want to do an experiment that will help you identify the good vs bad traffic, record all of the HTTP_USER_AGENT values in a small data base table over a period of time, perhaps a couple of weeks.  Then normalize the values to uppercase and sort them and count them.  You'll be able to see what's going on and you'll know exactly which requests to ignore.
0
Are You Using the Best Web Development Editor?

The worlds of web hosting and web development are constantly evolving. Every year we see design trends change, coding standards adapt and new frameworks/CMS created. With such a quick pace of change it’s easy to get lost trying to keep up.

See if your editor made the list.

 
LVL 31

Author Comment

by:Marco Gasi
ID: 39947523
Hi, Ray. I don't need a 100% level and I'm sure your suggestion will satisfy my needs the best possible way. I'll sure do suggested tests.
Thank you.

Marco
0
 
LVL 31

Author Closing Comment

by:Marco Gasi
ID: 39947525
Thank you both for your help. Have a nice week-end.
0
 
LVL 110

Expert Comment

by:Ray Paseur
ID: 39947720
Thanks, Marco.  You too!
0

Featured Post

Secure Your WordPress Site: 5 Essential Approaches

WordPress is the web's most popular CMS, but its dominance also makes it a target for attackers. Our eBook will show you how to:

Prevent costly exploits of core and plugin vulnerabilities
Repel automated attacks
Lock down your dashboard, secure your code, and protect your users

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Password hashing is better than message digests or encryption, and you should be using it instead of message digests or encryption.  Find out why and how in this article, which supplements the original article on PHP Client Registration, Login, Logo…
This article discusses four methods for overlaying images in a container on a web page
The viewer will learn how to count occurrences of each item in an array.
The viewer will learn how to create and use a small PHP class to apply a watermark to an image. This video shows the viewer the setup for the PHP watermark as well as important coding language. Continue to Part 2 to learn the core code used in creat…

623 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question