Go Premium for a chance to win a PS4. Enter to Win

x
?
Solved

Prevent robots' visit to be registered in database

Posted on 2014-03-22
6
Medium Priority
?
261 Views
Last Modified: 2014-03-22
Hi all.

In my site I used a robots.txt to prevent crawlers' visit to be recorded in the database.

I used this code

$allrobots = file_get_contents( 'allrobots.txt' ); //robot-name:
preg_match_all( '/(?<=robot-id:\s).*(?=$)/im', $allrobots, $crawlers );

if ( !in_array( strtolower( $_SERVER['HTTP_USER_AGENT'] ), $crawlers[0] ) )
{
    //here write to the database the visitor's data

Open in new window


But this seems to fail since I still get recorded visits from crawlers: for instance I still se in database visits of no more existent pages from Mountain View (that is by Google, isn't it?)

So what is the best way to accomplish my goal?

Thanks to all for any advice.

Cheers
0
Comment
Question by:Marco Gasi
  • 3
  • 2
6 Comments
 
LVL 84

Expert Comment

by:Dave Baldwin
ID: 39947262
The search robots for Google and Bing and Baidu make a direct request for 'robots.txt' so I'm not sure what your code above would do for you.

This page http://www.robotstxt.org/robotstxt.html tells you that to tell all (obedient) robots to not scan your pages, you should use the following code in a file called 'robots.txt' in the root of your web directories.
User-agent: *
Disallow: /

Open in new window

0
 
LVL 31

Author Comment

by:Marco Gasi
ID: 39947267
Hi, Dave, thanks for your reply.

What I ned is not a way to prevent robots to scan my pages. I would only avoid to store in the database visits made by crawlers and generally by not human beings, but I don't know if it's possible.
0
 
LVL 111

Accepted Solution

by:
Ray Paseur earned 2000 total points
ID: 39947364
A way to prevent robots...
simply doesn't exist at the 100% level.   If you're willing to tolerate a little "slop" in the process you can look for the substring "bot" in the HTTP_USER_AGENT.  That is almost always a strong clue.  If you have a common script that starts all of your web pages (something that starts session, connects database, etc.) you can put code into it that will test for the user agent and simply return a blank page to the spiders, or redirect to the home page.

In my experience with this, I have found that the overwhelming majority of 'bots obey robots.txt with only a few from Venezuela, China and Bulgaria that ignore the directives.  But this is the internet and there is no 100% certain way to identify 'bots.  I can write a cURL script that will look exactly like a Firefox browser referred by Google, and your server will not be able to detect the fact that there is no human behind the request.  And just today I got two requests from agent Java/1.6.0_34 somewhere in Sweden.  These are mostly edge cases.

If you want to do an experiment that will help you identify the good vs bad traffic, record all of the HTTP_USER_AGENT values in a small data base table over a period of time, perhaps a couple of weeks.  Then normalize the values to uppercase and sort them and count them.  You'll be able to see what's going on and you'll know exactly which requests to ignore.
0
Concerto Cloud for Software Providers & ISVs

Can Concerto Cloud Services help you focus on evolving your application offerings, while delivering the best cloud experience to your customers? From DevOps to revenue models and customer support, the answer is yes!

Learn how Concerto can help you.

 
LVL 31

Author Comment

by:Marco Gasi
ID: 39947523
Hi, Ray. I don't need a 100% level and I'm sure your suggestion will satisfy my needs the best possible way. I'll sure do suggested tests.
Thank you.

Marco
0
 
LVL 31

Author Closing Comment

by:Marco Gasi
ID: 39947525
Thank you both for your help. Have a nice week-end.
0
 
LVL 111

Expert Comment

by:Ray Paseur
ID: 39947720
Thanks, Marco.  You too!
0

Featured Post

Concerto Cloud for Software Providers & ISVs

Can Concerto Cloud Services help you focus on evolving your application offerings, while delivering the best cloud experience to your customers? From DevOps to revenue models and customer support, the answer is yes!

Learn how Concerto can help you.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Build an array called $myWeek which will hold the array elements Today, Yesterday and then builds up the rest of the week by the name of the day going back 1 week.   (CODE) (CODE) Then you just need to pass your date to the function. If i…
This article discusses how to create an extensible mechanism for linked drop downs.
Learn how to match and substitute tagged data using PHP regular expressions. Demonstrated on Windows 7, but also applies to other operating systems. Demonstrated technique applies to PHP (all versions) and Firefox, but very similar techniques will w…
The viewer will learn how to look for a specific file type in a local or remote server directory using PHP.
Suggested Courses

876 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question