?
Solved

Prevent robots' visit to be registered in database

Posted on 2014-03-22
6
Medium Priority
?
259 Views
Last Modified: 2014-03-22
Hi all.

In my site I used a robots.txt to prevent crawlers' visit to be recorded in the database.

I used this code

$allrobots = file_get_contents( 'allrobots.txt' ); //robot-name:
preg_match_all( '/(?<=robot-id:\s).*(?=$)/im', $allrobots, $crawlers );

if ( !in_array( strtolower( $_SERVER['HTTP_USER_AGENT'] ), $crawlers[0] ) )
{
    //here write to the database the visitor's data

Open in new window


But this seems to fail since I still get recorded visits from crawlers: for instance I still se in database visits of no more existent pages from Mountain View (that is by Google, isn't it?)

So what is the best way to accomplish my goal?

Thanks to all for any advice.

Cheers
0
Comment
Question by:Marco Gasi
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 3
  • 2
6 Comments
 
LVL 84

Expert Comment

by:Dave Baldwin
ID: 39947262
The search robots for Google and Bing and Baidu make a direct request for 'robots.txt' so I'm not sure what your code above would do for you.

This page http://www.robotstxt.org/robotstxt.html tells you that to tell all (obedient) robots to not scan your pages, you should use the following code in a file called 'robots.txt' in the root of your web directories.
User-agent: *
Disallow: /

Open in new window

0
 
LVL 31

Author Comment

by:Marco Gasi
ID: 39947267
Hi, Dave, thanks for your reply.

What I ned is not a way to prevent robots to scan my pages. I would only avoid to store in the database visits made by crawlers and generally by not human beings, but I don't know if it's possible.
0
 
LVL 111

Accepted Solution

by:
Ray Paseur earned 2000 total points
ID: 39947364
A way to prevent robots...
simply doesn't exist at the 100% level.   If you're willing to tolerate a little "slop" in the process you can look for the substring "bot" in the HTTP_USER_AGENT.  That is almost always a strong clue.  If you have a common script that starts all of your web pages (something that starts session, connects database, etc.) you can put code into it that will test for the user agent and simply return a blank page to the spiders, or redirect to the home page.

In my experience with this, I have found that the overwhelming majority of 'bots obey robots.txt with only a few from Venezuela, China and Bulgaria that ignore the directives.  But this is the internet and there is no 100% certain way to identify 'bots.  I can write a cURL script that will look exactly like a Firefox browser referred by Google, and your server will not be able to detect the fact that there is no human behind the request.  And just today I got two requests from agent Java/1.6.0_34 somewhere in Sweden.  These are mostly edge cases.

If you want to do an experiment that will help you identify the good vs bad traffic, record all of the HTTP_USER_AGENT values in a small data base table over a period of time, perhaps a couple of weeks.  Then normalize the values to uppercase and sort them and count them.  You'll be able to see what's going on and you'll know exactly which requests to ignore.
0
What does it mean to be "Always On"?

Is your cloud always on? With an Always On cloud you won't have to worry about downtime for maintenance or software application code updates, ensuring that your bottom line isn't affected.

 
LVL 31

Author Comment

by:Marco Gasi
ID: 39947523
Hi, Ray. I don't need a 100% level and I'm sure your suggestion will satisfy my needs the best possible way. I'll sure do suggested tests.
Thank you.

Marco
0
 
LVL 31

Author Closing Comment

by:Marco Gasi
ID: 39947525
Thank you both for your help. Have a nice week-end.
0
 
LVL 111

Expert Comment

by:Ray Paseur
ID: 39947720
Thanks, Marco.  You too!
0

Featured Post

Independent Software Vendors: We Want Your Opinion

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Build an array called $myWeek which will hold the array elements Today, Yesterday and then builds up the rest of the week by the name of the day going back 1 week.   (CODE) (CODE) Then you just need to pass your date to the function. If i…
Many old projects have bad code, but the budget doesn't exist to rewrite the codebase. You can update this code to be safer by introducing contemporary input validation, sanitation, and safer database queries.
Learn how to match and substitute tagged data using PHP regular expressions. Demonstrated on Windows 7, but also applies to other operating systems. Demonstrated technique applies to PHP (all versions) and Firefox, but very similar techniques will w…
The viewer will learn how to look for a specific file type in a local or remote server directory using PHP.
Suggested Courses

719 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question