<

Dynamic Robots.txt with Search Engine modifiers

Published on
7,378 Points
4,378 Views
Last Modified:
Michael Worsham
Well-rounded and highly experienced with a professional background in cloud/infrastructure solutions and project management.
As Wikipedia explains 'robots.txt' as -- the robot exclusion standard, also known as the Robots Exclusion Protocol or robots.txt protocol, is a convention to prevent cooperating web spiders and other web robots from accessing all or part of a website which is otherwise publicly viewable. Robots are often used by search engines to categorize and archive web sites, or by webmasters to proofread source code.

The 'robots.txt' file can be broken down in many different ways (depending on standards and non-standard extensions), however rather than try to explain them all here, I will direct your attention to the Wikipedia site instead -- http://en.wikipedia.org/wiki/Robots.txt -- as it gives better 'case-by-case' examples.

Back to the issue at hand -- sometimes when websites are developed, the infamous 'robots.txt' file is missing from the site. As a result, search engine bots (i.e. GoogleBot, MSNBot) will detect that the robots.txt file is missing and will thus attempt to scan and search through each and every directory on your web server in an attempt to find underlying information to categorize, archive and post to the world.

In response to this conundrum, I have a Perl script on my Apache web server that will - in a way - mimic the presence of a robots.txt file for the site, thus restrict what the search bot's can and cannot have access to (i.e. directories, files) and when the bot is allow to visit the site and for how long. The cool thing about the script is that it can be geared to any certain search engine bot (i.e. MSNBot), thus restrict the bot's actions even further than the default setup. The script is also totally dynamic -- where you can make a change to the Disallow or Allow exclusions and then saving the script. Once the script is re-run upon search engine bot discovery, it will re-read the updated robots.txt and go from there.

 
#!/usr/bin/perl 
$| = 1; 
$host  = $ENV{'REMOTE_HOST'};
$addr  = $ENV{'REMOTE_ADDR'};
$agent = $ENV{'HTTP_USER_AGENT'}; 
print "Content-type: text/plain\n\n"; 
if ($host =~ /\.msn\.com$/i && $agent =~ /^msnbot/) {
print <<'EOF';
User-agent: msnbot
Disallow: /cgi-bin/
Disallow: /images/
Disallow: /private/
Disallow: /css/
Disallow: /taxonomy/term/
Allow: /stories/
Request-rate: 1/1h
Visit-time: 0600-0900
EOF
} else {
print <<'EOF';
User-agent: *
Disallow: /cgi-bin/
Disallow: /images/
Disallow: /private/
Disallow: /css/
Disallow: /taxonomy/term/
Allow: /stories/
Request-rate: 2/1h
Visit-time: 0600-0900
EOF
}

Open in new window


The Perl script (above) should be downloaded and renamed as 'robots.pl'. Then place it in the /var/www/cgi-bin (or where ever your global/world readable and executable CGI files are found) and give it an executable tag (i.e. 'chmod 775 robots.pl'). Inside the Apache's httpd.conf, make sure that 'Options +ExecCGI' is also enabled for the /var/www/cgi-bin files.

To utilize the dynamic robots.pl script, place the RewriteRule (as seen below) in either the httpd.conf or the virtual host entry pointing back to the robots.pl Perl script. Restart your Apache process and you should be ready to go!

 
RewriteEngine On
RewriteRule /robots\.txt$ /var/www/cgi-bin/robots.pl [L,T=application/x-httpd-cgi]

Open in new window


 
0
Ask questions about what you read
If you have a question about something within an article, you can receive help directly from the article author. Experts Exchange article authors are available to answer questions and further the discussion.
Get 7 days free