<

[Product update] Infrastructure Analysis Tool is now available with Business Accounts.Learn More

x

Dynamic Robots.txt with Search Engine modifiers

Published on
7,294 Points
4,294 Views
Last Modified:
Michael Worsham
Well-rounded and highly experienced with a professional background in cloud/infrastructure solutions and project management.
As Wikipedia explains 'robots.txt' as -- the robot exclusion standard, also known as the Robots Exclusion Protocol or robots.txt protocol, is a convention to prevent cooperating web spiders and other web robots from accessing all or part of a website which is otherwise publicly viewable. Robots are often used by search engines to categorize and archive web sites, or by webmasters to proofread source code.

The 'robots.txt' file can be broken down in many different ways (depending on standards and non-standard extensions), however rather than try to explain them all here, I will direct your attention to the Wikipedia site instead -- http://en.wikipedia.org/wiki/Robots.txt -- as it gives better 'case-by-case' examples.

Back to the issue at hand -- sometimes when websites are developed, the infamous 'robots.txt' file is missing from the site. As a result, search engine bots (i.e. GoogleBot, MSNBot) will detect that the robots.txt file is missing and will thus attempt to scan and search through each and every directory on your web server in an attempt to find underlying information to categorize, archive and post to the world.

In response to this conundrum, I have a Perl script on my Apache web server that will - in a way - mimic the presence of a robots.txt file for the site, thus restrict what the search bot's can and cannot have access to (i.e. directories, files) and when the bot is allow to visit the site and for how long. The cool thing about the script is that it can be geared to any certain search engine bot (i.e. MSNBot), thus restrict the bot's actions even further than the default setup. The script is also totally dynamic -- where you can make a change to the Disallow or Allow exclusions and then saving the script. Once the script is re-run upon search engine bot discovery, it will re-read the updated robots.txt and go from there.

 
#!/usr/bin/perl 
$| = 1; 
$host  = $ENV{'REMOTE_HOST'};
$addr  = $ENV{'REMOTE_ADDR'};
$agent = $ENV{'HTTP_USER_AGENT'}; 
print "Content-type: text/plain\n\n"; 
if ($host =~ /\.msn\.com$/i && $agent =~ /^msnbot/) {
print <<'EOF';
User-agent: msnbot
Disallow: /cgi-bin/
Disallow: /images/
Disallow: /private/
Disallow: /css/
Disallow: /taxonomy/term/
Allow: /stories/
Request-rate: 1/1h
Visit-time: 0600-0900
EOF
} else {
print <<'EOF';
User-agent: *
Disallow: /cgi-bin/
Disallow: /images/
Disallow: /private/
Disallow: /css/
Disallow: /taxonomy/term/
Allow: /stories/
Request-rate: 2/1h
Visit-time: 0600-0900
EOF
}

Open in new window


The Perl script (above) should be downloaded and renamed as 'robots.pl'. Then place it in the /var/www/cgi-bin (or where ever your global/world readable and executable CGI files are found) and give it an executable tag (i.e. 'chmod 775 robots.pl'). Inside the Apache's httpd.conf, make sure that 'Options +ExecCGI' is also enabled for the /var/www/cgi-bin files.

To utilize the dynamic robots.pl script, place the RewriteRule (as seen below) in either the httpd.conf or the virtual host entry pointing back to the robots.pl Perl script. Restart your Apache process and you should be ready to go!

 
RewriteEngine On
RewriteRule /robots\.txt$ /var/www/cgi-bin/robots.pl [L,T=application/x-httpd-cgi]

Open in new window


 
0
  • 2
  • 2
4 Comments
LVL 29

Author Comment

by:Michael Worsham
The article reads fine with me -- even passed MS Word grammar/spelling checks. I let a couple of my co-workers read it as well and they didn't see any issues. Sure, point out what looks incorrect or needs to be re-worded.

Since explaining things in a simple manner isn't my forte, I admit it requires a bit of technical knowledge to use the script/code.

-- Michael

PS: I also found a bug in the article editor -- I cannot post 'Options +ExecCGI' (with the plus symbol) as the preview & editor remove it.
0
LVL 23

Expert Comment

by:Tony McCreath
Interesting but I don't understand why one should not just create a simple robots.txt file instead?
0
LVL 29

Author Comment

by:Michael Worsham
This script was designed for handling several hundred virtual hosts so that only one file would have to be modified rather than individual robots.txt files for each site.
0
LVL 23

Expert Comment

by:Tony McCreath
Using a common cgi folder. Nice idea.
0

Featured Post

Angular Fundamentals

Learn the fundamentals of Angular 2, a JavaScript framework for developing dynamic single page applications.

Basic Overview of office 365 user portal
If you, like me, have a dislike for using Online Subscription anti-spam services, then this video series is for you. I have an inherent dislike of leaving decisions such as what is and what isn't spamming to other people or services for me and insis…

Keep in touch with Experts Exchange

Tech news and trends delivered to your inbox every month