Dynamic Robots.txt with Search Engine modifiers

Michael WorshamChief Architect | Private Cloud Solutions Architect | Project Manager
CERTIFIED EXPERT
Well-rounded and highly experienced with a professional background in cloud/infrastructure solutions and project management.
Published:
Updated:
As Wikipedia explains 'robots.txt' as -- the robot exclusion standard, also known as the Robots Exclusion Protocol or robots.txt protocol, is a convention to prevent cooperating web spiders and other web robots from accessing all or part of a website which is otherwise publicly viewable. Robots are often used by search engines to categorize and archive web sites, or by webmasters to proofread source code.

The 'robots.txt' file can be broken down in many different ways (depending on standards and non-standard extensions), however rather than try to explain them all here, I will direct your attention to the Wikipedia site instead -- http://en.wikipedia.org/wiki/Robots.txt -- as it gives better 'case-by-case' examples.

Back to the issue at hand -- sometimes when websites are developed, the infamous 'robots.txt' file is missing from the site. As a result, search engine bots (i.e. GoogleBot, MSNBot) will detect that the robots.txt file is missing and will thus attempt to scan and search through each and every directory on your web server in an attempt to find underlying information to categorize, archive and post to the world.

In response to this conundrum, I have a Perl script on my Apache web server that will - in a way - mimic the presence of a robots.txt file for the site, thus restrict what the search bot's can and cannot have access to (i.e. directories, files) and when the bot is allow to visit the site and for how long. The cool thing about the script is that it can be geared to any certain search engine bot (i.e. MSNBot), thus restrict the bot's actions even further than the default setup. The script is also totally dynamic -- where you can make a change to the Disallow or Allow exclusions and then saving the script. Once the script is re-run upon search engine bot discovery, it will re-read the updated robots.txt and go from there.

 
#!/usr/bin/perl 
                      $| = 1; 
                      $host  = $ENV{'REMOTE_HOST'};
                      $addr  = $ENV{'REMOTE_ADDR'};
                      $agent = $ENV{'HTTP_USER_AGENT'}; 
                      print "Content-type: text/plain\n\n"; 
                      if ($host =~ /\.msn\.com$/i && $agent =~ /^msnbot/) {
                      print <<'EOF';
                      User-agent: msnbot
                      Disallow: /cgi-bin/
                      Disallow: /images/
                      Disallow: /private/
                      Disallow: /css/
                      Disallow: /taxonomy/term/
                      Allow: /stories/
                      Request-rate: 1/1h
                      Visit-time: 0600-0900
                      EOF
                      } else {
                      print <<'EOF';
                      User-agent: *
                      Disallow: /cgi-bin/
                      Disallow: /images/
                      Disallow: /private/
                      Disallow: /css/
                      Disallow: /taxonomy/term/
                      Allow: /stories/
                      Request-rate: 2/1h
                      Visit-time: 0600-0900
                      EOF
                      }

Open in new window


The Perl script (above) should be downloaded and renamed as 'robots.pl'. Then place it in the /var/www/cgi-bin (or where ever your global/world readable and executable CGI files are found) and give it an executable tag (i.e. 'chmod 775 robots.pl'). Inside the Apache's httpd.conf, make sure that 'Options +ExecCGI' is also enabled for the /var/www/cgi-bin files.

To utilize the dynamic robots.pl script, place the RewriteRule (as seen below) in either the httpd.conf or the virtual host entry pointing back to the robots.pl Perl script. Restart your Apache process and you should be ready to go!

 
RewriteEngine On
                      RewriteRule /robots\.txt$ /var/www/cgi-bin/robots.pl [L,T=application/x-httpd-cgi]

Open in new window


 
0
4,750 Views
Michael WorshamChief Architect | Private Cloud Solutions Architect | Project Manager
CERTIFIED EXPERT
Well-rounded and highly experienced with a professional background in cloud/infrastructure solutions and project management.

Comments (4)

Michael WorshamChief Architect | Private Cloud Solutions Architect | Project Manager
CERTIFIED EXPERT

Author

Commented:
The article reads fine with me -- even passed MS Word grammar/spelling checks. I let a couple of my co-workers read it as well and they didn't see any issues. Sure, point out what looks incorrect or needs to be re-worded.

Since explaining things in a simple manner isn't my forte, I admit it requires a bit of technical knowledge to use the script/code.

-- Michael

PS: I also found a bug in the article editor -- I cannot post 'Options +ExecCGI' (with the plus symbol) as the preview & editor remove it.
Tony McCreathTechnical SEO Consultant

Commented:
Interesting but I don't understand why one should not just create a simple robots.txt file instead?
Michael WorshamChief Architect | Private Cloud Solutions Architect | Project Manager
CERTIFIED EXPERT

Author

Commented:
This script was designed for handling several hundred virtual hosts so that only one file would have to be modified rather than individual robots.txt files for each site.
Tony McCreathTechnical SEO Consultant

Commented:
Using a common cgi folder. Nice idea.

Have a question about something in this article? You can receive help directly from the article author. Sign up for a free trial to get started.