Web Robots: Indexing of CGI scripted pages

Posted on 1999-06-21
Medium Priority
Last Modified: 2013-12-25
I have read that robots either ignore or have difficulty indexing pages "created on the fly"--through cgi script.
Much of my valuable content appears on such pages.  How can I make the robots index those pages?

(more specifically, excite used to index my pages, but stopped suddenly after a few months, then started again, then stopped for good....)
Question by:Mastadon
  • 3
  • 3
  • 2
  • +1

Expert Comment

ID: 1863371
use either a correctly set up robots.txt file or use the META element?

Author Comment

ID: 1863372
I believe my robot.txt file and meta tags are set up correctly... but I'll double check. Before I realized this was a problem, I remember reading something off the web that gave specific commands to be placed in your page source... some way of tucking the cgi contect out of the spiders view?.....

Expert Comment

ID: 1863373
that would probably be the META element with the "robots" name:

<META NAME="robots" CONTENT="noindex,nofollow">

(which means the page should not be indexed and links not followed)
Get 10% Off Your First Squarespace Website

Ready to showcase your work, publish content or promote your business online? With Squarespace’s award-winning templates and 24/7 customer service, getting started is simple. Head to Squarespace.com and use offer code ‘EXPERTS’ to get 10% off your first purchase.


Expert Comment

ID: 1863374
what exactly are robots?

Expert Comment

ID: 1863375
I do not know the exact definition, but I believe the general defintion is a program that goes to a specified site (URL) to index its content by following the links found on said site.

a whole bunch of information regarding the search engine's robots can be found on http://www.searchenginewatch.com/

Author Comment

ID: 1863376
Adjusted points to 450

Accepted Solution

pru2 earned 1800 total points
ID: 1863377
The short answer is you can't "make" robots (also known as search engine spiders) index dynamic pages.

Each search engine has it's own set of rules of what and how it indexes web pages. You can find good example of this at :


Another quote from searchenginewatch :
-- begin quote --
Dynamic Doorblock: Generating pages via CGI or database-delivery? Expect that some of the search engines won't be able to index them. Consider creating static pages whenever possible, perhaps using the database to update the pages, not to generate them on the fly. Also, avoid symbols in your URLs, especially the ? symbol. Search engines tend to choke on it.
-- end quote --

More specifically search engines if to index a page based on :
1. The path of the URL. Some search engines won't index any pages that have /cgi-bin/ in the path
2. The page's extention. All engines will index .htm and .html pages, and won't index the contents of a .gif. Other extentions like : .cgi, .asp, .pl, .php3, .cfm - it depends on the search engine.
3. Some engines might strip everything after a ? symbol in the URL. This allows dynamic pages to be indexed, but limits the problem of crawling to deep into a database backed site.

Regarding using the Meta tag - this URL explains very well the <Meta name=robots ..> tag :

The default (if no such tag is present) is to index the page and follow links .. so you can't really gain much by using it in your case.

Your best bet is to create static pages. If you have total control of the web server and you're the only one use it you can do something like having .html pages be static html files and have .htm actualy be cgi files (you'll have to change the mime types in your server config file). That might fool some of the search engines.


I have been implementing and running a search engine that indexes 350,000 pages on 10,000 domains over the past few months.


Expert Comment

ID: 1863378
Another idea which should work.
If your pages are dynamically created, but do not take any parameters (no ? after the filename in the URL) then you could put each cgi in it's own directory, call them something like index.cgi and tell the server to look for index.cgi as the default document in those directories.

Author Comment

ID: 1863379
Very informative with lots of suggestions.  I'll have to experiment with some of your suggestions.

I think we should try to open a "Web Promotion/Position" topic area on EE that could include SE discussions.

Featured Post

Free Tool: Port Scanner

Check which ports are open to the outside world. Helps make sure that your firewall rules are working as intended.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

Join & Write a Comment

The first step to building an amazing About page is to figure out what you want the page to say about your company. You then must grab the attention of the reader, boast a bit, tell a story and let others brag about you. With a little bit of thought…
Magento is the best technology for eCommerce start-ups as it offers the technical expertise and visual appeal to create a store that pulls sales and earns high ROI (Return on investment).
The viewer will learn how to count occurrences of each item in an array.
The viewer will learn how to look for a specific file type in a local or remote server directory using PHP.

624 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question