Still celebrating National IT Professionals Day with 3 months of free Premium Membership. Use Code ITDAY17


Web Robots: Indexing of CGI scripted pages

Posted on 1999-06-21
Medium Priority
Last Modified: 2013-12-25
I have read that robots either ignore or have difficulty indexing pages "created on the fly"--through cgi script.
Much of my valuable content appears on such pages.  How can I make the robots index those pages?

(more specifically, excite used to index my pages, but stopped suddenly after a few months, then started again, then stopped for good....)
Question by:Mastadon
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 3
  • 3
  • 2
  • +1

Expert Comment

ID: 1863371
use either a correctly set up robots.txt file or use the META element?

Author Comment

ID: 1863372
I believe my robot.txt file and meta tags are set up correctly... but I'll double check. Before I realized this was a problem, I remember reading something off the web that gave specific commands to be placed in your page source... some way of tucking the cgi contect out of the spiders view?.....

Expert Comment

ID: 1863373
that would probably be the META element with the "robots" name:

<META NAME="robots" CONTENT="noindex,nofollow">

(which means the page should not be indexed and links not followed)
Use Filtering Commands to Process Files in Linux

Learn how to manipulate data with the help of various filtering commands such as `cat`, `fmt`, `pr`, and others in Linux.


Expert Comment

ID: 1863374
what exactly are robots?

Expert Comment

ID: 1863375
I do not know the exact definition, but I believe the general defintion is a program that goes to a specified site (URL) to index its content by following the links found on said site.

a whole bunch of information regarding the search engine's robots can be found on

Author Comment

ID: 1863376
Adjusted points to 450

Accepted Solution

pru2 earned 1800 total points
ID: 1863377
The short answer is you can't "make" robots (also known as search engine spiders) index dynamic pages.

Each search engine has it's own set of rules of what and how it indexes web pages. You can find good example of this at :

Another quote from searchenginewatch :
-- begin quote --
Dynamic Doorblock: Generating pages via CGI or database-delivery? Expect that some of the search engines won't be able to index them. Consider creating static pages whenever possible, perhaps using the database to update the pages, not to generate them on the fly. Also, avoid symbols in your URLs, especially the ? symbol. Search engines tend to choke on it.
-- end quote --

More specifically search engines if to index a page based on :
1. The path of the URL. Some search engines won't index any pages that have /cgi-bin/ in the path
2. The page's extention. All engines will index .htm and .html pages, and won't index the contents of a .gif. Other extentions like : .cgi, .asp, .pl, .php3, .cfm - it depends on the search engine.
3. Some engines might strip everything after a ? symbol in the URL. This allows dynamic pages to be indexed, but limits the problem of crawling to deep into a database backed site.

Regarding using the Meta tag - this URL explains very well the <Meta name=robots ..> tag :

The default (if no such tag is present) is to index the page and follow links .. so you can't really gain much by using it in your case.

Your best bet is to create static pages. If you have total control of the web server and you're the only one use it you can do something like having .html pages be static html files and have .htm actualy be cgi files (you'll have to change the mime types in your server config file). That might fool some of the search engines.


I have been implementing and running a search engine that indexes 350,000 pages on 10,000 domains over the past few months.


Expert Comment

ID: 1863378
Another idea which should work.
If your pages are dynamically created, but do not take any parameters (no ? after the filename in the URL) then you could put each cgi in it's own directory, call them something like index.cgi and tell the server to look for index.cgi as the default document in those directories.

Author Comment

ID: 1863379
Very informative with lots of suggestions.  I'll have to experiment with some of your suggestions.

I think we should try to open a "Web Promotion/Position" topic area on EE that could include SE discussions.

Featured Post

New benefit for Premium Members - Upgrade now!

Ready to get started with anonymous questions today? It's easy! Learn more.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

This article was originally published on Monitis Blog, you can check it here . Today it’s fairly well known that high-performing websites and applications bring in more visitors, higher SEO, and ultimately more sales. By the same token, downtime…
Ready to get certified? Check out some courses that help you prepare for third-party exams.
This tutorial demonstrates how to identify and create boundary or building outlines in Google Maps. In this example, I outline the boundaries of an enclosed skatepark within a community park.  Login to your Google Account, then  Google for "Google M…
The viewer will receive an overview of the basics of CSS showing inline styles. In the head tags set up your style tags: (CODE) Reference the nav tag and set your properties.: (CODE) Set the reference for the UL element and styles for it to ensu…

660 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question