We help IT Professionals succeed at work.

Preventing robots scrolling through search results

Medium Priority
Last Modified: 2012-03-17
Working on a Real Estate website (similar to http://www.trulia.com/), I do not want people to be scraping the details of all the properties available on the site.

Therefore, I thought of having an additional "128-bit MD5 key" in the database and not use an incremental ID. (Ie. www.site.com/property?id=d41d8cd98f00b204e9800998ecf8427e instead of www.site.com/property?id=50). This way, it will be virtually impossible to look through all the permutations and capture the lot.

However, the problem I am currently facing is : what would prevent someone from getting the whole list of MD5 keys using the search results page ? (a robot can crawl through all the pages resulting of a wide search) All the IDs would then be visible in the search results page (on the link to the detailed page).

Does that constitute a risk ? How can this be avoided ?

Watch Question

Well behaved Robots that can be instructed to only index parts of your website by using robots.txt. For more information see http://www.robotstxt.org/robotstxt.html

To thwart malicious robots, you could use human authentication like captcha (http://www.google.com/recaptcha/captcha)  or hide / encrypt the searchresult url's with javascript.


Thanks for that.
How would you hide / encrypt the search results w/ javascript or jQuery ?


Also, is it a good idea to use an 32 characters key instead of an ID ?
Ess KayEntrapenuer
I like you idea of a key instead of Id

you can stop robots with htaccesss, metatags, and robots.txt

Blocking bad bots and site rippers (aka offline browsers)

Ess KayEntrapenuer


More on hiding

Other conent/link hiding techniques

Here I'll present some quick examples of hiding techniques that didn't fit into previous chapters.

1x1 pixel images/transparent images

This technique is pretty simple: a webmaster adds a very small image to a web page and uses image's alt text instead of a usual anchor text. Since the image is very small or transparent it is almost impossible to spot such link for humans. This method was very popular some time ago, but at the moment search engine robots get smarter. Since it's really easy to find image size a very small (1x1 pixel) image rises suspicion flag.


This technique is more effective when used in combination with some nice CSS rules to make a frame appear as a consistent part of the web page. robots.txt rules can also be used to tell robots not to index page in iframe.

Noscript tag

This tag is designed to hold alternative content which is shown to a user who has JavaScript turned off. Since almost every user has JavaScript on the contents of noscript tag remains invisible to them.


Cloaking is almost always considered a black hat seo technique. Cloaking is a method where different content versions are presented to human visitors and search engine robots.

Usually it works in this way: server side script tries to find out if a user requesting the page is robot or human either by checking user's IP address or HTTP header User-agent string (or both). User agents in HTTP headers are identified in a similar way as in robots.txt rules.

A tip: I've seen somewhere the following php code that was suggested as a reliable robot detection algorithm:

$is_robot = (strpos($_SERVER['HTTP_USER_AGENT'], '+http') === false);
It is based on presumption that all search engine robots leave their home url in their User-agent setting. This is absolutely false. Many robots do not specify their home urls. To learn more about User-agent identifiers you can simply look at your server logs; also do not forget to check robots.txt chapter if you skipped it.

More advanced client detection techniques can also be based on the client's behavior analysis after a few page requests.

Flash/Java applets

Flash was considered terra incognita for search engine spiders for quite a long time. But recently Google announced that they already index some parts of Flash (and I believe, that their skills of Flash reading will improve in the future). So this technique can not be considered as a very reliable one.

A similar alternative technique is Java applets. Search engines are not indexing applets content yet. And I can not be sure about the future. However, it is very easy to extract some information from Java, so Java applets should be used with care.

Robots-nocontent class

In May 2007 Yahoo introduced robots-nocontent html attribute which was meant to hide any part of a page from Yahoo robot. This attribute can be used on any html element, like:

<div class="robots-nocontent">
Yahoo robot (slurp) should read this as: content in the div marked with this attribute is unimportant. The biggest downside of this technique is that it works only in Yahoo. So it is not very popular among seo web masters.

Final words

I tried to sketch the most popular techniques of content/links hiding. However, a lot of other methods can be met online and you can also invent your own ways. The techniques I described above are more powerful when used together (e.g., JavaScript + CSS + robots.txt).

In any case, I do not advocate content/link hiding when it is used to manipulate search engine rankings in an unethical way. Still, if you do this in a good deed, I hope this article will give you a clue on how to hide your secrets properly in order not to get penalized accidentally. Also this should help you spot unethical guys and not mess with them.
Just remember that this is an attempt to outsmart malicious robots / dataminers while maintaining good usability. Robots get smarter and may defeat your defenses or you have to stop because the solution is more damaging then the problem.

One assumption is that robots can't evaluate javascript and therefore won't be able to find URL's encrypted and inserted that way.

A good example can be found at: http://hivelogic.com/enkoder/
I found some source code for this at https://bitbucket.org/danbenjamin/enkoder-rails/src/405db349d604/lib/enkoder.rb but there are other (PHP) adaptations.

Another options is to use a navigation form, update its parameters with javascript in your links and post the form. Example:

<form name="navFrm" action="http://yoursite.org" method="post">
  <input type="hidden" name="uid" value="" />

<a href="#" onClick="document.navFrm.uid.value='ABCD123'; document.navFrm.submit();">Item ABCD123</a>

Open in new window

You can add encryption and other nifty tricks, but in the end all information is in your HTML and you use smoke and mirrors to fool the robot.


@ Ironhoofs : A robot cannot see the onClick="document.navFrm.uid.value='ABCD123';" but a page scrapper can pick it up. Am I right ?

I am still wondering whether I should go ahead with the 32-bit IDs...
A robot / page scraper will pick up any human readable URL or email address, like the URL you put in the <FORM> tag.

The robot will also parse the onClick event, but because most can't execute the script they discard it. However, smarter robots could make an educated guess about the parameter and value from the javascript. Thats why the Hivelogic enkoder obscures the data.

I witnessed malicious robots submitting data after they parsed the page for forms. Therefore using large non-sequenced id's can be another small step towards hiding your content from unwanted eyes.

But in the end, you have to decide how much trouble you are willing to go through and if the solution is not chasing your visitors away...

Explore More ContentExplore courses, solutions, and other research materials related to this topic.