Go Premium for a chance to win a PS4. Enter to Win

x
  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 581
  • Last Modified:

Preventing robots scrolling through search results

Hi,
Working on a Real Estate website (similar to http://www.trulia.com/), I do not want people to be scraping the details of all the properties available on the site.

Therefore, I thought of having an additional "128-bit MD5 key" in the database and not use an incremental ID. (Ie. www.site.com/property?id=d41d8cd98f00b204e9800998ecf8427e instead of www.site.com/property?id=50). This way, it will be virtually impossible to look through all the permutations and capture the lot.

However, the problem I am currently facing is : what would prevent someone from getting the whole list of MD5 keys using the search results page ? (a robot can crawl through all the pages resulting of a wide search) All the IDs would then be visible in the search results page (on the link to the detailed page).

Does that constitute a risk ? How can this be avoided ?

Thanks
0
davidbayonchen
Asked:
davidbayonchen
  • 3
  • 3
  • 2
2 Solutions
 
IronhoofsCommented:
Well behaved Robots that can be instructed to only index parts of your website by using robots.txt. For more information see http://www.robotstxt.org/robotstxt.html

To thwart malicious robots, you could use human authentication like captcha (http://www.google.com/recaptcha/captcha)  or hide / encrypt the searchresult url's with javascript.
0
 
davidbayonchenAuthor Commented:
Thanks for that.
How would you hide / encrypt the search results w/ javascript or jQuery ?
0
 
davidbayonchenAuthor Commented:
Also, is it a good idea to use an 32 characters key instead of an ID ?
0
Veeam Task Manager for Hyper-V

Task Manager for Hyper-V provides critical information that allows you to monitor Hyper-V performance by displaying real-time views of CPU and memory at the individual VM-level, so you can quickly identify which VMs are using host resources.

 
Ess KayEntrapenuerCommented:
I like you idea of a key instead of Id

you can stop robots with htaccesss, metatags, and robots.txt


Blocking bad bots and site rippers (aka offline browsers)
http://www.javascriptkit.com/howto/htaccess13.shtml



http://3n9.org/articles/hide-and-seek-with-robots.html
0
 
Ess KayEntrapenuerCommented:
http://3n9.org/articles/content-links-hiding-techniques.html

More on hiding

Other conent/link hiding techniques

Here I'll present some quick examples of hiding techniques that didn't fit into previous chapters.

1x1 pixel images/transparent images

This technique is pretty simple: a webmaster adds a very small image to a web page and uses image's alt text instead of a usual anchor text. Since the image is very small or transparent it is almost impossible to spot such link for humans. This method was very popular some time ago, but at the moment search engine robots get smarter. Since it's really easy to find image size a very small (1x1 pixel) image rises suspicion flag.

Frames/iframes

This technique is more effective when used in combination with some nice CSS rules to make a frame appear as a consistent part of the web page. robots.txt rules can also be used to tell robots not to index page in iframe.

Noscript tag

This tag is designed to hold alternative content which is shown to a user who has JavaScript turned off. Since almost every user has JavaScript on the contents of noscript tag remains invisible to them.

Cloaking

Cloaking is almost always considered a black hat seo technique. Cloaking is a method where different content versions are presented to human visitors and search engine robots.

Usually it works in this way: server side script tries to find out if a user requesting the page is robot or human either by checking user's IP address or HTTP header User-agent string (or both). User agents in HTTP headers are identified in a similar way as in robots.txt rules.

A tip: I've seen somewhere the following php code that was suggested as a reliable robot detection algorithm:

$is_robot = (strpos($_SERVER['HTTP_USER_AGENT'], '+http') === false);
It is based on presumption that all search engine robots leave their home url in their User-agent setting. This is absolutely false. Many robots do not specify their home urls. To learn more about User-agent identifiers you can simply look at your server logs; also do not forget to check robots.txt chapter if you skipped it.

More advanced client detection techniques can also be based on the client's behavior analysis after a few page requests.

Flash/Java applets

Flash was considered terra incognita for search engine spiders for quite a long time. But recently Google announced that they already index some parts of Flash (and I believe, that their skills of Flash reading will improve in the future). So this technique can not be considered as a very reliable one.

A similar alternative technique is Java applets. Search engines are not indexing applets content yet. And I can not be sure about the future. However, it is very easy to extract some information from Java, so Java applets should be used with care.

Robots-nocontent class

In May 2007 Yahoo introduced robots-nocontent html attribute which was meant to hide any part of a page from Yahoo robot. This attribute can be used on any html element, like:

<div class="robots-nocontent">
Yahoo robot (slurp) should read this as: content in the div marked with this attribute is unimportant. The biggest downside of this technique is that it works only in Yahoo. So it is not very popular among seo web masters.

Final words

I tried to sketch the most popular techniques of content/links hiding. However, a lot of other methods can be met online and you can also invent your own ways. The techniques I described above are more powerful when used together (e.g., JavaScript + CSS + robots.txt).

In any case, I do not advocate content/link hiding when it is used to manipulate search engine rankings in an unethical way. Still, if you do this in a good deed, I hope this article will give you a clue on how to hide your secrets properly in order not to get penalized accidentally. Also this should help you spot unethical guys and not mess with them.
0
 
IronhoofsCommented:
Just remember that this is an attempt to outsmart malicious robots / dataminers while maintaining good usability. Robots get smarter and may defeat your defenses or you have to stop because the solution is more damaging then the problem.

One assumption is that robots can't evaluate javascript and therefore won't be able to find URL's encrypted and inserted that way.

A good example can be found at: http://hivelogic.com/enkoder/
I found some source code for this at https://bitbucket.org/danbenjamin/enkoder-rails/src/405db349d604/lib/enkoder.rb but there are other (PHP) adaptations.

Another options is to use a navigation form, update its parameters with javascript in your links and post the form. Example:

<form name="navFrm" action="http://yoursite.org" method="post">
  <input type="hidden" name="uid" value="" />
</form>

<a href="#" onClick="document.navFrm.uid.value='ABCD123'; document.navFrm.submit();">Item ABCD123</a>

Open in new window


You can add encryption and other nifty tricks, but in the end all information is in your HTML and you use smoke and mirrors to fool the robot.
0
 
davidbayonchenAuthor Commented:
@ Ironhoofs : A robot cannot see the onClick="document.navFrm.uid.value='ABCD123';" but a page scrapper can pick it up. Am I right ?

I am still wondering whether I should go ahead with the 32-bit IDs...
0
 
IronhoofsCommented:
A robot / page scraper will pick up any human readable URL or email address, like the URL you put in the <FORM> tag.

The robot will also parse the onClick event, but because most can't execute the script they discard it. However, smarter robots could make an educated guess about the parameter and value from the javascript. Thats why the Hivelogic enkoder obscures the data.

I witnessed malicious robots submitting data after they parsed the page for forms. Therefore using large non-sequenced id's can be another small step towards hiding your content from unwanted eyes.

But in the end, you have to decide how much trouble you are willing to go through and if the solution is not chasing your visitors away...
0

Featured Post

Cyber Threats to Small Businesses (Part 2)

The evolving cybersecurity landscape presents SMBs with a host of new threats to their clients, their data, and their bottom line. In part 2 of this blog series, learn three quick processes Webroot’s CISO, Gary Hayslip, recommends to help small businesses beat modern threats.

  • 3
  • 3
  • 2
Tackle projects and never again get stuck behind a technical roadblock.
Join Now