Solved

Preventing robots scrolling through search results

Posted on 2012-03-14
8
521 Views
Last Modified: 2012-03-17
Hi,
Working on a Real Estate website (similar to http://www.trulia.com/), I do not want people to be scraping the details of all the properties available on the site.

Therefore, I thought of having an additional "128-bit MD5 key" in the database and not use an incremental ID. (Ie. www.site.com/property?id=d41d8cd98f00b204e9800998ecf8427e instead of www.site.com/property?id=50). This way, it will be virtually impossible to look through all the permutations and capture the lot.

However, the problem I am currently facing is : what would prevent someone from getting the whole list of MD5 keys using the search results page ? (a robot can crawl through all the pages resulting of a wide search) All the IDs would then be visible in the search results page (on the link to the detailed page).

Does that constitute a risk ? How can this be avoided ?

Thanks
0
Comment
Question by:davidbayonchen
  • 3
  • 3
  • 2
8 Comments
 
LVL 7

Expert Comment

by:Ironhoofs
ID: 37719604
Well behaved Robots that can be instructed to only index parts of your website by using robots.txt. For more information see http://www.robotstxt.org/robotstxt.html

To thwart malicious robots, you could use human authentication like captcha (http://www.google.com/recaptcha/captcha)  or hide / encrypt the searchresult url's with javascript.
0
 

Author Comment

by:davidbayonchen
ID: 37719617
Thanks for that.
How would you hide / encrypt the search results w/ javascript or jQuery ?
0
 

Author Comment

by:davidbayonchen
ID: 37719633
Also, is it a good idea to use an 32 characters key instead of an ID ?
0
 
LVL 15

Assisted Solution

by:Ess Kay
Ess Kay earned 150 total points
ID: 37720157
I like you idea of a key instead of Id

you can stop robots with htaccesss, metatags, and robots.txt


Blocking bad bots and site rippers (aka offline browsers)
http://www.javascriptkit.com/howto/htaccess13.shtml



http://3n9.org/articles/hide-and-seek-with-robots.html
0
Find Ransomware Secrets With All-Source Analysis

Ransomware has become a major concern for organizations; its prevalence has grown due to past successes achieved by threat actors. While each ransomware variant is different, we’ve seen some common tactics and trends used among the authors of the malware.

 
LVL 15

Expert Comment

by:Ess Kay
ID: 37720293
http://3n9.org/articles/content-links-hiding-techniques.html

More on hiding

Other conent/link hiding techniques

Here I'll present some quick examples of hiding techniques that didn't fit into previous chapters.

1x1 pixel images/transparent images

This technique is pretty simple: a webmaster adds a very small image to a web page and uses image's alt text instead of a usual anchor text. Since the image is very small or transparent it is almost impossible to spot such link for humans. This method was very popular some time ago, but at the moment search engine robots get smarter. Since it's really easy to find image size a very small (1x1 pixel) image rises suspicion flag.

Frames/iframes

This technique is more effective when used in combination with some nice CSS rules to make a frame appear as a consistent part of the web page. robots.txt rules can also be used to tell robots not to index page in iframe.

Noscript tag

This tag is designed to hold alternative content which is shown to a user who has JavaScript turned off. Since almost every user has JavaScript on the contents of noscript tag remains invisible to them.

Cloaking

Cloaking is almost always considered a black hat seo technique. Cloaking is a method where different content versions are presented to human visitors and search engine robots.

Usually it works in this way: server side script tries to find out if a user requesting the page is robot or human either by checking user's IP address or HTTP header User-agent string (or both). User agents in HTTP headers are identified in a similar way as in robots.txt rules.

A tip: I've seen somewhere the following php code that was suggested as a reliable robot detection algorithm:

$is_robot = (strpos($_SERVER['HTTP_USER_AGENT'], '+http') === false);
It is based on presumption that all search engine robots leave their home url in their User-agent setting. This is absolutely false. Many robots do not specify their home urls. To learn more about User-agent identifiers you can simply look at your server logs; also do not forget to check robots.txt chapter if you skipped it.

More advanced client detection techniques can also be based on the client's behavior analysis after a few page requests.

Flash/Java applets

Flash was considered terra incognita for search engine spiders for quite a long time. But recently Google announced that they already index some parts of Flash (and I believe, that their skills of Flash reading will improve in the future). So this technique can not be considered as a very reliable one.

A similar alternative technique is Java applets. Search engines are not indexing applets content yet. And I can not be sure about the future. However, it is very easy to extract some information from Java, so Java applets should be used with care.

Robots-nocontent class

In May 2007 Yahoo introduced robots-nocontent html attribute which was meant to hide any part of a page from Yahoo robot. This attribute can be used on any html element, like:

<div class="robots-nocontent">
Yahoo robot (slurp) should read this as: content in the div marked with this attribute is unimportant. The biggest downside of this technique is that it works only in Yahoo. So it is not very popular among seo web masters.

Final words

I tried to sketch the most popular techniques of content/links hiding. However, a lot of other methods can be met online and you can also invent your own ways. The techniques I described above are more powerful when used together (e.g., JavaScript + CSS + robots.txt).

In any case, I do not advocate content/link hiding when it is used to manipulate search engine rankings in an unethical way. Still, if you do this in a good deed, I hope this article will give you a clue on how to hide your secrets properly in order not to get penalized accidentally. Also this should help you spot unethical guys and not mess with them.
0
 
LVL 7

Expert Comment

by:Ironhoofs
ID: 37724120
Just remember that this is an attempt to outsmart malicious robots / dataminers while maintaining good usability. Robots get smarter and may defeat your defenses or you have to stop because the solution is more damaging then the problem.

One assumption is that robots can't evaluate javascript and therefore won't be able to find URL's encrypted and inserted that way.

A good example can be found at: http://hivelogic.com/enkoder/
I found some source code for this at https://bitbucket.org/danbenjamin/enkoder-rails/src/405db349d604/lib/enkoder.rb but there are other (PHP) adaptations.

Another options is to use a navigation form, update its parameters with javascript in your links and post the form. Example:

<form name="navFrm" action="http://yoursite.org" method="post">
  <input type="hidden" name="uid" value="" />
</form>

<a href="#" onClick="document.navFrm.uid.value='ABCD123'; document.navFrm.submit();">Item ABCD123</a>

Open in new window


You can add encryption and other nifty tricks, but in the end all information is in your HTML and you use smoke and mirrors to fool the robot.
0
 

Author Comment

by:davidbayonchen
ID: 37724385
@ Ironhoofs : A robot cannot see the onClick="document.navFrm.uid.value='ABCD123';" but a page scrapper can pick it up. Am I right ?

I am still wondering whether I should go ahead with the 32-bit IDs...
0
 
LVL 7

Accepted Solution

by:
Ironhoofs earned 350 total points
ID: 37724525
A robot / page scraper will pick up any human readable URL or email address, like the URL you put in the <FORM> tag.

The robot will also parse the onClick event, but because most can't execute the script they discard it. However, smarter robots could make an educated guess about the parameter and value from the javascript. Thats why the Hivelogic enkoder obscures the data.

I witnessed malicious robots submitting data after they parsed the page for forms. Therefore using large non-sequenced id's can be another small step towards hiding your content from unwanted eyes.

But in the end, you have to decide how much trouble you are willing to go through and if the solution is not chasing your visitors away...
0

Featured Post

What Should I Do With This Threat Intelligence?

Are you wondering if you actually need threat intelligence? The answer is yes. We explain the basics for creating useful threat intelligence.

Join & Write a Comment

Suggested Solutions

Transferring data across the virtual world became simpler but protecting it is becoming a real security challenge.  How to approach cyber security  in today's business world!
Boost your ability to deliver ambitious and competitive web apps by choosing the right JavaScript framework to best suit your project’s needs.
Viewers will get an overview of the benefits and risks of using Bitcoin to accept payments. What Bitcoin is: Legality: Risks: Benefits: Which businesses are best suited?: Other things you should know: How to get started:
The viewer will learn how to count occurrences of each item in an array.

757 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

22 Experts available now in Live!

Get 1:1 Help Now