Avatar of rafaelrgl
rafaelrgl

asked on 

how to stop spider software from getting my website content

Ok guys, here is the problem, I know that exist out there a lots of spider programs that go on your website and open all pages and get your content, or maybe an hacker that wants to get your content he can use this kind of program to get it with no problem.

So how can we stop this thing. I know when those programs go to get the website they open one page after another, maybe have an function that checks the rate the user opens pages, lets say 1 pages / 5 second is an spider, so redirect the user to an page that he have to type an authentication using some image? The point is, what is the best way to stop this without losing also performance and letting others crawlers like google pass those authentication so we can have our website indexed and also block unwanted crawlers.

What you guys say?
.NET ProgrammingASP.NETC#

Avatar of undefined
Last Comment
Dave Baldwin
Avatar of firstheartland
firstheartland

use a robots.txt file in your main folder.  Here is an article:
http://www.robotstxt.org/robotstxt.html

If you are wanting to up security, you can use an incoming proxy like a barracuda web application firewall but there will always be a performance hit.  There may be a more specific answer such as use Drupal with login and permissions/taxonomy to segment what is allowable to anonymous users and what is restricted to authenticated ones but I would need more specific information.
Avatar of rafaelrgl
rafaelrgl

ASKER

robots.txt only blocks the spider who checks for it first, but there is software's that opens the page and opens all links ad so on, and reads the content . so robot.txt does not work on this case. needs to be something like i sad and also an proxy does not work since the behavior of those software's are like users behavior, it goes clicking on links and reading the content, but one thing that happens is, they do it fast so they can read all your content with no problem.

So, lets say we will do some code on this one. is there a function that we can add on our aspx pages that will count the rate of open pages per second then redirect an user to an authentication page so he can type some secure code, if he types the right words on the image he can continue open more pages, but if he hits the limit again then he have to type again the code. I see this as the only way to block software' s of copying out content.
Avatar of rafaelrgl
rafaelrgl

ASKER

i am not talking about restricting content using password login permission, but talking about regular pages being access.
ASKER CERTIFIED SOLUTION
Avatar of HooKooDooKu
HooKooDooKu

Blurred text
THIS SOLUTION IS ONLY AVAILABLE TO MEMBERS.
View this solution by signing up for a free trial.
Members can start a 7-Day free trial and enjoy unlimited access to the platform.
See Pricing Options
Start Free Trial
Avatar of rafaelrgl
rafaelrgl

ASKER

Thanks a lot for trying to understand me, ok, read very carefullly all your comment.

So, lets say when user enter we start the session like:


if Session("USER_RATEPAGE")  < 5 seconds from the last open page or session("BLOCK") = True
begin
      response.redirect("auth.aspx")
      session("BLOCK") = True
else
     Session("USER_RATEPAGE")  = GETTIME()
end

so, this way we are treating the session and not the ip address. and on the auth.aspx we can have those kind of images that have numbers and letters so user can type than set Session("BLOCK") = False

btw, i have the user's page like login, signoff and signin, but what i want is the pages that is public, and all the information. lets say for example, an recipe website or any website that have public content that does not need the user to sign in. but my concern is that if we implement something like this code above is that will block crawling from google and other search engines that really needs to crawl into my content.

one thing that this code does is does not allow any session that is going to open pages really fast to do without authentication. btw, the code above, i don't know how to put together, so i type the logic.

What you say HooKooDooKu?
Avatar of rafaelrgl
rafaelrgl

ASKER

another thing came to me is, what's the right time to set 5, 7, 10, 15 sec. because the spider would have to do the crawling above this rate to get my content, so I will filter some possibilities to have my content go away so easily
Avatar of Dave Baldwin
Dave Baldwin
Flag of United States of America image

Unless you have a real problem like they are using excessive bandwidth, you can waste a lot of your time with very little result.  If someone is constantly accessing your site, you can do IP blocking.  

Bots are easily made to look like real browsers and rate limiting is something they will do so they don't 'alert' you to the fact that they are downloading your content.  There is no generic protection you can put up that won't eventually interfere with your regular users.
Avatar of rafaelrgl
rafaelrgl

ASKER

my problem is not bandwidth, but someone copying my entire content so easy that will take 1 day to do it. i just hate spiders. lol.
SOLUTION
Avatar of Dave Baldwin
Dave Baldwin
Flag of United States of America image

Blurred text
THIS SOLUTION IS ONLY AVAILABLE TO MEMBERS.
View this solution by signing up for a free trial.
Members can start a 7-Day free trial and enjoy unlimited access to the platform.
Avatar of Dave Baldwin
Dave Baldwin
Flag of United States of America image

The best alternative at the moment is to use AJAX to load your content thru javascript.  That works because most bot or downloaders don't run javascript.  It means rewriting your whole site.  Very large sites like Google and Facebook use AJAX a lot, in part to prevent downloading and in part for the 'user experience' where they keep loading new content without reloading the page.
.NET Programming
.NET Programming

The .NET Framework is not specific to any one programming language; rather, it includes a library of functions that allows developers to rapidly build applications. Several supported languages include C#, VB.NET, C++ or ASP.NET.

137K
Questions
--
Followers
--
Top Experts
Get a personalized solution from industry experts
Ask the experts
Read over 600 more reviews

TRUSTED BY

IBM logoIntel logoMicrosoft logoUbisoft logoSAP logo
Qualcomm logoCitrix Systems logoWorkday logoErnst & Young logo
High performer badgeUsers love us badge
LinkedIn logoFacebook logoX logoInstagram logoTikTok logoYouTube logo