Avatar of rafaelrgl
 asked on

how to stop spider software from getting my website content

Ok guys, here is the problem, I know that exist out there a lots of spider programs that go on your website and open all pages and get your content, or maybe an hacker that wants to get your content he can use this kind of program to get it with no problem.

So how can we stop this thing. I know when those programs go to get the website they open one page after another, maybe have an function that checks the rate the user opens pages, lets say 1 pages / 5 second is an spider, so redirect the user to an page that he have to type an authentication using some image? The point is, what is the best way to stop this without losing also performance and letting others crawlers like google pass those authentication so we can have our website indexed and also block unwanted crawlers.

What you guys say?
.NET ProgrammingASP.NETC#

Avatar of undefined
Last Comment
Dave Baldwin

8/22/2022 - Mon

use a robots.txt file in your main folder.  Here is an article:

If you are wanting to up security, you can use an incoming proxy like a barracuda web application firewall but there will always be a performance hit.  There may be a more specific answer such as use Drupal with login and permissions/taxonomy to segment what is allowable to anonymous users and what is restricted to authenticated ones but I would need more specific information.

robots.txt only blocks the spider who checks for it first, but there is software's that opens the page and opens all links ad so on, and reads the content . so robot.txt does not work on this case. needs to be something like i sad and also an proxy does not work since the behavior of those software's are like users behavior, it goes clicking on links and reading the content, but one thing that happens is, they do it fast so they can read all your content with no problem.

So, lets say we will do some code on this one. is there a function that we can add on our aspx pages that will count the rate of open pages per second then redirect an user to an authentication page so he can type some secure code, if he types the right words on the image he can continue open more pages, but if he hits the limit again then he have to type again the code. I see this as the only way to block software' s of copying out content.

i am not talking about restricting content using password login permission, but talking about regular pages being access.
All of life is about relationships, and EE has made a viirtual community a real community. It lifts everyone's boat
William Peck

Log in or sign up to see answer
Become an EE member today7-DAY FREE TRIAL
Members can start a 7-Day Free trial then enjoy unlimited access to the platform
Sign up - Free for 7 days
Learn why we charge membership fees
We get it - no one likes a content blocker. Take one extra minute and find out why we block content.
Not exactly the question you had in mind?
Sign up for an EE membership and get your own personalized solution. With an EE membership, you can ask unlimited troubleshooting, research, or opinion questions.
ask a question

Thanks a lot for trying to understand me, ok, read very carefullly all your comment.

So, lets say when user enter we start the session like:

if Session("USER_RATEPAGE")  < 5 seconds from the last open page or session("BLOCK") = True
      session("BLOCK") = True
     Session("USER_RATEPAGE")  = GETTIME()

so, this way we are treating the session and not the ip address. and on the auth.aspx we can have those kind of images that have numbers and letters so user can type than set Session("BLOCK") = False

btw, i have the user's page like login, signoff and signin, but what i want is the pages that is public, and all the information. lets say for example, an recipe website or any website that have public content that does not need the user to sign in. but my concern is that if we implement something like this code above is that will block crawling from google and other search engines that really needs to crawl into my content.

one thing that this code does is does not allow any session that is going to open pages really fast to do without authentication. btw, the code above, i don't know how to put together, so i type the logic.

What you say HooKooDooKu?

another thing came to me is, what's the right time to set 5, 7, 10, 15 sec. because the spider would have to do the crawling above this rate to get my content, so I will filter some possibilities to have my content go away so easily
Dave Baldwin

Unless you have a real problem like they are using excessive bandwidth, you can waste a lot of your time with very little result.  If someone is constantly accessing your site, you can do IP blocking.  

Bots are easily made to look like real browsers and rate limiting is something they will do so they don't 'alert' you to the fact that they are downloading your content.  There is no generic protection you can put up that won't eventually interfere with your regular users.
Get an unlimited membership to EE for less than $4 a week.
Unlimited question asking, solutions, articles and more.

my problem is not bandwidth, but someone copying my entire content so easy that will take 1 day to do it. i just hate spiders. lol.
Log in to continue reading
Log In
Sign up - Free for 7 days
Get an unlimited membership to EE for less than $4 a week.
Unlimited question asking, solutions, articles and more.
Dave Baldwin

The best alternative at the moment is to use AJAX to load your content thru javascript.  That works because most bot or downloaders don't run javascript.  It means rewriting your whole site.  Very large sites like Google and Facebook use AJAX a lot, in part to prevent downloading and in part for the 'user experience' where they keep loading new content without reloading the page.