Protect my site from bulk download as well as possible


I want to give my visitors access to some pages I want to serve using php.
E.g. I will have links to ONE php page that will fetch the data from the file system

How can I block non-users like robots from bulk downloading the pages?

If I for example on pageA.html have

<a href="pagedownloaded.php?title=page%1&file=page1.html">Page1</a>
<a href="pagedownloaded.php?title=page%2&file=page2.html">Page2</a>
<a href="pagedownloaded.php?title=page%3&file=page3.html">Page3</a>
<a href="pagedownloaded.php?title=page%4&file=page4.html">Page4</a>

what can I use to protect page1-4 from a robot or bulk downloader and force them to only look at the page from my server?

I am aware that the referrer may not be set so perhaps something with a session and a token?

LVL 75
Michel PlungjanIT ExpertAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

This "bot trap" method seems interesting:

Just about any ident/captcha method can be easily defeated by a bot.
I'm not sure I understand. Are you trying to stop search engine robots from indexing your pages? If so, try looking into using robots.txt.

You can use mechanisms to stop third-party sites linking directly to your files, but usually you want people to visit your HTML pages, so you wouldn't stop people linking to those.

If you're trying to stop people from loading your pages into frames on their pages, you might be able to use Javascript to ask the browser what the address bar of the visitor's browser says, and execute a redirect if the URL isn't yours.

If you're hoping to stop robots from reading your pages and then putting up copies, well that's not easy to avoid. If you make your site so that robots can't see the actual content, Google will punish you for creating a two-faced site. Plus, a well-written robot will probably be able to use sessions.
Michel PlungjanIT ExpertAuthor Commented:
I have no problem with indexing

I just do not want to show the actual scripts I have on the page unless it is loaded from my site
Get your problem seen by more experts

Be seen. Boost your question’s priority for more expert views and faster solutions

You cannot do that, because scripts act exactly like it would be an user. You can either:

1. Limit the number of downloads per hour (day, year)

2. set CAPTCHA image bot prevention.

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
What if instead of using $_GET you use $_POST on pagedownloaded.php. Most if not all bulk downloads work with $_GET.



} else {
 //go away
<form name="link" method="post" action="pagedownloaded.php">
<input type="hidden" name="title" value="page%1">
<input type="hidden" name="file" value="page1.html">
<input type="submit" value="Page1">
Or if you still want a text link:
<form name="link" method="post" action="pagedownloaded.php">
<input type="hidden" name="title" value="page%1">
<input type="hidden" name="file" value="page1.html">
<a href="javascript: submitform()">Page1</A>
<script type="text/javscript">
function submitform()

Open in new window

If anyone will be making a bot just for your page, then you can hardly defend.

If you are asking about bots that are randomly circling around the web, you can defend easily.
Michel PlungjanIT ExpertAuthor Commented:
Interesting - I may not have made myself totally clear, but I do get a few ideas..

The situation is as follows:

I have answered in excess of 50,000 questions on javascript over the last 10 years
I want to have a database of questions on my site and present them in a structured way

Each script will get its own page with the question and the answer in a folder structure
What I thought about before I started googling was the referrer, making my link look like

<a href="pagedownloaded.php?title=Refresh%20Opener%20from%20popup&file=/windows_handling/openerefresh.html">refresh opener from popup</a>

<?php .. ?>
<title><? echo $_GET["title"] ?></title>
<h3><? echo $_GET["title"] ?></h3>
<? goGetDescription($_GET["file"] ); // THIS I want to have indexed!
if (strstr($HTTP_REFERRER,'') !=-1) { // naive method
  goGetAnswer($_GET["file"] );
else {
To see the answer, you need to go via my site at page .... . If you wish to bulk copy my answers, you need to contact me at .....

So perhaps I need to look at a javascript based way - perhaps with a captcha per answer - but that is really putting a lot of work on the visitor...
Yes, javascript... But this way, if anyone will be targeting your site (writing just for you), this is not good protection...

And CAPTCHA... It has bad point it can be annoying.
Michel PlungjanIT ExpertAuthor Commented:
I just want people to get the code from my site and not by grabbing the scripts directly from the database

That means I want to identify on the page with the question and answer if the page has been loaded by a person in a browser. If so, I will give the code, if not I will only show the question.


<? if some session { ?><iframe src="getanswer.php?idx=4"></iframe><? } ?>


<? if some session variable set by my site { ?><textarea><? echo myRow[4] ?></textarea><? } ?>

cURL can overpass everything except CAPTCA... Everything - sessions, cookies, javascript.

Write this letters in the image to view solution:
[dsdasasd] [..............]
Michel PlungjanIT ExpertAuthor Commented:
Ok, I will try captcha since I have it installed already.

ddrudik: how would a bot defeat a captcha unless it sends the image to some captcha hackers in a country with cheap labour?

And if they do, they are welcome... My code is not going to change the world order

Nobody, except human can read captcha.

They could  employ some asian/japanese kids (for reading and typing CAPTCHAs as cheap working force), but in general, captcha protects you the best.
Thanks for the question and the points.

For a method that doesn't require a human agent to decipher captchas:
Michel PlungjanIT ExpertAuthor Commented:
Drat!  Find a way to protect and they find a way to break it :(((
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today

From novice to tech pro — start learning today.