Advice on Web Crawlers...

Hi,

I have a Drupal website and am thinking of creating some pages that provide information to some software that I am developing.  

I envisage that the software contains predefined links to specific pages and other than that there are no links to the pages.

I do NOT want the content of these pages to appear in search engine results.  My question is:  

If I create a page:  www.drTribos.com.au/<SomeRandomString> and don't make any links to it, will a search engine be able (ok let's say likely) to find it?

Are there any recommendations that people can make?  Please let me know if I need to clarify the concept.

TIA
LVL 15
DrTribosAsked:
Who is Participating?
 
Dave BaldwinFixer of ProblemsCommented:
'robots.txt' is used to tell 'legitimate' search engine web crawlers to stay out.  The spam scanners will ignore that.  http://www.robotstxt.org/robotstxt.html

There is some suspicion that it doesn't really work though.  They won't list it in the results but it seems that they do keep track no matter what they say.  All browsers now check every site and page you go to against a malware database.  Firefox has been using Google's  for years as does Chrome and IE uses Microsoft's.  If you bring it up in your browser, someone somewhere will know.  And you might find people you never heard of downloading your page.
0
 
COBOLdinosaurCommented:
Yeah Dave is right the simple answer is if you put a page on a public facing server then it will get discovered.  Even without a link to it, it will be found unless it is in a directory that is secured and inaccessible then it will have some protection, unless a curious hacker finds a security hole.

Of course if it gets discovered then it is possible that someone will download it and put it on another site or post links to it.  You should never put anything on a publicly accessible server that you want kept confidential.

Cd&
0
 
Jason C. LevineNo oneCommented:
Agree with Cd&.

Password protect the page/folder to keep spiders out
Don't post it at all if it's sensitive (unless you really, really know what you're doing)
0
Cloud Class® Course: Python 3 Fundamentals

This course will teach participants about installing and configuring Python, syntax, importing, statements, types, strings, booleans, files, lists, tuples, comprehensions, functions, and classes.

 
DrTribosAuthor Commented:
Thanks guys... I was actually a little surprised when I first read Dave's answer, now I think I'm surprised that I was surprised... :-/

My information is not super sensitive... I am developing some software which has automatic bug reporting.  Among other things, the bug tracker I use detects duplicates and tracks frequency.  This provides me with the opportunity to notify the user (who just experienced the bug) if the:
- bug is known
- there is a workaround
- there is an upgrade

I was planning on making some web pages to describe workaround information.   I'm just in two minds about broadcasting this to the entire web.

I think I can put pages of that nature in a specific folder which is protected by a ht.access  - not sure the best way to implement.

Cheers,
0
 
Dave BaldwinFixer of ProblemsCommented:
If you're running on Apache, you can use '.htaccess' to implement Basic Auth security which will keep people out that don't have the password including search engine robots.  http://httpd.apache.org/docs/current/howto/auth.html

As far as I know, there is literally nothing you can do about the page reporting done by the browsers.  Supposedly you can turn it off but I don't know that it really works.  Years ago there was a question here by someone who had uploaded a file by FTP and only looked at it once in their browser.  There were no links to it anywhere.  So they were quite surprised when they saw in their logs that someone from Global Crossing had downloaded the file.  They were bought by Level3 which is one of the biggest network providers you never heard of because they don't do residential or 'last mile' networking.  They connect ISPs to each other.  The chances are very good that your request for this page went thru part of Level3's network.
0
 
DrTribosAuthor Commented:
Wow...

I can probably cope with using .htaccess  thanks for the link :-D
0
 
Dave BaldwinFixer of ProblemsCommented:
You're welcome.  So now you understand why that I say...

If you want privacy... turn off the computer and walk away.
0
 
oliverpoldenCommented:
I realise this has already been accepted but wanted to cover the options for protecting pages for which there are loads of options:
 - Put the page behind a login (the obvious answer)
 - Protected pages module: https://www.drupal.org/project/protected_pages
 - Premium pages: https://www.drupal.org/project/nopremium
Plus many more.

It sounds like protected pages is the right one for you. You cannot use a .htaccess file in Drupal to secure a "folder of pages" since Drupal serves all pages out of the database via index.php.

To add to the discovery of non-linked-to pages. There are lots of reasons they could be discovered by search engines:
 - Server misconfiguration that exposes directory listings
 - Automatically generated sitemaps
 - Automatically generated feeds e.g. RSS
 - Some unexpected link from elsewhere on the site

Hope that's helpful.
Oliver
0
 
DrTribosAuthor Commented:
Oliver, thank you for the extra info, very much appreciated.
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.