Solved

Robots.txt file how to exclude URLs that contain a question mark in any URL on our domain ?

Posted on 2016-08-10
8
44 Views
Last Modified: 2016-08-13
Hello, we have noticed that our site has been indexed in the search engines for query strings URLs. Non of these URLS are in our website. All our pages are simple static HTML pages. What would be the robot.txt   instructions we need to include in our robots.txt so that these URLs would not be indexed in the search engines?

Here is are two examples of we would like to exclude from being indexed:

http://www.our-website.com/?viewType=Print&amp%3Bamp%3BviewClass=Print
http://www.our-website.com/transport/page1.html?viewType=Print&amp%3Bamp%3BviewClass=Print

We have many links like this that we don't want. The equivalent links we DO want indexed would be:

http://www.our-website.com
http://www.our-website.com/transport/page1.html

We have an apache webserver and use .htaccess and robots.txt files

Thank you
0
Comment
Question by:boltweb
  • 4
  • 3
8 Comments
 
LVL 82

Expert Comment

by:Dave Baldwin
Comment Utility
There are two issues here.  

#1.  Google doesn't use just the links on your site in their search results.  If someone else has one of those links on their pages for whatever reason, it is next to impossible to get rid of it basically because it is 'theirs' and not yours.  

#2. The best way to tell Google what you do want indexed is by submitting a 'sitemap' thru Google Webmaster Tools (now called Google Search Console).  https://www.google.com/webmasters/tools/
0
 
LVL 23

Expert Comment

by:Dr. Klahn
Comment Utility
I don't believe what you request can be done in robots.txt.

In this situation I would activate mod_rewrite and use it to either:

a) Detect any URI containing a question mark and reject the request.
b) Detect a URI containing a question mark, and forward the request to a rewritten URI with everything after the question mark stripped.
c) Rewrite the URI to strip everything after the question mark.

(b) would be my choice.  This should eventually clean up the search engine entries as the search engine bots discover that the URIs are being forwarded.
0
 

Author Comment

by:boltweb
Comment Utility
Hello thanks for your replies. Ive been working full time on this issue to try and resolve it.

Dave:  I already have google sitemaps which I had been using  from Google webmaster tools since 2008 - this issue still occurred even though I always update the sitemap with new URLs once per week.

Dr Klahn,  your answer looks good. I implemented solution a) and it redirected to a 404 page which I thought was the solution but then I noticed that our google search box on our site stopped working because it  produces a query string  ?  in the URL. This is the only component on our site that creates query strings. I did not notice until I blocked all query strings.  I could put google search in a special folder called  /gsearch/  this way if I could block all query strings on the website *except* those in a /gsearch/ folder that would solve the problem. I have read that if google finds a 404 page it will drop it from the index after a few times.

I cannot permanently  remove these pages from google using webmaster tools unless I also loose the original page because I would have to add tags in the page to be permanently removed. Google has a temporary remove however the problem with that is that the URLs could reappear again after 3 months. I would also need to get a list of all the URL and there are too many to list (several hundred) and there is no easy way to get a list of them either.

On your suggestion  b) I am not clear on how I can forward the request to a rewritten URI if I don't know what the URI would be, because I don't have a list of all the URI affected and I don't see how I can get such a list. How would you suggest I get this list of affected pages?  I can use google but there are too many to copy manually.

Option c) looks a good solution but then it might also affect the google search.

I think that option c) looks good if we could exclude anything in a  /gsearch/ folder also option a) also looks a possible solution if we could also exclude anything in a /gsearch/ folder.   Note that we have the site translated and each translated version is in a separate subfolder on the site for that language and each language has its own separate /gsearch/ folder so we would need to instruct the server to block all ? form all URI except in any folder that has /gsearch/    do you know what the code would be to do that?

best regards

JohnB
0
 
LVL 23

Expert Comment

by:Dr. Klahn
Comment Utility
On your suggestion  b) I am not clear on how I can forward the request to a rewritten URI if I don't know what the URI would be, because I don't have a list of all the URI affected and I don't see how I can get such a list. How would you suggest I get this list of affected pages?  I can use google but there are too many to copy manually.

(This is rather ugly code)

RewriteCond %{THE_REQUEST} ^.*(\?).* [NC]
RewriteRule %{REQUEST_URI} [R,L]

Open in new window


If the request contains an question mark,
Rewrite it to the URI instead, mark it as a Redirect, and release it now without further processing.

REQUEST_URI
    The path component of the requested URI, such as "/index.html". This notably excludes the query string which is available as its own variable named QUERY_STRING.

THE_REQUEST
    The full HTTP request line sent by the browser to the server (e.g., "GET /index.html HTTP/1.1"). This does not include any additional headers sent by the browser. This value has not been unescaped (decoded), unlike most other variables below.
0
How to run any project with ease

Manage projects of all sizes how you want. Great for personal to-do lists, project milestones, team priorities and launch plans.
- Combine task lists, docs, spreadsheets, and chat in one
- View and edit from mobile/offline
- Cut down on emails

 

Author Comment

by:boltweb
Comment Utility
Hello I tried the code you suggested and it did not do anything  i.e. the query string page loaded just as normal.

i.e. this code did not work:
RewriteCond %{THE_REQUEST} ^.*(\?).* [NC]
RewriteRule %{REQUEST_URI} [R,L]

Is there a way to create a  rewriteCond rule to  redirect all requests with a URL that contains a  "?" to a 401 page  but exclude any pages that contain a ? in a subfolder called   /gsearch/     The could  be many subfolders called  /gsearch/ because there will be one for each language directory.
0
 
LVL 23

Accepted Solution

by:
Dr. Klahn earned 500 total points
Comment Utility
Part of the second line appears to have been dropped.  It should read:

RewriteRule .* %{REQUEST_URI} [R,L]

Open in new window

0
 

Author Comment

by:boltweb
Comment Utility
Thank you.
0
 

Author Closing Comment

by:boltweb
Comment Utility
Thank you.
0

Featured Post

Comprehensive Backup Solutions for Microsoft

Acronis protects the complete Microsoft technology stack: Windows Server, Windows PC, laptop and Surface data; Microsoft business applications; Microsoft Hyper-V; Azure VMs; Microsoft Windows Server 2016; Microsoft Exchange 2016 and SQL Server 2016.

Join & Write a Comment

Suggested Solutions

Title # Comments Views Activity
Problem to Eclipse 16 99
Ubuntu 14 with Apache 7 68
Clearing cache in word press. 3 34
Missing Apache mod_DBD packages in Centos 7 2 34
This is a guide to setting up a new WHM/cPanel Server to be used for web hosting accounts. It is intended for web hosting company administrators and dedicated server owners. For under $99 per month (considering normal rate of Big Data Cetnters like …
A web service (http://en.wikipedia.org/wiki/Web_service) is a software related technology that facilitates machine-to-machine interaction over a network. This article helps beginners in creating and consuming a web service using the ColdFusion Ma…
Internet Business Fax to Email Made Easy - With eFax Corporate (http://www.enterprise.efax.com), you'll receive a dedicated online fax number, which is used the same way as a typical analog fax number. You'll receive secure faxes in your email, fr…
This video discusses moving either the default database or any database to a new volume.

743 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

18 Experts available now in Live!

Get 1:1 Help Now