[2 days left] What’s wrong with your cloud strategy? Learn why multicloud solutions matter with Nimble Storage.Register Now

x
?
Solved

Robots.txt file how to exclude URLs that contain a question mark in any URL on our domain ?

Posted on 2016-08-10
8
Medium Priority
?
152 Views
Last Modified: 2016-08-13
Hello, we have noticed that our site has been indexed in the search engines for query strings URLs. Non of these URLS are in our website. All our pages are simple static HTML pages. What would be the robot.txt   instructions we need to include in our robots.txt so that these URLs would not be indexed in the search engines?

Here is are two examples of we would like to exclude from being indexed:

http://www.our-website.com/?viewType=Print&%3Bamp%3BviewClass=Print
http://www.our-website.com/transport/page1.html?viewType=Print&%3Bamp%3BviewClass=Print

We have many links like this that we don't want. The equivalent links we DO want indexed would be:

http://www.our-website.com
http://www.our-website.com/transport/page1.html

We have an apache webserver and use .htaccess and robots.txt files

Thank you
0
Comment
Question by:boltweb
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 4
  • 3
8 Comments
 
LVL 84

Expert Comment

by:Dave Baldwin
ID: 41751210
There are two issues here.  

#1.  Google doesn't use just the links on your site in their search results.  If someone else has one of those links on their pages for whatever reason, it is next to impossible to get rid of it basically because it is 'theirs' and not yours.  

#2. The best way to tell Google what you do want indexed is by submitting a 'sitemap' thru Google Webmaster Tools (now called Google Search Console).  https://www.google.com/webmasters/tools/
0
 
LVL 29

Expert Comment

by:Dr. Klahn
ID: 41751288
I don't believe what you request can be done in robots.txt.

In this situation I would activate mod_rewrite and use it to either:

a) Detect any URI containing a question mark and reject the request.
b) Detect a URI containing a question mark, and forward the request to a rewritten URI with everything after the question mark stripped.
c) Rewrite the URI to strip everything after the question mark.

(b) would be my choice.  This should eventually clean up the search engine entries as the search engine bots discover that the URIs are being forwarded.
0
 
LVL 1

Author Comment

by:boltweb
ID: 41754334
Hello thanks for your replies. Ive been working full time on this issue to try and resolve it.

Dave:  I already have google sitemaps which I had been using  from Google webmaster tools since 2008 - this issue still occurred even though I always update the sitemap with new URLs once per week.

Dr Klahn,  your answer looks good. I implemented solution a) and it redirected to a 404 page which I thought was the solution but then I noticed that our google search box on our site stopped working because it  produces a query string  ?  in the URL. This is the only component on our site that creates query strings. I did not notice until I blocked all query strings.  I could put google search in a special folder called  /gsearch/  this way if I could block all query strings on the website *except* those in a /gsearch/ folder that would solve the problem. I have read that if google finds a 404 page it will drop it from the index after a few times.

I cannot permanently  remove these pages from google using webmaster tools unless I also loose the original page because I would have to add tags in the page to be permanently removed. Google has a temporary remove however the problem with that is that the URLs could reappear again after 3 months. I would also need to get a list of all the URL and there are too many to list (several hundred) and there is no easy way to get a list of them either.

On your suggestion  b) I am not clear on how I can forward the request to a rewritten URI if I don't know what the URI would be, because I don't have a list of all the URI affected and I don't see how I can get such a list. How would you suggest I get this list of affected pages?  I can use google but there are too many to copy manually.

Option c) looks a good solution but then it might also affect the google search.

I think that option c) looks good if we could exclude anything in a  /gsearch/ folder also option a) also looks a possible solution if we could also exclude anything in a /gsearch/ folder.   Note that we have the site translated and each translated version is in a separate subfolder on the site for that language and each language has its own separate /gsearch/ folder so we would need to instruct the server to block all ? form all URI except in any folder that has /gsearch/    do you know what the code would be to do that?

best regards

JohnB
0
Looking for a new Web Host?

Lunarpages' assortment of hosting products and solutions ensure a perfect fit for anyone looking to get their vision or products to market. Our award winning customer support and 30-day money back guarantee show the pride we take in being the industry's premier MSP.

 
LVL 29

Expert Comment

by:Dr. Klahn
ID: 41754432
On your suggestion  b) I am not clear on how I can forward the request to a rewritten URI if I don't know what the URI would be, because I don't have a list of all the URI affected and I don't see how I can get such a list. How would you suggest I get this list of affected pages?  I can use google but there are too many to copy manually.

(This is rather ugly code)

RewriteCond %{THE_REQUEST} ^.*(\?).* [NC]
RewriteRule %{REQUEST_URI} [R,L]

Open in new window


If the request contains an question mark,
Rewrite it to the URI instead, mark it as a Redirect, and release it now without further processing.

REQUEST_URI
    The path component of the requested URI, such as "/index.html". This notably excludes the query string which is available as its own variable named QUERY_STRING.

THE_REQUEST
    The full HTTP request line sent by the browser to the server (e.g., "GET /index.html HTTP/1.1"). This does not include any additional headers sent by the browser. This value has not been unescaped (decoded), unlike most other variables below.
0
 
LVL 1

Author Comment

by:boltweb
ID: 41754611
Hello I tried the code you suggested and it did not do anything  i.e. the query string page loaded just as normal.

i.e. this code did not work:
RewriteCond %{THE_REQUEST} ^.*(\?).* [NC]
RewriteRule %{REQUEST_URI} [R,L]

Is there a way to create a  rewriteCond rule to  redirect all requests with a URL that contains a  "?" to a 401 page  but exclude any pages that contain a ? in a subfolder called   /gsearch/     The could  be many subfolders called  /gsearch/ because there will be one for each language directory.
0
 
LVL 29

Accepted Solution

by:
Dr. Klahn earned 2000 total points
ID: 41754859
Part of the second line appears to have been dropped.  It should read:

RewriteRule .* %{REQUEST_URI} [R,L]

Open in new window

0
 
LVL 1

Author Comment

by:boltweb
ID: 41755019
Thank you.
0
 
LVL 1

Author Closing Comment

by:boltweb
ID: 41755021
Thank you.
0

Featured Post

NEW Veeam Agent for Microsoft Windows

Backup and recover physical and cloud-based servers and workstations, as well as endpoint devices that belong to remote users. Avoid downtime and data loss quickly and easily for Windows-based physical or public cloud-based workloads!

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Hi, in this article I'm going to teach you how to run your own site, and how to let people in (without IP). I'll talk about and explain each step... :) By the way, everything in this Tutorial is completely free and legal. This article is for …
Meet the world's only “Transparent Cloud™” from Superb Internet Corporation. Now, you can experience firsthand a cloud platform that consistently outperforms Amazon Web Services (AWS), IBM’s Softlayer, and Microsoft’s Azure when it comes to CPU and …
In response to a need for security and privacy, and to continue fostering an environment members can turn to for support, solutions, and education, Experts Exchange has created anonymous question capabilities. This new feature is available to our Pr…
Please read the paragraph below before following the instructions in the video — there are important caveats in the paragraph that I did not mention in the video. If your PaperPort 12 or PaperPort 14 is failing to start, or crashing, or hanging, …
Suggested Courses
Course of the Month14 days, 16 hours left to enroll

649 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question