Solved

Robots.txt file how to exclude URLs that contain a question mark in any URL on our domain ?

Posted on 2016-08-10
8
126 Views
Last Modified: 2016-08-13
Hello, we have noticed that our site has been indexed in the search engines for query strings URLs. Non of these URLS are in our website. All our pages are simple static HTML pages. What would be the robot.txt   instructions we need to include in our robots.txt so that these URLs would not be indexed in the search engines?

Here is are two examples of we would like to exclude from being indexed:

http://www.our-website.com/?viewType=Print&%3Bamp%3BviewClass=Print
http://www.our-website.com/transport/page1.html?viewType=Print&%3Bamp%3BviewClass=Print

We have many links like this that we don't want. The equivalent links we DO want indexed would be:

http://www.our-website.com
http://www.our-website.com/transport/page1.html

We have an apache webserver and use .htaccess and robots.txt files

Thank you
0
Comment
Question by:boltweb
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 4
  • 3
8 Comments
 
LVL 83

Expert Comment

by:Dave Baldwin
ID: 41751210
There are two issues here.  

#1.  Google doesn't use just the links on your site in their search results.  If someone else has one of those links on their pages for whatever reason, it is next to impossible to get rid of it basically because it is 'theirs' and not yours.  

#2. The best way to tell Google what you do want indexed is by submitting a 'sitemap' thru Google Webmaster Tools (now called Google Search Console).  https://www.google.com/webmasters/tools/
0
 
LVL 27

Expert Comment

by:Dr. Klahn
ID: 41751288
I don't believe what you request can be done in robots.txt.

In this situation I would activate mod_rewrite and use it to either:

a) Detect any URI containing a question mark and reject the request.
b) Detect a URI containing a question mark, and forward the request to a rewritten URI with everything after the question mark stripped.
c) Rewrite the URI to strip everything after the question mark.

(b) would be my choice.  This should eventually clean up the search engine entries as the search engine bots discover that the URIs are being forwarded.
0
 
LVL 1

Author Comment

by:boltweb
ID: 41754334
Hello thanks for your replies. Ive been working full time on this issue to try and resolve it.

Dave:  I already have google sitemaps which I had been using  from Google webmaster tools since 2008 - this issue still occurred even though I always update the sitemap with new URLs once per week.

Dr Klahn,  your answer looks good. I implemented solution a) and it redirected to a 404 page which I thought was the solution but then I noticed that our google search box on our site stopped working because it  produces a query string  ?  in the URL. This is the only component on our site that creates query strings. I did not notice until I blocked all query strings.  I could put google search in a special folder called  /gsearch/  this way if I could block all query strings on the website *except* those in a /gsearch/ folder that would solve the problem. I have read that if google finds a 404 page it will drop it from the index after a few times.

I cannot permanently  remove these pages from google using webmaster tools unless I also loose the original page because I would have to add tags in the page to be permanently removed. Google has a temporary remove however the problem with that is that the URLs could reappear again after 3 months. I would also need to get a list of all the URL and there are too many to list (several hundred) and there is no easy way to get a list of them either.

On your suggestion  b) I am not clear on how I can forward the request to a rewritten URI if I don't know what the URI would be, because I don't have a list of all the URI affected and I don't see how I can get such a list. How would you suggest I get this list of affected pages?  I can use google but there are too many to copy manually.

Option c) looks a good solution but then it might also affect the google search.

I think that option c) looks good if we could exclude anything in a  /gsearch/ folder also option a) also looks a possible solution if we could also exclude anything in a /gsearch/ folder.   Note that we have the site translated and each translated version is in a separate subfolder on the site for that language and each language has its own separate /gsearch/ folder so we would need to instruct the server to block all ? form all URI except in any folder that has /gsearch/    do you know what the code would be to do that?

best regards

JohnB
0
Ransomware-A Revenue Bonanza for Service Providers

Ransomware – malware that gets on your customers’ computers, encrypts their data, and extorts a hefty ransom for the decryption keys – is a surging new threat.  The purpose of this eBook is to educate the reader about ransomware attacks.

 
LVL 27

Expert Comment

by:Dr. Klahn
ID: 41754432
On your suggestion  b) I am not clear on how I can forward the request to a rewritten URI if I don't know what the URI would be, because I don't have a list of all the URI affected and I don't see how I can get such a list. How would you suggest I get this list of affected pages?  I can use google but there are too many to copy manually.

(This is rather ugly code)

RewriteCond %{THE_REQUEST} ^.*(\?).* [NC]
RewriteRule %{REQUEST_URI} [R,L]

Open in new window


If the request contains an question mark,
Rewrite it to the URI instead, mark it as a Redirect, and release it now without further processing.

REQUEST_URI
    The path component of the requested URI, such as "/index.html". This notably excludes the query string which is available as its own variable named QUERY_STRING.

THE_REQUEST
    The full HTTP request line sent by the browser to the server (e.g., "GET /index.html HTTP/1.1"). This does not include any additional headers sent by the browser. This value has not been unescaped (decoded), unlike most other variables below.
0
 
LVL 1

Author Comment

by:boltweb
ID: 41754611
Hello I tried the code you suggested and it did not do anything  i.e. the query string page loaded just as normal.

i.e. this code did not work:
RewriteCond %{THE_REQUEST} ^.*(\?).* [NC]
RewriteRule %{REQUEST_URI} [R,L]

Is there a way to create a  rewriteCond rule to  redirect all requests with a URL that contains a  "?" to a 401 page  but exclude any pages that contain a ? in a subfolder called   /gsearch/     The could  be many subfolders called  /gsearch/ because there will be one for each language directory.
0
 
LVL 27

Accepted Solution

by:
Dr. Klahn earned 500 total points
ID: 41754859
Part of the second line appears to have been dropped.  It should read:

RewriteRule .* %{REQUEST_URI} [R,L]

Open in new window

0
 
LVL 1

Author Comment

by:boltweb
ID: 41755019
Thank you.
0
 
LVL 1

Author Closing Comment

by:boltweb
ID: 41755021
Thank you.
0

Featured Post

Why You Need a DevOps Toolchain

IT needs to deliver services with more agility and velocity. IT must roll out application features and innovations faster to keep up with customer demands, which is where a DevOps toolchain steps in. View the infographic to see why you need a DevOps toolchain.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Meet the world's only “Transparent Cloud™” from Superb Internet Corporation. Now, you can experience firsthand a cloud platform that consistently outperforms Amazon Web Services (AWS), IBM’s Softlayer, and Microsoft’s Azure when it comes to CPU and …
Lease-to-own eliminates the expenditure of hardware replacement and allows you to pay off the server over time. Usually, this is much cheaper than leasing servers. Think of lease-to-own as credit without interest.
In an interesting question (https://www.experts-exchange.com/questions/29008360/) here at Experts Exchange, a member asked how to split a single image into multiple images. The primary usage for this is to place many photographs on a flatbed scanner…
How to Install VMware Tools in Red Hat Enterprise Linux 6.4 (RHEL 6.4) Step-by-Step Tutorial

732 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question