Link to home
Create AccountLog in
Avatar of boltweb
boltwebFlag for Spain

asked on

Website under attack: Duplicated content in Google URLs but duplicates are URLs on our domain with a "?"

We have noticed that hundreds of our webpages are being duplicated in the google index and are being indexed in google but we have not published any of these URLs.  The unusual thing is that all the duplicated pages contain our domain name i.e. our actual domain in the base part of the URL as if it was hosted on our webserver.  

e.g.  our legitimate homepage is like this:

http://www.our-domain.com/legitimate-page-URL.html

and the duplicated pages that are indexed in google for the above page are like this:

http://www.our-domain.com/legitimate-pageURL.html?garbage-text here-page2
http://www.our-domain.com/legitimate-page-URL.html?garbage-text-page3

etc:    there may be dozens of the pages 2 and 3  all with duplicate content from the legitimate page

Repeated dozens of times


 All the URLs are have a "?"  (question mark without the quotes" in the URL. We do not publish pages with a "?" in any of our pages.   We only have static HTML pages on our website. We do not use a database or content management system. Just static HTML pages.  

We were alerted there was a problem a week ago when we notice our homepage disappeared from google index - even a specific search for the page does not produce any result. Then another of our pages started to drop from position 1 to position 6  then other pages started dropping ranking positions.  

 All the duplicate page URLs  have duplicate content of a legitimate page on our website but these pages are not physically hosted on our site -  are not hosted on our website. We are therefore not clear how those pages got into the Google index.  The problem is that there are hundreds of URLs that are copies of our homepage but with different URLs all of these URLs are cached in Google, so it appears that google thinks these are legit pages when infact they are not. If we click on the link in google then we see the legitimate page content but with the bogus URLs. Non of the bogus URL are on our server.

Does anyone know what is happening here and what we can do to stop this issue and ensure these duplicate pages are not indexed in google and other engines?

thank you
JohnB
ASKER CERTIFIED SOLUTION
Avatar of Jeffrey Dake
Jeffrey Dake
Flag of United States of America image

Link to home
membership
Create a free account to see this answer
Signing up is free and takes 30 seconds. No credit card required.
See answer
Avatar of boltweb

ASKER

Hello Jeffery, thank you for your reply. I have had my site in google webmaster tools. I can't add rel canonical yet because my pages are all static pages and I have 16,500 pages in total to tag with the rel canonical. I'm getting a script written to do this but until it is written I can't implement that solution on the entire site. However I've implemented it on the homepage and my home page recovered its position which was a big relief.  

I looked into google paramaters. That was something I did not know about. I saw this training video on Youtube ( Google parameters https://www.youtube.com/watch?v=DiEYcBZ36po ) which was very helpful and now I have set all parameters to be not indexed by google however note that this tool does  not remove a URL it simply excludes it from being indexed. The google removal tools - there are 2 but the problems there are a) I would need to edit the page to remove the page permanently - I can't do that since the page does not exist. So that leave the temporary remove tool in google. The problem there is I don't have a full list of URLs affected by this issue so I can't submit that list, and even if I did have the list the removal is only temporary and the would return again in 3 months.

The solution is the rel canonical. The other thing I can do is implement a rewrite mod in the .htaccess file to block URLs with the query strings however the problem with this is that I would loose the google search functionality that has a "?" in its results. I could move the google search to load in a special folder called /gsearch/ however I would have to do that for each of the languages for the site - for which there are 8 translations of the site. So if there was a way to block all query strings AND exclude the gsearch then that would solve the problem until I can get the script completed for doing the rel canonical application.
Avatar of boltweb

ASKER

Thank you Jeffery. Got my homepage back!