Link to home
Create AccountLog in
Avatar of Oliver2000
Oliver2000Flag for Brazil

asked on

double content in google

Hi experts,

I have a forum with +2 million pages indexed in google. how ever I know that a lot of this pages are doubled content because the same forum post open with 3 different urls.

for example the same topic opens for the urls
1. viewtopic.php?p=xxxx (view a post)
2. viewtopic.php?t=xxxx (view the entire topic but of course the post above as well)
3. this_is_example_topic.html (cached version of the same as the topic before)

I prefer normally the SEO version of the url (nr3.) but unfortunately google has indexed the versions viewtopic as well more than 500.000 times.

I thought about to block now simple the google indexer with my robots.txt and block access to all viewtopic urls which would filter everything out except the seo url version BUT i fear this is going to hit me negative and do more harm than well because this means goolge is going to kick hundreds of thousand urls out of the index. The content might not be 100% indexed with the SEO url version and once i kick the viewtopic urls out I might be out of traffic.

I would be happy for some suggestions or ideas on how to prevent a big mess but clean up my urls in google.

thanks in advance
Avatar of QuinnDex
QuinnDex

I would use sitemaps to pass the urls you want indexed to google
leaving out the urls you consider to be duplicate, this may not stop google from indexing them though.

https://support.google.com/webmasters/answer/156184?hl=en

you could use rel='nofollow on links to that page but if that page is linked to from another site google could still index it

https://support.google.com/webmasters/answer/96569?hl=en&ctx=cb&src=cb&cbid=-5rmggrfsp2rq&cbrank=3

only way to stop the page from being indexed is using meta tag telling google not to index

<meta name="robots" content="noindex, nofollow" />
Avatar of Oliver2000

ASKER

@QuinnDex: Hi, but this is not at all what I asked. I know how I can stop google from indexing. My question is how can I clean up without losing traffic now. Please read my question be careful again. I guess also that a sitemap not help anything since we talk about 5 million pages (a little bit to much for a sitemap i guess) and also every day thousands of new posts.
you cant clean up, google will do that over time, as google trys to re index the duplicate pages it will see the no index meta tag and drop the index for that page

you can go into your google analitics accouunt and remove manualy urls, but that isnt possible for 500,000 pages
5 million pages (a little bit to much for a sitemap i guess

not to much at all, you can produce sitemaps dynamically on a daily basis, this is how i used to do it on my site only had 2 million pages but when you get into those figures it doesnt make much difference
thx for your feedback again.

I was checking around and it looks like google has in index always both versions. the seo url version and the normal viewtopic version. if I now block simply in my robots.txt the viewtopic version google will next time try to rescan and get blocked for the viewtopic url.

so far so good... but there comes the main question.

is this going to be good or bad in the end.

what I mean is what will have the higher value for me in the end and bring more traffic.

Option: 1.
I leave the seo url version of the topics as well as the viewtopic.php version of the topic. so my site is twice in google index and appear in search results for diferent results more often.

Option 2.
I block out the viewtopic version and drop half a million pages out of my index but have only the clean seo version indexed. so far less pages indexed but no double content.

what will be in the end better for me?

interesting info about the dynamic sitemap. I did not know that I can have a sitemap with millions of sites, isnt this going to be a massive file? any tip on how you did the dynamic sitemap in a case like this?
is your seo url a rewriten url?
If it is i would redirect all traffic from the .php pages to the seo version with a  301 redirect this will update google at the same time and google will drop all the .php pages from its index over time, but will weight importance towards the seo urls, so they will rank higher. less urls but better ranked
SOLUTION
Avatar of QuinnDex
QuinnDex

Link to home
membership
Create a free account to see this answer
Signing up is free and takes 30 seconds. No credit card required.
See answer
ASKER CERTIFIED SOLUTION
Link to home
membership
Create a free account to see this answer
Signing up is free and takes 30 seconds. No credit card required.
You wrote the key question for me "less urls but better ranked" is this a fact? So you would say that less pages indexed but better ranked the once which are left is the better version in compare to more sites indexed but with same content and there for lower ranked?

of course I realize this dont happen over night, but on the long run... 6-8 weeks.
In my case I not even need to redirect much, I just need to block out the double urls in my robot txt. this would already do the job to make sure google index only the seo urls.

my key doubt was always the same.... are single urls with exclusive content but better ranked more important as maybe double urls with the same content but 1.000.000 more pages indexd?

I am not sure how to create a dynamic sitemap of my entire forum :( but I have a look into this.

resume: you guys say clear to me.... thought out the double content and use quality urls over quantity indexed.???
Duplicate content will not just get lower ranked; it will be de-indexed, and if google finds a lot of duplicate content they may remove all versions of the content from the index.  The most extreme penalty; which only happens when Google thinks you are intentionally trying to game the search is blacklisting of the site.  I don't see anything you are doing that should trigger the serious penalties, but the rank will probably not go up with out the elimination of what appears to be duplicate content.  It can take 3 to 6 months to recover from minor Google penalties, and years to fully recover from blacklisting.

Cd&
My site is one of the biggest in brazil with 5 million users and more than 13 years online. I dont think my site gets any bad attention from google since I never tried anything bad or any tricky thing. I honestly build my traffic with articles etc since many years and pay ghostwriter etc. for all my content to be unique. the problem is caused by my forum which grow over the years a lot and I did years ago some seo work with the better urls but never got 100% rid of the old urls. I was always frightened to touch this subject so google indexed since years both versions. but I know and see I do have double and even trible content in some cases with different urls.

what I am getting out of all this is clearly that it might take some time but on the long run is clear better to have the double content out and cleaner indexed the real important urls.

I am going to kick out all the wrong urls based on what we talked here and what I read in all kinds of SEO tips and sites. I am pretty sure that I got pulled down by the latest panda/penguin updates since i drop 20-25% in the last 6 weeks without any technical reason. but again, even if this was not the case and something else plays a role it is better to clean up the urls.

thanks for your support guys I appreciate the input and help
as BMW found out when they got blacklisted

blocking those pages in the robots.txt will help, but a 301 redirect will correct the issue faster,  google picks up on pattens, and will quickly catch on to the fact that your redirecting with a 301 all the .php pages and will start crawling your site

also in analytics you can remove pages by   parameter. not sure if it will accept a .php as a parameter but if so you can remove all your duplicates within a week
A drop of 20-25% is typical of what we have seen from others coming to EE with duplicate content problems caused by panda/penguin, so I would say you will probably recover quickly once the duplicates are loaded and doing the cleanup will also protect your reputation with Google so there is no long term permanent damage.

Cd&
Thanks for the tips and discussion. Since there is not final solution I award you both equaly for the time and tips you gave me. Thank you