double content in google

Hi experts,

I have a forum with +2 million pages indexed in google. how ever I know that a lot of this pages are doubled content because the same forum post open with 3 different urls.

for example the same topic opens for the urls
1. viewtopic.php?p=xxxx (view a post)
2. viewtopic.php?t=xxxx (view the entire topic but of course the post above as well)
3. this_is_example_topic.html (cached version of the same as the topic before)

I prefer normally the SEO version of the url (nr3.) but unfortunately google has indexed the versions viewtopic as well more than 500.000 times.

I thought about to block now simple the google indexer with my robots.txt and block access to all viewtopic urls which would filter everything out except the seo url version BUT i fear this is going to hit me negative and do more harm than well because this means goolge is going to kick hundreds of thousand urls out of the index. The content might not be 100% indexed with the SEO url version and once i kick the viewtopic urls out I might be out of traffic.

I would be happy for some suggestions or ideas on how to prevent a big mess but clean up my urls in google.

thanks in advance
Oliver2000Asked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

QuinnDexCommented:
I would use sitemaps to pass the urls you want indexed to google
leaving out the urls you consider to be duplicate, this may not stop google from indexing them though.

https://support.google.com/webmasters/answer/156184?hl=en

you could use rel='nofollow on links to that page but if that page is linked to from another site google could still index it

https://support.google.com/webmasters/answer/96569?hl=en&ctx=cb&src=cb&cbid=-5rmggrfsp2rq&cbrank=3

only way to stop the page from being indexed is using meta tag telling google not to index

<meta name="robots" content="noindex, nofollow" />
0
Oliver2000Author Commented:
@QuinnDex: Hi, but this is not at all what I asked. I know how I can stop google from indexing. My question is how can I clean up without losing traffic now. Please read my question be careful again. I guess also that a sitemap not help anything since we talk about 5 million pages (a little bit to much for a sitemap i guess) and also every day thousands of new posts.
0
QuinnDexCommented:
you cant clean up, google will do that over time, as google trys to re index the duplicate pages it will see the no index meta tag and drop the index for that page

you can go into your google analitics accouunt and remove manualy urls, but that isnt possible for 500,000 pages
0
Angular Fundamentals

Learn the fundamentals of Angular 2, a JavaScript framework for developing dynamic single page applications.

QuinnDexCommented:
5 million pages (a little bit to much for a sitemap i guess

not to much at all, you can produce sitemaps dynamically on a daily basis, this is how i used to do it on my site only had 2 million pages but when you get into those figures it doesnt make much difference
0
Oliver2000Author Commented:
thx for your feedback again.

I was checking around and it looks like google has in index always both versions. the seo url version and the normal viewtopic version. if I now block simply in my robots.txt the viewtopic version google will next time try to rescan and get blocked for the viewtopic url.

so far so good... but there comes the main question.

is this going to be good or bad in the end.

what I mean is what will have the higher value for me in the end and bring more traffic.

Option: 1.
I leave the seo url version of the topics as well as the viewtopic.php version of the topic. so my site is twice in google index and appear in search results for diferent results more often.

Option 2.
I block out the viewtopic version and drop half a million pages out of my index but have only the clean seo version indexed. so far less pages indexed but no double content.

what will be in the end better for me?

interesting info about the dynamic sitemap. I did not know that I can have a sitemap with millions of sites, isnt this going to be a massive file? any tip on how you did the dynamic sitemap in a case like this?
0
QuinnDexCommented:
is your seo url a rewriten url?
0
QuinnDexCommented:
If it is i would redirect all traffic from the .php pages to the seo version with a  301 redirect this will update google at the same time and google will drop all the .php pages from its index over time, but will weight importance towards the seo urls, so they will rank higher. less urls but better ranked
0
QuinnDexCommented:
to build a sitemap you would build this directly from the db, i had ms sql build mine in a sp that was scheduled to run at 1 am each day,

you build the maps with a max of 50,000 urls in each file and a index file of the sitmap files, the file size is also limited so depending on how much info you want to include reduce the number of urls in each file, i used to include 40k urls so file size never exceeded the limit

details are in the sitemap link i have given above
0
COBOLdinosaurCommented:
Not for points just a bit to clean up on the question of whether or not having to multiple links to content will hurt.

The short answer is yes.  Googles Panda updates heavily penalize duplicate content.  The Penguin updates are ruthless in rating content quality and dup content has some weight in that assessment as well.

We have seen some sites drop as much as 40% in search engine traffic over the last few months as Google re-defines "relevant content" in the SERPS and sloppy url management is one of the things killing off some sites.

The best protection is a sitemap and then use the meta tags and robots.txt to limit access.  The only (SEO) safe way to re-direct is 301 or 302.

Cd&
0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
Oliver2000Author Commented:
You wrote the key question for me "less urls but better ranked" is this a fact? So you would say that less pages indexed but better ranked the once which are left is the better version in compare to more sites indexed but with same content and there for lower ranked?

of course I realize this dont happen over night, but on the long run... 6-8 weeks.
0
Oliver2000Author Commented:
In my case I not even need to redirect much, I just need to block out the double urls in my robot txt. this would already do the job to make sure google index only the seo urls.

my key doubt was always the same.... are single urls with exclusive content but better ranked more important as maybe double urls with the same content but 1.000.000 more pages indexd?

I am not sure how to create a dynamic sitemap of my entire forum :( but I have a look into this.

resume: you guys say clear to me.... thought out the double content and use quality urls over quantity indexed.???
0
COBOLdinosaurCommented:
Duplicate content will not just get lower ranked; it will be de-indexed, and if google finds a lot of duplicate content they may remove all versions of the content from the index.  The most extreme penalty; which only happens when Google thinks you are intentionally trying to game the search is blacklisting of the site.  I don't see anything you are doing that should trigger the serious penalties, but the rank will probably not go up with out the elimination of what appears to be duplicate content.  It can take 3 to 6 months to recover from minor Google penalties, and years to fully recover from blacklisting.

Cd&
0
Oliver2000Author Commented:
My site is one of the biggest in brazil with 5 million users and more than 13 years online. I dont think my site gets any bad attention from google since I never tried anything bad or any tricky thing. I honestly build my traffic with articles etc since many years and pay ghostwriter etc. for all my content to be unique. the problem is caused by my forum which grow over the years a lot and I did years ago some seo work with the better urls but never got 100% rid of the old urls. I was always frightened to touch this subject so google indexed since years both versions. but I know and see I do have double and even trible content in some cases with different urls.

what I am getting out of all this is clearly that it might take some time but on the long run is clear better to have the double content out and cleaner indexed the real important urls.

I am going to kick out all the wrong urls based on what we talked here and what I read in all kinds of SEO tips and sites. I am pretty sure that I got pulled down by the latest panda/penguin updates since i drop 20-25% in the last 6 weeks without any technical reason. but again, even if this was not the case and something else plays a role it is better to clean up the urls.

thanks for your support guys I appreciate the input and help
0
QuinnDexCommented:
as BMW found out when they got blacklisted

blocking those pages in the robots.txt will help, but a 301 redirect will correct the issue faster,  google picks up on pattens, and will quickly catch on to the fact that your redirecting with a 301 all the .php pages and will start crawling your site

also in analytics you can remove pages by   parameter. not sure if it will accept a .php as a parameter but if so you can remove all your duplicates within a week
0
COBOLdinosaurCommented:
A drop of 20-25% is typical of what we have seen from others coming to EE with duplicate content problems caused by panda/penguin, so I would say you will probably recover quickly once the duplicates are loaded and doing the cleanup will also protect your reputation with Google so there is no long term permanent damage.

Cd&
0
Oliver2000Author Commented:
Thanks for the tips and discussion. Since there is not final solution I award you both equaly for the time and tips you gave me. Thank you
0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Search Engine Optimization (SEO)

From novice to tech pro — start learning today.