Solved

Collation for a new MS SQL Server

Posted on 2013-01-09
7
1,220 Views
Last Modified: 2013-01-12
Hi,
We're about to install a new MS SQL Server which would hold all our new DW and OLAP databases for the company.
We intend to insert into it data from legacy DBs, which, among other things, hold URLs with unicode characters.
In general we would like to be able to support many kinds of non-English languages, but we're not sure which collation to choose for the server setup and for the instances.
Our main candidates are:
1) utf8_general_ci - which would have been the best option maybe, but it seems that you cannot assign this collation at server-level! Problem is, if we would define it at instance level, what do we assign to the server-level, and what is the affect of having them not the same?
2) Latin1_General_CI_AS - the fallback of the above maybe, since it can be defined at both the server and instance levels, but what do we do with our unicode requirements?
3) Finnish_Swedish_CI_AS - since we have some swedish and finnish sites, some people refer to this collation is maybe better against the "latin" one above.

So the end question is: what collation should we choose at server-level and on instance-level?
Thank you in advance!!!
0
Comment
Question by:webpals
7 Comments
 
LVL 21

Expert Comment

by:Alpesh Patel
ID: 38761791
Pls use SQL_General_Latin1_CI_AS  it's best option for all languages
0
 
LVL 20

Assisted Solution

by:Marten Rune
Marten Rune earned 100 total points
ID: 38761831
Youre in trouble now.

If a legacy product demands finnish_swedish_CI_AS, and another SQL_General_Latin_CI_AS chances are things wont work, and you'll be running on unsupported configuration for one of the legacy products.

This is because bad coding matches var, nvar or similar columns with a operation in TempDB. Now TempDB will have the same collation as the server collation, not the database collation.

SQL Server itself is designed to allow any and all collations, and you should be able to support both legacy products if coders could only remember one small phrace when they create a tempory table, and that is "collate database_default" if they would only use this magic phrace the SQL could support it all on one instance.

Look at:
http://blog.sqlauthority.com/2007/06/11/sql-server-cannot-resolve-collation-conflict-for-equal-to-operation/
and
http://msdn.microsoft.com/en-us/library/aa258237(v=sql.80).aspx
or
http://www.sqlservercentral.com/Forums/Topic813941-338-1.aspx

It's been around since SQL2000. So you'd expect the programmers of legacy product to pick up on it, for like a decade ago. But this looks like it will sadly never happen.

There is always a new rookie programmer in every legacy products coding teem it seems.

Your solution will likely be the same as many of us out there, yet another instance of SQL, on the same host (for licensing reason), and you'll miss out on resource govenor (wont work between instances) and have a hard time figuring out wich instance is smuthering the others when it comes to performance tuning.

Regards Marten
0
 

Author Comment

by:webpals
ID: 38761879
Dear Marten,
Thank you for the long answer and details.

What I'm not yet clear about is: since the server would be running all its TempDB tables using the server collation SQL_General_Latin_CI_AS, do you recommend to:
1) Use the same collation for the DB instances too so to have no collisions. Thus the question is: how do we save and use our needed "unicode" data?
or
2) Use a different collation for the DB instance, like utf8_general_ci

Second, why is the recommendation to use SQL_General_Latin_CI_AS, which is stated at MS as an "old" SQL collation, rather than the "new" General_Latin_CI_AS?

Thanks
0
Efficient way to get backups off site to Azure

This user guide provides instructions on how to deploy and configure both a StoneFly Scale Out NAS Enterprise Cloud Drive virtual machine and Veeam Cloud Connect in the Microsoft Azure Cloud.

 
LVL 51

Accepted Solution

by:
Mark Wills earned 400 total points
ID: 38761986
Well, your unicode data is readily handled with NVARCHAR() and NCHAR() instead of VARCHAR

Given the diversity of languages, and the disposition of SQL to install according to the code page (windows collation), I would suggest you chose the easiest to live with in general terms.

If you really do have very specific language dependancies, then maybe you need to consider some databases being swedish (or finnish or whatever).

But it is often considered that collation must be the same as language, and that isnt entirely the case. The collation must be able to support the choice of language and typically the unicode data types will certainly do that...

Also, you need to (maybe) consider the newer collations to support windows collations - they have a 100 in there...

Generally speaking, I wouldnt use SQL collation unless there was some kind of heritage or prior dependancy. I would default to Latin1-General, case-insensitive, accent-sensitive, kanatype-insensitive, width-insensitive. And for language dependant columns which require unicode then use nvarchar().

Have a read of :
http://msdn.microsoft.com/en-US/library/ms143726(v=SQL.110).aspx

Probably the more important and definitely language dependant aspect are things like Month names and Day names but that is another story altogether...

One last thing to consider is you can really impact (big time) any server performance if you need to use the COLLATE clause in your selects. If the collation needs to convert a sort sequence then there can be a fairly significant penalty.

There was a quizz on Beyond Relational which shows some of the possible impact : http://beyondrelational.com/quiz/sqlserver/general/2010/questions/1/sqlserver-quiz-general-2010-erland-sommarskog-a-collation-puzzle.aspx and a general discussion (with examples)  http://beyondrelational.com/modules/24/syndicated/842/Posts/17777/revised-difference-between-collation-sqllatin1generalcp1cias-and-latin1generalcias.aspx
0
 

Author Comment

by:webpals
ID: 38762844
Mark, thank you!
So I'm now finally convinced that using the non-unicode Latin1-General, case-insensitive, accent-sensitive, kanatype-insensitive, width-insensitive option, plus using NVARCHAR() and NCHAR() for unicode fields is the best option for us.

Just to confirm what you mean by "Generally speaking, I wouldnt use SQL collation unless there was some kind of heritage or prior dependancy"

I read: http://beyondrelational.com/modules/24/syndicated/842/Posts/17777/revised-difference-between-collation-sqllatin1generalcp1cias-and-latin1generalcias.aspx

And from it's conclusion, it looks like Latin1_General_CI_AS is a better option than the SQL counterparts like: SQL_General_Latin1_CI_AS or SQL_Latin1_General_CP1_CI_AS suggested on this thread.

Is that true?
0
 
LVL 51

Assisted Solution

by:Mark Wills
Mark Wills earned 400 total points
ID: 38763645
Yes, it is true.

And if using the newer collations (the _100 to correspond with Win Server 2008) , there is a whitepaper worth reading first : http://download.microsoft.com/download/d/9/4/d948f981-926e-40fa-a026-5bfcf076d9b9/Pre-SQLServer2008_NativeDataProviders_and_NewCollations.docx
0
 

Author Comment

by:webpals
ID: 38771280
Thank you, Mark and thank you Marten. It's clear to me now, we'll go for the new collations (non SQL_ ones) either Latin1_General_CI_AS or the newer ones suggested (the _100) based on the server + client we use.
0

Featured Post

Backup Your Microsoft Windows Server®

Backup all your Microsoft Windows Server – on-premises, in remote locations, in private and hybrid clouds. Your entire Windows Server will be backed up in one easy step with patented, block-level disk imaging. We achieve RTOs (recovery time objectives) as low as 15 seconds.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Everyone has problem when going to load data into Data warehouse (EDW). They all need to confirm that data quality is good but they don't no how to proceed. Microsoft has provided new task within SSIS 2008 called "Data Profiler Task". It solve th…
Load balancing is the method of dividing the total amount of work performed by one computer between two or more computers. Its aim is to get more work done in the same amount of time, ensuring that all the users get served faster.
Via a live example, show how to set up a backup for SQL Server using a Maintenance Plan and how to schedule the job into SQL Server Agent.
Viewers will learn how to use the INSERT statement to insert data into their tables. It will also introduce the NULL statement, to show them what happens when no value is giving for any given column.

948 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

18 Experts available now in Live!

Get 1:1 Help Now