Solved

SQL Server composite keys vs GUID

Posted on 2009-06-29
13
489 Views
Last Modified: 2012-05-07
We are developing an application  with .Net and SQL Server, part of the application will allow the users to enter data and later sync with the server database.  We are considering using a GUID or a composite key to sync the tables.

I know the GUID has a performance hit , but I am not sure I like having to join two columns in every query using a composite key.

Does anybody have any experience  with this type of application? Suggestions?

I
0
Comment
Question by:JonMny
  • 5
  • 5
  • 2
  • +1
13 Comments
 
LVL 23

Accepted Solution

by:
Racim BOUDJAKDJI earned 250 total points
Comment Utility
<<Suggestions?>>
Forget GUID, not only they are performance crunchers but they are relationally useless for the purpose of distinguishing tuples.  columns combinations on the other hand do not have all these problems...
0
 
LVL 40

Assisted Solution

by:mrjoltcola
mrjoltcola earned 250 total points
Comment Utility
Yes. I prefer GUIDs almost exclusively, they make many things simpler.

Having the surrogate key makes various operations simpler, this is true of nearly all uses of surrogate keys, GUID or not:
1) Simplifying the SQL required, fewer key columns is better
2) Foreign keys are simpler, only 1 column required.
3) Surrogate keys are good practice because it allows changing domain specific data, which might otherwise be used for primary key

The most important benefit I have found using GUIDs is their elegance in implementing distributed databases. Many of my applications employ remote hand-held PDAs or laptops where we replicate.

GUIDS help in writing simpler replication rules in most replication middlewares I've used
GUIDS can be generated by multiple nodes, not just the master server, so each remote database can be inserting happily without worry of conflict.


The only performance penalty is in the INSERT but it is not enough of a reason not to use them, in my opinion. GUIDs can even be pre-generated / cached, theoretically, but I've seen no need to in practice.



0
 
LVL 40

Expert Comment

by:mrjoltcola
Comment Utility
>>Yes. I prefer GUIDs almost exclusively, they make many things simpler.

I would like to clarify my statement. I prefer them in the context of replication, which you stated in the question.

If replication is not involved, I still prefer surrogate keys in general, (identity cols or sequences) but not GUIDs specifically.
0
 
LVL 2

Expert Comment

by:Sguzek
Comment Utility
In my opinion use natural key whenever possible. You will have *much* less headache with data quality with natural keys.

If your primary concern is to have good quality of data go with natural keys. If you don't care about the data and want to have simpler solution for replication GUID is perfect.

When it comes to performance - surrogate key *may* be faster in some applications. It depends on too many factors to give you correct 'rule of thumb' answer.
0
 
LVL 40

Expert Comment

by:mrjoltcola
Comment Utility
>>It depends on too many factors to give you correct 'rule of thumb' answer.

In this case replication is in the context, so GUIDs are appropriate.

Natural keys do not work well in distributed systems. Performance is secondary to maintainable.


The other topic in this discussion is not so much a GUID issue as a natural vs surrogate key debate. I don't disagree with you about natural keys, but unless you can foresee the future, your requirements may change at some point and sometimes natural keys do have those problems of needing to change the key value.
0
 
LVL 23

Expert Comment

by:Racim BOUDJAKDJI
Comment Utility
<<In my opinion use natural key whenever possible. You will have *much* less headache with data quality with natural keys. >>
Agreed.  

Actually these is no such thing as data integrity with surrogate key-only schemas because there is no mechanism to insure distinguising a tuple from the other.  Dupplicates are synonym with poor data integrity.  The only way a surrogate key has logical meaning in a database is if this surrogate key is created with a unique constraint on the natural key which basically comes down to the same level of resource consumption that using composite keys EXCEPT when these composite keys are too long in length.

As far as replication is concerned if there's is no logical distinguishability between rows that is just the open door to replication anomalies occuring all the time due to dupplicates.  

A good rule of thumb is that:

> When a composite key is short in total length, it is almost always better to use natural keys, be it composite or not.
> When a composite key gets too long (say > 50 to be safe) then it's advised to implement a surrogate key but *only* if associated BUT with a unique constraint on the natural key.
> A design trading data integrity for performance is simply a poor design.   Logical integrity is as important as performance.

Regards...
0
What Security Threats Are You Missing?

Enhance your security with threat intelligence from the web. Get trending threat insights on hackers, exploits, and suspicious IP addresses delivered to your inbox with our free Cyber Daily.

 
LVL 23

Expert Comment

by:Racim BOUDJAKDJI
Comment Utility
<<In this case replication is in the context, so GUIDs are appropriate.>>
Replication will cause tons of issue if you don't have logical data integrity on each side of the replication (especially  merge replication).  Trading one for the other is a *bad* idea.  GUID are the ultimate data integrity killer tool.  They can however be used *if* the composite key is too long AND if  there is a unique constraint on the natural key.

<<Natural keys do not work well in distributed systems. Performance is secondary to maintainable.>>
I work mainly only with natural keys and normalized schemas and I don't have any particular issue with that.  

In distributed systems, I have seen much more people struggling with meaningless counters, and consequences of loss of data integrity (dupplicates) than people who have proper logical design in place.  Perfomance is to be dealt with at physical level (partitionning, file strategy, hardware etc...) not at logical level.  That is the rule of relational design.
0
 
LVL 40

Expert Comment

by:mrjoltcola
Comment Utility
>>Replication will cause tons of issue if you don't have logical data integrity on each side of the replication

Replication technology that I deal with works based on A KEY. If the key is a proper key, it makes no different whatsoever if it is a surrogate key or a natural tuple.

I've repeatedly worked with the following

Oracle Replication
DB2 Replication
Sybase Mobilink
Oracle Lite

Oracle's replication has historically used ROWID replication, and ROWID is simply a database provided surrogate key for a record. It does not matter what is in the record, if I have a duplicate on tier 1 I BETTER replicate that duplicate on up.

I'm curious what replication technology you are working with that has problems identifying rows with a surrogate key!
0
 
LVL 40

Expert Comment

by:mrjoltcola
Comment Utility
It is academic knowledge that GUIDs and surrogate key systems are very well suited to implementation of distributed systems. We are not simply talking performance here, and noone said "trade date integrity for performance"

You guys are using the "tuple" case as a red herring. There are MANY uses of surrogate keys that are perfectly capable of ensuring data integrity, given the problem domain.

A surrogate GUID is a perfectly acceptable, capable mechanism to:
1) uniquely identify an entity to the distributed system
2) guaranteeing integrity of the relationship between the entity and all objects it owns

Noone is arguing that you cannot also use UNIQUE constraints in addition!

However, with models that house data from multiple clients (Application Service Provider model for example) one customer's data might have different constraints than the next. We can design a specific model for each of our few thousand clients, or we can instead take a more practical approach by NOT limiting ourselves to relational purity with model enforced constraints and instead employ a mix of surrogate keys and application enforced constraints. Note I said a MIX! In NO way am I arguing the application should maintain integrity for the database system -- but logical integrity from the application side. If you are familiar with the common ASP model,  you know this issue by now.

The other problem in real systems is there may be so many attributes in the tuple that it becomes a performance detriment and a maintennance overhead. It is not uncommon in the medical field to have databases with a PATIENT table with 30-40 child tables and hierarchicies 4-5 levels deep. A tuple may in some cases reach 6 or more attributes where the actual data only has a single attribute. Then guess what happens when I've a table that is dependent on 2-3 parent tables with large foreign keys. Ever tried to support such an application? I know I have! It is a poor design not to make use surrogate keys when they fit the problem. I repeat, "when they fit the problem." When we are replicating this data to memory constrained devices, such as blackberries and PDAs, there are real problems with these large foreign keys. I've done actual tests and in some models the data footprint can be nearly double.


>>As far as replication is concerned if there's is no logical distinguishability between rows that is just the open door to replication anomalies occuring all the time due to dupplicates.  

This is simply not true. I've implemented quite a few commercial replication designs with near unlimited budgets using at least 1/2 a dozen middlewares. We had time and money to try 2-3 alternatives. Replication does not change the issue you stated. Regardless of whether the row is identified by a surrogate key or a natural tuple, it is identified nonetheless, if the replication is preserving all attributes (I'm not aware of any that don't). The constraint exists on all nodes in the system. If the replication middleware correctly maintains the keys, there will be no "replication anomalies" attributed to natural keys or surrogate keys. Any anomalies will be intrinsic to the data model, not the fact that it is a distributed model.


>>I work mainly only with natural keys and normalized schemas and I don't have any particular issue with that.  

Thats great, and I've also been designing and maintaining large scale distributed databases supporting hundreds of remote replicated clients for enough years to know what works. I'm not in the minority in this field.

GUIDs and surrogate keys work very well and business is done on these models with 100% success. You can argue all the theory you want, but relational model purity is not a panacea and not a means to an end. I'll use what is proven.

0
 
LVL 23

Expert Comment

by:Racim BOUDJAKDJI
Comment Utility
<<Replication technology that I deal with works based on A KEY>>
Who claimed otherwise?  A key is just a physical concept different from the concept of logical unique tuple identifer.  

<<it makes no different whatsoever if it is a surrogate key or a natural tuple.>>
You are wrong.  A replication that works with *no* proper logical data integrity will have the following side effects, in the context of replication:

--> On the source, all the problems linked with a lack of logical unique identifier
--> Additional bandwidth consumption due to replicating dupplicates
--> Replication anomalies (in merge replication) : if you don't have a proper logical mechanism to insure distinguishability between *tuples* how can you know which tuples ought to be synchronized.

<<if I have a duplicate on tier 1 I BETTER replicate that duplicate on up.>>
Well if you consider that replicating problems linked to a lack of proper *logical* identifier is a good thing who am I to argue with that.

<<I'm curious what replication technology you are working with that has problems identifying rows with a surrogate key!.>>
Oh a few in the last 20 years.

I am not against what you are saying just placing some safeguards, Replication is just a physical mechanism to synchronize *designated* rows assuming.  If these designated rows are not logically distinguished then the entire replication scheme is wrong, that does not mean it does not work.  
0
 
LVL 23

Expert Comment

by:Racim BOUDJAKDJI
Comment Utility
<<It is academic knowledge that GUIDs and surrogate key systems are very well suited to implementation of distributed systems. We are not simply talking performance here, and noone said "trade date integrity for performance">>
Would you care provising some serious academic pointers.  

<<You guys are using the "tuple" case as a red herring. There are MANY uses of surrogate keys that are perfectly capable of ensuring data integrity, given the problem domain.

A surrogate GUID is a perfectly acceptable, capable mechanism to:
1) uniquely identify an entity to the distributed system
2) guaranteeing integrity of the relationship between the entity and all objects it owns>>
Please reread carefully what I wrote.  Nobody claimed here that surrogate keys are *never* to be used.
 

--When a composite key gets too long (say > 50 to be safe) then it's *advised* to implement a surrogate key but *only* if associated BUT with a unique constraint on the natural key.

<<However, with models that house data from multiple clients (Application Service Provider model for example) one customer's data might have different constraints than the next. We can design a specific model for each of our few thousand clients, or we can instead take a more practical approach by NOT limiting ourselves to relational purity with model enforced constraints and instead employ a mix of surrogate keys and application enforced constraints. Note I said a MIX! In NO way am I arguing the application should maintain integrity for the database system -- but logical integrity from the application side. If you are familiar with the common ASP model,  you know this issue by now.>>
Sorry I have no idea what you are refering to.  The relational model has been around for 30 years and is still the reference.  I stick to it.

<<The other problem in real systems is there may be so many attributes in the tuple that it becomes a performance detriment and a maintennance overhead. It is not uncommon in the medical field to have databases with a PATIENT table with 30-40 child tables and hierarchicies 4-5 levels deep. A tuple may in some cases reach 6 or more attributes where the actual data only has a single attribute. Then guess what happens when I've a table that is dependent on 2-3 parent tables with large foreign keys. Ever tried to support such an application? I know I have! It is a poor design not to make use surrogate keys when they fit the problem. I repeat, "when they fit the problem." When we are replicating this data to memory constrained devices, such as blackberries and PDAs, there are real problems with these large foreign keys. I've done actual tests and in some models the data footprint can be nearly double.
>>
Ihave seen a different experience.  I on the contrary believe that anything that goes away from reational modeling is underoptimized.  Representing hierarchies in relational modeling is child's game and optimizes the space consumption to minimum.

<<his is simply not true. I've implemented quite a few commercial replication designs with near unlimited budgets using at least 1/2 a dozen middlewares. We had time and money to try 2-3 alternatives. Replication does not change the issue you stated. Regardless of whether the row is identified by a surrogate key or a natural tuple, it is identified nonetheless, if the replication is preserving all attributes (I'm not aware of any that don't). The constraint exists on all nodes in the system. If the replication middleware correctly maintains the keys, there will be no "replication anomalies" attributed to natural keys or surrogate keys. Any anomalies will be intrinsic to the data model, not the fact that it is a distributed model.>>
Sorry, you lost me again...I do not know what a distributive model is?

<<Thats great, and I've also been designing and maintaining large scale distributed databases supporting hundreds of remote replicated clients for enough years to know what works. I'm not in the minority in this field.>>
Good for you...

<<GUIDs and surrogate keys work very well and business is done on these models with 100% success. You can argue all the theory you want, but relational model purity is not a panacea and not a means to an end. I'll use what is proven.>>
The concept of unique identifier is a concept that has been working in the last 30 years for a majority of people.
You may have had experiences that tell you otherwise but that does not mean you can ignore all previous work on that.
0
 
LVL 9

Author Comment

by:JonMny
Comment Utility
Thanks for all the comments, Without getting into the debate over natural keys etc. I have decided to use the GUID for the sync records, currently we have  4 tables that will be involved in a sync where the client will have records that will not exist on the server. For the other 19 the server will be the only place where inserts occur. Does it make sense do use a GUID on all the tables for "consistency" or should I just use it on the ones that need it. I know the argument I will get is that things could change and then will have to change the database, code etc.
 
 
0
 
LVL 2

Expert Comment

by:Sguzek
Comment Utility
I would use KISS rule (Keep It Simple Stupid - no offence).

If you don't need GUIDs in the remaining tables - just don't add them. Look at 'value' you're adding - gives you nothing but costs you time (resources, money, etc).

In other words - would you spend your own money on something unusable just because your neighbour has it ?

Regards,
Slawek

0

Featured Post

How your wiki can always stay up-to-date

Quip doubles as a “living” wiki and a project management tool that evolves with your organization. As you finish projects in Quip, the work remains, easily accessible to all team members, new and old.
- Increase transparency
- Onboard new hires faster
- Access from mobile/offline

Join & Write a Comment

The Delta outage: 650 cancelled flights, more than 1200 delayed flights, thousands of frustrated customers, tens of millions of dollars in damages – plus untold reputational damage to one of the world’s most trusted airlines. All due to a catastroph…
Ever wondered why sometimes your SQL Server is slow or unresponsive with connections spiking up but by the time you go in, all is well? The following article will show you how to install and configure a SQL job that will send you email alerts includ…
Via a live example, show how to backup a database, simulate a failure backup the tail of the database transaction log and perform the restore.
Viewers will learn how to use the UPDATE and DELETE statements to change or remove existing data from their tables. Make a table: Update a specific column given a specific row using the UPDATE statement: Remove a set of values using the DELETE s…

744 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

12 Experts available now in Live!

Get 1:1 Help Now