Solved

Natural Keys vs. Surrogate Keys

Posted on 2008-10-08
15
2,115 Views
Last Modified: 2008-10-14
( my apologies to those whom I have inadvertently quoted with attributing the quote ... )

Use of surrogate keys versus natural keys is a recurring debate in the world of database design. There are advocates on both sides that are fanatically religious in their preference. Personally, I've tried both ways (for several years each), and my preference has switched multiple times. I've read many people's opinions and they've not been definitively convincing.

As far as I can see, surrogate keys have four major advantages.

First, you can change the natural key, if you wish.

The second advantage is that surrogate keys often take less space in the database than natural keys do.

The third advantage is that one surrogate key can take the place of a multi-field (composite) natural key.

The last advantage is that surrogate keys generally take less thought when designing the database. ( I wish this were not the case. But, in almost all the database design sessions in which I have participated, if surrogate keys were chosen, it has been true. It was chosen because it seemed "easier".)

On the flip-side, though, I've found that the use of surrogate keys typically requires significantly more joins in order to retrieve meaningful information.

Another problem with surrogate keys is that a natural key might be duplicated (although avoiding duplicate natural keys is relatively easily by assigning a unique constraint to the natural key).

( see: http://www.bcarter.com/intsurr1.htm )

So, without starting a religious flame war, can you tell me which one you use and why?  What situations would naturally lend itself to one way or the other?

-- DaveSlash
0
Comment
Question by:daveslash
  • 7
  • 4
  • 3
  • +1
15 Comments
 
LVL 22

Expert Comment

by:dportas
Comment Utility
The way to answer the last part of your question is to understand data modelling properly so that you'll be able to make your own informed decision. A potted summary in this forum isn't likely to be a good way to learn. Good books or courses are better ways.

Responses to your 4 "major advantages":

1. This is nothing to do with surrogate keys. Natural key values can be changed without the presence of a surrogate. (Formally speaking you should say that one set of tuples is replaced by another - the notion of a key being "changed" doesn't make much sense in the relational model - but the situation is no different with or without a surrogate).

2. A table with a surrogate key self-evidently has MORE data than the same data without the surrogate. Doubly so if both surrogate and natural key are indexed.

3. Why is that an advantage?

4. You believe that thinking less is an advantage?
0
 
LVL 42

Accepted Solution

by:
dqmq earned 167 total points
Comment Utility
Oh boy, this is going to be fun...

First, let me address the pros/cons that you identified:

>First, you can change the natural key, if you wish.

If you have a need to change the natural key very much, then it is a poor choice for a primary key.  One good reason to use a surrogate key is when a good natural key is not available.  However, there is a HUGE risk in that case, that the surrogate key will acquire business meaning over time.  In other words, it ceases to be a surrogate key which, by definition, has no business meaning.  Really, what we've done is introduced a new natural key that the business didn't recognize before.

>The second advantage is that surrogate keys often take less space in the database than natural keys do.

Well, hmmm...the surrogate key assumes MORE space in the table where it serves as the primary key because the natural key does not go away.  However, it is true that a surrogate key can often be shorter than the primary key and then take up less room in the FK references to it.  I believe the width of the key is usually more important from a performance standpoint than it is from a disk space standpoint.

>The third advantage is that one surrogate key can take the place of a multi-field (composite) natural key.

Yes, and that makes joins a lot simpler for the programmer and faster for the database engine to resolve.  I give surrogate keys stronger consideration when the number of columns in the key exceeds three.
 
>The last advantage is that surrogate keys generally take less thought when designing the database.

That's certainly a double-edged sword, isn't it?

>I've found that the use of surrogate keys typically requires significantly more joins in order to retrieve meaningful information.

More important than that, it weakens the database's ability to enforce some of the business rules.  

>Another problem with surrogate keys is that a natural key might be duplicated

As you've pointed out, we have an easy solution for that and it should ALWAYS be done to protect integrity of the data.  But the correllary to "less thought" that you mentioned above, is that, more often than not, that step gets skipped.

To summarize, it's a grave oversight and sells our profession short to adopt the overly-simplistic approach of putting a surrogate key on every table.

I always do a logical design which is based on business rules and therefore has no surrogate keys.

I stick with the natural keys unless a have a defensible reason to do otherwise.  My philosophy is that data integrity is the "prime directive" of a database system and natural keys are better at it.

Before introducing a surrogate key, I evaluate whether the business really needs a better natural key.  In that case, I'd be inclined to give the business some choice in the form of that key (rather than the next-sequential number approach).  

I consider using a surrogate key if the natural key is too wide.  By that I mean, more than about 3 columns or 12 bytes.  I do not adhere to a hard-and-fast threshold... but the wider the key, the more I'm inclined to use a surrogate.

I consider using a surrogate keys to overcome certain compatibility issues that come about as a result of conversions, consolidations, changing semantics, and those sorts of things.  A data warehouse has lots of surrogate keys for these reasons.

I consider using a surrogate key when performance or disk space demands it.

I avoid a surrogate key when the almost-natural key needs a tie breaker.  I very well may use an identity, autonumber, sequence or whatever you want to call it, but I would not assert uniqueness on that singular column and I would not evevate it to the PK.

I take GREAT care to avoid a surrogate key when it compromises a business rule that is enforceable through a natural key.

If I do use a surrogate key, I always declare a unique constraint on the natural key.


I













0
 
LVL 42

Expert Comment

by:dqmq
Comment Utility
In response to dportas whose opionion I greatly respect.

>This is nothing to do with surrogate keys.
Spot on.  It has to do with lazy or poorly trained programmers and compensating for a poor choice of primary key in the first place.

2. A table with a surrogate key self-evidently has MORE data than the same data without the surrogate. Doubly so if both surrogate and natural key are indexed.
True.  But, I think the argument is that the surrogate key is narrower than the natural key and therefore takes up less room in child tables and their indexes.  Of course, we both know, that's not always the case.

3. Why is that [single column key] an advantage?
Those lazy programmers again.  But, in their defense, SQL is less prone to error when the join conditions are simple.  And some SQL expressions are vastly simpler with a single-column identity.

4. You believe that thinking less is an advantage?
Less bad thinking is an advantage.  :>)  Those who advocate skimping in the planning/design phases exemplify that.
 
0
 
LVL 47

Assisted Solution

by:schwertner
schwertner earned 167 total points
Comment Utility
Nowadays in WEB development it is almost not possible to use natural keys.
There are also big issues using surrogate keys in this environment, because
after creating the primary key the application needs it to create some referential foreign
keys. This takes time, because the application runs in other environment.
So the practitioners deliver monster surrogate key in 128 or 256 bytes columns,
what is horror for the DBAs like me.

Also the usage of natural keys is limited, because except some academic examples
the natural keys (or their components) are classified data - SSNs, etc. and are kept
in the DB encrypted. It is not possible to use encrypted data as key at all.

I do not worry about joins. I worry about 2 things:
1. The length of the key
2. The size of the underlying index (keys without indexes do not work, exceprt
in the University Labs).
0
 
LVL 47

Expert Comment

by:schwertner
Comment Utility
And why your question is placed in the Oracle thread?
I have no intention to take part in the DB2 thread.
0
 
LVL 47

Expert Comment

by:schwertner
Comment Utility
0
 
LVL 22

Assisted Solution

by:dportas
dportas earned 166 total points
Comment Utility
That techrepublic article bases its discussion on some confused and misunderstood ideas. Nothing in the relational model requires that key attributes must be "stable" or that the key attributes can't be changed (stability is often a desirable property for practical reasons but it is not a fundamental requirement).

Unfortunately these discussions are far too often based on some appaling misconceptions about basic database principles and so they end up being futile and uninformative. The tragedy is that some people actually try to learn from rubbish of that kind.

0
Highfive + Dolby Voice = No More Audio Complaints!

Poor audio quality is one of the top reasons people don’t use video conferencing. Get the crispest, clearest audio powered by Dolby Voice in every meeting. Highfive and Dolby Voice deliver the best video conferencing and audio experience for every meeting and every room.

 
LVL 47

Expert Comment

by:schwertner
Comment Utility
dportas:

This what you say (10.09.2008 at 01:13PM EEST) is not true.
Primary key needs stability (structure and type).
This comes not from the Relational Model (it is theory!),
this comes from development and DBA practice.
If you change the type of the Primry Key then all joins
have to be rewriten and all foreign keys should be redifined.
If you change the values you have to change also the foreign
key values. This is big issue in a DB.
0
 
LVL 47

Expert Comment

by:schwertner
Comment Utility
Also even in my lecture today I stated (according Oracle recommendations):
" Primary keys generally cannot be changed.

The Relational Theory is good for Academics, but Practitioners have
to deal with many other circumstances, that make the life hard.
##### Excertpt ##########
Introduction to Oracle9i: SQL I-19

Guidelines for Primary Keys and Foreign Keys
" You can not use duplicate values in a primary key.
" Primary keys generally cannot be changed.
" Foreign keys are based on data values and are purely logical, not physical, pointers.
" A foreign key value must match an existing primary key value or unique key value, or else be null.
" A foreign key must reference either a primary key or unique key column.
0
 
LVL 22

Expert Comment

by:dportas
Comment Utility
Schwertner. Not sure I follow. You say that what I've said is "not true" but then go on to repeat my points as though you agree with them. In a nutshell my point (which you seem to agree with) is that RM does not require a key to be stable or unchanging. Changing keys is therefore a physical implementation problem, not a logical one. Which part of my reply did you disagree with?
0
 
LVL 42

Expert Comment

by:dqmq
Comment Utility
We all agree that a primary key must uniquely identify the row.  With a surrogate key we identify the row with a value that has no business meaning.  In so doing, we have just lost the an important benefit of the PK.  Namely, to assure that the business entity is recorded only once. In the student table, the same student (as defined by the business) can be stored twice because the business identiity of that student is not the primary key.   Or said another way, given any two records, we have no way to know if they are for the same student or not.  Suppose the records are identical except for the surrogate key. Same student or not?  Suppose two records differ only by student name.  Is that two different students or the same student before and after a name change?

If we isolate the PK from business meaning, it becomes a physical artifact with no business value.  

I have more points to make, but need to get back to it later.

0
 
LVL 42

Expert Comment

by:dqmq
Comment Utility
I'm back...

If we isolate the PK from business meaning, it becomes a physical artifact with no business value and no way to promote integrity of the business relationship between tables.   If it takes on business value overtime, then it ceases to be a surrogate key.

Consider a School table and a Student table each with surrogate keys and a foreign key relationship between them such that one school may have many students.  For example, Dennis is a student at Northwestern University.   That represents a business fact.  Now change the name to Northwest College and we've just corrupted the business fact because Dennis (and no doubt thousands like him) did not really change schools.   Had we used the school name for the primary key (cascade update, notwithstanding), that would have been prevented.   Said another way, if the business agrees that the school name is the official way to identify a school, then you cannot change the name without identifying a different school.  Using a natural key prevents that from happening when there are relationships that depend on it.

Let's follow this a bit further and consider a similar relationship between School and Class, implemented with surrogate keys.   Dennis is a student at  Northwestern University and Bill is a student at Northwest College.  Because of referential integrity we can be assured that Dennis and Bill are students at some college.  But with surrogate keys, Student does not inherit the business semantics of the School key (rather, being a surrogate key, it has no business semantics).  That means Dennis and Bill can readily be assigned to the same Class even though they go to different schools. For that matter, either can be assigned to a Class that is not taught at the school they are attending.   Thoughtful use of natural keys permits the database to enforce referential integrity in business terms, not just physical terms.

All the cautions about stability of the natural key is kind of a red herring.  First, because neither SQL databases nor the relational model dissuade you from changing it.  Second, because changing a natural key has more significance to the business than changing any other column.  Surrogate keys do facilitate changing the otherwise natural keys.  However,  at the same time, that practice diminishes the significance and business implication of doing it.  I think it's a great benefit of natural keys, that we are forced into deliberately thinking about what it means to make such a change and how we go about doing it.

The requirement to know the value of the key before a record can be inserted was cited as another disadvantage of Natural Keys.  What? Huh?  Given that, by definition, a PK represents the identity of a record, does this not seem absurd:  "here database, is an assertion with no means of identification. Please keep track of it for me".   On the contrary, .NET apps have a huge difficulty with surrogate keys in this regard.  They often want to know the parent key BEFORE it is inserted in order to construct child rows for the same trip to the database.    

To my knowledge, all the benefits of surrogate keys are either for performance or ease of development.  For this you sacrifice the ability of the database design to enforce business rules and you burden the application with more of the responsibility for data integrity.

'nuff said





 




In the world of natural keys, the class table could


 



 



     
0
 
LVL 47

Expert Comment

by:schwertner
Comment Utility
Hello Dportas,

Keys (primary, foreign, unique) are not physical objects.
They are logical objects. Only the value of the key in the table
has physical  nature, but the the link via these keys is fully logical.

This is the revolutionary invent of Dr. E.F.Codd.
Before Codd (network and Hierarchical  models)
there were   hardcoded physical links between the items.
Dr. Codd breaks this and this is the main reason for the
acceptance and success of the Relational Models.

From technical point of view there are many traps   with natural
and surrogate keys.

If one choses to use natural keys as string there are two questions:
1. How long should be this string and what will happen if there is a
need for longer value
2. Experienced DBA knows that DBs migtare to other installations.
National Language Support and character set support requires
string code changes (example from ASCII to EBCDIC, from USACII7
to Unicode like AL32UTF8). We, the DBAs, have fear that this translation
will hurt the links
3. Real Life example: In our company the security officers asked us to encrypt
every meaningful column as law requirement. What will happen if  you encrypt
the natural keys?

From the other side surrogate keys have also disadvantages. In WEB applications
the implementation works in other tier and if it needs parent key and child keys - they should be delivered
via the network. The context change causes delay and loads the traffic.
 
0
 
LVL 18

Author Comment

by:daveslash
Comment Utility

dqmq:  
> I give surrogate keys stronger consideration when the number of columns in the key exceeds three.
> ...
> To summarize, it's a grave oversight and sells our profession short to adopt the
> overly-simplistic approach of putting a surrogate key on every table.

I agree completely.

schwertner:
> natural keys are classified data ... and are kept in the DB encrypted

That would certainly be a problem in using a natural key as a PK.

schwertner:
> And why your question is placed in the Oracle thread?
> I have no intention to take part in the DB2 thread.

I placed the question in both the Oracle zone and the DB2 zone because those are the two "biggest" databases in the world. I've observed that the experts in those two areas GENERALLY have more experience and expertise.  ( That is, of course, another sweeping generalization. Therefore, it will not ALWAYS be the case. )

Thanks to all of you who put so much thought and time into this question. I realize that there is no "definitive" solution that works in every single case. But, I do have a better idea of when to use each method.

Thanks again!
DaveSlash
0
 
LVL 47

Expert Comment

by:schwertner
Comment Utility
Also in Oracle Magazine (July/August 2008) there is a section with  remark about this topic by Tom Kyte.
0

Featured Post

How to run any project with ease

Manage projects of all sizes how you want. Great for personal to-do lists, project milestones, team priorities and launch plans.
- Combine task lists, docs, spreadsheets, and chat in one
- View and edit from mobile/offline
- Cut down on emails

Join & Write a Comment

This post first appeared at Oracleinaction  (http://oracleinaction.com/undo-and-redo-in-oracle/)by Anju Garg (Myself). I  will demonstrate that undo for DML’s is stored both in undo tablespace and online redo logs. Then, we will analyze the reaso…
This article explains all about SQL Server Piecemeal Restore with examples in step by step manner.
This video explains at a high level about the four available data types in Oracle and how dates can be manipulated by the user to get data into and out of the database.
This video shows how to recover a database from a user managed backup

772 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

12 Experts available now in Live!

Get 1:1 Help Now