Want to protect your cyber security and still get fast solutions? Ask a secure question today.Go Premium

x
?
Solved

Primary keys: best practice

Posted on 2006-03-24
12
Medium Priority
?
1,098 Views
Last Modified: 2008-02-01
I was not the most attentative student in my Database lectures but I did pick up enough on relational database theory to understand the process of defining primary keys for a table. However, I have found myself in a situation where I have designed a database and used "natural primary keys" where my choice of keys has been challanged in favour of a unique record ID (IDENTITY) field. To explain - take the tables

create table tblTest(
  userid INT IDENTITY(1,1) NOT NULL,
  Name VARCHAR(63),
  Surname VARCHAR(63),
  PassportNumber VARCHAR(20) PRIMARY KEY
)

And

create table tblTest2 (
   recid INT IDENTITY(1,1) NOT NULL,
   userid_id INT NOT NULL,
   product_id INT NOT NULL,
   ... some other fields
)

ALTER TABLE [dbo].[tblTest2] WITH NOCHECK ADD
      CONSTRAINT [PK_tblTest2tblTest] PRIMARY KEY  CLUSTERED
      (
            [userid_id],
            [product_id]
      )  ON [PRIMARY]

In both cases the primary key selected is the natural primary key for the table and is guaranteed unique - no problem.

A "discussion" has arisen because the implementor wants to remove the primary keys created and make the record id's the primary keys (the identity fields) because it will make some part of the coding easier. When I challanged this I realised I couldn't give a practical reason why the original method was the way to go - other than that from a theoretical perspective it was the correct way of doing things. Obviously the change would include creating a unique constraint on the fields previoulsy part of the primary key along with indices where required.

I would like to hear some opinions on this - does it really matter and if so why?
0
Comment
Question by:Julian Hansen
12 Comments
 
LVL 2

Accepted Solution

by:
Favago earned 500 total points
ID: 16279216
Hi!

On tableTest 1:

The main reason is speed: Numeric types can be indexed better, so accessing it will be faster. This is important when you join tables, or when a prevoius or sub-select already returned the ID..

If you created the identity column just to be able to store the reference pointer on smaller space (doesn't have to spend valuable varchar(x) bytes on a Foreign Key reference), so it is unique, why wouldn't you make it the PK ?

An other reason: This is somewhat abstract. The numeric identity doesn't directly connects with the entity you model. Thus, it can be unique in for  each entity. Let's say you choose one of the entities attributes, what you think to be unique. What if this turns out not to be unique later? For example Ethernet-card MAC-addresses. They should be all unique, but sometimes assembly line errors or something else renders them not to be. What if someone looses his passport, and receives a new one with a new number?
By using a non-real-world identifier, you can barrier yourself from the errors of the real world (maybe).

On tableTest 2:

Having to create a rec_id, or using (prod_id, user_id) depends on the usage. If multiple connections between the A1, B1 entities of the entity-sets  A, B is possible, then you may get on well with creating rec_id, being able to distinct between these connections (or if they could be distincted by other non-numeric attributes, you'll be able to distinct faster).

If multiple A1 - B1 connections are not possible, the rec_id may not be required. But what if it later turns out, that multiple connections are do possible? You have to consider this possibility yourself. This is quite case-sensitive, because this table is not a table of an entity-set, but rather a table of a connection (thinking in ER-scheme).

I'm interested in other point of views too!
0
 
LVL 61

Author Comment

by:Julian Hansen
ID: 16279881
Favago,

Thanks for the feedback. My problem is that there does not seem to be consensus of opinion on this. In my research their seem to be two camps and neither seems to have the upper hand. The following two articles demonstrate the point

http://www.aspfaq.com/show.asp?id=2504
http://www.sqlteam.com/item.asp?ItemID=2599

I can see, to a certain extent, what the proponents of the natural key argument are saying but from a practical point of few maintaing parent child relationships through potentially complex composite natural keys seems like more work than is necessary especially when the natural keys contain data that is difficult to index. Both sides of the argument seem to say that their approach guarantees uniqueness.

What I am looking for is a good argument (either way) of what to use or the criteria by which to make a decision on when to use which method.

0
 
LVL 2

Expert Comment

by:zeus40
ID: 16281621
I totally agree with Favago.

I have designed systems where the natural primary keys were used, and lo and behold, when the system was used by other sections, that key no longer held true.

Argument against using PassportNumber as a primary key.
In your design, you are assuming that PassportNumbers are unique.  Maybe they are for the local government that you are dealing with.  But you are binding a record's identity to an assumption that may not hold true when other governments passports are used.

Argument against using UserID and ProductID as a primary key.
UserID's can change as can ProductIDs.  Cascading Updates and Deletes may cause unwanted results (if you chose cascading as a design style [which I never do]).  Place an ID that uniquely identifies the column and then make userID ProductID attributes of that ID.

On a side note.  The term CLUSTERED does not conote a dual key primary key index.  That is called a composite key.  CLUSTERED means that the DBMS system is physically going to store the records in an order using that composite key.
For example if the records are inserted as follows:
UserID ProductID
1         1
2         2
4         3
5         4
3         5

The DBMS will reorder and (potentially) store them as:
1         1
2         2
3         5
4         3
5         4

Note:  I am not criticizing your design choices.  I am providing you with an opinion.   Your design may function fine until the end of time.
0
What does it mean to be "Always On"?

Is your cloud always on? With an Always On cloud you won't have to worry about downtime for maintenance or software application code updates, ensuring that your bottom line isn't affected.

 
LVL 61

Author Comment

by:Julian Hansen
ID: 16281803
Favago - thanks for the inpu.

>> I have designed systems where the natural primary keys were used, and lo and behold, when the system was used by other sections, that key no longer held true.

Then by definition the system was not using a proper natural primary key which can only be called a primary key if it is unique.

Essentially this is an argument to circumvent potential design issues and not as a genuine argument against the use of natural keys.

>> Argument against using PassportNumber as a primary key.

I am not using PassportNumbers - the tables present a hypothetical situation I was using to describe the context of the question and not an actual system. For the purpose of this exercise assume that the primary keys (as defined in the example tables above) are in fact unique.

>>On a side note.  The term CLUSTERED does not conote a dual key primary key index.  That is called a composite key.  CLUSTERED means that the DBMS system
>> is physically going to store the records in an order using that composite key.

Not sure what you are saying here. The definition was of a primary key - there is a choice of CLUSTERED or NON-CLUSTERED - you have to choose one. For the purpose of this discussion the actual choice is irrelevant. Just to reiterate - this is not an actual system I am trying to design - it is an example to help with the discussion. My interest is in the use of natural primary keys for relationships between table vs meaningless surrogate keys that are not related to the data but simply used as a convenience. I don't want the thread to deviate into the merits (or demerits) of the design of the sample presented as that is not what I would like to discusss.

0
 
LVL 2

Expert Comment

by:zeus40
ID: 16282265
Your statement,
Then by definition the system was not using a proper natural primary key which can only be called a primary key if it is unique.
Essentially this is an argument to circumvent potential design issues and not as a genuine argument against the use of natural keys.

I think the term Natural Primary Key is a misnomer.  The appropriate term would be "Assumed Natural Primary Key" or "Theoretical Natural Primary Key".  Natural Keys are Natural Keys until they are proven not to be Natural Keys.
Also, Primary Keys are Primary Keys until they are proven not to be Primary Keys.

Your statement
I am not using PassportNumbers - the tables present a hypothetical situation I was using to describe the context of the question and not an actual system. For the purpose of this exercise assume that the primary keys (as defined in the example tables above) are in fact unique.

I would have to disagree.  In fact you are using Passport Numbers.  I say this because your DDL statement:
create table tblTest(
  userid INT IDENTITY(1,1) NOT NULL,
  Name VARCHAR(63),
  Surname VARCHAR(63),
  PassportNumber VARCHAR(20) PRIMARY KEY
)
Explicitly states that PassportNumber is being used as the Primary Key.  So, your hypothetical situation is actually the case for which you are basing your "Request for Opinions" on whether to not use (assumed) Natural Keys or Identity Keys.  

Basically, you are intent on discarding everything you have set forth in your case, for reasons I don't know why.  

Rule:  In DB Design, have the system provide Identity Columns as a primary key.  Or, enjoy rewritting a lot of code when your company gets bought out, peoples SSN's get changed, or product IDs change because Marketing had another "great" idea.


0
 
LVL 42

Assisted Solution

by:dqmq
dqmq earned 500 total points
ID: 16285886
In relational theory there's hardly a dispute that natural keys are better.  In that perfect world, a primary key uniquely defines the entity and is immutable.  So, if we have such a thing, what is to be gained by arbitrarily assigning an identity column with the same properties?  Theoretically, it's an unnecessary complication and only serves to confuse. With two ways to identify something, which one do we use?  Who wants a different RFID transmitter embedded in each arm?

For the most part, the argument that identity columns produce more efficient indexes or save index space are largely bogus.  If you have anything that resembles a natural key, you will want it indexed anyway, even if it is not the PK. That is how the business users will search the data.

However, in the real world things are not ideal and artificial keys often do make a lot of practical sense. Personally, I prefer natural keys, but find that all-too-often artificial keys are justified. Often times, what we tend to call a natural key does not survive all the true tests of a proper PK.  However, I do NOT subscribe to the practice of unconditionally using identity columns on every table.  Furthermore, I do not always used identity for an artificial key.  For example, I might use two-character abbreviations for the aritificial key to a state table.  Or one character codes for the artificial key to a status table.

In my opinion, there are sometimes significant practical advantages to natural keys, too. For one that I already alluded to, the world is a simpler place when everybody refers to an identity the same way.   I work in an electric utility and we have two ways to identify a meter--the meterID which is and identity column (assigned internally by the computer) and meterNo (actually stamped on the meter) which is used by people.  I cannot tell you how many mistakes have been made because two parties have used different identifiers unbeknownst to each other.  I cannot tell you how many times I've seen eyes gloss over or gotten that "here we go again" shrug when I've tried to explain that the computer has a special numbering system that you don't see.

Two, in parent-child relationships, the PK propagates downward. That's a valuable quality.  It's often easier to retrieve that important value (eliminates joins) and it also facilitates data integrity. My meter table has a child table for meter register (each register measures a different thing) and that table has a child for readings (the measurements). So the model is like this Meter --< MeterRegister --<  MeterReading.  Well, guess what--anything that accesses MeterReading invariably does so by MeterNo. Had we used that as the PK, it would propagate down to the reading table. But since we did not, we always needed to join upwards to get it. We have numerous cases of having to join across 4, 5, or 6 tables for the sole purpose of retrieving the user-facing identifier which is not in the PK.

To make exerbate things, we have another system which imports readings by meterNo.  We cannot enforce RI against those tables because meterNo is not the PK of the parent. The other data integrity exposure occurs if a change occurs to either the one or the other candidate keys. Now, that's not supposed to be allowed and for a PK it's relatively easy to enforce, but I've yet to see that kind of constraint enforced on a natural key other than the PK.

On the other hand, it's fairly easy to think you have a natural key when you really do not.  In my meter example, one is tempted to use meterNo for the PK and I have made several arguments why that would make life simpler. But what if we purchase a crate of used meters that have overlapping meterNo's with our existing inventory.  We're stuck.  So meterNo is not a proper natural key unless the business is confined to a known set of meters.

I do agree that most of the arguments against using natural PK's are ultimately based on the reality that what we thought was a proper natural key, eventually turns out not to be so.  

So, my rule-of-thumb is to tend toward a natural key unless:
1. It contains too many columns
    (a bit of a judgement call, but 4 starts to get unwieldy in my book).
2. It's necessary to add an artificial "tie-breaker" column to the otherwise natural key
    (beware of timestamps used that way).
3. There is the SLIGHTEST hint that ANY column of the natural key will EVER change
    (error on the side of "it will" if in-doubt)
4. There is an identity column for some reason besides needing an artificial key
    (suggests the natural key is not that resilient)
5. There are "special values" in some PK's to circumvent the not null requrement
6. Any column of the natural key is overly complex


Examples:
PostalZip--Natural
Account--Natural
PurchaseOrder--Natural

Customer--Identity
Address--Identity
Product--Identity
Employee--Identity
NetworkAdapter--Identity

State--Artificial,non-identity
AccountStatus--Artificial, non-identity
0
 
LVL 61

Author Comment

by:Julian Hansen
ID: 16287078
Zeus40 you need to take a step back, you are doing exactly what I was hoping would not happen - focusing on the detail of the example instead of addressing thie issue at hand. I do appreciate your comments but they do not help because they address an example I made up in my head and which has no meaning for me other than as a tool in this debate. If PassPort number is not a suitable PK for the example then change it to

TheMostUniqueFieldPossible_ImpossibleToDuplicateEver VARCHAR(20) PRIMARY KEY

Which is not a system generated value and has meaning for the entity we are trying to model and forget about passport number.

The reason for my post was that in my research I have found a lot of articles arguing convincingly for both approaches. Rather than take them at face value I thought I would start my own debate to see if I could learn for myself what the best approach was. Lets forget the example completely and look only at Natural Primary Keys vs Surrogate Primary Keys.
0
 
LVL 61

Author Comment

by:Julian Hansen
ID: 16287111
dqmq,

Thanks for your comments.

Here are some points I would like to specifically deal with. Firstly, let us assume that our choice of Primary Key is perfect. I accept the argument that in the real world this may be asking a bit much but for the purposes of this discussion assume that we have a perfect Primary Key.

Then here are some issues as I see them with both approaches

Arguments Against PK

1. PK is long (many fields) as addressed by dqmq. For arguments sake (remember we assumed perfect primary keys) say that Field1 CHAR(60), Field2 CHAR(20), Field3 CHAR(40) and Field4 CHAR(50) form a perfect primary key. Granted we would have an index on this key anyway but in terms of child records each record would have to include 170 bytes of extra space for the primary key vs 4 bytes for an identity (or 16 bytes if you use a GUID). In the days of terrabyte storage maybe this is not an issue but we are also having to deal with millions of records and millions X 170 works out to a lot of space.

2. In situations where you have to create a join, then doing so on a single field is a lot neater and easier (in terms of typing and potential typo's) than a compound key.

Arguments for PK

1. Primary key in child records is meaningful and does not always require a join to derive meaning.

2. When copying records from one database to another if you have ever had to do this on fields joined by system generated identity columns you will understand the involved process this is. Each parent record has to be created first, its identity recorded and this propogated to child records. If natural PK (non-system generated) are used then a simple insert select can be used to propogate the data.

This, however, can be solved by using GUID fields - which in themselves provided benefit when merging two databases.

So far I am not sold on either approach yet - maybe someone can add to the arguments above to provide more pro's and cons for each.

Thanks to all who have responded so far - all comments have been beneficial.
0
 
LVL 9

Assisted Solution

by:gabeso
gabeso earned 500 total points
ID: 16288170
This is a very interesting debate.

To add another opinion ...

Database design should be managed by tools and then other tools use these designs to generate other things such as classes that read/write the data. Database design is something that you should be done with layers and meta-level thinking.

So a general way to design any database is to define each entity as having an 'id' field of some integer type (32-bit, 64-bit, or even a 128-bit guid). The relationship between entities is then simply a set of integer values. You could go so far as to only have many-many relations between entities as these are the most general relationship - which handles the case when a revision of the schema demands promotion of the relationship from one-one, to one-many or many-one. This enables a more stable schema: Policy can then be controlled by indexing the fields in the relationship to force one-one, one-many or many-one - this is simple to change.

This gives a lower framework layer of entities and relationships that have no properties yet. This can be managed by tools because everything is known and entities and relationships have a defined structure that can be navigated.

You can then build higher layers on top of this skeleton containing sets of properties (perhaps analogous to implementing interfaces) and so on as the design evolves. The properties are just added to the entity or the relationship as required.

The basic point is that with 'natural' keys you could not do this.

Having thought about it I think a deeper reason for this approach is that you have to start somewhere and your design will evolve so you have to handle changing requirements: You are using a tool and want to be very general about it.  Perhaps this is more important when you have many large schemas to generate and don't want to get bogged down in detail or specifics relating to the idiosyncracies of any particular design.

0
 
LVL 2

Assisted Solution

by:zeus40
zeus40 earned 500 total points
ID: 16289548
*after I have taken a step back* I realize that your question has no answer.

Also, on further thought, I believe the forums are desinged to ask specific questions, instead of theoretical ponderances.  Else you will argue every counterpoint and have the largest thread on the site.

Feel free to award points, I am going to request that the thread be shut down in 4 days.

Thanks,



0
 
LVL 61

Author Comment

by:Julian Hansen
ID: 16292746
Gabeso - thanks for the input

Zeus40

>> *after I have taken a step back* I realize that your question has no answer.

That in itself is an answer ;)

This was not meant to be a general discussion thread. My purpose was to derive some sort of answer from it and on one level I have - so no need to shut the thread down I will close it with points awarded within the next couple of days.

When I started this I was on the back foot. I had just come out of a design meeting where a discussion had taken place about the merits of the two approaches. There was one stauch supporter of surrogate keys who claimed there were no benefits associated with the natural PK approach. At my current level of DB knowelege I suspected that his claims where not 100% true so I decided to find out. I did some googling and posted a question here to see what transpired.

I feel that I am now in a far better position to handle this debate than I was 3 days ago - thanks largley in part to the comments posted here.
0
 
LVL 61

Author Comment

by:Julian Hansen
ID: 16321190
There have been no new comments posted in the last two days. Therefore I am closing the thread and awarding points. I have a better understanding of the topic under discussion so thanks to all who responded.
0

Featured Post

Industry Leaders: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

In this blog post, we’ll look at how ClickHouse performs in a general analytical workload using the star schema benchmark test.
Backups and Disaster RecoveryIn this post, we’ll look at strategies for backups and disaster recovery.
In this video, Percona Solution Engineer Rick Golba discuss how (and why) you implement high availability in a database environment. To discuss how Percona Consulting can help with your design and architecture needs for your database and infrastr…
With just a little bit of  SQL and VBA, many doors open to cool things like synchronize a list box to display data relevant to other information on a form.  If you have never written code or looked at an SQL statement before, no problem! ...  give i…
Suggested Courses

572 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question