Link to home
Start Free TrialLog in
Avatar of Julian Hansen
Julian HansenFlag for South Africa

asked on

Primary keys: best practice

I was not the most attentative student in my Database lectures but I did pick up enough on relational database theory to understand the process of defining primary keys for a table. However, I have found myself in a situation where I have designed a database and used "natural primary keys" where my choice of keys has been challanged in favour of a unique record ID (IDENTITY) field. To explain - take the tables

create table tblTest(
  userid INT IDENTITY(1,1) NOT NULL,
  Name VARCHAR(63),
  Surname VARCHAR(63),
  PassportNumber VARCHAR(20) PRIMARY KEY
)

And

create table tblTest2 (
   recid INT IDENTITY(1,1) NOT NULL,
   userid_id INT NOT NULL,
   product_id INT NOT NULL,
   ... some other fields
)

ALTER TABLE [dbo].[tblTest2] WITH NOCHECK ADD
      CONSTRAINT [PK_tblTest2tblTest] PRIMARY KEY  CLUSTERED
      (
            [userid_id],
            [product_id]
      )  ON [PRIMARY]

In both cases the primary key selected is the natural primary key for the table and is guaranteed unique - no problem.

A "discussion" has arisen because the implementor wants to remove the primary keys created and make the record id's the primary keys (the identity fields) because it will make some part of the coding easier. When I challanged this I realised I couldn't give a practical reason why the original method was the way to go - other than that from a theoretical perspective it was the correct way of doing things. Obviously the change would include creating a unique constraint on the fields previoulsy part of the primary key along with indices where required.

I would like to hear some opinions on this - does it really matter and if so why?
ASKER CERTIFIED SOLUTION
Avatar of Favago
Favago

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of Julian Hansen

ASKER

Favago,

Thanks for the feedback. My problem is that there does not seem to be consensus of opinion on this. In my research their seem to be two camps and neither seems to have the upper hand. The following two articles demonstrate the point

http://www.aspfaq.com/show.asp?id=2504
http://www.sqlteam.com/item.asp?ItemID=2599

I can see, to a certain extent, what the proponents of the natural key argument are saying but from a practical point of few maintaing parent child relationships through potentially complex composite natural keys seems like more work than is necessary especially when the natural keys contain data that is difficult to index. Both sides of the argument seem to say that their approach guarantees uniqueness.

What I am looking for is a good argument (either way) of what to use or the criteria by which to make a decision on when to use which method.

Avatar of zeus40
zeus40

I totally agree with Favago.

I have designed systems where the natural primary keys were used, and lo and behold, when the system was used by other sections, that key no longer held true.

Argument against using PassportNumber as a primary key.
In your design, you are assuming that PassportNumbers are unique.  Maybe they are for the local government that you are dealing with.  But you are binding a record's identity to an assumption that may not hold true when other governments passports are used.

Argument against using UserID and ProductID as a primary key.
UserID's can change as can ProductIDs.  Cascading Updates and Deletes may cause unwanted results (if you chose cascading as a design style [which I never do]).  Place an ID that uniquely identifies the column and then make userID ProductID attributes of that ID.

On a side note.  The term CLUSTERED does not conote a dual key primary key index.  That is called a composite key.  CLUSTERED means that the DBMS system is physically going to store the records in an order using that composite key.
For example if the records are inserted as follows:
UserID ProductID
1         1
2         2
4         3
5         4
3         5

The DBMS will reorder and (potentially) store them as:
1         1
2         2
3         5
4         3
5         4

Note:  I am not criticizing your design choices.  I am providing you with an opinion.   Your design may function fine until the end of time.
Favago - thanks for the inpu.

>> I have designed systems where the natural primary keys were used, and lo and behold, when the system was used by other sections, that key no longer held true.

Then by definition the system was not using a proper natural primary key which can only be called a primary key if it is unique.

Essentially this is an argument to circumvent potential design issues and not as a genuine argument against the use of natural keys.

>> Argument against using PassportNumber as a primary key.

I am not using PassportNumbers - the tables present a hypothetical situation I was using to describe the context of the question and not an actual system. For the purpose of this exercise assume that the primary keys (as defined in the example tables above) are in fact unique.

>>On a side note.  The term CLUSTERED does not conote a dual key primary key index.  That is called a composite key.  CLUSTERED means that the DBMS system
>> is physically going to store the records in an order using that composite key.

Not sure what you are saying here. The definition was of a primary key - there is a choice of CLUSTERED or NON-CLUSTERED - you have to choose one. For the purpose of this discussion the actual choice is irrelevant. Just to reiterate - this is not an actual system I am trying to design - it is an example to help with the discussion. My interest is in the use of natural primary keys for relationships between table vs meaningless surrogate keys that are not related to the data but simply used as a convenience. I don't want the thread to deviate into the merits (or demerits) of the design of the sample presented as that is not what I would like to discusss.

Your statement,
Then by definition the system was not using a proper natural primary key which can only be called a primary key if it is unique.
Essentially this is an argument to circumvent potential design issues and not as a genuine argument against the use of natural keys.

I think the term Natural Primary Key is a misnomer.  The appropriate term would be "Assumed Natural Primary Key" or "Theoretical Natural Primary Key".  Natural Keys are Natural Keys until they are proven not to be Natural Keys.
Also, Primary Keys are Primary Keys until they are proven not to be Primary Keys.

Your statement
I am not using PassportNumbers - the tables present a hypothetical situation I was using to describe the context of the question and not an actual system. For the purpose of this exercise assume that the primary keys (as defined in the example tables above) are in fact unique.

I would have to disagree.  In fact you are using Passport Numbers.  I say this because your DDL statement:
create table tblTest(
  userid INT IDENTITY(1,1) NOT NULL,
  Name VARCHAR(63),
  Surname VARCHAR(63),
  PassportNumber VARCHAR(20) PRIMARY KEY
)
Explicitly states that PassportNumber is being used as the Primary Key.  So, your hypothetical situation is actually the case for which you are basing your "Request for Opinions" on whether to not use (assumed) Natural Keys or Identity Keys.  

Basically, you are intent on discarding everything you have set forth in your case, for reasons I don't know why.  

Rule:  In DB Design, have the system provide Identity Columns as a primary key.  Or, enjoy rewritting a lot of code when your company gets bought out, peoples SSN's get changed, or product IDs change because Marketing had another "great" idea.


SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Zeus40 you need to take a step back, you are doing exactly what I was hoping would not happen - focusing on the detail of the example instead of addressing thie issue at hand. I do appreciate your comments but they do not help because they address an example I made up in my head and which has no meaning for me other than as a tool in this debate. If PassPort number is not a suitable PK for the example then change it to

TheMostUniqueFieldPossible_ImpossibleToDuplicateEver VARCHAR(20) PRIMARY KEY

Which is not a system generated value and has meaning for the entity we are trying to model and forget about passport number.

The reason for my post was that in my research I have found a lot of articles arguing convincingly for both approaches. Rather than take them at face value I thought I would start my own debate to see if I could learn for myself what the best approach was. Lets forget the example completely and look only at Natural Primary Keys vs Surrogate Primary Keys.
dqmq,

Thanks for your comments.

Here are some points I would like to specifically deal with. Firstly, let us assume that our choice of Primary Key is perfect. I accept the argument that in the real world this may be asking a bit much but for the purposes of this discussion assume that we have a perfect Primary Key.

Then here are some issues as I see them with both approaches

Arguments Against PK

1. PK is long (many fields) as addressed by dqmq. For arguments sake (remember we assumed perfect primary keys) say that Field1 CHAR(60), Field2 CHAR(20), Field3 CHAR(40) and Field4 CHAR(50) form a perfect primary key. Granted we would have an index on this key anyway but in terms of child records each record would have to include 170 bytes of extra space for the primary key vs 4 bytes for an identity (or 16 bytes if you use a GUID). In the days of terrabyte storage maybe this is not an issue but we are also having to deal with millions of records and millions X 170 works out to a lot of space.

2. In situations where you have to create a join, then doing so on a single field is a lot neater and easier (in terms of typing and potential typo's) than a compound key.

Arguments for PK

1. Primary key in child records is meaningful and does not always require a join to derive meaning.

2. When copying records from one database to another if you have ever had to do this on fields joined by system generated identity columns you will understand the involved process this is. Each parent record has to be created first, its identity recorded and this propogated to child records. If natural PK (non-system generated) are used then a simple insert select can be used to propogate the data.

This, however, can be solved by using GUID fields - which in themselves provided benefit when merging two databases.

So far I am not sold on either approach yet - maybe someone can add to the arguments above to provide more pro's and cons for each.

Thanks to all who have responded so far - all comments have been beneficial.
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Gabeso - thanks for the input

Zeus40

>> *after I have taken a step back* I realize that your question has no answer.

That in itself is an answer ;)

This was not meant to be a general discussion thread. My purpose was to derive some sort of answer from it and on one level I have - so no need to shut the thread down I will close it with points awarded within the next couple of days.

When I started this I was on the back foot. I had just come out of a design meeting where a discussion had taken place about the merits of the two approaches. There was one stauch supporter of surrogate keys who claimed there were no benefits associated with the natural PK approach. At my current level of DB knowelege I suspected that his claims where not 100% true so I decided to find out. I did some googling and posted a question here to see what transpired.

I feel that I am now in a far better position to handle this debate than I was 3 days ago - thanks largley in part to the comments posted here.
There have been no new comments posted in the last two days. Therefore I am closing the thread and awarding points. I have a better understanding of the topic under discussion so thanks to all who responded.