Community Pick: Many members of our community have endorsed this article.
Editor's Choice: This article has been selected by our editors as an exceptional contribution.

The great PK debate: Natural Keys vs. Surrogates (again)

Jim Dettman (EE MVE)Volunteer
CERTIFIED EXPERT
Independent consultant specializing in the writing of custom packages for businesses.
Published:
Updated:
I titled this "The great PK debate: Natural Keys vs. Surrogates (again)", because on almost any developer list or forum that deals with databases, this topic always comes up  at one point or another and usually with very strong opinions on both sides.  What started me on this article was that several days ago I tripped over an article on primary key's that was just flat out wrong.  

After poking around a bit on the net I realized that over the years, a lot of myths and misconceptions have grown up around this topic.  It seems that as the years go by, more and more gets published and discussed on this topic, but things only get cloudier.  For example, are auto number fields really a surrogate key?  Can they function as a primary key?   I hope that through this article, I will be able to clarify and explain fully enough the answers to questions such as those and hopefully, another great PK debate will not ensue (or at least if it does, you'll have plenty of ammo for the debate<g>).

When I was looking at articles out on the net, one thing that struck me about almost everything I read is that not many started off on the right foot. So the first thing to clear up is what relational theory actually is. Relational theory (the big "R" as some call it), was developed by E.F. (Tedd) Codd while working at IBM.  His first paper "Derivability, Redundancy, and Consistency of Relations" was published in 1969 as a research paper. While Codd was working on ways to store data in a data bank, what he actually did is in a very scientific and mathematical way describe a theory of data organization, based on a branch of mathematics that deals with sets of data.  In a nutshell, it was the overall concept that data could be modelled in a very logical and rigorous way.

So R theory is not concerned with how data is physically stored nor does it only apply to computer systems. What R theory is concerned with and only with is the logical organization of data in order to provide information. To make that clearer, I can apply relational theory to data kept on a chalk board, with sticky notes on a wall, or even with pen and paper. With R theory, it is the meaning of the data and how it is organized that is important and nothing else.  As you read on, it is imperative that you keep this distinction in mind.

That brings us to Misconception #1; relational theory does not exist because of databases; it is something that is applied to them. And, Misconception #2; R theory deals with the relationships between your tables.  

To a certain extent R Theory is applied to the database to deal with the relationships between the tables, well, when you discuss normalization, which we won't get into here. More fundamentally it applies to something else and is talking about something else, which is a relation, and that is why most articles start off on the wrong foot, failing to recognise the relation.

A relation represents a set of data where the given set all pertains to the same type of "thing" or entity.  In a relation, the data is laid out in rows and columns.  The columns, each which represent an attribute of the entity (describes it in some way), and the rows (which are called tupples) represent a specific instance of a set of attributes in the relation.  Now if you stripped away all the mumbo jumbo technical terms and use database terms instead, you would see that I have just described a table.  The attributes are the fields and the tupples are the records.

So let's start now with where most other articles pick up and that is with the classic example of a customer list.  Since our relation is about customers and for the sake of simplicity, we'll stick to the following attributes in the relation:

Name
Address
City
State
Zip
Phone Number

One of the fundamental things I didn't mention above about forming a relation is that each tupple needs to be unique. If by combining all the attributes, you cannot uniquely identify a given tupple, then you don't have enough attributes in your relation. It is the combination of one or more attributes, which will uniquely identify a tupple that will become the primary key. Can you have more than one primary key? No!  Can you have more than one possible primary key?  Certainly and these are called candidate keys, one of which will be designated as the primary key.

Each grouping of one or more attributes that can uniquely identify a tupple is called a super key.  There are a multitude of super keys in the above and going to the extreme, would be combining every attribute to form a key.

But when you work with data, there are a few attributes that a primary key should have:

1. It should be as stable as possible.
2. It should be as minimal as possible.
3. It should be as familiar as possible.

Given that, we certainly would not want to use most of the super keys in a relation.  Keeping the above in mind, let's take a look at our example.  Would name alone suffice to uniquely identify each row?  Probably not as you could easily have "ABC Company" in two different cities.  How about Name and Address?  Well that would be better, but might still not be unique enough.  You might have two "ABC Company" on "33 Main street", but in different cities.  What about combining Name, Address, City, and State?  That might work.  How about Name, Address, and Zip?   Possibly.  What about the phone number by itself?  Yes, that would probably work too.

So we have three candidate keys, but which one should be the primary key?  Again, looking at attributes a good primary key should have, phone number is probably the best bet.  Phone numbers typically don't change unless you move so it would be fairly stable and it certainly is shorter then using either of the other candidate two keys.  Last, it should be familiar to anyone that works there. But cannot the phone number change?   Sure and most would say that this means that it cannot be a primary key!  

Which of course brings us to Misconception #3; Primary keys can never change. Primary keys can and will change; companies move, phone numbers can change, etc.  There is nothing within relational theory that says a primary key cannot change.  Again, keep in mind that still we are talking about the logical representation of data.  We have not moved on to how it is physically stored or issues with that.

Up to this point, we've been discussing what are commonly referred to as Natural Keys.  That is the key is derived from the existing attributes.

Early database designs used the method above to choose a key for a table.  But rather quickly, it was discovered that the computer systems we had could not keep up performance wise as the keys had a tendency to become fairly long (when forming joins between relations).  This is the point where surrogate keys came into use.   Many believe that a surrogate key should be meaningless and have no connection with the data in the row.  This partly came about because of data warehousing.   When warehousing data, it becomes imperative that a key assigned to a row never changes. Some database designers even go to the point of saying that a meaningless surrogate should never be displayed to the user.  But does a meaningless number used as a surrogate work as a primary key?  Is it really a surrogate?  Let's look at an example where the company has a customer table that looks like this:

tblCustomers
CustID - Identity  - Primary Key
Name - Text
Address - Text
City - Text
State - Text
Zip - Text
Phone Number - Text

So our CustID attribute is now just a number that goes up by one for each new record.  It's never given to the customer and it does uniquely identify each row in the table.   Now imagine that I'm a customer calling to place an order:

  Customer:   Hi, I'm calling to place an order.
  Sales Rep:   Have you done business with us before?
  Customer:   Yes
  Sales Rep:     Great.  I can just pick you from the list..uh, what's the company name?
  Customer:      ABC Company
  Sales Rep:     Oh...well I have six companies listed with that name...where are you located?
  Customer:   New York City, NY
  Sales Rep:     Wow! Believe it or not, there are three ABC Companies in NYC!...what's the address?
  Customer:   7th avenue and 28th street
  Sales Rep:   OK great...this must be you... is your phone number 210-699-9999?
  Customer:   Yes
  Sales Rep:     OK, what did you want to order?

You've just seen in action the difference between a true primary key and one that is not even though it is labelled as a "primary key" in the table design.  While CustID does serve to identify the record uniquely within the relation physically, it does not serve to identify a specific customer within the relation logically.  In order to do that, the sales rep used the natural attributes of customers to ensure that the correct customer was being chosen.  The attributes he used would have been a super key, possibly a candidate key, the real primary key, but it was not the "primary key" key that is currently in the table design.

"Surrogate" means "to take the place of", but as we have just seen, a meaningless key does not take the place of a primary key from a logical point of view.  Yes, it does uniquely identify a row in a table, but only in a physical sense, not in terms of the data.  And how could it, since it has no connection with the data in that row.

Which brings us to Misconception #4; A meaningless identity or auto number column can serve as a primary key.   A meaningless key is simply a tag or pointer, but in regards to relational theory, it cannot be a primary key.

So is there anything that could be called a true surrogate key?  Well what if we took that meaningless CustID and handed it to the customer when they first started doing business with us?  In doing that, we would give it meaning.  It is now known to us and the customer and would never be given to another customer.  We could now use it to identify a customer uniquely in a logical context, so yes, it would serve as a primary key even though it is an artificial (non-natural) attribute.  A subtle difference to be sure, but it does make a difference.

I can see the question now; Ok the above is all well and good and I now understand the differences, but where does that leave me in developing applications?

Well first, like 99% of the developers out there at this point (including myself), you're going to use meaningless keys (note that I did not call it a surrogate though<g>) in your databases.   But with understanding the above, you also now realize that it might mean:

1. You might need to maintain additional indexes and/or code to form a constraint based on a primary key - how does your database ensure that you don't have a customer entered twice?
2. You might want to make changes in your user interface - How can I present data to the user as efficiently as possible in order to uniquely identify a specific instance of something?
3. Are there places where a surrogate key can be used?
4. Are there places where I might not want to use a meaningless key?

By asking yourself these questions and with the understanding gained from the above, you now should be able to answer them.  For example, if we take number four, there is one place where I think it is just down right silly to add a meaningless key; that is in a many to many linking table.

Given:

tblBooks
BookID - Identity - PK
BookTitle - Text
ISBNNumber - Text

tblAuthors
AuthorID - Identity - PK
LastName - Text
FirstName - Text

And that many authors can author more then one book and a book can have one or more authors, then you use a linking table to form the many to many relationship like this:

tblAuthorsAndBooks - One record per Author / Book combination
AuthorBookID - Identity - PK
AuthorID - Long - CK1-A
BookID - Long - CK1-B

"CK" stands for candidate key.  Looking at this, including AuthorBookID just for the sake of having a meaningless key is a waste.  The AuthorID/BookID combination must be unique, so we need a constraint (usually done with an index) on it anyway.   The table should look like this:

tblAuthorsAndBooks - One record per Author / Book combination
AuthorID - Long - PK-A
BookID - Long - PK-B

Another place I think it is a waste to add a meaningless key is in a simple lookup table which has a code and a description as the only attributes:

tblCreditCodes
Code - Text - PK
Description - Text

The Code in of itself is the primary key and can be used as the key.  So why would you want to do this:

tblCreditCodes
CreditID - Identity - PK
Code - Text - CK1
Description - Text

Well one reason mentioned for using surrogates or meaningless keys was for performance.  If this was going to be a large table and the Code attribute was bigger then a long (4 bytes), then it might make sense to use an identity field in the table.    But given that, then maybe I should change my user interface and just present a Description and have a table like this:

tblCreditCodes
CreditID - Identity - PK
Description - Text

Again though realizing I will need to maintain an additional index on Description so that I can't enter the same thing twice.  As you can see, the important point about all this is that by understanding the underlying concepts vs the real world challenges that we all face can in the end help you design and develop better databases and applications.

In conclusion, I hope you found this article enlightening or at the very least given the subject matter and your judgment of my points, entertaining.  Comments are more then welcome.

Jim Dettman
11
6,538 Views
Jim Dettman (EE MVE)Volunteer
CERTIFIED EXPERT
Independent consultant specializing in the writing of custom packages for businesses.

Comments (7)

Commented:
>>  it is the candidate with the best overall qualifications (minimal length, stability, and familiarity)
>> in providing a unique key, and as such, is denoted as the "primary" key.

That is however a subjective choice and is of no practical significance so there is no sound basis to argue one way or the other. I would humbly suggest that there is no pressing need for articles about selecting a primary key. Writing such an article will only perpetuate the myth that this is somehow an important issue (it isn't). Better to write articles about how to use ALL keys correctly.

Apologies, I misread your remarks about changing primary keys. You do correctly say that primary keys are changeable.
I enjoyed your article. I liked the way you present the misconceptions in a logical and progressive way. As database analyst, I often had to explain the difference between "natural" and "surrogate" keys, and ended up using the expressions "exposed key" and "hidden key".

Examples of "exposed keys" are licence plates, social security numbers, atomic codes from the periodic table, ISO country and currency codes, scientific names of species, and yes, telephone numbers, domain names and e-mail addresses. They are all human attempts to describe or "reach" something in the most concise way. They are, as stable and minimal as possible, and tend to also become familiar. Employees of a company tend to recognize immediately their cost centre codes, for example, as being "exposed keys".

Hidden keys are those normally not displayed to the user, for example transaction numbers. This is actually an important question for many tables. "I will number your categories, do you want to expose these new numbers?". If yes, it should be explained that the "keys" will tend to spill out of the database, and be used in communication. Soon, it will no longer be practical to change the surrogate key, it will have become a natural key. If it is exposed, there is the choice of a numeric or alphanumeric key, and the decision on how keys are generated.

I found that distinction very useful. Simply because exposed keys need an authority, besides the database developer, they become part of the "user perceived" data structure, and as such someone has to decide how to uniquely identify "things". My responsibility is to choose a *technically* efficient key by using an existing or newly created exposed key, or by numbering my records internally. I can even choose to do so when the user-perceived key is different, for optimisation (replacing 50 character unique category names by an integer in joins). That is, as you point out, irrelevant to the user and to the fundamental nature of the data: the database's primary key is a number, the user's primary key is the category name.

In the end, the six fields you include in your table 'customers' are already surrogate keys because they are words and codes, instead of natural attributes (measures, coordinates, descriptions, properties): the company name has been chosen to be different from other nearby companies, the street names to be unique in a city, the city names to be unique in a state. The state is probably the official state abbreviation, the zip code is clearly a surrogate key for places within a country, and the phone number is of course unique... There is nothing "natural" about this information. Every element is made of words specifically designed to discriminate the objects they name from other similar objects...

All I'm really saying is the the nuances between "natural", "primary", and "surrogate" are less useful than the simple "exposed" versus "hidden" paradigm. Not in relational theory, of course, but in actual database development.

This isn't meant to diminish what you wrote in any way, on the contrary. Perhaps the debates you mention were caused by semantic misalignment rather than by actual differences in views (one side using "surrogate key" to mean "hidden key", for example).

(°v°)
CERTIFIED EXPERT
Fellow
Most Valuable Expert 2017

Author

Commented:
Thanks!

  Now if I could only find the time to write a few more; I've got a sticky with about 12 ideas for articles on my montior and 2 or 3 that have been half written for the past four months.

JimD.
Jim thanks so much for taking the time to write this article!! It has really helped me a lot and your writing style has transformed something abstract and difficult to understand for a beginner like me to some easy to digest and logical!! Thanks once again and your efforts are greatly appreciated by me and all the rest!! = ))
BitsqueezerDatabase Developer
CERTIFIED EXPERT

Commented:
Hi,

good article about the difference between what "internal" and "external" keys are in the sense of "exposed" or "hidden" keys like harfang said above.

But I completely disagree about the term "PKs are not 'special'". That's in most cases wrong. The big difference between PKs and other indices made unique (where logically is no difference, you can apply a PK on an autonumber field and also a unique key) is the way they influence the physical order of records in a database.

In most relational databases including Access and SQL Server, the PK is a "clustered index" (in SQL Server you can create PKs without clustering, as far as I remember, but in Access you can't and in SQL Server it is the default that a PK is always a clustered index).

That means: If you have an ID field with "1,2,3" as records the physical order of the records will be "1,2,3". If the ID field used for the PK is not an autonumber field so that you can change it manually and you change the first record to "4" then the physical order of all records will be changed, after changing the ID it will be "2,3,4" and not, in case of a simple unique index, "4,2,3" (means: the physical order has not been changed).

This is a VERY important speciality of PKs which is often forgotten in such discussions as it makes a big difference in performance. Using a "natural" key with informations like a phone number or a currency code means: If it will be changed, the complete table will be reordered physically on the data storage volume. So this is the most often reason why developers mostly uses a PK with an autonumber field: This guarantees that the physical order will never change as the next saved value is of course higher than the last. For the same reason it is not a very good idea in common to use a GUID as PK: It is a very random value, very long, text based and useless for communication (can seldom be used as for example customer ID). Used as PK that means it is guaranteed that it will reorder the physical structure of a table each time you write a new record.
(Access has not the same problem here so much: It reorders the records physically only if you use "compact & repair", but a database server does this permanently (of course with optimization about the point in time).)

The physical sorting of a table in the order of a PK is not "just for fun" - it could have been created like a normal unique key without clustering, but the idea of the developers of the PK idea is to make this one the fastest index of all other, this is the reason why it IS the Primary Key, not because it is the "most often used" unique key or the key with the highest importance. If that would be the case the PK would have no advantage against other unique keys, you would create i.e. more than one unique key and use one time this and other time that, you would have no difference. In that case the statement would be true that you can use a combination of two ID fields in an m:n table as PK without any difference.
The reason why it normally should NEVER be a combined index is exactly the physical sorting of the table. If you have m:n-tables they often changes its contents, and each time you change or add data to such tables it would be for example (in order of creating) "2,3 / 4,2 / 1,3" (each one a pair of IDs in an m:n table). So if you use a PK on both columns the table will be resorted as "1,3 / 2,3 / 4,2" which is not needed for any reason - no one ever looks into such a table and wants such sort order. But if you add an additional PK (although the PK itself is mostly never used for an own relation) as autonumber you have a PK which never changes , the only exception would be a deletion of a row.

Another reason why combined columns as PK are mostly a bad idea is that you always would need both columns (in an 1:n relation) to uniquely identify the wanted row in the other table, that's unnecessary overhead.

In real life scenarios it most often is the case: If you do not have a single column with a PK that never changes and has in most cases no "natural" meaning (exception would be i.e. a CustomerID) you will always add one later in development of an application/database. It's even often the case that you would create an artificial additional ID with a meaning although you already have an autonumber ID in cases where you must make sure that the created unique number has no gap in the numbering like invoice numbers where a finance office would not be satisfied if you have a gap in the number and cannot explain why this number never was an invoice number because someone started to add a record, the ID was created and he has undone that - the ID will not be used again in an autonumber field.

I also sometimes thought about using a meaningful attribute as PK when it is sure to be unique, but in most cases I regret that later because of the stated reasons and so I nowadays don't think much about PKs and always create a long integer autonumber field for each and any table and have never problems doing so.

Cheers,

Christian

View More

Have a question about something in this article? You can receive help directly from the article author. Sign up for a free trial to get started.