I titled this "The great PK debate: Natural Keys vs. Surrogates (again)", because on almost any developer list or forum that deals with databases, this topic always comes up at one point or another and usually with very strong opinions on both sides. What started me on this article was that several days ago I tripped over an article on primary key's that was just flat out wrong.
After poking around a bit on the net I realized that over the years, a lot of myths and misconceptions have grown up around this topic. It seems that as the years go by, more and more gets published and discussed on this topic, but things only get cloudier. For example, are auto number fields really a surrogate key? Can they function as a primary key? I hope that through this article, I will be able to clarify and explain fully enough the answers to questions such as those and hopefully, another great PK debate will not ensue (or at least if it does, you'll have plenty of ammo for the debate<g>).
When I was looking at articles out on the net, one thing that struck me about almost everything I read is that not many started off on the right foot. So the first thing to clear up is what relational theory actually is. Relational theory (the big "R" as some call it), was developed by E.F. (Tedd) Codd while working at IBM. His first paper "Derivability, Redundancy, and Consistency of Relations" was published in 1969 as a research paper. While Codd was working on ways to store data in a data bank, what he actually did is in a very scientific and mathematical way describe a theory of data organization, based on a branch of mathematics that deals with sets of data. In a nutshell, it was the overall concept that data could be modelled in a very logical and rigorous way.
So R theory is not concerned with how data is physically stored nor does it only apply to computer systems. What R theory is concerned with and only with is the logical organization of data in order to provide information. To make that clearer, I can apply relational theory to data kept on a chalk board, with sticky notes on a wall, or even with pen and paper. With R theory, it is the meaning of the data and how it is organized that is important and nothing else. As you read on, it is imperative that you keep this distinction in mind.
That brings us to Misconception #1; relational theory does not exist because of databases; it is something that is applied to them. And, Misconception #2; R theory deals with the relationships between your tables.
To a certain extent R Theory is applied to the database to deal with the relationships between the tables, well, when you discuss normalization, which we won't get into here. More fundamentally it applies to something else and is talking about something else, which is a relation, and that is why most articles start off on the wrong foot, failing to recognise the relation.
A relation represents a set of data where the given set all pertains to the same type of "thing" or entity. In a relation, the data is laid out in rows and columns. The columns, each which represent an attribute of the entity (describes it in some way), and the rows (which are called tupples) represent a specific instance of a set of attributes in the relation. Now if you stripped away all the mumbo jumbo technical terms and use database terms instead, you would see that I have just described a table. The attributes are the fields and the tupples are the records.
So let's start now with where most other articles pick up and that is with the classic example of a customer list. Since our relation is about customers and for the sake of simplicity, we'll stick to the following attributes in the relation:
One of the fundamental things I didn't mention above about forming a relation is that each tupple needs to be unique. If by combining all the attributes, you cannot uniquely identify a given tupple, then you don't have enough attributes in your relation. It is the combination of one or more attributes, which will uniquely identify a tupple that will become the primary key. Can you have more than one primary key? No! Can you have more than one possible primary key? Certainly and these are called candidate keys, one of which will be designated as the primary key.
Each grouping of one or more attributes that can uniquely identify a tupple is called a super key. There are a multitude of super keys in the above and going to the extreme, would be combining every attribute to form a key.
But when you work with data, there are a few attributes that a primary key should have:
1. It should be as stable as possible.
2. It should be as minimal as possible.
3. It should be as familiar as possible.
Given that, we certainly would not want to use most of the super keys in a relation. Keeping the above in mind, let's take a look at our example. Would name alone suffice to uniquely identify each row? Probably not as you could easily have "ABC Company" in two different cities. How about Name and Address? Well that would be better, but might still not be unique enough. You might have two "ABC Company" on "33 Main street", but in different cities. What about combining Name, Address, City, and State? That might work. How about Name, Address, and Zip? Possibly. What about the phone number by itself? Yes, that would probably work too.
So we have three candidate keys, but which one should be the primary key? Again, looking at attributes a good primary key should have, phone number is probably the best bet. Phone numbers typically don't change unless you move so it would be fairly stable and it certainly is shorter then using either of the other candidate two keys. Last, it should be familiar to anyone that works there. But cannot the phone number change? Sure and most would say that this means that it cannot be a primary key!
Which of course brings us to Misconception #3; Primary keys can never change. Primary keys can and will change; companies move, phone numbers can change, etc. There is nothing within relational theory that says a primary key cannot change. Again, keep in mind that still we are talking about the logical representation of data. We have not moved on to how it is physically stored or issues with that.
Up to this point, we've been discussing what are commonly referred to as Natural Keys. That is the key is derived from the existing attributes.
Early database designs used the method above to choose a key for a table. But rather quickly, it was discovered that the computer systems we had could not keep up performance wise as the keys had a tendency to become fairly long (when forming joins between relations). This is the point where surrogate keys came into use. Many believe that a surrogate key should be meaningless and have no connection with the data in the row. This partly came about because of data warehousing. When warehousing data, it becomes imperative that a key assigned to a row never changes. Some database designers even go to the point of saying that a meaningless surrogate should never be displayed to the user. But does a meaningless number used as a surrogate work as a primary key? Is it really a surrogate? Let's look at an example where the company has a customer table that looks like this:
CustID - Identity - Primary Key
Name - Text
Address - Text
City - Text
State - Text
Zip - Text
Phone Number - Text
So our CustID attribute is now just a number that goes up by one for each new record. It's never given to the customer and it does uniquely identify each row in the table. Now imagine that I'm a customer calling to place an order:
Customer: Hi, I'm calling to place an order.
Sales Rep: Have you done business with us before?
Sales Rep: Great. I can just pick you from the list..uh, what's the company name?
Customer: ABC Company
Sales Rep: Oh...well I have six companies listed with that name...where are you located?
Customer: New York City, NY
Sales Rep: Wow! Believe it or not, there are three ABC Companies in NYC!...what's the address?
Customer: 7th avenue and 28th street
Sales Rep: OK great...this must be you... is your phone number 210-699-9999?
Sales Rep: OK, what did you want to order?
You've just seen in action the difference between a true primary key and one that is not even though it is labelled as a "primary key" in the table design. While CustID does serve to identify the record uniquely within the relation physically, it does not serve to identify a specific customer within the relation logically. In order to do that, the sales rep used the natural attributes of customers to ensure that the correct customer was being chosen. The attributes he used would have been a super key, possibly a candidate key, the real primary key, but it was not the "primary key" key that is currently in the table design.
"Surrogate" means "to take the place of", but as we have just seen, a meaningless key does not take the place of a primary key from a logical point of view. Yes, it does uniquely identify a row in a table, but only in a physical sense, not in terms of the data. And how could it, since it has no connection with the data in that row.
Which brings us to Misconception #4; A meaningless identity or auto number column can serve as a primary key. A meaningless key is simply a tag or pointer, but in regards to relational theory, it cannot be a primary key.
So is there anything that could be called a true surrogate key? Well what if we took that meaningless CustID and handed it to the customer when they first started doing business with us? In doing that, we would give it meaning. It is now known to us and the customer and would never be given to another customer. We could now use it to identify a customer uniquely in a logical context, so yes, it would serve as a primary key even though it is an artificial (non-natural) attribute. A subtle difference to be sure, but it does make a difference.
I can see the question now; Ok the above is all well and good and I now understand the differences, but where does that leave me in developing applications?
Well first, like 99% of the developers out there at this point (including myself), you're going to use meaningless keys (note that I did not call it a surrogate though<g>) in your databases. But with understanding the above, you also now realize that it might mean:
1. You might need to maintain additional indexes and/or code to form a constraint based on a primary key - how does your database ensure that you don't have a customer entered twice?
2. You might want to make changes in your user interface - How can I present data to the user as efficiently as possible in order to uniquely identify a specific instance of something?
3. Are there places where a surrogate key can be used?
4. Are there places where I might not want to use a meaningless key?
By asking yourself these questions and with the understanding gained from the above, you now should be able to answer them. For example, if we take number four, there is one place where I think it is just down right silly to add a meaningless key; that is in a many to many linking table.
BookID - Identity - PK
BookTitle - Text
ISBNNumber - Text
AuthorID - Identity - PK
LastName - Text
FirstName - Text
And that many authors can author more then one book and a book can have one or more authors, then you use a linking table to form the many to many relationship like this:
tblAuthorsAndBooks - One record per Author / Book combination
AuthorBookID - Identity - PK
AuthorID - Long - CK1-A
BookID - Long - CK1-B
"CK" stands for candidate key. Looking at this, including AuthorBookID just for the sake of having a meaningless key is a waste. The AuthorID/BookID combination must be unique, so we need a constraint (usually done with an index) on it anyway. The table should look like this:
tblAuthorsAndBooks - One record per Author / Book combination
AuthorID - Long - PK-A
BookID - Long - PK-B
Another place I think it is a waste to add a meaningless key is in a simple lookup table which has a code and a description as the only attributes:
Code - Text - PK
Description - Text
The Code in of itself is the primary key and can be used as the key. So why would you want to do this:
CreditID - Identity - PK
Code - Text - CK1
Description - Text
Well one reason mentioned for using surrogates or meaningless keys was for performance. If this was going to be a large table and the Code attribute was bigger then a long (4 bytes), then it might make sense to use an identity field in the table. But given that, then maybe I should change my user interface and just present a Description and have a table like this:
CreditID - Identity - PK
Description - Text
Again though realizing I will need to maintain an additional index on Description so that I can't enter the same thing twice. As you can see, the important point about all this is that by understanding the underlying concepts vs the real world challenges that we all face can in the end help you design and develop better databases and applications.
In conclusion, I hope you found this article enlightening or at the very least given the subject matter and your judgment of my points, entertaining. Comments are more then welcome.
If you have a question about something within an article, you can receive help directly from the article author. Experts Exchange article authors are available to answer questions and further the discussion.