Nature key or surrogate key (country table)?

This has probably been discuss a number of times "nature keys vs surrogate" etc

I have an original database design where the previous architecture used a mix of both nature keys and surrogate keys

The database is currently being redesigned, all tables have been converted to "surrogate" keys
Except for the country, currency, and state table these 3 tables are still using nature keys (makes life simple)

However i'm thinking we "should" convert them all to surrograte keys aswell
but the real question is, is there really a point?
when does
-a country change name?
-a currency change currency code?
-a state change name?

It can happen! and has! (hence the reason considering having surrogate keys in all tables).
Who is Participating?
Mark WillsTopic AdvisorCommented:
Well, always an interesting question...  And you are correct, there is a previous and fairly long thread here in EE somewhere on this same topic...

This is for the Country Table ? And Currency Table ? And State ? - What about Postcode - that is even more of a topical discussion point :)

In cases where the data is going to be unique, is not going to change, is reasonable small in nature has some kind of in built sort, is not prone to a lot of inserts and deletes, then do you really need a surrogate - given the primary access method is likely to be the country code - or even the name ?

E.g. Think most countries are known by a three or four character abbreviation, certainly currencies tend to be so, and with a finite number (meaning unlikely to be heavily adding and removing countries) it is the type of table that can use it's abbreviation as a key.

In fact, would go one step further and postulate that the country name could in fact be a self describing code and name and key. Because these types of tables are fairly static, and normally chosen from a drop down list, then it could easily be a self descriptive code.

But (and there is always one), it is not just that table in isolation - you often have states, districts, regions hanging off that table, and that is what makes all the difference. That is when a surrogate - and by that I do mean an Identity column - does make the world of difference in terms of "slave" tables hanging off that master, and being an indentity column, does make index searching very much quicker when used to look up a state, or a suburb etc.

And that is the real reason why we use surrogates, it is not so much the masterfile table itself, it is everything else that uses them / captures them and what happens in those tables in terms of possible indexing, sorting etc (integers do a great job as key values), and of course, being unaffected by typing errors (e.g. can easily change the name without affecting any previous entries).

So, change them all, each and every one of them and keep to a very good naming convention so that identities / keys are readily identifiable.

There is also a similar / maybe associated question about the use of a generic reference table for name/value pairs. Certainly Country and Currency descriptions have been held in those type of structures before (and phone types too for that matter).
First, I do not equate surrogate keys to an identity column.  A surrogate key is simply an artificial key with no business meaning.  Often surrogates are identity, but it is not necessarily.  Further, an identity column is NOT always a surrogate key.

Logical data models do not have surrogate keys.   All tables should have natural keys with business meaning.  The option to assign a surrogate key is part of the physical database design.  As such, valid reasons for doing so depend on the particular database.  Common reasons are:
   Limits to the number of columns allowed in the primary key
   Performance advantages of narrow, integer keys
   Simplifies some complex SQL

If you do opt for a surrogate key, be sure to preserve the natural key with a unique key constraint.  If that is not available in your databse, then use an unique index.  

In the case of the tables you mentioned, I see no reason for a surrogate key.  

I'll chime in on this topic but in a far less detailed manner.  

Let's say that for your Country and State names, you have an nvarchar(100) column.  Keeping in mind, that the USA is not the only nation that has states.  Now let's say that you have 1 million addresses in your system in a mix of Home, Billing and Work.  Each of those columns will carry the the nvarchar(100) column for each of the values.  It could be argued that that the State alone could be used to identify an address so carrying both, but if a nation decides to rename a state to Mississippi you'd have problems.  So let's continue to presume that you have 2 nvarchar(100) columns.  Your tables will require far more data than if you represent them using surrogate keys.  This will also help to ensure that your data can be properly indexes since SQL Server limits indexes to 900 bytes.  And those two nvarchar(100) columns would chew up 400 bytes if you used them in some sort of a composite index.

Anyway, like I said.  My speech may not be as divine and detailed as MW and dqmq, but they are valid things to consider.
Free Tool: IP Lookup

Get more info about an IP address or domain name, such as organization, abuse contacts and geolocation.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

Mark WillsTopic AdvisorCommented:
Opening paragraph, quite true, but, in all reality how often do you see  "id int identity clustered primary key" which ends up being a perfect surrogate - although technically you are correct and certainly does not have to be that way. But in my books, if I am talking surrogates then I will associate the innocuous identity to fit that purpose, lest we then need to build some kind of control number system, and adding a great deal of complexity around an already available construct.

Data designers tend to be biased or demonstrate a "personal preference" for either natural or surrogates, and can argue and quote the relative benefits and merits of their preferred approach. Data Warehouse designers do tend more toward surrogates because it more abley supports the transformation process where several "natural" keys can refer to a similar or collective business view of the data. I certainly was an avid fan of the natural key, and since writing a data warehouse system - mainly the ELT processes almost from the ground up, grew to love the surrogate and then asked why it wasn't that way from the get go - in fact 30 plus years ago, it was certainly the case to use key codes to save on space and introduce speed where sort and seek algoritms were primitive and such key codes were really shorthand access to the real name, as speed and flexibility in indexing evolved, then those codes also evolved and in doing so became meaningful interpretations or abbreviations and really started the whole natural key debate... So, it is nothing new, and the surrogate is ages old. In fact I want to see the day where systems and algorithms are so bloody good, that for name / value pairs we can use self serving data so that the business model can be finally divorced totally from the data model.

There was certainly a more recent period of time, where some "database designers" simply threw a magic number at the problem and hoping it will go away and did a huge disservice to the integrity of professional data designers and data modellers and is possibly where some of the taint has come. There was a period there where building a database was almost a desktop job and anyone can do it. Well it has kind of gone full circle again where magic numbers were so much easier to store and sort on with speed, performance and indeed internationalisation has become more important. In such cases supporting language thesaurus, different native languages, unicode, along with some of the performance reasons mentioned earlier, we will see growing popularity or resurgance of the surrogate, and hopefully for all the right reasons with all the lessons learned in use of natural keys in terms of the need for business oriented referential integrity.

Off that little soapbox for now...

There is always a reasonable demand for a unique constraint on the business data in a "master" type situation - after all there should only be one entry "Australia" in the Country Master, but still doesn't stop "Australia" and "Australya" from co-existing, or indeed "USA","U.S.A","United States","America". Indeed if using a natural key, implying the use of a code or other representation of business data (common names in this scenario) does not stop "AU","Australia"; "AUS","Australia"; "AUD","Australia" - still leaving a unique name choice of "Australia" from a drop down.  That previous example is the biggest myth about natural keys implying business oriented data integrity or exclusivity of use. The other big myth is that natural keys impose some kind of business logic above and beyond that which can be achieved via surrogates. Keys also need to be able to withstand the test of time, a unique value must retain the unique meaning at that point in time. In an ETL experience, have seen a code "WD" which used to mean Warehouse Dispatch become Warehouse Delivery - with a change in meaning from "Sending Out" to "Picking Up" ie opposite. Which is another reason why it is attractive to divorce the architecture of primary keys from the business requirement of natural keys.

Could even argue that a need for "natural keys" does not really exist as opposed to unique contraints - if you regard constructs such as primary keys and indexes as performance as architectural objects, in which case a column known as "ID" can just as easily provide that service, and fulfill it's role as a relationship mapper.

However, there is always a need for constructs like unique constraints, foereign keys and indexes on the data model components and the way SQL works, having key values as row pointers to data will make the other index options considerably more efficient - and what better way to get that than with a relatively small data element such as an identity as a clustered primary key - being permitted just one identity and one clustered index, the identity as primary key does fulfill both requirements of uniqueness and key value quite nicely. And because it is used as a pointer to row data from other indexes, and as a referential integrity pointer, then it does in essence become an integral part of the data set.

In my books "natural keys" are a business term, "surrogate keys" is an architectural term. The business requirement will help determine the architectural requirement which I believe is best achieved through surrogate keys, and leave the business use at a higher level.
Mark WillsTopic AdvisorCommented:
seeing Brandon's last post, wish I had got off the soapbox earlier, and with all that diatribe, did not even get started with "composite keys", "index fragmentation", "query plan optimisation" and such like. Oh well...
robertkcAuthor Commented:
"This is for the Country Table ? And Currency Table ? And State ? - What about Postcode - that is even more of a topical discussion point :)"

In the actual database tables were used

"First, I do not equate surrogate keys to an identity column.  A surrogate key is simply an artificial key with no business meaning. "

The reason we have replaced all natural keys in previous design with surrogate keys is because surrogate keys are more efficient and easier to manage than natural keys.
Some experts may disagree.

@Mark Wills, maybe we aussies just think alike? :)

Mark WillsTopic AdvisorCommented:
Could be... I prefer to think that we are simply ahead in time. And just to be contrary, a most interesting example of where a natural key might be useful is associated with time ie utc date - regionally generic, globally applicable, business oriented, data significant, and unique (as a time measure). There are even date functions to assist the special significance of date.

select getutcdate() as utc_date,datediff(n, getutcdate(),getdate()) as ahead_of_time

660 minutes, if you can beat that, then you must be one of us "ahead of time" - anyone biting yet ?

>The reason we have replaced all natural keys in previous design with surrogate keys is because...

I doubt you "replaced" the natural keys; you just disregarded them.  When you disregard a natural key, you do so at the peril of losing the business integrity it represents. Did you remove the natural key columns from the table?  No.  They are still there; you just removed the constraints that enforce integrity of the natural key.  
>The other big myth is that natural keys impose some kind of business logic above and beyond that which can be achieved via surrogates.

Surrogate keys, by definition, have no business meaning and cannot impose
ANY business integrity whatsoever.  Natural keys, on the other hand, are very much about imposing business integrity.  

>...surrogate keys are more efficient and easier to manage than natural keys.
I agree surrogate keys are easier to manage.  That's because they have no business integrity.  If you prefer to overlook the integrity of your database, then go for it.
Mark WillsTopic AdvisorCommented:

think we have had the business integrity discussion before... data relationships, referential integrity, constraints can readily be achieved with surrogates and at least  accommodates the foundations for business integrity, how are natural keys (instead of surrogates) any more capable (he says with a 'gulp' ) ? because I think it is probably the only really contentious issue in the whole surrogate v natural debate... Your thoughts (I do find it interesting, and not trying to test your patience) ?

Agree that surrogates do make it easier, in fact, "throwing the magic number" has sometimes introduced questionable habits. With surrogates what can tend to happen (typically with data modelling tools) is the over use of reference / map tables e.g.

contacts  (contactid int identity primary key clustered)
contacthistorylink (contactid int, contacthistoryid int)
contacthistory (contacthistoryid int identity primary key clustered, date datetime)

and really, should be more like :

contacts  (contactid int identity primary key clustered)
contacthistory (contacthistoryid int identity primary key clustered, contactid int references contacts(contactid), date datetime)

get rid of the link table because it is a compound key, adds complexity in retrieving data, disassociates real ownership of data at the "detail" level.

But that is more a discipline about surrogates rather than an advantage of using natural keys...
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.