Solved

Composite Integer Primary Key - Need Advice

Posted on 2009-03-30
9
372 Views
Last Modified: 2012-05-06
I am trying to determine the best way to create a primary key and assign values to the key fields for three database tables.  The three tables are as follows (I am only including the fields here relevant to the question):

Client (ClientId INT, Name varchar(50))
Project (ClientId INT, ProjectId INT, Name varchar(50))
Project_Task (ClientId INT, ProjectId INT, TaskId INT, Name varchar(50))

The primary key of each table is as follows (each PK is setup as a CLUSTERED INDEX):

Client: ClientId
Project: ClientId, ProjectId
Project_Task: ClientId, ProjectId, TaskId

I am designing this for a multi-tenant SaaS application, and I want to design for the fact that these tables may become very large.  Right now, the ClientId in the Client table, the ProjectId in the Project table, and the TaskId in the Project_Task table are IDENTITY fields (auto-generated integers).  This makes it easy to insert new records into these tables (id assigned by the database), but it also makes the ProjectId and the TaskId unique across all clients, which is unnecessary.  Example:

Row in Client table: ClientId=1, Name=ABC
Row in Client table: ClientId=2, Name=XYZ
Row in Project table: ClientId=1, ProjectId=1, Name=ABC Project
Row in Project table: ClientId=2, ProjectId=2, Name=XYZ Project
Row in Project_Task table: ClientId=1, ProjectId=1, TaskId=1, Name=ABC Task 1
Row in Project_Task table: ClientId=1, ProjectId=1, TaskId=2, Name=ABC Task 2
Row in Project_Task table: ClientId=1, ProjectId=2, TaskId=3, Name=XYZ Task 1
Row in Project_Task table: ClientId=1, ProjectId=2, TaskId=4, Name=XYZ Task 2

I am concerned that the TaskId may exceed the limits of the INT using this approach.  What I would really like is the ProjectId and the TaskId to be unique within each client and each project.  Example:

Row in Client table: ClientId=1, Name=ABC
Row in Client table: ClientId=2, Name=XYZ
Row in Project table: ClientId=1, ProjectId=1, Name=ABC Project
Row in Project table: ClientId=2, ProjectId=1, Name=XYZ Project
Row in Project_Task table: ClientId=1, ProjectId=1, TaskId=1, Name=ABC Task 1
Row in Project_Task table: ClientId=1, ProjectId=1, TaskId=2, Name=ABC Task 2
Row in Project_Task table: ClientId=1, ProjectId=1, TaskId=1, Name=XYZ Task 1
Row in Project_Task table: ClientId=1, ProjectId=1, TaskId=2, Name=XYZ Task 2

The only way I can think of achieving this is to manually assign the ProjectId and TaskId by querying the table for the MAX integer for that Client (or Project) and assign the next highest integer (MAX+1).  Then, I could make the TaskId a SMALLINT instead of an INT (there is no way a project will have more than 32,000 tasks assigned to it).

Is the MAX+1 option the only way to do this, or is there a better way?  If I go with the MAX+1 option, what disadvantages/concerns would there be?  What precautions would you have to take to make sure that MAX+1 is always unique in a multi-user environment?

FYI: I am torn between an alternative design where the primary key would be just a single identity column (i.e. integer or possibly GUID surrogate key), and then I would create a UNIQUE CONSTRAINT on my original primary key (i.e. Project Table: ClientId,ProjectId).  I would also then create the CLUSTERED INDEX on the UNIQUE CONSTRAINT, rather than the primary key.  If I do this, I am assuming it would make joins easier (only need to join on one field), but now I am maintaining two indexes on every table instead of one.  My major concern is performance, both on querying data as well as on inserts.  Opinions on this alternative approach would be greatly appreciated, especially using an integer vs. GUID for the surrogate key.  (I was thinking of using a GUID for external access to objects via Web Services).
0
Comment
Question by:ardolino
  • 5
  • 4
9 Comments
 
LVL 21

Accepted Solution

by:
JestersGrind earned 500 total points
ID: 24020241
I think you are taking the right approach with ClientID, ProjectID and TaskID being IDENTITY fields.  Yes, it makes each one unique across all clients.  It is unnecessary, but it's also irrelevant.  There's no advantage that I can think of to somehow reseed the key per client.  In the project table, ProjectID is the primary key and ClientID is an attibute of the project.  I don't think there is a need to make it part of the primary key.  By making it part of the primary key, you are saying that the same ProjectID could apply to multiple ClientIDs which I don't think is correct..  Same goes with the Tasks table.  It should have a TaskID as the primary key and ProjectID is an attribute.  ClientID doesn't need to be in the table at all because it can be found through the Projects table.  By having it in both tables you risk having data anomolies if the data somehow becomes diffferent between the two.

The main reason for using a guid over an integer is if you would ever have to merge multiple databases together, it's a lot easier with a guid than an integer because you virtually be assured there won't be any duplicates.  If that's not the case, I recommend going with integer.  

Greg


0
 

Author Comment

by:ardolino
ID: 24020900
Ok, so if I use this approach, where would I put the CLUSTERED INDEX, on the primary key, or create a separate UNIQUE CONSTRAINT?

What I am thinking is that the Project Task table will almost always be queried by Client (i.e. identify all tasks for all projects for a particular client).  In your design, I will always have to use a join between the Project Task table and the Project table to perform this query.  In my design, the ClientId is already a field on the Project Task table.  Which way is better, and why?
0
 
LVL 21

Expert Comment

by:JestersGrind
ID: 24021030
The clustered indexes should go on the primary keys.  A non-clustered index should go on the foreign key columns.  I don't think a unique constraint is necessary because you can only have one client per project and one project per task.  The reason for designing it that way is to ensure data integrity.  As far as performance when you're joining indexed integer columns, performance should be fast.  The issue with storing ClientID in more than one location is that you introduce potential data anomalies.  It's a data integrity issue.  With the same data being stored in two location, there is a possibility that they could be different.  

Greg


0
 

Author Comment

by:ardolino
ID: 24021271
Ok, I see your point.  Makes a lot of sense.

One last question - I know its redundant, but would there be any benefit in including the ClientId in the primary key in a multi-tenant environment (SaaS application)?  What I am thinking is that if the ClientId is part of the primary key and it is a CLUSTERED INDEX, and since every query will have the ClientId in it anyway, would it help in terms of performance (i.e. all Client data is stored together instead of throughout the table)?  My understanding is that a CLUSTERED INDEX determines the order in which data is stored in the database.  And, if the ClientId always part of the primary key, referential integrity will be enforced (preventing data anomolies).  What do you think?
0
How to run any project with ease

Manage projects of all sizes how you want. Great for personal to-do lists, project milestones, team priorities and launch plans.
- Combine task lists, docs, spreadsheets, and chat in one
- View and edit from mobile/offline
- Cut down on emails

 
LVL 21

Expert Comment

by:JestersGrind
ID: 24021977
I don't see any benefit in it.  Yes, it's an extra join to get the client information, but we're talking about (I assume) a relatively small table.  Even if there are thousands of clients, that's small as far as table sizes go.  The join between the clients table and the table with the ClientID as a foreign key is minuscule.  You could even avoid that join by caching the ClientID and then passing it to each stored procedure that needs it.

Greg


0
 

Author Comment

by:ardolino
ID: 24023444
Sorry, I did not clarify.  My concern is not so much with the Client table, but other tables in a multi-tenant database.  I did not mention this before, but:

Project has a one-to-many relationship with a System.
System has a one-to-many relationship with an Object.
Object has a one-to-many relationship with a Property.

If I follow your recommendation, then I would assume these tables would look like this:

Project (PK ProjectId, FK ClientId)
System (PK SystemId, FK ProjectId)
Object (PK ObjectId, FK SystemId)
Property (PK PropertyId, FK ObjectId)

CLUSTERED INDEX on PK, non-CLUSTERED INDEX on FK for all tables

So now lets say I have a 1000 clients (1000 records), each having a 100 projects (100,000 records), each project having 2 systems (200,000 records), each system having 10 objects (2,000,000 records), and each object having 50 properties (100,000,000 records).  These numbers are certainly possible with 1000 clients.

Here's my concern (which I also did not mention before, so I apologize).  I don't want the user to execute a query for any item that does not belong to that client.  So I was thinking of adding the ClientId to every query.  If the ClientId is on the Project table and I'm querying the Property table, I'd have to join 4 tables.  Given the number of possible records, will this become a performance issue?
0
 
LVL 21

Expert Comment

by:JestersGrind
ID: 24028004
No, I don't think performance should be an issue.  When you are joining on clustered index, integer, primary key fields, queries should return data very fast, even with millions of records.  We have tables like that where I work and performance has never been an issue.

Greg


0
 

Author Comment

by:ardolino
ID: 24029179
Thank you so much for all of your help.
0
 

Author Closing Comment

by:ardolino
ID: 31564367
Thank you for all of your help.
0

Featured Post

What is SQL Server and how does it work?

The purpose of this paper is to provide you background on SQL Server. It’s your self-study guide for learning fundamentals. It includes both the history of SQL and its technical basics. Concepts and definitions will form the solid foundation of your future DBA expertise.

Join & Write a Comment

In this article we will get to know that how can we recover deleted data if it happens accidently. We really can recover deleted rows if we know the time when data is deleted by using the transaction log.
Slowly Changing Dimension Transformation component in data task flow is very useful for us to manage and control how data changes in SSIS.
Familiarize people with the process of utilizing SQL Server functions from within Microsoft Access. Microsoft Access is a very powerful client/server development tool. One of the SQL Server objects that you can interact with from within Microsoft Ac…
Using examples as well as descriptions, and references to Books Online, show the different Recovery Models available in SQL Server and explain, as well as show how full, differential and transaction log backups are performed

743 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

12 Experts available now in Live!

Get 1:1 Help Now