Go Premium for a chance to win a PS4. Enter to Win

x
?
Solved

Composite Integer Primary Key - Need Advice

Posted on 2009-03-30
9
Medium Priority
?
405 Views
Last Modified: 2012-05-06
I am trying to determine the best way to create a primary key and assign values to the key fields for three database tables.  The three tables are as follows (I am only including the fields here relevant to the question):

Client (ClientId INT, Name varchar(50))
Project (ClientId INT, ProjectId INT, Name varchar(50))
Project_Task (ClientId INT, ProjectId INT, TaskId INT, Name varchar(50))

The primary key of each table is as follows (each PK is setup as a CLUSTERED INDEX):

Client: ClientId
Project: ClientId, ProjectId
Project_Task: ClientId, ProjectId, TaskId

I am designing this for a multi-tenant SaaS application, and I want to design for the fact that these tables may become very large.  Right now, the ClientId in the Client table, the ProjectId in the Project table, and the TaskId in the Project_Task table are IDENTITY fields (auto-generated integers).  This makes it easy to insert new records into these tables (id assigned by the database), but it also makes the ProjectId and the TaskId unique across all clients, which is unnecessary.  Example:

Row in Client table: ClientId=1, Name=ABC
Row in Client table: ClientId=2, Name=XYZ
Row in Project table: ClientId=1, ProjectId=1, Name=ABC Project
Row in Project table: ClientId=2, ProjectId=2, Name=XYZ Project
Row in Project_Task table: ClientId=1, ProjectId=1, TaskId=1, Name=ABC Task 1
Row in Project_Task table: ClientId=1, ProjectId=1, TaskId=2, Name=ABC Task 2
Row in Project_Task table: ClientId=1, ProjectId=2, TaskId=3, Name=XYZ Task 1
Row in Project_Task table: ClientId=1, ProjectId=2, TaskId=4, Name=XYZ Task 2

I am concerned that the TaskId may exceed the limits of the INT using this approach.  What I would really like is the ProjectId and the TaskId to be unique within each client and each project.  Example:

Row in Client table: ClientId=1, Name=ABC
Row in Client table: ClientId=2, Name=XYZ
Row in Project table: ClientId=1, ProjectId=1, Name=ABC Project
Row in Project table: ClientId=2, ProjectId=1, Name=XYZ Project
Row in Project_Task table: ClientId=1, ProjectId=1, TaskId=1, Name=ABC Task 1
Row in Project_Task table: ClientId=1, ProjectId=1, TaskId=2, Name=ABC Task 2
Row in Project_Task table: ClientId=1, ProjectId=1, TaskId=1, Name=XYZ Task 1
Row in Project_Task table: ClientId=1, ProjectId=1, TaskId=2, Name=XYZ Task 2

The only way I can think of achieving this is to manually assign the ProjectId and TaskId by querying the table for the MAX integer for that Client (or Project) and assign the next highest integer (MAX+1).  Then, I could make the TaskId a SMALLINT instead of an INT (there is no way a project will have more than 32,000 tasks assigned to it).

Is the MAX+1 option the only way to do this, or is there a better way?  If I go with the MAX+1 option, what disadvantages/concerns would there be?  What precautions would you have to take to make sure that MAX+1 is always unique in a multi-user environment?

FYI: I am torn between an alternative design where the primary key would be just a single identity column (i.e. integer or possibly GUID surrogate key), and then I would create a UNIQUE CONSTRAINT on my original primary key (i.e. Project Table: ClientId,ProjectId).  I would also then create the CLUSTERED INDEX on the UNIQUE CONSTRAINT, rather than the primary key.  If I do this, I am assuming it would make joins easier (only need to join on one field), but now I am maintaining two indexes on every table instead of one.  My major concern is performance, both on querying data as well as on inserts.  Opinions on this alternative approach would be greatly appreciated, especially using an integer vs. GUID for the surrogate key.  (I was thinking of using a GUID for external access to objects via Web Services).
0
Comment
Question by:ardolino
  • 5
  • 4
9 Comments
 
LVL 21

Accepted Solution

by:
JestersGrind earned 2000 total points
ID: 24020241
I think you are taking the right approach with ClientID, ProjectID and TaskID being IDENTITY fields.  Yes, it makes each one unique across all clients.  It is unnecessary, but it's also irrelevant.  There's no advantage that I can think of to somehow reseed the key per client.  In the project table, ProjectID is the primary key and ClientID is an attibute of the project.  I don't think there is a need to make it part of the primary key.  By making it part of the primary key, you are saying that the same ProjectID could apply to multiple ClientIDs which I don't think is correct..  Same goes with the Tasks table.  It should have a TaskID as the primary key and ProjectID is an attribute.  ClientID doesn't need to be in the table at all because it can be found through the Projects table.  By having it in both tables you risk having data anomolies if the data somehow becomes diffferent between the two.

The main reason for using a guid over an integer is if you would ever have to merge multiple databases together, it's a lot easier with a guid than an integer because you virtually be assured there won't be any duplicates.  If that's not the case, I recommend going with integer.  

Greg


0
 

Author Comment

by:ardolino
ID: 24020900
Ok, so if I use this approach, where would I put the CLUSTERED INDEX, on the primary key, or create a separate UNIQUE CONSTRAINT?

What I am thinking is that the Project Task table will almost always be queried by Client (i.e. identify all tasks for all projects for a particular client).  In your design, I will always have to use a join between the Project Task table and the Project table to perform this query.  In my design, the ClientId is already a field on the Project Task table.  Which way is better, and why?
0
 
LVL 21

Expert Comment

by:JestersGrind
ID: 24021030
The clustered indexes should go on the primary keys.  A non-clustered index should go on the foreign key columns.  I don't think a unique constraint is necessary because you can only have one client per project and one project per task.  The reason for designing it that way is to ensure data integrity.  As far as performance when you're joining indexed integer columns, performance should be fast.  The issue with storing ClientID in more than one location is that you introduce potential data anomalies.  It's a data integrity issue.  With the same data being stored in two location, there is a possibility that they could be different.  

Greg


0
Learn Veeam advantages over legacy backup

Every day, more and more legacy backup customers switch to Veeam. Technologies designed for the client-server era cannot restore any IT service running in the hybrid cloud within seconds. Learn top Veeam advantages over legacy backup and get Veeam for the price of your renewal

 

Author Comment

by:ardolino
ID: 24021271
Ok, I see your point.  Makes a lot of sense.

One last question - I know its redundant, but would there be any benefit in including the ClientId in the primary key in a multi-tenant environment (SaaS application)?  What I am thinking is that if the ClientId is part of the primary key and it is a CLUSTERED INDEX, and since every query will have the ClientId in it anyway, would it help in terms of performance (i.e. all Client data is stored together instead of throughout the table)?  My understanding is that a CLUSTERED INDEX determines the order in which data is stored in the database.  And, if the ClientId always part of the primary key, referential integrity will be enforced (preventing data anomolies).  What do you think?
0
 
LVL 21

Expert Comment

by:JestersGrind
ID: 24021977
I don't see any benefit in it.  Yes, it's an extra join to get the client information, but we're talking about (I assume) a relatively small table.  Even if there are thousands of clients, that's small as far as table sizes go.  The join between the clients table and the table with the ClientID as a foreign key is minuscule.  You could even avoid that join by caching the ClientID and then passing it to each stored procedure that needs it.

Greg


0
 

Author Comment

by:ardolino
ID: 24023444
Sorry, I did not clarify.  My concern is not so much with the Client table, but other tables in a multi-tenant database.  I did not mention this before, but:

Project has a one-to-many relationship with a System.
System has a one-to-many relationship with an Object.
Object has a one-to-many relationship with a Property.

If I follow your recommendation, then I would assume these tables would look like this:

Project (PK ProjectId, FK ClientId)
System (PK SystemId, FK ProjectId)
Object (PK ObjectId, FK SystemId)
Property (PK PropertyId, FK ObjectId)

CLUSTERED INDEX on PK, non-CLUSTERED INDEX on FK for all tables

So now lets say I have a 1000 clients (1000 records), each having a 100 projects (100,000 records), each project having 2 systems (200,000 records), each system having 10 objects (2,000,000 records), and each object having 50 properties (100,000,000 records).  These numbers are certainly possible with 1000 clients.

Here's my concern (which I also did not mention before, so I apologize).  I don't want the user to execute a query for any item that does not belong to that client.  So I was thinking of adding the ClientId to every query.  If the ClientId is on the Project table and I'm querying the Property table, I'd have to join 4 tables.  Given the number of possible records, will this become a performance issue?
0
 
LVL 21

Expert Comment

by:JestersGrind
ID: 24028004
No, I don't think performance should be an issue.  When you are joining on clustered index, integer, primary key fields, queries should return data very fast, even with millions of records.  We have tables like that where I work and performance has never been an issue.

Greg


0
 

Author Comment

by:ardolino
ID: 24029179
Thank you so much for all of your help.
0
 

Author Closing Comment

by:ardolino
ID: 31564367
Thank you for all of your help.
0

Featured Post

Concerto Cloud for Software Providers & ISVs

Can Concerto Cloud Services help you focus on evolving your application offerings, while delivering the best cloud experience to your customers? From DevOps to revenue models and customer support, the answer is yes!

Learn how Concerto can help you.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

In part one, we reviewed the prerequisites required for installing SQL Server vNext. In this part we will explore how to install Microsoft's SQL Server on Ubuntu 16.04.
An alternative to the "For XML" way of pivoting and concatenating result sets into strings, and an easy introduction to "common table expressions" (CTEs). Being someone who is always looking for alternatives to "work your data", I came across this …
Via a live example combined with referencing Books Online, show some of the information that can be extracted from the Catalog Views in SQL Server.
This videos aims to give the viewer a basic demonstration of how a user can query current session information by using the SYS_CONTEXT function

782 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question