Link to home
Start Free TrialLog in
Avatar of ardolino
ardolino

asked on

Composite Integer Primary Key - Need Advice

I am trying to determine the best way to create a primary key and assign values to the key fields for three database tables.  The three tables are as follows (I am only including the fields here relevant to the question):

Client (ClientId INT, Name varchar(50))
Project (ClientId INT, ProjectId INT, Name varchar(50))
Project_Task (ClientId INT, ProjectId INT, TaskId INT, Name varchar(50))

The primary key of each table is as follows (each PK is setup as a CLUSTERED INDEX):

Client: ClientId
Project: ClientId, ProjectId
Project_Task: ClientId, ProjectId, TaskId

I am designing this for a multi-tenant SaaS application, and I want to design for the fact that these tables may become very large.  Right now, the ClientId in the Client table, the ProjectId in the Project table, and the TaskId in the Project_Task table are IDENTITY fields (auto-generated integers).  This makes it easy to insert new records into these tables (id assigned by the database), but it also makes the ProjectId and the TaskId unique across all clients, which is unnecessary.  Example:

Row in Client table: ClientId=1, Name=ABC
Row in Client table: ClientId=2, Name=XYZ
Row in Project table: ClientId=1, ProjectId=1, Name=ABC Project
Row in Project table: ClientId=2, ProjectId=2, Name=XYZ Project
Row in Project_Task table: ClientId=1, ProjectId=1, TaskId=1, Name=ABC Task 1
Row in Project_Task table: ClientId=1, ProjectId=1, TaskId=2, Name=ABC Task 2
Row in Project_Task table: ClientId=1, ProjectId=2, TaskId=3, Name=XYZ Task 1
Row in Project_Task table: ClientId=1, ProjectId=2, TaskId=4, Name=XYZ Task 2

I am concerned that the TaskId may exceed the limits of the INT using this approach.  What I would really like is the ProjectId and the TaskId to be unique within each client and each project.  Example:

Row in Client table: ClientId=1, Name=ABC
Row in Client table: ClientId=2, Name=XYZ
Row in Project table: ClientId=1, ProjectId=1, Name=ABC Project
Row in Project table: ClientId=2, ProjectId=1, Name=XYZ Project
Row in Project_Task table: ClientId=1, ProjectId=1, TaskId=1, Name=ABC Task 1
Row in Project_Task table: ClientId=1, ProjectId=1, TaskId=2, Name=ABC Task 2
Row in Project_Task table: ClientId=1, ProjectId=1, TaskId=1, Name=XYZ Task 1
Row in Project_Task table: ClientId=1, ProjectId=1, TaskId=2, Name=XYZ Task 2

The only way I can think of achieving this is to manually assign the ProjectId and TaskId by querying the table for the MAX integer for that Client (or Project) and assign the next highest integer (MAX+1).  Then, I could make the TaskId a SMALLINT instead of an INT (there is no way a project will have more than 32,000 tasks assigned to it).

Is the MAX+1 option the only way to do this, or is there a better way?  If I go with the MAX+1 option, what disadvantages/concerns would there be?  What precautions would you have to take to make sure that MAX+1 is always unique in a multi-user environment?

FYI: I am torn between an alternative design where the primary key would be just a single identity column (i.e. integer or possibly GUID surrogate key), and then I would create a UNIQUE CONSTRAINT on my original primary key (i.e. Project Table: ClientId,ProjectId).  I would also then create the CLUSTERED INDEX on the UNIQUE CONSTRAINT, rather than the primary key.  If I do this, I am assuming it would make joins easier (only need to join on one field), but now I am maintaining two indexes on every table instead of one.  My major concern is performance, both on querying data as well as on inserts.  Opinions on this alternative approach would be greatly appreciated, especially using an integer vs. GUID for the surrogate key.  (I was thinking of using a GUID for external access to objects via Web Services).
ASKER CERTIFIED SOLUTION
Avatar of JestersGrind
JestersGrind
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of ardolino
ardolino

ASKER

Ok, so if I use this approach, where would I put the CLUSTERED INDEX, on the primary key, or create a separate UNIQUE CONSTRAINT?

What I am thinking is that the Project Task table will almost always be queried by Client (i.e. identify all tasks for all projects for a particular client).  In your design, I will always have to use a join between the Project Task table and the Project table to perform this query.  In my design, the ClientId is already a field on the Project Task table.  Which way is better, and why?
The clustered indexes should go on the primary keys.  A non-clustered index should go on the foreign key columns.  I don't think a unique constraint is necessary because you can only have one client per project and one project per task.  The reason for designing it that way is to ensure data integrity.  As far as performance when you're joining indexed integer columns, performance should be fast.  The issue with storing ClientID in more than one location is that you introduce potential data anomalies.  It's a data integrity issue.  With the same data being stored in two location, there is a possibility that they could be different.  

Greg


Ok, I see your point.  Makes a lot of sense.

One last question - I know its redundant, but would there be any benefit in including the ClientId in the primary key in a multi-tenant environment (SaaS application)?  What I am thinking is that if the ClientId is part of the primary key and it is a CLUSTERED INDEX, and since every query will have the ClientId in it anyway, would it help in terms of performance (i.e. all Client data is stored together instead of throughout the table)?  My understanding is that a CLUSTERED INDEX determines the order in which data is stored in the database.  And, if the ClientId always part of the primary key, referential integrity will be enforced (preventing data anomolies).  What do you think?
I don't see any benefit in it.  Yes, it's an extra join to get the client information, but we're talking about (I assume) a relatively small table.  Even if there are thousands of clients, that's small as far as table sizes go.  The join between the clients table and the table with the ClientID as a foreign key is minuscule.  You could even avoid that join by caching the ClientID and then passing it to each stored procedure that needs it.

Greg


Sorry, I did not clarify.  My concern is not so much with the Client table, but other tables in a multi-tenant database.  I did not mention this before, but:

Project has a one-to-many relationship with a System.
System has a one-to-many relationship with an Object.
Object has a one-to-many relationship with a Property.

If I follow your recommendation, then I would assume these tables would look like this:

Project (PK ProjectId, FK ClientId)
System (PK SystemId, FK ProjectId)
Object (PK ObjectId, FK SystemId)
Property (PK PropertyId, FK ObjectId)

CLUSTERED INDEX on PK, non-CLUSTERED INDEX on FK for all tables

So now lets say I have a 1000 clients (1000 records), each having a 100 projects (100,000 records), each project having 2 systems (200,000 records), each system having 10 objects (2,000,000 records), and each object having 50 properties (100,000,000 records).  These numbers are certainly possible with 1000 clients.

Here's my concern (which I also did not mention before, so I apologize).  I don't want the user to execute a query for any item that does not belong to that client.  So I was thinking of adding the ClientId to every query.  If the ClientId is on the Project table and I'm querying the Property table, I'd have to join 4 tables.  Given the number of possible records, will this become a performance issue?
No, I don't think performance should be an issue.  When you are joining on clustered index, integer, primary key fields, queries should return data very fast, even with millions of records.  We have tables like that where I work and performance has never been an issue.

Greg


Thank you so much for all of your help.
Thank you for all of your help.