• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 403
  • Last Modified:

How do I figure out the most efficient SQL query for a table with a clustered composite primary key?


I'm taking a class in SQL and am a relative newbie in the area of constructing high performing SQL queries against large data sets. I'm currently looking at a problem that I'm having difficulty figuring out the best approach/query.

Consider the following:

There is a table defined as:
      [load_id]                         [int]             NOT NULL,
      [master_id]       [varchar](20)       NOT NULL,
      [group_id]       [varchar](20)       NOT NULL,
      [section_id]       [varchar](20)       NOT NULL,
      [datetime_stamp]    [datetime]       NULL
      [... other data columns ...]
(      [load_id] ASC,
      [master_cd] ASC,
      [group_id] ASC,
      [section_id] ASC

This is a big table with N million records
The table gets appended with new data multiple times a day.

Every time new records get appended to the table, approximately 32K records are added.
All 32K records have the same load_id.

The load_id is incremented by one each time we load a batch of 32K records
(the 1st 32K entries have load_id=1, the next 32K has load_id=2, etc...).

The datetime_stamp field shows the time at which the entries
were loaded and is the same for all 32K entries in a single load.  

Q::What's the most efficient way to return the 1st of Foos for the current day?

For Example:
Today, records were loaded into this table at 9am, 12pm and 3pm.  
At 5pm today we want to know what foos were loaded at 9am since that is the first load that occurred today.  
For any given day, there can be different number of loads and the times that the loads occur will vary.

So what I think I want to do here is query for the requested day and return all the records where the load_id == 1. I think the key to a good solution surrounds the order that the items occur in your query's where clause.

I'm a little confused about the effects of a clustered composite key and how that would effect the performance of the query.
Datetime_stamp is not indexed so I believe it will probably perform a full table scan (??)
If it wasn't a clustered index, I'd probably search for load_id then followed by date.
There's something that I'm missing here. I've been scratching my head for a while.

What is the most efficient query here?

Thanks in advance,
2 Solutions
Ephraim WangoyaCommented:
If you are going to do a search with the datetime_stamp field as well, then you better index it

The query part should be simple enough

select *
from Foo
where LoadID = xxxx
and datetime_stamp >= yyyyy
and datetime_stamp < zzzzz
AnujSQL Server DBACommented:
Few suggestions,

Your clustered index key little wide, its around 64bytes so your non-clustered index keys will have this 64byte key + the index key size,  example if you have an index on column "Datetime_Stamp", your index size will be 8 bytes(Datetime size) + 64 bytes = 72 bytes/row meaning ~112 rows per index page for 1,000,000 rows there will be 8929 pages. If your clusterd index is not a wide key (integer 4 byte) this will be 12 bytes/row = 671 rows per page! and for million records this will be ~1491 pages! so for queries with range scan or index scan needs to touch only 1491 pages compare to 8929 pages, this is the advantage of using smaller keys as clustered index.

If possible convert this to UNIQUE INDEX and add new Identity column as surrogate primary key as a clustered index.

To improve the query on datetime_stamp, create non-clustered index on it.
You need to create a second table that contains
CREATE Table [LoadMasterFoo](
      [datetime_stamp]          [datetime]       NOT NULL
      [load_id]                         [int]             NOT NULL

Create UNIQUE CLUSTERED INDEX on  [LoadMasterFoo](  [datetime_stamp]   ,    [load_id] )....

This allows you to quickly retrieve the load id.  then you use the unique load id to get to your table.  Sql Server is not going to like an index with 30 thousands of identical keys.  this technique would be much more efficient and would allow you to add data unique to the bath when the users discover that they have additional data that JUST HAS to be retained at the load level....

? - Why was your date Nullable?
A proven path to a career in data science

At Springboard, we know how to get you a job in data science. With Springboard’s Data Science Career Track, you’ll master data science  with a curriculum built by industry experts. You’ll work on real projects, and get 1-on-1 mentorship from a data scientist.

An alternate to getting this table is to create a view againist the table.

Define an index against the view and make the index persistent.  (That causes Sql Server to physically build an underlying structure to support the index and maintain it automatically as the source table is changed.)

But I think your creating the master table in the load process is more straight forward and unless the process were already completed and the application team resistant to any changes, I probably would not push the persisted index on the view concept.  I'd keep it as a last resort if I saw queries underperforming in production and felt I had no other recourse.
Scott PletcherSenior DBACommented:
If you intend to use datetime to query against this table, then cluster by datetime.

The table should be clustered by your prime search criteria, esp. if the search will often be a range.

You can cluster without having to make it a PK.  Or, add an ident to the ***END*** of they key to insure uniqueness, do ***NOT*** use an ident as the clus key.

(     [datetime_stamp] ASC,
      [ident] ASC

Change datetime_stamp to a smalldatetime if practical and you need to reduce the key size.

-- to find the *FIRST* load of a day, use code like below;
-- code will be extremely fast, since table is clus by datetime_stamp;

DECLARE @date datetime
SET @date = GETDATE()

FROM dbo.Foo
    datetime_stamp = (SELECT MIN(datetime_stamp) FROM dbo.Foo WHERE datetime_stamp >= DATEADD(DAY, DATEDIFF(DAY, 0, @date), 0))
Scott PletcherSenior DBACommented:
>> Q::What's the most efficient way to return the 1st of Foos for the current day? <<

If you have to answer that with the existing PK, then you need to take SJCFL's approach, creating a second table, containing only one row per (datetime_stamp, load_id) and clustered by datetime_stamp (you do not need the load_id in the key).

It looks like the current table was designed with the idea of using this second table to do the lookup.

>> Sql Server is not going to like an index with 30 thousands of identical keys. <<

Sql won't care, it's just extremely inefficient.  No reason to store 32K copies of identical data.
Scott PletcherSenior DBACommented:
-- with the second table, use code like below to find the first load of a given day

DECLARE @date datetime
SET @date = GETDATE()

FROM dbo.Foo
    load_id = (SELECT MIN(load_id) FROM dbo.Foo2 WHERE datetime_stamp >= DATEADD(DAY, DATEDIFF(DAY, 0, @date), 0))
jxbmaSoftware ConsultantAuthor Commented:
Awesome feedback folks!

Thanks so much,
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

Join & Write a Comment

Featured Post

Free Tool: Path Explorer

An intuitive utility to help find the CSS path to UI elements on a webpage. These paths are used frequently in a variety of front-end development and QA automation tasks.

One of a set of tools we're offering as a way of saying thank you for being a part of the community.

Tackle projects and never again get stuck behind a technical roadblock.
Join Now