How do I figure out the most efficient SQL query for a table with a clustered composite primary key?

Hi:

I'm taking a class in SQL and am a relative newbie in the area of constructing high performing SQL queries against large data sets. I'm currently looking at a problem that I'm having difficulty figuring out the best approach/query.

Consider the following:

There is a table defined as:
======================
CREATE TABLE [Foo](
      [load_id]                         [int]             NOT NULL,
      [master_id]       [varchar](20)       NOT NULL,
      [group_id]       [varchar](20)       NOT NULL,
      [section_id]       [varchar](20)       NOT NULL,
      [datetime_stamp]    [datetime]       NULL
      [... other data columns ...]
)
 CONSTRAINT [pk_Foo] PRIMARY KEY CLUSTERED
(      [load_id] ASC,
      [master_cd] ASC,
      [group_id] ASC,
      [section_id] ASC
)


This is a big table with N million records
The table gets appended with new data multiple times a day.

Every time new records get appended to the table, approximately 32K records are added.
All 32K records have the same load_id.

The load_id is incremented by one each time we load a batch of 32K records
(the 1st 32K entries have load_id=1, the next 32K has load_id=2, etc...).

The datetime_stamp field shows the time at which the entries
were loaded and is the same for all 32K entries in a single load.  

==========================================================================
==========================================================================
Q::What's the most efficient way to return the 1st of Foos for the current day?
==========================================================================
==========================================================================


For Example:
===========
Today, records were loaded into this table at 9am, 12pm and 3pm.  
At 5pm today we want to know what foos were loaded at 9am since that is the first load that occurred today.  
For any given day, there can be different number of loads and the times that the loads occur will vary.


So what I think I want to do here is query for the requested day and return all the records where the load_id == 1. I think the key to a good solution surrounds the order that the items occur in your query's where clause.

I'm a little confused about the effects of a clustered composite key and how that would effect the performance of the query.
Datetime_stamp is not indexed so I believe it will probably perform a full table scan (??)
If it wasn't a clustered index, I'd probably search for load_id then followed by date.
There's something that I'm missing here. I've been scratching my head for a while.

What is the most efficient query here?


Thanks in advance,
jxbma
LVL 1
jxbmaSoftware ConsultantAsked:
Who is Participating?

[Product update] Infrastructure Analysis Tool is now available with Business Accounts.Learn More

x
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Ephraim WangoyaCommented:
If you are going to do a search with the datetime_stamp field as well, then you better index it

The query part should be simple enough

select *
from Foo
where LoadID = xxxx
and datetime_stamp >= yyyyy
and datetime_stamp < zzzzz
AnujSQL Server DBACommented:
Few suggestions,

Your clustered index key little wide, its around 64bytes so your non-clustered index keys will have this 64byte key + the index key size,  example if you have an index on column "Datetime_Stamp", your index size will be 8 bytes(Datetime size) + 64 bytes = 72 bytes/row meaning ~112 rows per index page for 1,000,000 rows there will be 8929 pages. If your clusterd index is not a wide key (integer 4 byte) this will be 12 bytes/row = 671 rows per page! and for million records this will be ~1491 pages! so for queries with range scan or index scan needs to touch only 1491 pages compare to 8929 pages, this is the advantage of using smaller keys as clustered index.

If possible convert this to UNIQUE INDEX and add new Identity column as surrogate primary key as a clustered index.

To improve the query on datetime_stamp, create non-clustered index on it.
SJCFL-AdminCommented:
You need to create a second table that contains
CREATE Table [LoadMasterFoo](
      [datetime_stamp]          [datetime]       NOT NULL
      [load_id]                         [int]             NOT NULL


Create UNIQUE CLUSTERED INDEX on  [LoadMasterFoo](  [datetime_stamp]   ,    [load_id] )....

This allows you to quickly retrieve the load id.  then you use the unique load id to get to your table.  Sql Server is not going to like an index with 30 thousands of identical keys.  this technique would be much more efficient and would allow you to add data unique to the bath when the users discover that they have additional data that JUST HAS to be retained at the load level....

? - Why was your date Nullable?
Active Protection takes the fight to cryptojacking

While there were several headline-grabbing ransomware attacks during in 2017, another big threat started appearing at the same time that didn’t get the same coverage – illicit cryptomining.

SJCFL-AdminCommented:
An alternate to getting this table is to create a view againist the table.

Define an index against the view and make the index persistent.  (That causes Sql Server to physically build an underlying structure to support the index and maintain it automatically as the source table is changed.)

But I think your creating the master table in the load process is more straight forward and unless the process were already completed and the application team resistant to any changes, I probably would not push the persisted index on the view concept.  I'd keep it as a last resort if I saw queries underperforming in production and felt I had no other recourse.

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
Scott PletcherSenior DBACommented:
If you intend to use datetime to query against this table, then cluster by datetime.

The table should be clustered by your prime search criteria, esp. if the search will often be a range.

You can cluster without having to make it a PK.  Or, add an ident to the ***END*** of they key to insure uniqueness, do ***NOT*** use an ident as the clus key.



CONSTRAINT [pk_Foo] PRIMARY KEY CLUSTERED
(     [datetime_stamp] ASC,
      [ident] ASC
)


Change datetime_stamp to a smalldatetime if practical and you need to reduce the key size.



-- to find the *FIRST* load of a day, use code like below;
-- code will be extremely fast, since table is clus by datetime_stamp;

DECLARE @date datetime
SET @date = GETDATE()

SELECT ...
FROM dbo.Foo
WHERE
    datetime_stamp = (SELECT MIN(datetime_stamp) FROM dbo.Foo WHERE datetime_stamp >= DATEADD(DAY, DATEDIFF(DAY, 0, @date), 0))
Scott PletcherSenior DBACommented:
>> Q::What's the most efficient way to return the 1st of Foos for the current day? <<

If you have to answer that with the existing PK, then you need to take SJCFL's approach, creating a second table, containing only one row per (datetime_stamp, load_id) and clustered by datetime_stamp (you do not need the load_id in the key).

It looks like the current table was designed with the idea of using this second table to do the lookup.


>> Sql Server is not going to like an index with 30 thousands of identical keys. <<

Sql won't care, it's just extremely inefficient.  No reason to store 32K copies of identical data.
Scott PletcherSenior DBACommented:
-- with the second table, use code like below to find the first load of a given day

DECLARE @date datetime
SET @date = GETDATE()

SELECT ...
FROM dbo.Foo
WHERE
    load_id = (SELECT MIN(load_id) FROM dbo.Foo2 WHERE datetime_stamp >= DATEADD(DAY, DATEDIFF(DAY, 0, @date), 0))
jxbmaSoftware ConsultantAuthor Commented:
Awesome feedback folks!

Thanks so much,
JohnB
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Microsoft SQL Server

From novice to tech pro — start learning today.