Solved

How do I figure out the most efficient SQL query for a table with a clustered composite primary key?

Posted on 2012-04-04
8
359 Views
Last Modified: 2012-04-13
Hi:

I'm taking a class in SQL and am a relative newbie in the area of constructing high performing SQL queries against large data sets. I'm currently looking at a problem that I'm having difficulty figuring out the best approach/query.

Consider the following:

There is a table defined as:
======================
CREATE TABLE [Foo](
      [load_id]                         [int]             NOT NULL,
      [master_id]       [varchar](20)       NOT NULL,
      [group_id]       [varchar](20)       NOT NULL,
      [section_id]       [varchar](20)       NOT NULL,
      [datetime_stamp]    [datetime]       NULL
      [... other data columns ...]
)
 CONSTRAINT [pk_Foo] PRIMARY KEY CLUSTERED
(      [load_id] ASC,
      [master_cd] ASC,
      [group_id] ASC,
      [section_id] ASC
)


This is a big table with N million records
The table gets appended with new data multiple times a day.

Every time new records get appended to the table, approximately 32K records are added.
All 32K records have the same load_id.

The load_id is incremented by one each time we load a batch of 32K records
(the 1st 32K entries have load_id=1, the next 32K has load_id=2, etc...).

The datetime_stamp field shows the time at which the entries
were loaded and is the same for all 32K entries in a single load.  

==========================================================================
==========================================================================
Q::What's the most efficient way to return the 1st of Foos for the current day?
==========================================================================
==========================================================================


For Example:
===========
Today, records were loaded into this table at 9am, 12pm and 3pm.  
At 5pm today we want to know what foos were loaded at 9am since that is the first load that occurred today.  
For any given day, there can be different number of loads and the times that the loads occur will vary.


So what I think I want to do here is query for the requested day and return all the records where the load_id == 1. I think the key to a good solution surrounds the order that the items occur in your query's where clause.

I'm a little confused about the effects of a clustered composite key and how that would effect the performance of the query.
Datetime_stamp is not indexed so I believe it will probably perform a full table scan (??)
If it wasn't a clustered index, I'd probably search for load_id then followed by date.
There's something that I'm missing here. I've been scratching my head for a while.

What is the most efficient query here?


Thanks in advance,
jxbma
0
Comment
Question by:jxbma
8 Comments
 
LVL 32

Expert Comment

by:ewangoya
ID: 37809697
If you are going to do a search with the datetime_stamp field as well, then you better index it

The query part should be simple enough

select *
from Foo
where LoadID = xxxx
and datetime_stamp >= yyyyy
and datetime_stamp < zzzzz
0
 
LVL 15

Assisted Solution

by:Anuj
Anuj earned 250 total points
ID: 37809744
Few suggestions,

Your clustered index key little wide, its around 64bytes so your non-clustered index keys will have this 64byte key + the index key size,  example if you have an index on column "Datetime_Stamp", your index size will be 8 bytes(Datetime size) + 64 bytes = 72 bytes/row meaning ~112 rows per index page for 1,000,000 rows there will be 8929 pages. If your clusterd index is not a wide key (integer 4 byte) this will be 12 bytes/row = 671 rows per page! and for million records this will be ~1491 pages! so for queries with range scan or index scan needs to touch only 1491 pages compare to 8929 pages, this is the advantage of using smaller keys as clustered index.

If possible convert this to UNIQUE INDEX and add new Identity column as surrogate primary key as a clustered index.

To improve the query on datetime_stamp, create non-clustered index on it.
0
 
LVL 6

Expert Comment

by:SJCFL-Admin
ID: 37811410
You need to create a second table that contains
CREATE Table [LoadMasterFoo](
      [datetime_stamp]          [datetime]       NOT NULL
      [load_id]                         [int]             NOT NULL


Create UNIQUE CLUSTERED INDEX on  [LoadMasterFoo](  [datetime_stamp]   ,    [load_id] )....

This allows you to quickly retrieve the load id.  then you use the unique load id to get to your table.  Sql Server is not going to like an index with 30 thousands of identical keys.  this technique would be much more efficient and would allow you to add data unique to the bath when the users discover that they have additional data that JUST HAS to be retained at the load level....

? - Why was your date Nullable?
0
 
LVL 6

Accepted Solution

by:
SJCFL-Admin earned 250 total points
ID: 37811548
An alternate to getting this table is to create a view againist the table.

Define an index against the view and make the index persistent.  (That causes Sql Server to physically build an underlying structure to support the index and maintain it automatically as the source table is changed.)

But I think your creating the master table in the load process is more straight forward and unless the process were already completed and the application team resistant to any changes, I probably would not push the persisted index on the view concept.  I'd keep it as a last resort if I saw queries underperforming in production and felt I had no other recourse.
0
How to run any project with ease

Manage projects of all sizes how you want. Great for personal to-do lists, project milestones, team priorities and launch plans.
- Combine task lists, docs, spreadsheets, and chat in one
- View and edit from mobile/offline
- Cut down on emails

 
LVL 69

Expert Comment

by:ScottPletcher
ID: 37816398
If you intend to use datetime to query against this table, then cluster by datetime.

The table should be clustered by your prime search criteria, esp. if the search will often be a range.

You can cluster without having to make it a PK.  Or, add an ident to the ***END*** of they key to insure uniqueness, do ***NOT*** use an ident as the clus key.



CONSTRAINT [pk_Foo] PRIMARY KEY CLUSTERED
(     [datetime_stamp] ASC,
      [ident] ASC
)


Change datetime_stamp to a smalldatetime if practical and you need to reduce the key size.



-- to find the *FIRST* load of a day, use code like below;
-- code will be extremely fast, since table is clus by datetime_stamp;

DECLARE @date datetime
SET @date = GETDATE()

SELECT ...
FROM dbo.Foo
WHERE
    datetime_stamp = (SELECT MIN(datetime_stamp) FROM dbo.Foo WHERE datetime_stamp >= DATEADD(DAY, DATEDIFF(DAY, 0, @date), 0))
0
 
LVL 69

Expert Comment

by:ScottPletcher
ID: 37816431
>> Q::What's the most efficient way to return the 1st of Foos for the current day? <<

If you have to answer that with the existing PK, then you need to take SJCFL's approach, creating a second table, containing only one row per (datetime_stamp, load_id) and clustered by datetime_stamp (you do not need the load_id in the key).

It looks like the current table was designed with the idea of using this second table to do the lookup.


>> Sql Server is not going to like an index with 30 thousands of identical keys. <<

Sql won't care, it's just extremely inefficient.  No reason to store 32K copies of identical data.
0
 
LVL 69

Expert Comment

by:ScottPletcher
ID: 37816440
-- with the second table, use code like below to find the first load of a given day

DECLARE @date datetime
SET @date = GETDATE()

SELECT ...
FROM dbo.Foo
WHERE
    load_id = (SELECT MIN(load_id) FROM dbo.Foo2 WHERE datetime_stamp >= DATEADD(DAY, DATEDIFF(DAY, 0, @date), 0))
0
 
LVL 1

Author Closing Comment

by:jxbma
ID: 37842507
Awesome feedback folks!

Thanks so much,
JohnB
0

Featured Post

What Should I Do With This Threat Intelligence?

Are you wondering if you actually need threat intelligence? The answer is yes. We explain the basics for creating useful threat intelligence.

Join & Write a Comment

If you have heard of RFC822 date formats, they can be quite a challenge in SQL Server. RFC822 is an Internet standard format for email message headers, including all dates within those headers. The RFC822 protocols are available in detail at:   ht…
Ever needed a SQL 2008 Database replicated/mirrored/log shipped on another server but you can't take the downtime inflicted by initial snapshot or disconnect while T-logs are restored or mirror applied? You can use SQL Server Initialize from Backup…
This videos aims to give the viewer a basic demonstration of how a user can query current session information by using the SYS_CONTEXT function
Viewers will learn how to use the INSERT statement to insert data into their tables. It will also introduce the NULL statement, to show them what happens when no value is giving for any given column.

747 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

13 Experts available now in Live!

Get 1:1 Help Now