Solved

How do I figure out the most efficient SQL query for a table with a clustered composite primary key?

Posted on 2012-04-04
8
383 Views
Last Modified: 2012-04-13
Hi:

I'm taking a class in SQL and am a relative newbie in the area of constructing high performing SQL queries against large data sets. I'm currently looking at a problem that I'm having difficulty figuring out the best approach/query.

Consider the following:

There is a table defined as:
======================
CREATE TABLE [Foo](
      [load_id]                         [int]             NOT NULL,
      [master_id]       [varchar](20)       NOT NULL,
      [group_id]       [varchar](20)       NOT NULL,
      [section_id]       [varchar](20)       NOT NULL,
      [datetime_stamp]    [datetime]       NULL
      [... other data columns ...]
)
 CONSTRAINT [pk_Foo] PRIMARY KEY CLUSTERED
(      [load_id] ASC,
      [master_cd] ASC,
      [group_id] ASC,
      [section_id] ASC
)


This is a big table with N million records
The table gets appended with new data multiple times a day.

Every time new records get appended to the table, approximately 32K records are added.
All 32K records have the same load_id.

The load_id is incremented by one each time we load a batch of 32K records
(the 1st 32K entries have load_id=1, the next 32K has load_id=2, etc...).

The datetime_stamp field shows the time at which the entries
were loaded and is the same for all 32K entries in a single load.  

==========================================================================
==========================================================================
Q::What's the most efficient way to return the 1st of Foos for the current day?
==========================================================================
==========================================================================


For Example:
===========
Today, records were loaded into this table at 9am, 12pm and 3pm.  
At 5pm today we want to know what foos were loaded at 9am since that is the first load that occurred today.  
For any given day, there can be different number of loads and the times that the loads occur will vary.


So what I think I want to do here is query for the requested day and return all the records where the load_id == 1. I think the key to a good solution surrounds the order that the items occur in your query's where clause.

I'm a little confused about the effects of a clustered composite key and how that would effect the performance of the query.
Datetime_stamp is not indexed so I believe it will probably perform a full table scan (??)
If it wasn't a clustered index, I'd probably search for load_id then followed by date.
There's something that I'm missing here. I've been scratching my head for a while.

What is the most efficient query here?


Thanks in advance,
jxbma
0
Comment
Question by:jxbma
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
8 Comments
 
LVL 32

Expert Comment

by:Ephraim Wangoya
ID: 37809697
If you are going to do a search with the datetime_stamp field as well, then you better index it

The query part should be simple enough

select *
from Foo
where LoadID = xxxx
and datetime_stamp >= yyyyy
and datetime_stamp < zzzzz
0
 
LVL 15

Assisted Solution

by:Anuj
Anuj earned 250 total points
ID: 37809744
Few suggestions,

Your clustered index key little wide, its around 64bytes so your non-clustered index keys will have this 64byte key + the index key size,  example if you have an index on column "Datetime_Stamp", your index size will be 8 bytes(Datetime size) + 64 bytes = 72 bytes/row meaning ~112 rows per index page for 1,000,000 rows there will be 8929 pages. If your clusterd index is not a wide key (integer 4 byte) this will be 12 bytes/row = 671 rows per page! and for million records this will be ~1491 pages! so for queries with range scan or index scan needs to touch only 1491 pages compare to 8929 pages, this is the advantage of using smaller keys as clustered index.

If possible convert this to UNIQUE INDEX and add new Identity column as surrogate primary key as a clustered index.

To improve the query on datetime_stamp, create non-clustered index on it.
0
 
LVL 6

Expert Comment

by:SJCFL-Admin
ID: 37811410
You need to create a second table that contains
CREATE Table [LoadMasterFoo](
      [datetime_stamp]          [datetime]       NOT NULL
      [load_id]                         [int]             NOT NULL


Create UNIQUE CLUSTERED INDEX on  [LoadMasterFoo](  [datetime_stamp]   ,    [load_id] )....

This allows you to quickly retrieve the load id.  then you use the unique load id to get to your table.  Sql Server is not going to like an index with 30 thousands of identical keys.  this technique would be much more efficient and would allow you to add data unique to the bath when the users discover that they have additional data that JUST HAS to be retained at the load level....

? - Why was your date Nullable?
0
Get 15 Days FREE Full-Featured Trial

Benefit from a mission critical IT monitoring with Monitis Premium or get it FREE for your entry level monitoring needs.
-Over 200,000 users
-More than 300,000 websites monitored
-Used in 197 countries
-Recommended by 98% of users

 
LVL 6

Accepted Solution

by:
SJCFL-Admin earned 250 total points
ID: 37811548
An alternate to getting this table is to create a view againist the table.

Define an index against the view and make the index persistent.  (That causes Sql Server to physically build an underlying structure to support the index and maintain it automatically as the source table is changed.)

But I think your creating the master table in the load process is more straight forward and unless the process were already completed and the application team resistant to any changes, I probably would not push the persisted index on the view concept.  I'd keep it as a last resort if I saw queries underperforming in production and felt I had no other recourse.
0
 
LVL 69

Expert Comment

by:Scott Pletcher
ID: 37816398
If you intend to use datetime to query against this table, then cluster by datetime.

The table should be clustered by your prime search criteria, esp. if the search will often be a range.

You can cluster without having to make it a PK.  Or, add an ident to the ***END*** of they key to insure uniqueness, do ***NOT*** use an ident as the clus key.



CONSTRAINT [pk_Foo] PRIMARY KEY CLUSTERED
(     [datetime_stamp] ASC,
      [ident] ASC
)


Change datetime_stamp to a smalldatetime if practical and you need to reduce the key size.



-- to find the *FIRST* load of a day, use code like below;
-- code will be extremely fast, since table is clus by datetime_stamp;

DECLARE @date datetime
SET @date = GETDATE()

SELECT ...
FROM dbo.Foo
WHERE
    datetime_stamp = (SELECT MIN(datetime_stamp) FROM dbo.Foo WHERE datetime_stamp >= DATEADD(DAY, DATEDIFF(DAY, 0, @date), 0))
0
 
LVL 69

Expert Comment

by:Scott Pletcher
ID: 37816431
>> Q::What's the most efficient way to return the 1st of Foos for the current day? <<

If you have to answer that with the existing PK, then you need to take SJCFL's approach, creating a second table, containing only one row per (datetime_stamp, load_id) and clustered by datetime_stamp (you do not need the load_id in the key).

It looks like the current table was designed with the idea of using this second table to do the lookup.


>> Sql Server is not going to like an index with 30 thousands of identical keys. <<

Sql won't care, it's just extremely inefficient.  No reason to store 32K copies of identical data.
0
 
LVL 69

Expert Comment

by:Scott Pletcher
ID: 37816440
-- with the second table, use code like below to find the first load of a given day

DECLARE @date datetime
SET @date = GETDATE()

SELECT ...
FROM dbo.Foo
WHERE
    load_id = (SELECT MIN(load_id) FROM dbo.Foo2 WHERE datetime_stamp >= DATEADD(DAY, DATEDIFF(DAY, 0, @date), 0))
0
 
LVL 1

Author Closing Comment

by:jxbma
ID: 37842507
Awesome feedback folks!

Thanks so much,
JohnB
0

Featured Post

Get 15 Days FREE Full-Featured Trial

Benefit from a mission critical IT monitoring with Monitis Premium or get it FREE for your entry level monitoring needs.
-Over 200,000 users
-More than 300,000 websites monitored
-Used in 197 countries
-Recommended by 98% of users

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

In this article we will learn how to fix  “Cannot install SQL Server 2014 Service Pack 2: Unable to install windows installer msi file” error ?
In the first part of this tutorial we will cover the prerequisites for installing SQL Server vNext on Linux.
This video shows, step by step, how to configure Oracle Heterogeneous Services via the Generic Gateway Agent in order to make a connection from an Oracle session and access a remote SQL Server database table.
This videos aims to give the viewer a basic demonstration of how a user can query current session information by using the SYS_CONTEXT function

688 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question