Solved

How do I figure out the most efficient SQL query for a table with a clustered composite primary key?

Posted on 2012-04-04
8
375 Views
Last Modified: 2012-04-13
Hi:

I'm taking a class in SQL and am a relative newbie in the area of constructing high performing SQL queries against large data sets. I'm currently looking at a problem that I'm having difficulty figuring out the best approach/query.

Consider the following:

There is a table defined as:
======================
CREATE TABLE [Foo](
      [load_id]                         [int]             NOT NULL,
      [master_id]       [varchar](20)       NOT NULL,
      [group_id]       [varchar](20)       NOT NULL,
      [section_id]       [varchar](20)       NOT NULL,
      [datetime_stamp]    [datetime]       NULL
      [... other data columns ...]
)
 CONSTRAINT [pk_Foo] PRIMARY KEY CLUSTERED
(      [load_id] ASC,
      [master_cd] ASC,
      [group_id] ASC,
      [section_id] ASC
)


This is a big table with N million records
The table gets appended with new data multiple times a day.

Every time new records get appended to the table, approximately 32K records are added.
All 32K records have the same load_id.

The load_id is incremented by one each time we load a batch of 32K records
(the 1st 32K entries have load_id=1, the next 32K has load_id=2, etc...).

The datetime_stamp field shows the time at which the entries
were loaded and is the same for all 32K entries in a single load.  

==========================================================================
==========================================================================
Q::What's the most efficient way to return the 1st of Foos for the current day?
==========================================================================
==========================================================================


For Example:
===========
Today, records were loaded into this table at 9am, 12pm and 3pm.  
At 5pm today we want to know what foos were loaded at 9am since that is the first load that occurred today.  
For any given day, there can be different number of loads and the times that the loads occur will vary.


So what I think I want to do here is query for the requested day and return all the records where the load_id == 1. I think the key to a good solution surrounds the order that the items occur in your query's where clause.

I'm a little confused about the effects of a clustered composite key and how that would effect the performance of the query.
Datetime_stamp is not indexed so I believe it will probably perform a full table scan (??)
If it wasn't a clustered index, I'd probably search for load_id then followed by date.
There's something that I'm missing here. I've been scratching my head for a while.

What is the most efficient query here?


Thanks in advance,
jxbma
0
Comment
Question by:jxbma
8 Comments
 
LVL 32

Expert Comment

by:ewangoya
ID: 37809697
If you are going to do a search with the datetime_stamp field as well, then you better index it

The query part should be simple enough

select *
from Foo
where LoadID = xxxx
and datetime_stamp >= yyyyy
and datetime_stamp < zzzzz
0
 
LVL 15

Assisted Solution

by:Anuj
Anuj earned 250 total points
ID: 37809744
Few suggestions,

Your clustered index key little wide, its around 64bytes so your non-clustered index keys will have this 64byte key + the index key size,  example if you have an index on column "Datetime_Stamp", your index size will be 8 bytes(Datetime size) + 64 bytes = 72 bytes/row meaning ~112 rows per index page for 1,000,000 rows there will be 8929 pages. If your clusterd index is not a wide key (integer 4 byte) this will be 12 bytes/row = 671 rows per page! and for million records this will be ~1491 pages! so for queries with range scan or index scan needs to touch only 1491 pages compare to 8929 pages, this is the advantage of using smaller keys as clustered index.

If possible convert this to UNIQUE INDEX and add new Identity column as surrogate primary key as a clustered index.

To improve the query on datetime_stamp, create non-clustered index on it.
0
 
LVL 6

Expert Comment

by:SJCFL-Admin
ID: 37811410
You need to create a second table that contains
CREATE Table [LoadMasterFoo](
      [datetime_stamp]          [datetime]       NOT NULL
      [load_id]                         [int]             NOT NULL


Create UNIQUE CLUSTERED INDEX on  [LoadMasterFoo](  [datetime_stamp]   ,    [load_id] )....

This allows you to quickly retrieve the load id.  then you use the unique load id to get to your table.  Sql Server is not going to like an index with 30 thousands of identical keys.  this technique would be much more efficient and would allow you to add data unique to the bath when the users discover that they have additional data that JUST HAS to be retained at the load level....

? - Why was your date Nullable?
0
U.S. Department of Agriculture and Acronis Access

With the new era of mobile computing, smartphones and tablets, wireless communications and cloud services, the USDA sought to take advantage of a mobilized workforce and the blurring lines between personal and corporate computing resources.

 
LVL 6

Accepted Solution

by:
SJCFL-Admin earned 250 total points
ID: 37811548
An alternate to getting this table is to create a view againist the table.

Define an index against the view and make the index persistent.  (That causes Sql Server to physically build an underlying structure to support the index and maintain it automatically as the source table is changed.)

But I think your creating the master table in the load process is more straight forward and unless the process were already completed and the application team resistant to any changes, I probably would not push the persisted index on the view concept.  I'd keep it as a last resort if I saw queries underperforming in production and felt I had no other recourse.
0
 
LVL 69

Expert Comment

by:Scott Pletcher
ID: 37816398
If you intend to use datetime to query against this table, then cluster by datetime.

The table should be clustered by your prime search criteria, esp. if the search will often be a range.

You can cluster without having to make it a PK.  Or, add an ident to the ***END*** of they key to insure uniqueness, do ***NOT*** use an ident as the clus key.



CONSTRAINT [pk_Foo] PRIMARY KEY CLUSTERED
(     [datetime_stamp] ASC,
      [ident] ASC
)


Change datetime_stamp to a smalldatetime if practical and you need to reduce the key size.



-- to find the *FIRST* load of a day, use code like below;
-- code will be extremely fast, since table is clus by datetime_stamp;

DECLARE @date datetime
SET @date = GETDATE()

SELECT ...
FROM dbo.Foo
WHERE
    datetime_stamp = (SELECT MIN(datetime_stamp) FROM dbo.Foo WHERE datetime_stamp >= DATEADD(DAY, DATEDIFF(DAY, 0, @date), 0))
0
 
LVL 69

Expert Comment

by:Scott Pletcher
ID: 37816431
>> Q::What's the most efficient way to return the 1st of Foos for the current day? <<

If you have to answer that with the existing PK, then you need to take SJCFL's approach, creating a second table, containing only one row per (datetime_stamp, load_id) and clustered by datetime_stamp (you do not need the load_id in the key).

It looks like the current table was designed with the idea of using this second table to do the lookup.


>> Sql Server is not going to like an index with 30 thousands of identical keys. <<

Sql won't care, it's just extremely inefficient.  No reason to store 32K copies of identical data.
0
 
LVL 69

Expert Comment

by:Scott Pletcher
ID: 37816440
-- with the second table, use code like below to find the first load of a given day

DECLARE @date datetime
SET @date = GETDATE()

SELECT ...
FROM dbo.Foo
WHERE
    load_id = (SELECT MIN(load_id) FROM dbo.Foo2 WHERE datetime_stamp >= DATEADD(DAY, DATEDIFF(DAY, 0, @date), 0))
0
 
LVL 1

Author Closing Comment

by:jxbma
ID: 37842507
Awesome feedback folks!

Thanks so much,
JohnB
0

Featured Post

Free Tool: Site Down Detector

Helpful to verify reports of your own downtime, or to double check a downed website you are trying to access.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
SQL Insert to Begin if data exists 2 31
SSRS Deployment problem 5 64
SQL R 21 25
partitioning database after decade growth 8 24
In this article I will describe the Copy Database Wizard method as one possible migration process and I will add the extra tasks needed for an upgrade when and where is applied so it will cover all.
This article explains how to reset the password of the sa account on a Microsoft SQL Server.  The steps in this article work in SQL 2005, 2008, 2008 R2, 2012, 2014 and 2016.
Via a live example, show how to extract insert data into a SQL Server database table using the Import/Export option and Bulk Insert.
Using examples as well as descriptions, and references to Books Online, show the documentation available for datatypes, explain the available data types and show how data can be passed into and out of variables.

839 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question