Solved

SQL Server 2008 Index Optimization

Posted on 2011-09-03
10
443 Views
Last Modified: 2012-08-14
I am preparing for 70-432 and took a practice exam.  One of the questions (paraphrased) is:

CREATE TABLE Invoice_details (
Invoice_id int NOT NULL,
Customer_id int NOT NULL,
Invoice_date datetime DEFAULT GETDATE() NULL,
Amount_total int NULL,
Serial_num int IDENTITY (1,1) NOT NULL)

A clustered index exists on Serial_num.  A nonclustered index exists on Invoice_id/Customer_id (composite index).  Assuming this query:

SELECT Customer_id, Invoice_id, Invoice_date 
FROM Invoice_details
WHERE Customer_id=1234
ORDER BY Invoice_date DESC;

What best ensures optimal performance?

A) Create a clustered index on Customer_id
B) Create a nonclustered index on Customer_id
C) Create a nonclustered index on Invoice_date
D) Alter the nonclustered index to include Invoice_date 
E) Alter the nonclustered index to remove Customer_id

Open in new window


I select option (B), and got it wrong - the test claims option (D) was the correct answer.  I was unconvinced, so set up a little test suite.  I recreated the table from the question, and then ran these statements:

create nonclustered index ix_invoice_customer on Invoice_details (Invoice_id,Customer_id)
create nonclustered index ix_invoice_customer_inc on Invoice_details (Invoice_id,Customer_id) include (Invoice_date)
create nonclustered index ix_customer_invoice on Invoice_details (Customer_id,Invoice_id)
create nonclustered index ix_customer_invoice_inc on Invoice_details (Customer_id,Invoice_id) include (Invoice_date)
create nonclustered index ix_customer on Invoice_details (Customer_id)
create nonclustered index ix_customer_inc on Invoice_details (Customer_id) include (Invoice_id,Invoice_date)
create nonclustered index ix_customer_inc1 on Invoice_details (Customer_id) include (Invoice_date)
go
DECLARE @R1 INT;
DECLARE @R2 INT;
DECLARE @R3 INT;
DECLARE @R4 INT;
DECLARE @x INT;
SET @R1=(9000*RAND())+1000;
SET @R2=(9000*RAND())+1000;
SET @R3=(9000*RAND())+1000;
SET @R4=(9000*RAND())+1000;
SET @x=1;
WHILE (@x<100000)
BEGIN
INSERT INTO Invoice_details (Invoice_id,Customer_id,Invoice_date,Amount_total) VALUES (@R1,@R2,DATEADD(second,@R3,GETDATE()),@R4)
SET @x=@x+1;
END
go

Open in new window


Next, I popped the following into a new query window and hit execute:

SET SHOWPLAN_ALL ON
GO

--query 1 (original index)
SELECT Customer_id, Invoice_id, Invoice_date 
FROM Invoice_details with (index(ix_invoice_customer))
WHERE Customer_id=1234
ORDER BY Invoice_date DESC;

--query 2 (original, plus includes relevant field)
SELECT Customer_id, Invoice_id, Invoice_date 
FROM Invoice_details with (index(ix_invoice_customer_inc))
WHERE Customer_id=1234
ORDER BY Invoice_date DESC;

--query 3 (reversed field order)
SELECT Customer_id, Invoice_id, Invoice_date 
FROM Invoice_details with (index(ix_customer_invoice))
WHERE Customer_id=1234
ORDER BY Invoice_date DESC;

--query 4 (reversed field order, plus includes relevant field)
SELECT Customer_id, Invoice_id, Invoice_date 
FROM Invoice_details with (index(ix_customer_invoice_inc))
WHERE Customer_id=1234
ORDER BY Invoice_date DESC;

--query 5 (single-field index)
SELECT Customer_id, Invoice_id, Invoice_date 
FROM Invoice_details with (index(ix_customer))
WHERE Customer_id=1234
ORDER BY Invoice_date DESC;

--query 6 (single-field index, plus includes all relevant fields)
SELECT Customer_id, Invoice_id, Invoice_date 
FROM Invoice_details with (index(ix_customer_inc))
WHERE Customer_id=1234
ORDER BY Invoice_date DESC;

--query 7 (single-field index, plus includes one relevant field)
SELECT Customer_id, Invoice_id, Invoice_date 
FROM Invoice_details with (index(ix_customer_inc1))
WHERE Customer_id=1234
ORDER BY Invoice_date DESC;

Open in new window


The results (attached as an xls file) were surprising.  The rankings were pretty much as I expected, with my chosen solution being the 3rd best optimization possible overall, and the best optimization possible from the choices available.  I was a little surprised by how much better it ran, but whatever...I can live with trouncing the alternative.  :)  The truly surprising thing was the difference between the original and the "correct" solution - the original performed better, despite requiring a key lookup.

I have looked at this up, down, and sideways, and I cannot seem to wrap my head around this.  MSSQL reports that an index scan+clustered lookup is less expensive than an index scan with included data point.  Can anyone shed some light on this?



optimize-test.xls
0
Comment
Question by:Steve Bink
10 Comments
 
LVL 50

Assisted Solution

by:Lowfatspread
Lowfatspread earned 20 total points
Comment Utility
haven't you just inserted 100,000 rows with the same data values?

i'm not clear how you can "optimally" select the best solution without knowing anything about the
distribution statistics for the underlying data...

D) is probably the change with the least impact to your system... in that it basically adds the missing data
   item for the query, without altering the basic system objects available (much), and is probably the minimal storage impact.
,i'm not clear what your index hint to force the index usage , actually does to the generated plan, as opposed to a plan which is generated internally with just regard to the available possibilities...

the best way to check for optimal indexing is to just add/alter 1 index at a time. and to test that independently... then compare the tests later... (making sure to clear buffers etc between tests).
0
 
LVL 50

Author Comment

by:Steve Bink
Comment Utility
>>> haven't you just inserted 100,000 rows with the same data values?

My apologies, I copied the wrong query window.  :/  The actual process set RAND() during the loop:

DECLARE @R1 INT;
DECLARE @R2 INT;
DECLARE @R3 INT;
DECLARE @R4 INT;
DECLARE @x INT;
SET @x=1;
WHILE (@x<100000)
BEGIN
SET @R1=(9000*RAND())+1000;
SET @R2=(9000*RAND())+1000;
SET @R3=(9000*RAND())+1000;
SET @R4=(9000*RAND())+1000;
INSERT INTO Invoice_details (Invoice_id,Customer_id,Invoice_date,Amount_total) VALUES (@R1,@R2,DATEADD(second,@R3,GETDATE()),@R4)
SET @x=@x+1;
END
go

Open in new window


>>> i'm not clear how you can "optimally" select the best solution without knowing anything about the

The question was concerned only about the optimization of the query.  It did not mention any other statistical points (such as drive space).  I have learned over time that the questions on these certifications usually deal with scenarios where more than one option is valid, and some small piece of information in the question guides you to where they want you to go.  In this case, the question was phrased to be concerned solely with query performance.  Without an explicit concern for the disk space involved, I believe creating a new nonclustered index is the best optimization available - a belief borne out by the test results.  Otherwise, I agree with your assessment in the real world, but this is a hypothetical constructed solely for the purpose of the test question.  For the record, my ideal optimization would have been to reverse the field order on the existing nonclustered index, and include Invoice_date.

Given that, the distribution should not play much of a role, if any, on the performance difference between query 1 and query 2.  Both are using an index scan to identify the same number of rows.  The difference between them is down to one using a second table lookup based on the clustered key versus the other grabbing included data at the leaf it has already found.  Logic dictates that the additional disk access *should* decrease the performance of the key-based lookup, but the results show exactly the opposite.  That is the mystery I'm trying to address.

You do have a point regarding clearing the buffer, and allowing the internal optimizer to make a decision.  I did neither of those.  I will rerun the tests again once I have a few spare moments and post those results as well.
0
 
LVL 142

Assisted Solution

by:Guy Hengel [angelIII / a3]
Guy Hengel [angelIII / a3] earned 30 total points
Comment Utility
well:
A clustered index exists on Serial_num.  

this is already a bad choice. the clustered index should not be on that field, as it is the primary key
better would be a clustered index on customerid, as that is the field where usually you want fast access for the OLAP (user interface queries): get all invoices per customer.
while this would go with options A, that answer alone is wrong, because it would need to read: drop (replace) the clustered index (with a non-clustered index) and create then a clustered index : you cannot have 2 clustered indexes on 1 table
 

another non.-clustered index would be on invoiceid, another in invoicedate eventually and of course one on serialnum
those indexes might need to have other columns included to become "covering" indexes, but that depends on the queries you run on the table.

for the exam prep question: you are right, your choice would be best, and this is one of the questions where, knowing the exams, is one of the either wrong or "you have to answer as by the book, and not by the practice/real life" expertise.
0
 
LVL 42

Assisted Solution

by:dqmq
dqmq earned 450 total points
Comment Utility
Answer D) is correct; the question is wrong :>).  

The question is trying to touch on several (perhaps too many) concepts in one fell swoop:   clustered/non-clustered, covering indexes, multi-column indexes, etc.  As such--and with indexing strategies in general, the subtleties are EVERYTHING. I expect something got lost in translation, either by you or by whomever originally "remembered" the test question.  For example, the question does not give the precedence of the two columns in the composite, non-clustered index, which makes a big difference. Nor does the question offer the data distribution or statistics information, also huge.  For example, if the table is small or customer_id is predominant, then "D" may very well be best.

In essence, the question really does not have enough to go on.

Given the question, as written, I would have answered "B".  But I sort of remember a similar question and my answer at the time was "D".  Back to lost in translation.
 

>For the record, my ideal optimization would have been to reverse the field order on the existing nonclustered index, and include Invoice_date.

Since we're at liberty to invent the ideal answer for the record, what would you think about changing the clustered index to non-clustered and recreating another clustered index on customer_id?
   
0
 
LVL 50

Author Comment

by:Steve Bink
Comment Utility
>>> In essence, the question really does not have enough to go on.

Most definitely.  Optimizing indexes depends heavily on the common queries in the workload, and that point is not addressed at all.  In this scenario, absent other information, it is very likely that using a clustered index on Customer_id is going to be a better optimization.  At least, it seems that way given the common queries I think I'd be running.  I've read a lot of material advocating clustered PK identity fields, but I'm not willing to accept that position blindly.  It certainly makes sense in the context of join optimization (a clustered unique will be faster) and insert optimization (urgh..page split bad), but then we're back to "more information".  So again, everyone here is correct when pointing out the dearth of information in the question.  Without knowing more of the particulars, we are just spitting upwind trying to guess what might be best.

But as much as I'm enjoying the discussion (and I am...no snarkiness there), the primary question I'm posing is getting left behind.  Why would query 1 perform better than query 2?  In theory, including a field on the leaf should show better performance than looking up that same field from the clustered key.  The disk access to read it has already been done via the covering index, and receiving it from memory should be faster than receiving it from a second disk read to the clustered leaf.  I can also see where slight modifications to the question would have led me to answer "D" originally, but the results are contrary to the expectations.  

0
What Is Threat Intelligence?

Threat intelligence is often discussed, but rarely understood. Starting with a precise definition, along with clear business goals, is essential.

 
LVL 42

Expert Comment

by:dqmq
Comment Utility
Because the query 2 index is twice as big as the query 1 index.  The additional time required for query 1 to access the clustered leaf is trivial compared to the additional time required by query 1 to scan the larger index.  
0
 
LVL 50

Author Comment

by:Steve Bink
Comment Utility
But hasn't the larger index already been scanned and read into memory?  Isn't that the point of a covering index?
0
 
LVL 50

Author Comment

by:Steve Bink
Comment Utility
I wanted to explain my question further.

In the execution plan, SQL Server is going to read the nonclustered index (an index scan) to find a match to the query's predicate, Customer_id=1234.  When it finds a matching index entry, that leaf is read into memory.  The leaf node will include all of the index columns, plus the "uniquifier" pointing to a clustered index entry.  On the normal index (query 1), the leaf node should contain Invoice_id and Customer_id.  On the covering index (query 2), the leaf node should contain Invoice_date in addition to the key fields.  

So when the query executes, both queries initially execute a scan of the nonclustered index.  Both of these scans are going to return the same index rows.  At this point in the query execution, query 1 *must* execute a clustered lookup to get Invoice_date from the table data, but query 2 should already have the information ready and available from the leaf node.  But SQL seems to be ignoring that information, and continues with a clustered lookup anyways.  Why is SQL Server executing another lookup to get data it should already have?
0
 
LVL 42

Accepted Solution

by:
dqmq earned 450 total points
Comment Utility
>So when the query executes, both queries initially execute a scan of the nonclustered index.

Yes, Q2 being twice as big, takes nearly twice as long.  

>At this point in the query execution, query 1 *must* execute a clustered lookup to get Invoice_date from the table data, but query 2 should already have the information ready and available from the leaf node.

That's correct.  And that's exactly what I see in your Explain Plan:

Q1:
SELECT
  |--Sort
       |--Nested Loops
            |--Index Scan                     (io=.14)
            |--Clustered Index Seek     (io=.003)

Q2:
 SELECT
  |--Sort                        
       |--Index Scan                            (io=.24)


The i/o cost for Q2's additional scan time is 33 times more than the i/o cost for the Q1's additional seek.  


The larger the table gets, the better Q1 will perform compared to Q2.  Delete most of the rows, and Q2 will perform better than Q1.  

0
 
LVL 50

Author Comment

by:Steve Bink
Comment Utility
Ah, I see now.  That makes perfect sense.  I was thinking of the included field as an increase in disk space, but did not really associate that with a longer scan time.  I should have seen it in the statistics as well.

Thanks ever so much for your patience, and the nice discussion on optimization.  70-432 is on Thursday...I hope they don't have this question on it.  :)
0

Featured Post

Complete VMware vSphere® ESX(i) & Hyper-V Backup

Capture your entire system, including the host, with patented disk imaging integrated with VMware VADP / Microsoft VSS and RCT. RTOs is as low as 15 seconds with Acronis Active Restore™. You can enjoy unlimited P2V/V2V migrations from any source (even from a different hypervisor)

Join & Write a Comment

After restoring a Microsoft SQL Server database (.bak) from backup or attaching .mdf file, you may run into "Error '15023' User or role already exists in the current database" when you use the "User Mapping" SQL Management Studio functionality to al…
If you have heard of RFC822 date formats, they can be quite a challenge in SQL Server. RFC822 is an Internet standard format for email message headers, including all dates within those headers. The RFC822 protocols are available in detail at:   ht…
In this seventh video of the Xpdf series, we discuss and demonstrate the PDFfonts utility, which lists all the fonts used in a PDF file. It does this via a command line interface, making it suitable for use in programs, scripts, batch files — any pl…
Here's a very brief overview of the methods PRTG Network Monitor (https://www.paessler.com/prtg) offers for monitoring bandwidth, to help you decide which methods you´d like to investigate in more detail.  The methods are covered in more detail in o…

744 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

19 Experts available now in Live!

Get 1:1 Help Now