Link to home
Start Free TrialLog in
Avatar of gdemaria
gdemariaFlag for United States of America

asked on

Best Practices for Cluster Index

I am creating a clustered index strategy for my reporting tables.

Reports  (report_id)

ReportProducts
    Pk   reportProduct_id
    Fk  report_id

ReportProductOptions
   Pk  reportProductOption_id
   Fk  reportProduct_id
   fk  report_id
         optionCode
         optionName
         lastSold_Date



I am going to create clustered index so that the report_id is always first  in the cluster.   I hope this will prevent the constant deadlocks I am getting.

My question is... if I am going to access my ReportProductOptions table directly to the primary key, do I need to use the report_id in the where clause?

update reportProductOptions
    set optionName = 'Hello'
where report_id = 123
and     reportProductOption_id = 34567


...or....

update reportProductOptions
    set optionName = 'Hello'
where reportProductOption_id = 34567



I suspect my clustered index would NOT hold the table's primary key

   reportProductOptions Cluster Index:    report_id, product_id, optionCode, optionName

Regular (non-cluster) Index on the primary key:  reportProductOptions  reportProductOption_id


Advise please?   Is this the right approach?
Avatar of Peter Chan
Peter Chan
Flag of Hong Kong image

Hi,
Yes, we usually refer to column (having the index) when referring or linking the tables. Index is for big table having lots of records inside.
Avatar of ste5an
As far as I understand you: You will have a clustered index and a different primary key?

Then the answer is yes, you need the report ID in the WHERE conditions. otherwise the clustered index may not be used and your execution plan may use a primary key scan/seek and lookups.

But you need to review your model. Cause such an strategy sounds like you have in reality a combined primary key. But even when you don't have it, then you need a two column foreign key relationship between ReportProducts and ReportProductOptions.

I would reconsider the entire approach. When it's about reporting, then chose a dimensional model instead of a normalized relational model. Cause reports are calculated once, there is normally no need to update them like in your sample code snippets.

btw, can you explain your use-case of reporting? Cause
Avatar of gdemaria

ASKER

Hi  ste5an - thanks for your response.

Our reports are very complex, they are not simply pulling data, they are running lots of calculations with drill downs to products, options and annual values.   At each level many values are calculated on a running basis (conditionally compounded every year).    I simplified the model in my example, there is an annual table ReportProductOptionYears and this table has about 5,000 record created for every report_id.  

Even though the tables are cleared out in the middle of the night, they can quickly reach over 1 million records.  The problem is deadlocks.  Every user is assigned a report_id and his transactions should never touch any other report_id, so how can there be deadlocks?    The advise of others have given is to cluster index starting with the report_id in every table.      That will hopefully reduce deadlocks of records the user will never touch.    I am trying to figure out the details of that.    For the deep child tables, do I cluster around the primary key?

The table ReportProductOptionYears  is 4 levels deep from the reporting table.    Do I setup a cluster like this:

Report_id,  ReportProductOption_Id,  YearNumber
and then I leave the primary key as it's own index (non-clustered)

Or do I just include the Primary Key as part of the clustered index?

Report_id,  ReportProductOptionYear_ID,  YearNumber



In order to update a value on the ReportProductOptionYears table, is it better to ...

Can I go directly to the primary key without the report_id...   if the primary key is not in the cluster but its own index?

update ReportProductOptionYears
    set costFactor = xxxx
where ReportProductOptionYear_ID = 111111

Or this....


update ReportProductOptionYears
    set costFactor = xxxx
where report_id = 1111111
and     ReportProductOptionYear_ID = 111111

How far do I go?   Do I include every fkey in the cluster and then add to the where clause?

update ReportProductOptionYears
    set costFactor = xxxx
where report_id = 1111111
and     ReportProductOption_ID = 111111
and     ReportProductOptionYear_ID = 111111
Hi,
If report_id is first indexed column of one index, you can do it like
update ReportProductOptionYears
    set costFactor = xxxx
where report_id = 1111111
and     ReportProductOption_ID = 111111
and     ReportProductOptionYear_ID = 111111

Open in new window

There's nothing wrong with multiple-key clustering indexes, especially for child tables (as these are).  Such combined keys produce more efficient execution plans across all queries.  That should indeed eliminate (or at least drastically reduce) the deadlocks you're currently seeing.

Here's my suggested indexing (Cl=clustered index):


ReportProducts
        reportProduct_id
    Fk  report_id
    Cl  (report_id, reportProduct_id)

ReportProductOptions
       reportProductOption_id
       reportProduct_id
    Fk report_id
         optionCode
         optionName
         lastSold_Date
    Fk (report_id, reportProduct_id)
    Cl (report_id, reportProduct_id, reportProductOption_id) --yes, 3 levels of keys.  If that's what the data structure needs,
        then that's what you should use, not artificially limit it to one column to follow some generic "rule".
        Of course make sure you've done proper diligence on the initial design and selection of keys to make sure they will be stable later.  Don't over-rush the design process.
For the deadlocks:
When deadlocks happen you need to look at the reason. Is it an INSERT or UPDATE?

In the first case you need to play with the fill factor. Using a value between 0.6 and 0.9 may help. In the second case you need to look into supporting indices. Your sample UPDATE

UPDATE reportProductOptions
SET    optionName = 'Hello'
WHERE  report_id = 123
       AND reportProductOption_id = 34567;

Open in new window

requires an covering index over ReportProductOptions ( Report_ID, ReportProductOption, OptionName ) and change the query to

UPDATE reportProductOptions
SET    optionName = 'Hello'
WHERE  report_id = 123
       AND reportProductOption_id = 34567
       AND optionName != 'Hello'
       OR optionName IS NULL;

Open in new window

For the query predicates:

Always use the clustered index columns in your WHERE predicates. Cause (Report_ID, ReportProductOption_ID) is for sure in reality a not strict monotone function. Thus using the first column of the clustered key should enable a index/table seek, where as without you should end in index scan otherwise.
This question needs an answer!
Become an EE member today
7 DAY FREE TRIAL
Members can start a 7-Day Free trial then enjoy unlimited access to the platform.
View membership options
or
Learn why we charge membership fees
We get it - no one likes a content blocker. Take one extra minute and find out why we block content.