Index Creation on Table with 30 Million Rows

Posted on 2011-02-11
Last Modified: 2012-05-11
Have a table with approx. 30 million rows in it.  It has 4 columns (3 int's, one date) plus a PK.  I need to be able to join on the 4 columns when doing an insert to make sure the row isn't already there.  I also need to be able to group on any combination of the 4, sort on any one and possible group by 2 1 or more.

I originally had an index on each column but the index's alone were eating up 32GB of space.  I switched to an additional index on the PK with the 4 columns included and the index size dropped 12GB but I've noticed a significant decrease in select performance.  Now I'm thinking of creating an index using all 4 columns (as opposed the index of the PK with the 4 columns included, I hope that makes sense).  But I've always thought that the only way that will give me good performance is if I'm using all 4 columns to join, sort, group, etc.

Am I basically stuck with a choice between performance and index size?  that if I want good performance I need to create a sep. index on each and just eat the disk space?
Question by:davidcahan
  • 5
  • 4
LVL 75

Expert Comment

by:Anthony Perkins
ID: 34876676
>>but I've noticed a significant decrease in select performance. <<
Have you considered updating stats?  
Have you checked to see if the indexes need to be defragged?
LVL 33

Accepted Solution

knightEknight earned 500 total points
ID: 34876697
Generally the PK is the clustered index.  Is the PK column one of your selectable columns?  If not, then re-create the PK as non-clustered first.  (Usually the PK will be selectable, but not always.)  This will take some time for 30M rows.

Since your table width is small (3 INTs and a DT), I suggest you make your clustered index on all four columns in the order in which they are most commonly appear in the WHERE and/or ORDER BY clause:

So if this is the most common query:

   select  ...
   where  col3 = ###
       and  col1 = ###

then make your clustered index on col3, col1, col2, col4.

As for the other non-clustered indexes, they will use less disk space if you make them one column each, but they will perform better if you include more columns (assuming the other columns appear in the SELECT portion of most of your queries).  Consider making the non-clustered indexes in a different file group on a different disk.  This will alleviate some of the space problem and improve performance too.

Author Comment

ID: 34876751
Will creating on all 4 adversely affect performance when I only end up using one in some of my where clauses?
LVL 33

Assisted Solution

knightEknight earned 500 total points
ID: 34876863
If the other columns are not used in the WHERE clause then you don't need separate indexes for them.  But if they are used in the SELECT then include them in the one index you do have.  Not sure that will save you any disk space, but it will improve performance.

In general, put the most selectable column fist in your index.  By "selectable" I mean the one with the most distinct values that appears often in your WHERE clause(s).  So if the column has values of only 0 and 1 (for example), it is not a good candidate to be first on the index.
LVL 33

Expert Comment

ID: 34876868
Of course, if you run a query with a WHERE clause that has a single column, and that column is not indexed, that query will be slow.
Comprehensive Backup Solutions for Microsoft

Acronis protects the complete Microsoft technology stack: Windows Server, Windows PC, laptop and Surface data; Microsoft business applications; Microsoft Hyper-V; Azure VMs; Microsoft Windows Server 2016; Microsoft Exchange 2016 and SQL Server 2016.


Author Comment

ID: 34878883
I'm guessing that the most selectable column in my case would be DateTime.  We keep track of each and everytime a file is played and in a day each file can play over 700 times.  If a unit has total files that play and each one plays 700 times that's a lot of different datetime's.

Wouldn't you agree?

Author Comment

ID: 34878904
sorry, that should have said "if a unit has 20 total files that play...".  

also, let me show you some common statements I'm writing against the table, maybe that will help. (i'm not writing syntactically correct SQL.  Just trying to give you a general idea)

Select count(*) from table
group by ContentID,UnitID,ZoneID,Date
having count(*) > 1

Insert into table
Select a,b,c,d from importtable left join table on a = contentid and b = unitid and c = zoneid and d=date
where is null

select unitname, contentname, date
      from table with(nolock)
      inner join units
      on table.unitid =  um_id
      inner join content
      on table.contentid = cm_id
      where between '2010-01-01' and '2010-12-31'
      and accountid = 174

LVL 33

Expert Comment

ID: 34887955
Besides selectability you must first consider usage.  If the datetime column is always or frequently used in your WHERE clauses then it is a good candidate for your clustered index.  The only caution I would add (if this is the case) is that if you are adding millions of new rows per day to this table, the inserts will get slower over time, even though the selects will still perform well (again, assuming the datetime column is in the WHERE clauses).

Author Comment

ID: 34888670
Do indexes come into play with JOIN's?
LVL 33

Assisted Solution

knightEknight earned 500 total points
ID: 34888685
Yes, the query execution plan will (try to) determine and use the best index for each table in the query.

Featured Post

U.S. Department of Agriculture and Acronis Access

With the new era of mobile computing, smartphones and tablets, wireless communications and cloud services, the USDA sought to take advantage of a mobilized workforce and the blurring lines between personal and corporate computing resources.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
How to SUM hours for the same record 1 33
Change variables in SQL table 6 81
SQL Encryption question 2 50
How to import SQL 2000 database to SQL 2014 5 99
I've encountered valid database schemas that do not have a primary key.  For example, I use LogParser from Microsoft to push IIS logs into a SQL database table for processing and analysis.  However, occasionally due to user error or a scheduled task…
Introduction: When running hybrid database environments, you often need to query some data from a remote db of any type, while being connected to your MS SQL Server database. Problems start when you try to combine that with some "user input" pass…
This Micro Tutorial will give you a basic overview how to record your screen with Microsoft Expression Encoder. This program is still free and open for the public to download. This will be demonstrated using Microsoft Expression Encoder 4.
As a trusted technology advisor to your customers you are likely getting the daily question of, ‘should I put this in the cloud?’ As customer demands for cloud services increases, companies will see a shift from traditional buying patterns to new…

920 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

14 Experts available now in Live!

Get 1:1 Help Now