Index Creation on Table with 30 Million Rows

Posted on 2011-02-11
Last Modified: 2012-05-11
Have a table with approx. 30 million rows in it.  It has 4 columns (3 int's, one date) plus a PK.  I need to be able to join on the 4 columns when doing an insert to make sure the row isn't already there.  I also need to be able to group on any combination of the 4, sort on any one and possible group by 2 1 or more.

I originally had an index on each column but the index's alone were eating up 32GB of space.  I switched to an additional index on the PK with the 4 columns included and the index size dropped 12GB but I've noticed a significant decrease in select performance.  Now I'm thinking of creating an index using all 4 columns (as opposed the index of the PK with the 4 columns included, I hope that makes sense).  But I've always thought that the only way that will give me good performance is if I'm using all 4 columns to join, sort, group, etc.

Am I basically stuck with a choice between performance and index size?  that if I want good performance I need to create a sep. index on each and just eat the disk space?
Question by:davidcahan
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 5
  • 4
LVL 75

Expert Comment

by:Anthony Perkins
ID: 34876676
>>but I've noticed a significant decrease in select performance. <<
Have you considered updating stats?  
Have you checked to see if the indexes need to be defragged?
LVL 33

Accepted Solution

knightEknight earned 500 total points
ID: 34876697
Generally the PK is the clustered index.  Is the PK column one of your selectable columns?  If not, then re-create the PK as non-clustered first.  (Usually the PK will be selectable, but not always.)  This will take some time for 30M rows.

Since your table width is small (3 INTs and a DT), I suggest you make your clustered index on all four columns in the order in which they are most commonly appear in the WHERE and/or ORDER BY clause:

So if this is the most common query:

   select  ...
   where  col3 = ###
       and  col1 = ###

then make your clustered index on col3, col1, col2, col4.

As for the other non-clustered indexes, they will use less disk space if you make them one column each, but they will perform better if you include more columns (assuming the other columns appear in the SELECT portion of most of your queries).  Consider making the non-clustered indexes in a different file group on a different disk.  This will alleviate some of the space problem and improve performance too.

Author Comment

ID: 34876751
Will creating on all 4 adversely affect performance when I only end up using one in some of my where clauses?
Industry Leaders: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

LVL 33

Assisted Solution

knightEknight earned 500 total points
ID: 34876863
If the other columns are not used in the WHERE clause then you don't need separate indexes for them.  But if they are used in the SELECT then include them in the one index you do have.  Not sure that will save you any disk space, but it will improve performance.

In general, put the most selectable column fist in your index.  By "selectable" I mean the one with the most distinct values that appears often in your WHERE clause(s).  So if the column has values of only 0 and 1 (for example), it is not a good candidate to be first on the index.
LVL 33

Expert Comment

ID: 34876868
Of course, if you run a query with a WHERE clause that has a single column, and that column is not indexed, that query will be slow.

Author Comment

ID: 34878883
I'm guessing that the most selectable column in my case would be DateTime.  We keep track of each and everytime a file is played and in a day each file can play over 700 times.  If a unit has total files that play and each one plays 700 times that's a lot of different datetime's.

Wouldn't you agree?

Author Comment

ID: 34878904
sorry, that should have said "if a unit has 20 total files that play...".  

also, let me show you some common statements I'm writing against the table, maybe that will help. (i'm not writing syntactically correct SQL.  Just trying to give you a general idea)

Select count(*) from table
group by ContentID,UnitID,ZoneID,Date
having count(*) > 1

Insert into table
Select a,b,c,d from importtable left join table on a = contentid and b = unitid and c = zoneid and d=date
where is null

select unitname, contentname, date
      from table with(nolock)
      inner join units
      on table.unitid =  um_id
      inner join content
      on table.contentid = cm_id
      where between '2010-01-01' and '2010-12-31'
      and accountid = 174

LVL 33

Expert Comment

ID: 34887955
Besides selectability you must first consider usage.  If the datetime column is always or frequently used in your WHERE clauses then it is a good candidate for your clustered index.  The only caution I would add (if this is the case) is that if you are adding millions of new rows per day to this table, the inserts will get slower over time, even though the selects will still perform well (again, assuming the datetime column is in the WHERE clauses).

Author Comment

ID: 34888670
Do indexes come into play with JOIN's?
LVL 33

Assisted Solution

knightEknight earned 500 total points
ID: 34888685
Yes, the query execution plan will (try to) determine and use the best index for each table in the query.

Featured Post

Announcing the Most Valuable Experts of 2016

MVEs are more concerned with the satisfaction of those they help than with the considerable points they can earn. They are the types of people you feel privileged to call colleagues. Join us in honoring this amazing group of Experts.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

I've encountered valid database schemas that do not have a primary key.  For example, I use LogParser from Microsoft to push IIS logs into a SQL database table for processing and analysis.  However, occasionally due to user error or a scheduled task…
INTRODUCTION: While tying your database objects into builds and your enterprise source control system takes a third-party product (like Visual Studio Database Edition or Red-Gate's SQL Source Control), you can achieve some protection using a sing…
Come and listen to Percona CEO Peter Zaitsev discuss what’s new in Percona open source software, including Percona Server for MySQL ( and MongoDB (…
In this video, viewers are given an introduction to using the Windows 10 Snipping Tool, how to quickly locate it when it's needed and also how make it always available with a single click of a mouse button, by pinning it to the Desktop Task Bar. Int…

729 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question