Want to win a PS4? Go Premium and enter to win our High-Tech Treats giveaway. Enter to Win


Index Creation on Table with 30 Million Rows

Posted on 2011-02-11
Medium Priority
Last Modified: 2012-05-11
Have a table with approx. 30 million rows in it.  It has 4 columns (3 int's, one date) plus a PK.  I need to be able to join on the 4 columns when doing an insert to make sure the row isn't already there.  I also need to be able to group on any combination of the 4, sort on any one and possible group by 2 1 or more.

I originally had an index on each column but the index's alone were eating up 32GB of space.  I switched to an additional index on the PK with the 4 columns included and the index size dropped 12GB but I've noticed a significant decrease in select performance.  Now I'm thinking of creating an index using all 4 columns (as opposed the index of the PK with the 4 columns included, I hope that makes sense).  But I've always thought that the only way that will give me good performance is if I'm using all 4 columns to join, sort, group, etc.

Am I basically stuck with a choice between performance and index size?  that if I want good performance I need to create a sep. index on each and just eat the disk space?
Question by:davidcahan
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 5
  • 4
LVL 75

Expert Comment

by:Anthony Perkins
ID: 34876676
>>but I've noticed a significant decrease in select performance. <<
Have you considered updating stats?  
Have you checked to see if the indexes need to be defragged?
LVL 33

Accepted Solution

knightEknight earned 2000 total points
ID: 34876697
Generally the PK is the clustered index.  Is the PK column one of your selectable columns?  If not, then re-create the PK as non-clustered first.  (Usually the PK will be selectable, but not always.)  This will take some time for 30M rows.

Since your table width is small (3 INTs and a DT), I suggest you make your clustered index on all four columns in the order in which they are most commonly appear in the WHERE and/or ORDER BY clause:

So if this is the most common query:

   select  ...
   where  col3 = ###
       and  col1 = ###

then make your clustered index on col3, col1, col2, col4.

As for the other non-clustered indexes, they will use less disk space if you make them one column each, but they will perform better if you include more columns (assuming the other columns appear in the SELECT portion of most of your queries).  Consider making the non-clustered indexes in a different file group on a different disk.  This will alleviate some of the space problem and improve performance too.

Author Comment

ID: 34876751
Will creating on all 4 adversely affect performance when I only end up using one in some of my where clauses?
Industry Leaders: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

LVL 33

Assisted Solution

knightEknight earned 2000 total points
ID: 34876863
If the other columns are not used in the WHERE clause then you don't need separate indexes for them.  But if they are used in the SELECT then include them in the one index you do have.  Not sure that will save you any disk space, but it will improve performance.

In general, put the most selectable column fist in your index.  By "selectable" I mean the one with the most distinct values that appears often in your WHERE clause(s).  So if the column has values of only 0 and 1 (for example), it is not a good candidate to be first on the index.
LVL 33

Expert Comment

ID: 34876868
Of course, if you run a query with a WHERE clause that has a single column, and that column is not indexed, that query will be slow.

Author Comment

ID: 34878883
I'm guessing that the most selectable column in my case would be DateTime.  We keep track of each and everytime a file is played and in a day each file can play over 700 times.  If a unit has total files that play and each one plays 700 times that's a lot of different datetime's.

Wouldn't you agree?

Author Comment

ID: 34878904
sorry, that should have said "if a unit has 20 total files that play...".  

also, let me show you some common statements I'm writing against the table, maybe that will help. (i'm not writing syntactically correct SQL.  Just trying to give you a general idea)

Select count(*) from table
group by ContentID,UnitID,ZoneID,Date
having count(*) > 1

Insert into table
Select a,b,c,d from importtable left join table on a = contentid and b = unitid and c = zoneid and d=date
where table.pk is null

select unitname, contentname, date
      from table with(nolock)
      inner join units
      on table.unitid =  um_id
      inner join content
      on table.contentid = cm_id
      where table.date between '2010-01-01' and '2010-12-31'
      and accountid = 174

LVL 33

Expert Comment

ID: 34887955
Besides selectability you must first consider usage.  If the datetime column is always or frequently used in your WHERE clauses then it is a good candidate for your clustered index.  The only caution I would add (if this is the case) is that if you are adding millions of new rows per day to this table, the inserts will get slower over time, even though the selects will still perform well (again, assuming the datetime column is in the WHERE clauses).

Author Comment

ID: 34888670
Do indexes come into play with JOIN's?
LVL 33

Assisted Solution

knightEknight earned 2000 total points
ID: 34888685
Yes, the query execution plan will (try to) determine and use the best index for each table in the query.

Featured Post

Free Tool: SSL Checker

Scans your site and returns information about your SSL implementation and certificate. Helpful for debugging and validating your SSL configuration.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

There are some very powerful Dynamic Management Views (DMV's) introduced with SQL 2005. The two in particular that we are going to discuss are sys.dm_db_index_usage_stats and sys.dm_db_index_operational_stats.   Recently, I was involved in a di…
INTRODUCTION: While tying your database objects into builds and your enterprise source control system takes a third-party product (like Visual Studio Database Edition or Red-Gate's SQL Source Control), you can achieve some protection using a sing…
In this video, Percona Solution Engineer Dimitri Vanoverbeke discusses why you want to use at least three nodes in a database cluster. To discuss how Percona Consulting can help with your design and architecture needs for your database and infras…
Want to learn how to record your desktop screen without having to use an outside camera. Click on this video and learn how to use the cool google extension called "Screencastify"! Step 1: Open a new google tab Step 2: Go to the left hand upper corn…

636 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question