?
Solved

Index Creation on Table with 30 Million Rows

Posted on 2011-02-11
10
Medium Priority
?
408 Views
Last Modified: 2012-05-11
Have a table with approx. 30 million rows in it.  It has 4 columns (3 int's, one date) plus a PK.  I need to be able to join on the 4 columns when doing an insert to make sure the row isn't already there.  I also need to be able to group on any combination of the 4, sort on any one and possible group by 2 1 or more.

I originally had an index on each column but the index's alone were eating up 32GB of space.  I switched to an additional index on the PK with the 4 columns included and the index size dropped 12GB but I've noticed a significant decrease in select performance.  Now I'm thinking of creating an index using all 4 columns (as opposed the index of the PK with the 4 columns included, I hope that makes sense).  But I've always thought that the only way that will give me good performance is if I'm using all 4 columns to join, sort, group, etc.

Am I basically stuck with a choice between performance and index size?  that if I want good performance I need to create a sep. index on each and just eat the disk space?
0
Comment
Question by:davidcahan
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 5
  • 4
10 Comments
 
LVL 75

Expert Comment

by:Anthony Perkins
ID: 34876676
>>but I've noticed a significant decrease in select performance. <<
Have you considered updating stats?  
Have you checked to see if the indexes need to be defragged?
0
 
LVL 33

Accepted Solution

by:
knightEknight earned 2000 total points
ID: 34876697
Generally the PK is the clustered index.  Is the PK column one of your selectable columns?  If not, then re-create the PK as non-clustered first.  (Usually the PK will be selectable, but not always.)  This will take some time for 30M rows.

Since your table width is small (3 INTs and a DT), I suggest you make your clustered index on all four columns in the order in which they are most commonly appear in the WHERE and/or ORDER BY clause:

So if this is the most common query:

   select  ...
   where  col3 = ###
       and  col1 = ###

then make your clustered index on col3, col1, col2, col4.

As for the other non-clustered indexes, they will use less disk space if you make them one column each, but they will perform better if you include more columns (assuming the other columns appear in the SELECT portion of most of your queries).  Consider making the non-clustered indexes in a different file group on a different disk.  This will alleviate some of the space problem and improve performance too.
0
 

Author Comment

by:davidcahan
ID: 34876751
Will creating on all 4 adversely affect performance when I only end up using one in some of my where clauses?
0
 [eBook] Windows Nano Server

Download this FREE eBook and learn all you need to get started with Windows Nano Server, including deployment options, remote management
and troubleshooting tips and tricks

 
LVL 33

Assisted Solution

by:knightEknight
knightEknight earned 2000 total points
ID: 34876863
If the other columns are not used in the WHERE clause then you don't need separate indexes for them.  But if they are used in the SELECT then include them in the one index you do have.  Not sure that will save you any disk space, but it will improve performance.

In general, put the most selectable column fist in your index.  By "selectable" I mean the one with the most distinct values that appears often in your WHERE clause(s).  So if the column has values of only 0 and 1 (for example), it is not a good candidate to be first on the index.
0
 
LVL 33

Expert Comment

by:knightEknight
ID: 34876868
Of course, if you run a query with a WHERE clause that has a single column, and that column is not indexed, that query will be slow.
0
 

Author Comment

by:davidcahan
ID: 34878883
I'm guessing that the most selectable column in my case would be DateTime.  We keep track of each and everytime a file is played and in a day each file can play over 700 times.  If a unit has total files that play and each one plays 700 times that's a lot of different datetime's.

Wouldn't you agree?
0
 

Author Comment

by:davidcahan
ID: 34878904
sorry, that should have said "if a unit has 20 total files that play...".  

also, let me show you some common statements I'm writing against the table, maybe that will help. (i'm not writing syntactically correct SQL.  Just trying to give you a general idea)

Select count(*) from table
group by ContentID,UnitID,ZoneID,Date
having count(*) > 1

Insert into table
Select a,b,c,d from importtable left join table on a = contentid and b = unitid and c = zoneid and d=date
where table.pk is null

select unitname, contentname, date
      from table with(nolock)
      inner join units
      on table.unitid =  um_id
      inner join content
      on table.contentid = cm_id
      where table.date between '2010-01-01' and '2010-12-31'
      and accountid = 174

0
 
LVL 33

Expert Comment

by:knightEknight
ID: 34887955
Besides selectability you must first consider usage.  If the datetime column is always or frequently used in your WHERE clauses then it is a good candidate for your clustered index.  The only caution I would add (if this is the case) is that if you are adding millions of new rows per day to this table, the inserts will get slower over time, even though the selects will still perform well (again, assuming the datetime column is in the WHERE clauses).
0
 

Author Comment

by:davidcahan
ID: 34888670
Do indexes come into play with JOIN's?
0
 
LVL 33

Assisted Solution

by:knightEknight
knightEknight earned 2000 total points
ID: 34888685
Yes, the query execution plan will (try to) determine and use the best index for each table in the query.
0

Featured Post

 [eBook] Windows Nano Server

Download this FREE eBook and learn all you need to get started with Windows Nano Server, including deployment options, remote management
and troubleshooting tips and tricks

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

This article will describe one method to parse a delimited string into a table of data.   Why would I do that you ask?  Let's say that you need to pass multiple parameters into a stored procedure to search for.  For our sake, we'll say that we wa…
Introduction: When running hybrid database environments, you often need to query some data from a remote db of any type, while being connected to your MS SQL Server database. Problems start when you try to combine that with some "user input" pass…
Do you want to know how to make a graph with Microsoft Access? First, create a query with the data for the chart. Then make a blank form and add a chart control. This video also shows how to change what data is displayed on the graph as well as form…
In this video, Percona Solution Engineer Dimitri Vanoverbeke discusses why you want to use at least three nodes in a database cluster. To discuss how Percona Consulting can help with your design and architecture needs for your database and infras…

741 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question