Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people, just like you, are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
Solved

Index Creation on Table with 30 Million Rows

Posted on 2011-02-11
10
404 Views
Last Modified: 2012-05-11
Have a table with approx. 30 million rows in it.  It has 4 columns (3 int's, one date) plus a PK.  I need to be able to join on the 4 columns when doing an insert to make sure the row isn't already there.  I also need to be able to group on any combination of the 4, sort on any one and possible group by 2 1 or more.

I originally had an index on each column but the index's alone were eating up 32GB of space.  I switched to an additional index on the PK with the 4 columns included and the index size dropped 12GB but I've noticed a significant decrease in select performance.  Now I'm thinking of creating an index using all 4 columns (as opposed the index of the PK with the 4 columns included, I hope that makes sense).  But I've always thought that the only way that will give me good performance is if I'm using all 4 columns to join, sort, group, etc.

Am I basically stuck with a choice between performance and index size?  that if I want good performance I need to create a sep. index on each and just eat the disk space?
0
Comment
Question by:davidcahan
  • 5
  • 4
10 Comments
 
LVL 75

Expert Comment

by:Anthony Perkins
ID: 34876676
>>but I've noticed a significant decrease in select performance. <<
Have you considered updating stats?  
Have you checked to see if the indexes need to be defragged?
0
 
LVL 33

Accepted Solution

by:
knightEknight earned 500 total points
ID: 34876697
Generally the PK is the clustered index.  Is the PK column one of your selectable columns?  If not, then re-create the PK as non-clustered first.  (Usually the PK will be selectable, but not always.)  This will take some time for 30M rows.

Since your table width is small (3 INTs and a DT), I suggest you make your clustered index on all four columns in the order in which they are most commonly appear in the WHERE and/or ORDER BY clause:

So if this is the most common query:

   select  ...
   where  col3 = ###
       and  col1 = ###

then make your clustered index on col3, col1, col2, col4.

As for the other non-clustered indexes, they will use less disk space if you make them one column each, but they will perform better if you include more columns (assuming the other columns appear in the SELECT portion of most of your queries).  Consider making the non-clustered indexes in a different file group on a different disk.  This will alleviate some of the space problem and improve performance too.
0
 

Author Comment

by:davidcahan
ID: 34876751
Will creating on all 4 adversely affect performance when I only end up using one in some of my where clauses?
0
Use Case: Protecting a Hybrid Cloud Infrastructure

Microsoft Azure is rapidly becoming the norm in dynamic IT environments. This document describes the challenges that organizations face when protecting data in a hybrid cloud IT environment and presents a use case to demonstrate how Acronis Backup protects all data.

 
LVL 33

Assisted Solution

by:knightEknight
knightEknight earned 500 total points
ID: 34876863
If the other columns are not used in the WHERE clause then you don't need separate indexes for them.  But if they are used in the SELECT then include them in the one index you do have.  Not sure that will save you any disk space, but it will improve performance.

In general, put the most selectable column fist in your index.  By "selectable" I mean the one with the most distinct values that appears often in your WHERE clause(s).  So if the column has values of only 0 and 1 (for example), it is not a good candidate to be first on the index.
0
 
LVL 33

Expert Comment

by:knightEknight
ID: 34876868
Of course, if you run a query with a WHERE clause that has a single column, and that column is not indexed, that query will be slow.
0
 

Author Comment

by:davidcahan
ID: 34878883
I'm guessing that the most selectable column in my case would be DateTime.  We keep track of each and everytime a file is played and in a day each file can play over 700 times.  If a unit has total files that play and each one plays 700 times that's a lot of different datetime's.

Wouldn't you agree?
0
 

Author Comment

by:davidcahan
ID: 34878904
sorry, that should have said "if a unit has 20 total files that play...".  

also, let me show you some common statements I'm writing against the table, maybe that will help. (i'm not writing syntactically correct SQL.  Just trying to give you a general idea)

Select count(*) from table
group by ContentID,UnitID,ZoneID,Date
having count(*) > 1

Insert into table
Select a,b,c,d from importtable left join table on a = contentid and b = unitid and c = zoneid and d=date
where table.pk is null

select unitname, contentname, date
      from table with(nolock)
      inner join units
      on table.unitid =  um_id
      inner join content
      on table.contentid = cm_id
      where table.date between '2010-01-01' and '2010-12-31'
      and accountid = 174

0
 
LVL 33

Expert Comment

by:knightEknight
ID: 34887955
Besides selectability you must first consider usage.  If the datetime column is always or frequently used in your WHERE clauses then it is a good candidate for your clustered index.  The only caution I would add (if this is the case) is that if you are adding millions of new rows per day to this table, the inserts will get slower over time, even though the selects will still perform well (again, assuming the datetime column is in the WHERE clauses).
0
 

Author Comment

by:davidcahan
ID: 34888670
Do indexes come into play with JOIN's?
0
 
LVL 33

Assisted Solution

by:knightEknight
knightEknight earned 500 total points
ID: 34888685
Yes, the query execution plan will (try to) determine and use the best index for each table in the query.
0

Featured Post

Best Practices: Disaster Recovery Testing

Besides backup, any IT division should have a disaster recovery plan. You will find a few tips below relating to the development of such a plan and to what issues one should pay special attention in the course of backup planning.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

I've encountered valid database schemas that do not have a primary key.  For example, I use LogParser from Microsoft to push IIS logs into a SQL database table for processing and analysis.  However, occasionally due to user error or a scheduled task…
In this article we will get to know that how can we recover deleted data if it happens accidently. We really can recover deleted rows if we know the time when data is deleted by using the transaction log.
Although Jacob Bernoulli (1654-1705) has been credited as the creator of "Binomial Distribution Table", Gottfried Leibniz (1646-1716) did his dissertation on the subject in 1666; Leibniz you may recall is the co-inventor of "Calculus" and beat Isaac…

840 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question