Solved

heap tables having data corruption

Posted on 2013-01-23
16
410 Views
Last Modified: 2013-02-12
recently i found that a heap table had non clustered index corruption. when i dropped all non clustered index, then created and dropped a clustered index, and then added the non clustered index it worked fine (errors went away from dbcc checkdb)..

is this documented in msdn? could you suggest an authoritative tech article suggesting the problems with heap tables and corruption?
0
Comment
Question by:25112
  • 8
  • 5
  • 2
  • +1
16 Comments
 
LVL 28

Assisted Solution

by:Ryan McCauley
Ryan McCauley earned 179 total points
ID: 38813286
What error message were you getting from DBCC that lead you to believe the table data (or was it the index data?) was corrupt? A heap table is just a table with the rows in no particular order - unless you're doing large scans of some kind of sequential key, there's really no requirement to have a clustered indexes, and most tables are fine as heaps.

If you had corruption in your indexes, dropping them and recreating them would resolve it, but I don't think that corruption would be tied to the fact that you're indexing a heap rather than a clustered table - I've never seen anything like that. If you can provide the actual error message, I'd be interested to look into it more to see what caused it, as that seem peculiar.
0
 
LVL 69

Assisted Solution

by:ScottPletcher
ScottPletcher earned 286 total points
ID: 38814653
I've quite often seen SQL have such issues with heaps (only!) if they add and remove a lot of rows over time.  [I've never seen this with a clustered table.]

I've had SQL report a heap was using 120GB when the whole db was only 30GB!

A periodic DBCC UPDATEUSAGE ( ) on heaps seems to help.

But I still suggest periodically rebuilding heaps using a clustered index.  Just how often this needs done you have to determine, based on table size and how critical the table is.  You can immediately drop the clustered index if you want to -- just the building of the clus index corrects any space/pointer issues.
0
 
LVL 5

Author Comment

by:25112
ID: 38819191
Ryan, my situation is in
http://www.experts-exchange.com/Microsoft/Development/MS-SQL-Server/Q_28004268.html

just the fact that dropping the nc index and adding them back did not resolve the index corruption errors from DBCC.. then i dropped all NC index and then created CI once and dropped CI and then rebuilt (create) the original NC index, and now there is no more errors.. hence makes me think something related to absence of CI?

but i could not find any authoritative MSDN article suggesting so.. hence this thread.
0
 
LVL 69

Assisted Solution

by:ScottPletcher
ScottPletcher earned 286 total points
ID: 38819230
You can find similar issues with heaps in this thread:

http://www.sqlservercentral.com/Forums/Topic1124708-391-1.aspx

Anyone who hasn't seen this problem before just hasn't worked that much with heaps.  It's an extremely common problem.

I don't know what MSDN says or doesn't say about it, but I know I've seen it dozens of times, whether they officially acknowledge it or not.

My strong suggestion is still to periodically cluster heaps, then drop the cluster, to keep them error-free and keep their allocations as reported by SQL reasonably accurate.
0
 
LVL 28

Assisted Solution

by:Ryan McCauley
Ryan McCauley earned 179 total points
ID: 38819506
Hmm - I've generally advocated away from creating a clustered index where I don't think it's necessary or there's not a convenient clustering key, but I may have to reconsider in light of this. It's not something I've ever observed firsthand, but I suppose the heaps I've generally worked with are smaller (under a million rows), whereas the very large tables I work with are in data warehouses, and are generally clustered on date or something similarly-ascending.

If you're going to create the clustered index only to drop it, why not just keep it in place? As long as the key you're using fits the criteria (consistent, narrow, continually ascending, etc - see http://www.sqlskills.com/blogs/kimberly/ever-increasing-clustering-key-the-clustered-index-debate-again/), why not keep it in place? it makes table scans a bit more consistent when they're necessary, and if it avoid problems like this, it only makes sense as generally good practice.
0
 
LVL 69

Assisted Solution

by:ScottPletcher
ScottPletcher earned 286 total points
ID: 38819549
>> why not keep a clus index in place? <<

I'm giving whoever set up the table credit that they knew a heap was better in that specific case.

A clustered index does have extra overhead when inserting rows into the table.

And nonclus index(es) will perform much less well with a clus index than w/o one.

In certain cases, then, using a clus index as just a "cleanup" mechanism makes perfect sense.
0
 
LVL 5

Author Comment

by:25112
ID: 38831675
>>And nonclus index(es) will perform much less well with a clus index than w/o one.

Are you saying
"And nonclus index(es) will perform much well with a clus index than w/o one."
or
"And nonclus index(es) will perform much less with a clus index than w/o one."

sorry i did not understand the meaning of the combo of the words.
0
 
LVL 69

Assisted Solution

by:ScottPletcher
ScottPletcher earned 286 total points
ID: 38831753
Sorry: that language is convoluted.

Every nonclustered index performs better on a table that does not have a clustered index at all.

No clus index, better non-clus performance.
0
Why You Should Analyze Threat Actor TTPs

After years of analyzing threat actor behavior, it’s become clear that at any given time there are specific tactics, techniques, and procedures (TTPs) that are particularly prevalent. By analyzing and understanding these TTPs, you can dramatically enhance your security program.

 
LVL 28

Assisted Solution

by:Ryan McCauley
Ryan McCauley earned 179 total points
ID: 38832210
I thought it was the other way around - clustered index = better performance in general (including with non-clustered indexes on the table), though there are obviously cases where that's not true (inserts in a heap are quicker because they can go wherever is free, rather than making physical space). I can't seem to find the article where I initially read that, though, so take it with a grain of salt.

I clearly need to do some performance testing of my own to confirm the speed either way.
0
 
LVL 69

Assisted Solution

by:ScottPletcher
ScottPletcher earned 286 total points
ID: 38832339
I don't feel the need to test that, simply because of the mechanics involved.

Non-clus index w/o clus index:
non-clus index row points directly to the table page itself.

Non-clus index w/ clus index:
non-clus index has clus key; clus key is then used to do lookup in the clus index; the clus index then gives the table page.

To me, the extra level of lookup must add overhead.
0
 
LVL 28

Assisted Solution

by:Ryan McCauley
Ryan McCauley earned 179 total points
ID: 38837581
@ScottPletcher:

I was curious about this, so I did some quick benchmarks to compare performance of various operations on a heap vs. a clustered table. Maybe I'm missing something obvious here, and while the difference wasn't dramatic, the clustered table came out faster in every operation except one (using a lookup against the non-clustered index to update that same column).

To do the test, I created a table with two columns - ID (INT) and contentint (INT). In one, I clustered the table on ID, and in the other, I created a non-clustered index on ID (both had a non-clustered index on the other column). Then it runs through eight different tests:

 - ordered insert
 - unordered insert
 - update the "contentint" column after a lookup with ID
 - update the "contentint" column after a lookup with the column value
 - lookup data by ID
 - lookup data by other column
 - delete data by ID
 - delete data by other column

I've attached the script I used - it runs through each operation 10,000 times and tracks the total milliseconds required, and then performs that same test 15 times (10,000 operations each time) on both the heap and the clustered table (shown as the average of the 15 executions). Also, I ran the tests on both a server with SAN storage and my workstation, and the results showed the same pattern (though obviously, my desktop isn't nearly as fast).

Here are the results (numbers are MS per 10,000 operations, % improvement shown on the right):

Picture of comparison results
Given those results, am I doing something incorrect in the benchmark (I'm open to any feedback you might have about what I'm doing wrong to get these results)? Or is the clustered table generally faster, as I'm seeing here, in almost all operations? I wish I could find the original article that I remember - I could swear it was written by somebody like Paul Randal or Kimberly Tripp, but I can't find it for the life of me.
Clustered-index-performance-test.sql
0
 
LVL 69

Assisted Solution

by:ScottPletcher
ScottPletcher earned 286 total points
ID: 38837612
Key lookups are a well-known performance issue.  Those can only occur with clus indexes.

You have to look at the query plan to determine that.

On a table that small, I'm suprised there's any degree of difference between the two.

At any rate, I/O numbers -- specifically logical I/O -- would be much more useful and reliable than clock time, which is affected by a huge number of things.
0
 
LVL 75

Assisted Solution

by:Anthony Perkins
Anthony Perkins earned 35 total points
ID: 38837666
I wish I could find the original article that I remember
I suspect you are referring to this article:
http://www.sqlskills.com/blogs/kimberly/the-clustered-index-debate-continues/
0
 
LVL 69

Assisted Solution

by:ScottPletcher
ScottPletcher earned 286 total points
ID: 38837722
I disagree with her, IMO, overly-broad assertion about the proper clustered index key based on my experience over decades, and the proven results of selecting a non-identity clustering key.

Yes, I know she's famous and well respected, and I deeply respect her too.

But I think she's using more lab/theoretical tests than real-world tests.  [Besides, that was 2007: maybe she's changed her mind somewhat since then :-) .]

And, yes, you do then have to learn to use FILLFACTOR.  And, sometimes, do more frequent table reorgs/rebuilds; for most tables that's not a big issue.

Over 70% of the tables originally clustered by identity get HUGE performance gains when I cluster them by the proper, natural key, even if the first clus key column is not inherently unique.  It's dropped I/O on common joins by up to 97%.

Kalen Delaney covers the choices on clustering well in her books on query tuning and optimization.

The single column, mono-increasing, blah blah may indeed slightly reduce the overhead of the INSERT (at least in theory; in reality, you lose it back and more because you have to add a gazillion non-clus covering indexes then to cover basic queries, which means INSERTs often end up costing more because of all the extra indexes to maintain).  I've literally removed 5,000+ indexes across three instances here while getting huge performance gains.

And to perhaps gain performance on that one INSERT, you lose performance on the 10-1000+ SELECTs that typically follow it.

Very often the natural keys also are mono-increasing (because they are identity columns, and clus index keys, from a different "master" table), negating the need to use identity to get that anyway.  For example, order_id or batch_number.  It's just silly to cluster an order_items table by a stand-alone identity when a comination ( order_id, order_item_id|item_number ) is available.

I know many people push identity as the only real choice when clustering a table.  To me, that's far too simplistic.  The clustering key is the critical choice for performance on a table.  No one-extremely-simple-rule-fits-(almost)-every-table works here.  It does allow you to ignore FILLFACTOR, so if they don't how to properly use that, I guess that inclines them to the "never look at the columns, just slap an identity cluster on the table" approach.
0
 
LVL 28

Assisted Solution

by:Ryan McCauley
Ryan McCauley earned 179 total points
ID: 38839653
For most of my career, I'd shied away from using an identity column, only clustering the table where it felt appropriate to do so, and in those cases, only using an appropriate key that I'd be using to select ranges of data (as you're saying). I'd never noticed any performance problems in heap tables, and honestly felt as though it was unnecessary to slap an identity column on the table just so you could cluster it.

Thanks for the clarification - I was caught off guard by the assertion I'd seen (can't remember if it was Kimberly or somebody else) that general IO operations on clustered tables is always faster, and I figured there was more to the story than that.
0
 
LVL 69

Accepted Solution

by:
ScottPletcher earned 286 total points
ID: 38840375
>> general IO operations on clustered tables is always faster <<

That's a bit overstated.   General i/o ops on clustered tables are faster on heaps that have many INSERTs run on them.  

But if you use heaps to a single mass load, such as for staging data, they are inherently more efficient than a clustered table because a clus index does not have to be built.  That's the single best use for heaps: as a load once, read all staging table.

Even if you subsequently need to build a non-clus index or two, the heap can win out in that situation.

But heaps do indeed have poorer performance when you do subsequent inserts to an existing heap table, as Ms Tripp noted.  The only really efficient way to re-use a heap table is to truncate it and reload it from scratch.
0

Featured Post

How your wiki can always stay up-to-date

Quip doubles as a “living” wiki and a project management tool that evolves with your organization. As you finish projects in Quip, the work remains, easily accessible to all team members, new and old.
- Increase transparency
- Onboard new hires faster
- Access from mobile/offline

Join & Write a Comment

In this article I will describe the Copy Database Wizard method as one possible migration process and I will add the extra tasks needed for an upgrade when and where is applied so it will cover all.
Slowly Changing Dimension Transformation component in data task flow is very useful for us to manage and control how data changes in SSIS.
Familiarize people with the process of utilizing SQL Server functions from within Microsoft Access. Microsoft Access is a very powerful client/server development tool. One of the SQL Server objects that you can interact with from within Microsoft Ac…
Viewers will learn how to use the UPDATE and DELETE statements to change or remove existing data from their tables. Make a table: Update a specific column given a specific row using the UPDATE statement: Remove a set of values using the DELETE s…

706 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

19 Experts available now in Live!

Get 1:1 Help Now