heap tables having data corruption

recently i found that a heap table had non clustered index corruption. when i dropped all non clustered index, then created and dropped a clustered index, and then added the non clustered index it worked fine (errors went away from dbcc checkdb)..

is this documented in msdn? could you suggest an authoritative tech article suggesting the problems with heap tables and corruption?
Who is Participating?
Scott PletcherConnect With a Mentor Senior DBACommented:
>> general IO operations on clustered tables is always faster <<

That's a bit overstated.   General i/o ops on clustered tables are faster on heaps that have many INSERTs run on them.  

But if you use heaps to a single mass load, such as for staging data, they are inherently more efficient than a clustered table because a clus index does not have to be built.  That's the single best use for heaps: as a load once, read all staging table.

Even if you subsequently need to build a non-clus index or two, the heap can win out in that situation.

But heaps do indeed have poorer performance when you do subsequent inserts to an existing heap table, as Ms Tripp noted.  The only really efficient way to re-use a heap table is to truncate it and reload it from scratch.
Ryan McCauleyConnect With a Mentor Data and Analytics ManagerCommented:
What error message were you getting from DBCC that lead you to believe the table data (or was it the index data?) was corrupt? A heap table is just a table with the rows in no particular order - unless you're doing large scans of some kind of sequential key, there's really no requirement to have a clustered indexes, and most tables are fine as heaps.

If you had corruption in your indexes, dropping them and recreating them would resolve it, but I don't think that corruption would be tied to the fact that you're indexing a heap rather than a clustered table - I've never seen anything like that. If you can provide the actual error message, I'd be interested to look into it more to see what caused it, as that seem peculiar.
Scott PletcherConnect With a Mentor Senior DBACommented:
I've quite often seen SQL have such issues with heaps (only!) if they add and remove a lot of rows over time.  [I've never seen this with a clustered table.]

I've had SQL report a heap was using 120GB when the whole db was only 30GB!

A periodic DBCC UPDATEUSAGE ( ) on heaps seems to help.

But I still suggest periodically rebuilding heaps using a clustered index.  Just how often this needs done you have to determine, based on table size and how critical the table is.  You can immediately drop the clustered index if you want to -- just the building of the clus index corrects any space/pointer issues.
Build your data science skills into a career

Are you ready to take your data science career to the next step, or break into data science? With Springboard’s Data Science Career Track, you’ll master data science topics, have personalized career guidance, weekly calls with a data science expert, and a job guarantee.

25112Author Commented:
Ryan, my situation is in

just the fact that dropping the nc index and adding them back did not resolve the index corruption errors from DBCC.. then i dropped all NC index and then created CI once and dropped CI and then rebuilt (create) the original NC index, and now there is no more errors.. hence makes me think something related to absence of CI?

but i could not find any authoritative MSDN article suggesting so.. hence this thread.
Scott PletcherConnect With a Mentor Senior DBACommented:
You can find similar issues with heaps in this thread:


Anyone who hasn't seen this problem before just hasn't worked that much with heaps.  It's an extremely common problem.

I don't know what MSDN says or doesn't say about it, but I know I've seen it dozens of times, whether they officially acknowledge it or not.

My strong suggestion is still to periodically cluster heaps, then drop the cluster, to keep them error-free and keep their allocations as reported by SQL reasonably accurate.
Ryan McCauleyConnect With a Mentor Data and Analytics ManagerCommented:
Hmm - I've generally advocated away from creating a clustered index where I don't think it's necessary or there's not a convenient clustering key, but I may have to reconsider in light of this. It's not something I've ever observed firsthand, but I suppose the heaps I've generally worked with are smaller (under a million rows), whereas the very large tables I work with are in data warehouses, and are generally clustered on date or something similarly-ascending.

If you're going to create the clustered index only to drop it, why not just keep it in place? As long as the key you're using fits the criteria (consistent, narrow, continually ascending, etc - see http://www.sqlskills.com/blogs/kimberly/ever-increasing-clustering-key-the-clustered-index-debate-again/), why not keep it in place? it makes table scans a bit more consistent when they're necessary, and if it avoid problems like this, it only makes sense as generally good practice.
Scott PletcherConnect With a Mentor Senior DBACommented:
>> why not keep a clus index in place? <<

I'm giving whoever set up the table credit that they knew a heap was better in that specific case.

A clustered index does have extra overhead when inserting rows into the table.

And nonclus index(es) will perform much less well with a clus index than w/o one.

In certain cases, then, using a clus index as just a "cleanup" mechanism makes perfect sense.
25112Author Commented:
>>And nonclus index(es) will perform much less well with a clus index than w/o one.

Are you saying
"And nonclus index(es) will perform much well with a clus index than w/o one."
"And nonclus index(es) will perform much less with a clus index than w/o one."

sorry i did not understand the meaning of the combo of the words.
Scott PletcherConnect With a Mentor Senior DBACommented:
Sorry: that language is convoluted.

Every nonclustered index performs better on a table that does not have a clustered index at all.

No clus index, better non-clus performance.
Ryan McCauleyConnect With a Mentor Data and Analytics ManagerCommented:
I thought it was the other way around - clustered index = better performance in general (including with non-clustered indexes on the table), though there are obviously cases where that's not true (inserts in a heap are quicker because they can go wherever is free, rather than making physical space). I can't seem to find the article where I initially read that, though, so take it with a grain of salt.

I clearly need to do some performance testing of my own to confirm the speed either way.
Scott PletcherConnect With a Mentor Senior DBACommented:
I don't feel the need to test that, simply because of the mechanics involved.

Non-clus index w/o clus index:
non-clus index row points directly to the table page itself.

Non-clus index w/ clus index:
non-clus index has clus key; clus key is then used to do lookup in the clus index; the clus index then gives the table page.

To me, the extra level of lookup must add overhead.
Ryan McCauleyConnect With a Mentor Data and Analytics ManagerCommented:

I was curious about this, so I did some quick benchmarks to compare performance of various operations on a heap vs. a clustered table. Maybe I'm missing something obvious here, and while the difference wasn't dramatic, the clustered table came out faster in every operation except one (using a lookup against the non-clustered index to update that same column).

To do the test, I created a table with two columns - ID (INT) and contentint (INT). In one, I clustered the table on ID, and in the other, I created a non-clustered index on ID (both had a non-clustered index on the other column). Then it runs through eight different tests:

 - ordered insert
 - unordered insert
 - update the "contentint" column after a lookup with ID
 - update the "contentint" column after a lookup with the column value
 - lookup data by ID
 - lookup data by other column
 - delete data by ID
 - delete data by other column

I've attached the script I used - it runs through each operation 10,000 times and tracks the total milliseconds required, and then performs that same test 15 times (10,000 operations each time) on both the heap and the clustered table (shown as the average of the 15 executions). Also, I ran the tests on both a server with SAN storage and my workstation, and the results showed the same pattern (though obviously, my desktop isn't nearly as fast).

Here are the results (numbers are MS per 10,000 operations, % improvement shown on the right):

Picture of comparison results
Given those results, am I doing something incorrect in the benchmark (I'm open to any feedback you might have about what I'm doing wrong to get these results)? Or is the clustered table generally faster, as I'm seeing here, in almost all operations? I wish I could find the original article that I remember - I could swear it was written by somebody like Paul Randal or Kimberly Tripp, but I can't find it for the life of me.
Scott PletcherConnect With a Mentor Senior DBACommented:
Key lookups are a well-known performance issue.  Those can only occur with clus indexes.

You have to look at the query plan to determine that.

On a table that small, I'm suprised there's any degree of difference between the two.

At any rate, I/O numbers -- specifically logical I/O -- would be much more useful and reliable than clock time, which is affected by a huge number of things.
Anthony PerkinsConnect With a Mentor Commented:
I wish I could find the original article that I remember
I suspect you are referring to this article:
Scott PletcherConnect With a Mentor Senior DBACommented:
I disagree with her, IMO, overly-broad assertion about the proper clustered index key based on my experience over decades, and the proven results of selecting a non-identity clustering key.

Yes, I know she's famous and well respected, and I deeply respect her too.

But I think she's using more lab/theoretical tests than real-world tests.  [Besides, that was 2007: maybe she's changed her mind somewhat since then :-) .]

And, yes, you do then have to learn to use FILLFACTOR.  And, sometimes, do more frequent table reorgs/rebuilds; for most tables that's not a big issue.

Over 70% of the tables originally clustered by identity get HUGE performance gains when I cluster them by the proper, natural key, even if the first clus key column is not inherently unique.  It's dropped I/O on common joins by up to 97%.

Kalen Delaney covers the choices on clustering well in her books on query tuning and optimization.

The single column, mono-increasing, blah blah may indeed slightly reduce the overhead of the INSERT (at least in theory; in reality, you lose it back and more because you have to add a gazillion non-clus covering indexes then to cover basic queries, which means INSERTs often end up costing more because of all the extra indexes to maintain).  I've literally removed 5,000+ indexes across three instances here while getting huge performance gains.

And to perhaps gain performance on that one INSERT, you lose performance on the 10-1000+ SELECTs that typically follow it.

Very often the natural keys also are mono-increasing (because they are identity columns, and clus index keys, from a different "master" table), negating the need to use identity to get that anyway.  For example, order_id or batch_number.  It's just silly to cluster an order_items table by a stand-alone identity when a comination ( order_id, order_item_id|item_number ) is available.

I know many people push identity as the only real choice when clustering a table.  To me, that's far too simplistic.  The clustering key is the critical choice for performance on a table.  No one-extremely-simple-rule-fits-(almost)-every-table works here.  It does allow you to ignore FILLFACTOR, so if they don't how to properly use that, I guess that inclines them to the "never look at the columns, just slap an identity cluster on the table" approach.
Ryan McCauleyConnect With a Mentor Data and Analytics ManagerCommented:
For most of my career, I'd shied away from using an identity column, only clustering the table where it felt appropriate to do so, and in those cases, only using an appropriate key that I'd be using to select ranges of data (as you're saying). I'd never noticed any performance problems in heap tables, and honestly felt as though it was unnecessary to slap an identity column on the table just so you could cluster it.

Thanks for the clarification - I was caught off guard by the assertion I'd seen (can't remember if it was Kimberly or somebody else) that general IO operations on clustered tables is always faster, and I figured there was more to the story than that.
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.