Solved

VERY LARGE TABLE - Performance - Design - Insert - Delete

Posted on 2004-08-27
17
273 Views
Last Modified: 2007-12-19
Hi Everyone!

We have a very large table in a database (100 million - 200 million rows or even more) in MS SQL Server 2000.
Each row contains a timestamp and we have a clustered index on the timestamp. (Please observe the clustered index on the time when answering the questions).
Our recovery model is set to simple as we do not care about the transaction log.
Many new records (about 1 million/day) is inserted into the table 24-7.

Question 1: When we delete old data this takes a very long time. What can we do to speed up the deletion of old data? Do the server have to move rows due to the clustered index (in my eyes this should not matter because we are always deleting from the beginning of the table and inserting at the end)? We can not do a TRUNCATE table because we do not want to delete all data, only the oldest!

Question 2: Does it matter what the table size are? Or can we grow one table in the database up to 1 TB without affecting performance? Or should we gain from splitting the one big table into several smaller (currently we only have two discs with RAID-1)? I understand that we would gain from this if we had more discs. But I wonder if it matters when we only have one disc?

Question 3: Is it important (do we need it at all) to rebuild the indexes if we always deleting from the beginning and inserts new data at the end? In my opinion this should not be necessary as we have the clustered index on the timestamp and therefore 8K pages and 64K extents will be deallocated at the beginning and new extents are allocated at the end?

Question 4: Do we gain performance by NOT inserting data at the same time as we are doing deletes? I think that this should not matter that much because again new data is inserted at the end of the table and old data is deleted from the beginning.

Question 5: I have at rare occasions seen that we got a lock resource error (native error 1204), when inserting and deleting at the same time. I think it is strange because the need for locks should not be that big when there are only 2 threads running at the same time in the database, one that inserts data at the end of the table and one that deletes it at the beginning of the table. Any comments on why this error occurs.

PLEASE help me with the above questions!!!
I am very grateful for all the help I can get!

Best Regards
/Stefan
0
Comment
Question by:stefanlennerbrant
  • 5
  • 4
  • 4
  • +2
17 Comments
 
LVL 5

Accepted Solution

by:
hkamal earned 125 total points
ID: 11911281
stefanlennerbrantm, that is quite a big table you have there. My initial thought would be to use vertical partioning and add logic to your related stored procs to become archiving-aware (use an key table that points to the correct archive table for a given date range)


Question 1: You have to ask yourself; do you reports on the data MUCH more than you change it (ie do you have atransactional or reporting function). If you're doing a lot of inserts a day I would question the use of clustering. You can be creating needless contention on inserts

Question 2: As per my original point, I would definitely look into partioning - we have a 200million row table here which is queued up for partioning. Any query on it is painfully slow. Remember, non-clustered indexes are actually much better at range scans (where mydate >= somedate)

Question 3: AIUI, although in your case (clustered), data is physically reordered, this is not the only benefit of rebuilding indexs. Selectivity and cardianlity stats are kept up to date and (in the case of non-clustered) fragmentation eliminated)

Question 4: It depends what locking strategy you have on the table; datarow locking is escalated tio table (allpage) locking so datapage locking may be better. This could affect your simultaneous updates

Question 5: It does seem strange. Check the locking as mentioned in point 4 and experiment with recreating the index as non-clustered.

0
 
LVL 4

Expert Comment

by:bragn
ID: 11911545
Hi stefanlennerbrant

Is the clustered index the only index?
The clustered index should be selected from the way the data is accessed in select statements I should think.

A large number of indices slow down delete, inserts and updates considerable.  Consider fewer indices.

You're creating a very large transaction with your deletion - and that degrades performance.
Consider deleting in smaller (that is - not as large) chunks.  Smaller date/timerange in each transaction.
I think the server might run out of locks in general when you do your deletes and that explains error 1024.

Best regards,
bragn.
0
 

Author Comment

by:stefanlennerbrant
ID: 11911640
To bragn,
Thanks for your answer.

The clustered index is not the only index. We also have som nonclustered indexes.
Unfortunately, we need all the indexes that we have, they have been carefully selected to support searching of the data (which is done rarely and not at the same time as deleting).

And, we are deleting small chunks (1000 rows) at a time to avoid locking out the inserting of data.
So, we are not doing a large transaction, the delete thread loops and deletes 1000 rows at a time, and still I find it strange that we run out of general locks!

Best Regards
/Stefan
0
 
LVL 4

Expert Comment

by:bragn
ID: 11911676
Hi Stefan.

You are right.  My theory on the locks does not hold water in that case - unless you're not committing between deletions or the server is badly configured.

These pages might be helpful....
http://doc.ddart.net/mssql/sql2000/html/trblsql/tr_reslsyserr_1_6gxg.htm
http://support.microsoft.com/default.aspx?scid=kb;en-us;224453


Best regards,
bragn.

0
 

Author Comment

by:stefanlennerbrant
ID: 11911681
To hkamal,

Thanks for your thoughts.

Yes, it is a large table!

1: We do not report on the data much. Insert is continuosly done in the table at the end thanks to the clustered index. Why do you question the use of clustering??? What do you mean by "contention on inserts"?

2: Ok, i have to read about vertical partitioning and what that means! What would we gain by using vertical partitioning?

Ps. We never do any updates! Only inserts and deletes!

Best Regards
/Stefan
0
 
LVL 5

Expert Comment

by:hkamal
ID: 11911896
I queried the use of clustering has a higher overhead which is worth it if your (select) queries benefit more from it. You didn't say if you tend to do range scans (which I assume you do). There is a very good paper on index costings and for range scans, a non-clustered indexes is better

Vertical partitioning I think would be VERY beneficial for you especiallty as you seem to delete regularly from the bottom of the table suggesting your data does age. Of all the methods described here, I suspect partitioning will be have the biggest impact in your case

I didn't mean updates in the sql sense, I refer to deletes and inserts collectively as updates.

For updates, try to reduce the round-trip from client to server, possibly by encompassing entire transaction in a stored procdure so it remains server-bound

Use explicit transactions rather than implicit transactions. You may have an implict transaction open without your knowledge which hurts concurrency (in a looped delete/insert, implict transaction is potentially open holding locks and blocking other uses from data). Run the DBCC OPENTRAN command from Query Analyzer to find out more

If you have other indexes, check that they're not splitting and use PAD_INDEX

Just a few things to check

0
 

Author Comment

by:stefanlennerbrant
ID: 11912042
Thanks hkamal,

where can I find the paper you talk about in "There is a very good paper on index costings and for range scans, a non-clustered indexes is better"?

We are definitely doing range scans when deleting, and sometimes when searching.

/Stefan
0
 
LVL 5

Expert Comment

by:hkamal
ID: 11912457
Unfortunately I had the pdf but cannot locate the original URL. After a detailed description of each index type, it had a table of various scenarios along with processor costings..

The thinking is this; with a clustered index, the server locates the first matching record and (boosted by SQL Server's look-ahead feature) reads successive records thus requiring fewer pages. With a range query, it cannot benefir from this as only a few record will match per page but it still has to load the leaf causing more pages to be read into memory (compared to a tiny one-column non-clustered index using up far fewer pages to load for the same range)

Re my earlier point re contention; O'Reilly's Transact-SQL optimization and Tuning suggests:
Avoid creating a clustered index on amonotonically increasing key [timestamp in yourcase]. It forces all new rows to go to the end of the table.. such indexes forces all users inserting data to compete for the last page of the table (which is why I said contention, however, if you inserts are done in batch mode rather multi-user interactive mode, this is less of an issue).

HTH
0
Why You Should Analyze Threat Actor TTPs

After years of analyzing threat actor behavior, it’s become clear that at any given time there are specific tactics, techniques, and procedures (TTPs) that are particularly prevalent. By analyzing and understanding these TTPs, you can dramatically enhance your security program.

 
LVL 4

Expert Comment

by:bragn
ID: 11912782
Hi all

The data is stored around the clustered index.  That is - the leaf level of the index contains the data as well as the index key.  So when you're scanning the clustered index the data is read as well.  Not only that, it's read in page chunks so if you want surrounding data (that is - the data that sorts with the surrounding index keys) the server doesn't need to read another page (up to a limit of the page size ofc).
In practice, the clustered index decides how the physical data is ordered.

So if you're select statement is like:
select * from table where table.column1 > 1 and table.column1 < 100 then you benefit from a clustered index on column1.

You're dealing with a very larger dbase and a lot more of inserts/deletes than I'm used to but I think the locks, transaction and update of indexes is much more time consuming than the seek itself when it comes to that inserts/deletes.
A good clustered index on the other hand is very effective with regard to select's.

Inside Microsoft SQL Server 2000 by Kalen Delaney takes a good look at these matters.

Best regards,
bragn.





0
 
LVL 5

Expert Comment

by:hkamal
ID: 11913027
stefanlennerbrant , as I mentioned originally, you may also wish to check your locking strategy. If you have datarows locking, set it to page-lock from the outset and avoid escalation straight to table-lock. This is quite common for fighting deadlocks. Or if you know in advance of a query that you'll exceed the pre-configured page-lock threshold beyond which the optimizer will escalate to table-lock, you may wish to pre-empt page-lock allocations and specify table-lock from the out-set.
Check out PAGLOCK also (which allows you to go on reporting on historical data without hindering users adding to the end fo the table)
0
 
LVL 34

Expert Comment

by:arbert
ID: 11915139
I would consider horizontal partitioning--not vertical.  It all depends on if you have a date column (or another column) that is a prime candidate for the partitioning column.  As far as RAID, RAID10 is your best bet for delete/update operations--the write speed is awesome.  Vertical partitioning splits the less frequently used columns into another table.  Horizontal partitioning splits the rows between tables.

You also want to make sure your log file is on a different raid group/disks than your data files....
0
 
LVL 34

Expert Comment

by:arbert
ID: 11915154
"If you have datarows locking, set it to page-lock from the outset and avoid escalation straight to table-lock. This is quite common for fighting deadlocks."

It's also a good way to create deadlocks.  Personally, I would use profiler and perfmon to figure out what's causing the excess locking in the first place.  Lots of disk io (which I suspect you have on a large delete) will cause excess locking....
0
 
LVL 2

Expert Comment

by:nexusSam
ID: 11916976
arbert, my comment that you quoted came directly from O'Reilly - we can argue the merits of each locking scheme for some time, ultimately it's down to what works for this application.
stefanlennerbrant , of course I was referring to horizontal partitioning all along (hence my references to archiving and logic for key pointers to different date-range archives). Vertical partitioning is when you move some columns into a different table, but with the same number of rows

0
 
LVL 34

Expert Comment

by:arbert
ID: 11917212
Chill, I wasn't attacking...

"we can argue the merits of each locking scheme for some time, ultimately it's down to what works for this application."

I'm not arguing anything--I'm simply saying don't put a bandaid on a broken leg.  Changing the locking methods, for the most part, simply masks a problem...Just because it's in black and white doesn't mean it's always worth printing...
0
 
LVL 34

Expert Comment

by:arbert
ID: 11917218
hmm, it also looks like you have multiple accounts nexus--that's a violation of EE agreement...
0
 

Author Comment

by:stefanlennerbrant
ID: 11929952
Arbert, do you know if the clustered index has to be rebuilt much when doing the delete as described?
(The clustered index is on a date, stored as an integer, and the delete is done in a range, where we first do:
set rowcount 1000
delete from zz where date >= yy and date <= xx

/Stefan

0
 
LVL 34

Expert Comment

by:arbert
ID: 11930818
Saying rebuilt probably isn't the correct term, it has to be maintained.  The overhead isn't all that great--just all depends on the number of rows.  That's where partitioning would really help you a lot.  Partition on the date column.  
0

Featured Post

Do You Know the 4 Main Threat Actor Types?

Do you know the main threat actor types? Most attackers fall into one of four categories, each with their own favored tactics, techniques, and procedures.

Join & Write a Comment

Suggested Solutions

Title # Comments Views Activity
Dynamics crm 2011 8 46
SQL server 2008 SP4 29 35
BULK INSERT most recent CSV 19 21
In or Between 2 0
I wrote this interesting script that really help me find jobs or procedures when working in a huge environment. I could I have written it as a Procedure but then I would have to have it on each machine or have a link to a server-related search that …
Introduction SQL Server Integration Services can read XML files, that’s known by every BI developer.  (If you didn’t, don’t worry, I’m aiming this article at newcomers as well.) But how far can you go?  When does the XML Source component become …
Via a live example, show how to extract information from SQL Server on Database, Connection and Server properties
Using examples as well as descriptions, and references to Books Online, show the documentation available for datatypes, explain the available data types and show how data can be passed into and out of variables.

746 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

11 Experts available now in Live!

Get 1:1 Help Now