Link to home
Start Free TrialLog in
Avatar of Stefan Lennerbrant
Stefan LennerbrantFlag for Sweden

asked on

VERY LARGE TABLE - Performance - Design - Insert - Delete

Hi Everyone!

We have a very large table in a database (100 million - 200 million rows or even more) in MS SQL Server 2000.
Each row contains a timestamp and we have a clustered index on the timestamp. (Please observe the clustered index on the time when answering the questions).
Our recovery model is set to simple as we do not care about the transaction log.
Many new records (about 1 million/day) is inserted into the table 24-7.

Question 1: When we delete old data this takes a very long time. What can we do to speed up the deletion of old data? Do the server have to move rows due to the clustered index (in my eyes this should not matter because we are always deleting from the beginning of the table and inserting at the end)? We can not do a TRUNCATE table because we do not want to delete all data, only the oldest!

Question 2: Does it matter what the table size are? Or can we grow one table in the database up to 1 TB without affecting performance? Or should we gain from splitting the one big table into several smaller (currently we only have two discs with RAID-1)? I understand that we would gain from this if we had more discs. But I wonder if it matters when we only have one disc?

Question 3: Is it important (do we need it at all) to rebuild the indexes if we always deleting from the beginning and inserts new data at the end? In my opinion this should not be necessary as we have the clustered index on the timestamp and therefore 8K pages and 64K extents will be deallocated at the beginning and new extents are allocated at the end?

Question 4: Do we gain performance by NOT inserting data at the same time as we are doing deletes? I think that this should not matter that much because again new data is inserted at the end of the table and old data is deleted from the beginning.

Question 5: I have at rare occasions seen that we got a lock resource error (native error 1204), when inserting and deleting at the same time. I think it is strange because the need for locks should not be that big when there are only 2 threads running at the same time in the database, one that inserts data at the end of the table and one that deletes it at the beginning of the table. Any comments on why this error occurs.

PLEASE help me with the above questions!!!
I am very grateful for all the help I can get!

Best Regards
/Stefan
ASKER CERTIFIED SOLUTION
Avatar of hkamal
hkamal

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of bragn
bragn

Hi stefanlennerbrant

Is the clustered index the only index?
The clustered index should be selected from the way the data is accessed in select statements I should think.

A large number of indices slow down delete, inserts and updates considerable.  Consider fewer indices.

You're creating a very large transaction with your deletion - and that degrades performance.
Consider deleting in smaller (that is - not as large) chunks.  Smaller date/timerange in each transaction.
I think the server might run out of locks in general when you do your deletes and that explains error 1024.

Best regards,
bragn.
Avatar of Stefan Lennerbrant

ASKER

To bragn,
Thanks for your answer.

The clustered index is not the only index. We also have som nonclustered indexes.
Unfortunately, we need all the indexes that we have, they have been carefully selected to support searching of the data (which is done rarely and not at the same time as deleting).

And, we are deleting small chunks (1000 rows) at a time to avoid locking out the inserting of data.
So, we are not doing a large transaction, the delete thread loops and deletes 1000 rows at a time, and still I find it strange that we run out of general locks!

Best Regards
/Stefan
Hi Stefan.

You are right.  My theory on the locks does not hold water in that case - unless you're not committing between deletions or the server is badly configured.

These pages might be helpful....
http://doc.ddart.net/mssql/sql2000/html/trblsql/tr_reslsyserr_1_6gxg.htm
http://support.microsoft.com/default.aspx?scid=kb;en-us;224453


Best regards,
bragn.

To hkamal,

Thanks for your thoughts.

Yes, it is a large table!

1: We do not report on the data much. Insert is continuosly done in the table at the end thanks to the clustered index. Why do you question the use of clustering??? What do you mean by "contention on inserts"?

2: Ok, i have to read about vertical partitioning and what that means! What would we gain by using vertical partitioning?

Ps. We never do any updates! Only inserts and deletes!

Best Regards
/Stefan
I queried the use of clustering has a higher overhead which is worth it if your (select) queries benefit more from it. You didn't say if you tend to do range scans (which I assume you do). There is a very good paper on index costings and for range scans, a non-clustered indexes is better

Vertical partitioning I think would be VERY beneficial for you especiallty as you seem to delete regularly from the bottom of the table suggesting your data does age. Of all the methods described here, I suspect partitioning will be have the biggest impact in your case

I didn't mean updates in the sql sense, I refer to deletes and inserts collectively as updates.

For updates, try to reduce the round-trip from client to server, possibly by encompassing entire transaction in a stored procdure so it remains server-bound

Use explicit transactions rather than implicit transactions. You may have an implict transaction open without your knowledge which hurts concurrency (in a looped delete/insert, implict transaction is potentially open holding locks and blocking other uses from data). Run the DBCC OPENTRAN command from Query Analyzer to find out more

If you have other indexes, check that they're not splitting and use PAD_INDEX

Just a few things to check

Thanks hkamal,

where can I find the paper you talk about in "There is a very good paper on index costings and for range scans, a non-clustered indexes is better"?

We are definitely doing range scans when deleting, and sometimes when searching.

/Stefan
Unfortunately I had the pdf but cannot locate the original URL. After a detailed description of each index type, it had a table of various scenarios along with processor costings..

The thinking is this; with a clustered index, the server locates the first matching record and (boosted by SQL Server's look-ahead feature) reads successive records thus requiring fewer pages. With a range query, it cannot benefir from this as only a few record will match per page but it still has to load the leaf causing more pages to be read into memory (compared to a tiny one-column non-clustered index using up far fewer pages to load for the same range)

Re my earlier point re contention; O'Reilly's Transact-SQL optimization and Tuning suggests:
Avoid creating a clustered index on amonotonically increasing key [timestamp in yourcase]. It forces all new rows to go to the end of the table.. such indexes forces all users inserting data to compete for the last page of the table (which is why I said contention, however, if you inserts are done in batch mode rather multi-user interactive mode, this is less of an issue).

HTH
Hi all

The data is stored around the clustered index.  That is - the leaf level of the index contains the data as well as the index key.  So when you're scanning the clustered index the data is read as well.  Not only that, it's read in page chunks so if you want surrounding data (that is - the data that sorts with the surrounding index keys) the server doesn't need to read another page (up to a limit of the page size ofc).
In practice, the clustered index decides how the physical data is ordered.

So if you're select statement is like:
select * from table where table.column1 > 1 and table.column1 < 100 then you benefit from a clustered index on column1.

You're dealing with a very larger dbase and a lot more of inserts/deletes than I'm used to but I think the locks, transaction and update of indexes is much more time consuming than the seek itself when it comes to that inserts/deletes.
A good clustered index on the other hand is very effective with regard to select's.

Inside Microsoft SQL Server 2000 by Kalen Delaney takes a good look at these matters.

Best regards,
bragn.





stefanlennerbrant , as I mentioned originally, you may also wish to check your locking strategy. If you have datarows locking, set it to page-lock from the outset and avoid escalation straight to table-lock. This is quite common for fighting deadlocks. Or if you know in advance of a query that you'll exceed the pre-configured page-lock threshold beyond which the optimizer will escalate to table-lock, you may wish to pre-empt page-lock allocations and specify table-lock from the out-set.
Check out PAGLOCK also (which allows you to go on reporting on historical data without hindering users adding to the end fo the table)
I would consider horizontal partitioning--not vertical.  It all depends on if you have a date column (or another column) that is a prime candidate for the partitioning column.  As far as RAID, RAID10 is your best bet for delete/update operations--the write speed is awesome.  Vertical partitioning splits the less frequently used columns into another table.  Horizontal partitioning splits the rows between tables.

You also want to make sure your log file is on a different raid group/disks than your data files....
"If you have datarows locking, set it to page-lock from the outset and avoid escalation straight to table-lock. This is quite common for fighting deadlocks."

It's also a good way to create deadlocks.  Personally, I would use profiler and perfmon to figure out what's causing the excess locking in the first place.  Lots of disk io (which I suspect you have on a large delete) will cause excess locking....
arbert, my comment that you quoted came directly from O'Reilly - we can argue the merits of each locking scheme for some time, ultimately it's down to what works for this application.
stefanlennerbrant , of course I was referring to horizontal partitioning all along (hence my references to archiving and logic for key pointers to different date-range archives). Vertical partitioning is when you move some columns into a different table, but with the same number of rows

Chill, I wasn't attacking...

"we can argue the merits of each locking scheme for some time, ultimately it's down to what works for this application."

I'm not arguing anything--I'm simply saying don't put a bandaid on a broken leg.  Changing the locking methods, for the most part, simply masks a problem...Just because it's in black and white doesn't mean it's always worth printing...
hmm, it also looks like you have multiple accounts nexus--that's a violation of EE agreement...
Arbert, do you know if the clustered index has to be rebuilt much when doing the delete as described?
(The clustered index is on a date, stored as an integer, and the delete is done in a range, where we first do:
set rowcount 1000
delete from zz where date >= yy and date <= xx

/Stefan

Saying rebuilt probably isn't the correct term, it has to be maintained.  The overhead isn't all that great--just all depends on the number of rows.  That's where partitioning would really help you a lot.  Partition on the date column.