SQL CHECKSUM as a way to index a large text-oriented table with frequent INSERTS and ocaasional SELECTS?

kkamm
kkamm used Ask the Experts™
on
I have a SQL Server 2008 database table that accepts auditing information. This information is primarily text-based but has a numeric field that acts as a foreign key for SELECT operations. Most of the activity on this table is INSERT-based (no UPDATES though) with an occasional [SELECT WHERE] query to generate reports.

Currently there is a numeric IDENTITY PK autonumbering column with a clustered index on it. The table is constantly growing, nearing 1 million rows, and my concern is that INSERT performance will degrade over time due to index updating,etc.

My question is 2-fold:

1) Is there any benefit to even having an index for this table given the nature of the activity it receives?

2) If an index IS used then would a CHECKSUM column based on the other fields be a candidate for a non-clustered index without adversely affecting performance?
Comment
Watch Question

Do more with

Expert Office
EXPERT OFFICE® is a registered trademark of EXPERTS EXCHANGE®
Topic Advisor, Page Editor
Distinguished Expert 2018
Commented:
It really depends on the "WHERE" clause as to what might be a good index.

If only ever inserting, then arguably dont need an index. By having a clustered index it will store the data on the node of the index, and in your case probably doesnt add too much overhead because it is an identity and will be adding at the end of the index tree each time.

A checksum or hash value is sometimes handy for large text fields, but then there is a definite overhead in calculating the values, and then being able to re-use that in any meaningful way.

If there is something for the "WHERE" clause, that would benefit from an index, then I would be inclined to keep the current clustered primary key because that key value will be used as the pointer in the new index and because the PK is clustered then it points to the actual data.

Another thought, you can always defrag a clustered table a lot easier than a plain hash table (ie no index) in terms of maintenace.

So, maybe you can talk a bit more about that occassional "where" clause ? And what types of performance problems you are seeing / thinking of ?
Commented:
Even with audit data you have to perform some analysis from time to time. So you'd need to have an index on something besides identity column (record id; date; whatever).

Index obviously affects data modification performance. Although it something you need to deal with. 1M rows is small table. So don't worry much about overhead.

Speaking of non-clustered indexes, the general rule is - index is not used if value is not selective enough. E.g. when SQL Server estimates that it needs to perform key lookup operation (getting data from CI for NCI row) for more than X% of the rows in the table. X varies, but very small (1-2%).

Index on persistent calculated column via checksum could help in case if you need to search for equality on large text column. It helps to workaround 900 bytes NCI key size limitation. But it's useless for the search by substring. Check here: http://aboutsqlserver.com/2010/12/05/sunday-t-sql-tip-equality-predicate-on-large-varchar-column/

One other thing I'd like to mention about those indexes - "truly random indexes" on checksum, hash, guids introduce excessive fragmentation as well as performance of large batch operations is not good on the very large tables. I had a case when I had to optimize large ~200GB table with 3 indexes - one of them was index generated by hashbyte. By removing that index I was able to decrease the time of insert operation of 50K rows from 45 seconds down to 5. But for 1M rows it's not a big deal.

Generally speaking - all indexing questions are always on "It depends" category.

Author

Commented:
The occasional 'WHERE' involves one INT field that usually returns a small number (50-100) of rows. There are text fields that are returned but they are not being used as part of the WHERE clause.

It looks like a CHECKSUM field index would be beneficial if I had text data that was lengthy and recreatable on the query side (e.g DNA sequences, boilerplate legal language, etc.) and subject to frequent SELECT WHERE or UPDATE operations. The text fields I have are all free-form and would only be queried for keywords, if at all.
Success in ‘20 With a Profitable Pricing Strategy

Do you wonder if your IT business is truly profitable or if you should raise your prices? Learn how to calculate your overhead burden using our free interactive tool and use it to determine the right price for your IT services. Start calculating Now!

Commented:
In that case (keywords), checksum would not help you. Full-text search perhaps is the better option.
Mark WillsTopic Advisor, Page Editor
Distinguished Expert 2018

Commented:
Well, given the new bits of information, I still stand by my original post.

I would keep the identity as a clustered key and given the selection of about 100 from 1000000 rows, would be tempted to add a second index on that int column otherwise you will likely be doing table scans everytime.

There is no need (nor use) for checksum as far as I can ascertain, and certainly no performance increase over the current.

Just make sure your database is well sized and you keep control of any excessive fragmentation.

Author

Commented:
Yeah - the CHECKSUM seems like the wrong tool here.

I enabled a NCI on the INT field I mentioned and the SELECT WHERE was predictably faster whereas a single row INSERT was slightly slower. Given the majority of activity is INSERT statements I may just keep the PK index for now and enable the NCI if I see more SELECT activity in the future.

Thanks for the help.

Do more with

Expert Office
Submit tech questions to Ask the Experts™ at any time to receive solutions, advice, and new ideas from leading industry professionals.

Start 7-Day Free Trial