- Community Pick
- Experts Exchange Approved
In my case, one database table with the IIS logs had over a 100,000,000 rows without duplicates! The usual solutions -- to create a temporary or duplicate table, a while loop or a dreaded resource consuming cursor -- are not always practical.
So in SQL 2005 I've found a very useful function that allowed me to delete duplicates with one query!
First off you have to determine if there are duplicates. For these examples I'm going to create the following table:
Next, I'll populate it with duplicates:
Detecting duplicates is easy with the following query:
Notice that I remove one from the COUNT(*) to exclude the one valid row. You actually don't need the HAVING COUNT(*) > 1 statement because if a row isn't duplicated COUNT(*) returns 1 which becomes 0. However, it improves readability.
Also observe that I used CHECKSUM(*) which saves typing all the column names! CHECKSUM() is under utilized by many developers. It is a great way to generate a unique hash for one or more columns to use for analyzing data.
Finally here is the solution to delete the duplicate rows.
I use the ROW_NUMBER() function added to SQL 2005 along with the OVER clause to order the sub query by my checksum and create row numbers for this sub query.
Because the sub queries have collapsed all the columns into a hash valued generated by CHECKSUM() it's far smaller than a duplicate or temporary table even if there are millions of rows.
The algorithm then simply creates what is essentially a temporary identity column for the rows, keeps the row with the maximum value and deletes the duplicates! I could do it with the minimum value however, I don't know if there would be a benefit.
You can convert this strategy to use divide and conquer and use the TOP clause to work with reduce subsets of the table.
Notes
- To be clear once again, in a well designed database all key business data should have constraints applied to protect the data from conditions like this. For example a primary key or a unique index.
- If this is essential data that wasn't protected with constraints I'd recommend using HASHBYTES() or manually typing in all the column names instead of using CHECKSUM() or BINARY_CHECKSUM()
- In SQL 2000 you could simulate the same the technique by adding a temporary IDENTITY COLUMN then removing it. However on a very large table it may timeout when adding the column.
- If any columns are NTEXT, TEXT, IMAGE et cetera the CHECKSUM() will fail.
- CHECKSUM() is case insensitive. BINARY_CHECKSUM() is case sensitive
Cheers
Ted
by: StiggySon on 2010-07-10 at 02:41:44ID: 16825