I've encountered valid database schemas that do not have a primary key. For example, I use LogParser from Microsoft to push IIS logs into a SQL database table for processing and analysis. However, occasionally due to user error or a scheduled task design flaw duplicates have been created.
All essential business data should always be protected with constraints such as a primary key or a unique index.
In my case, one database table with the IIS logs had over a 100,000,000 rows without duplicates! The usual solutions -- to create a temporary or duplicate table, a while loop or a dreaded resource consuming cursor -- are not always practical.
So in SQL 2005 I've found a very useful function that allowed me to delete duplicates with one query!
First off you have to determine if there are duplicates. For these examples I'm going to create the following table:
CREATE TABLE [dbo].[Duplicates]([DuplicatesNonNull] [varchar](50) NOT NULL) ON [PRIMARY]
Notice that I remove one from the COUNT(*) to exclude the one valid row. You actually don't need the HAVING COUNT(*) > 1 statement because if a row isn't duplicated COUNT(*) returns 1 which becomes 0. However, it improves readability.
Also observe that I used CHECKSUM(*) which saves typing all the column names! CHECKSUM() is under utilized by many developers. It is a great way to generate a unique hash for one or more columns to use for analyzing data.
Revision: As pointed out in the article comments CHECKSUM() could create false positives (matches that shouldn't match) in very large data sets. However, if the table contains essential data it should never have been allowed to be in this state!
Finally here is the solution to delete the duplicate rows.
DELETE TFROM (SELECT ROW_NUMBER() OVER(ORDER BY CHECKSUM(*)) RowNb, CHECKSUM(*) ChkSum FROM dbo.Duplicates) TWHERE T.RowNb NOT IN ( SELECT MAX(RowNb) FROM (SELECT ROW_NUMBER() OVER(ORDER BY CHECKSUM(*)) RowNb, CHECKSUM(*) ChkSum FROM dbo.Duplicates) T GROUP BY ChkSum )
I use the ROW_NUMBER() function added to SQL 2005 along with the OVER clause to order the sub query by my checksum and create row numbers for this sub query.
Because the sub queries have collapsed all the columns into a hash valued generated by CHECKSUM() it's far smaller than a duplicate or temporary table even if there are millions of rows.
The algorithm then simply creates what is essentially a temporary identity column for the rows, keeps the row with the maximum value and deletes the duplicates! I could do it with the minimum value however, I don't know if there would be a benefit.
You can convert this strategy to use divide and conquer and use the TOP clause to work with reduce subsets of the table.
To be clear once again, in a well designed database all key business data should have constraints applied to protect the data from conditions like this. For example a primary key or a unique index.
If this is essential data that wasn't protected with constraints I'd recommend using HASHBYTES() or manually typing in all the column names instead of using CHECKSUM() or BINARY_CHECKSUM()
In SQL 2000 you could simulate the same the technique by adding a temporary IDENTITY COLUMN then removing it. However on a very large table it may timeout when adding the column.
If any columns are NTEXT, TEXT, IMAGE et cetera the CHECKSUM() will fail.
CHECKSUM() is case insensitive. BINARY_CHECKSUM() is case sensitive