Community Pick: Many members of our community have endorsed this article.
Editor's Choice: This article has been selected by our editors as an exceptional contribution.

Delete Duplicate Rows Without a Primary Key in Microsoft SQL 2005

Ted BouskillSoftware Development Manager
CERTIFIED EXPERT
Almost 30 years of diverse professional software development experience.
Published:
Updated:
I've encountered valid database schemas that do not have a primary key.  For example, I use LogParser from Microsoft to push IIS logs into a SQL database table for processing and analysis.  However, occasionally due to user error or a scheduled task design flaw duplicates have been created.

All essential business data should always be protected with constraints such as a primary key or a unique index.

In my case, one database table with the IIS logs had over a 100,000,000 rows without duplicates!  The usual solutions -- to create a temporary or duplicate table, a while loop or a dreaded resource consuming cursor -- are not always practical.

So in SQL 2005 I've found a very useful function that allowed me to delete duplicates with one query!

First off you have to determine if there are duplicates.  For these examples I'm going to create the following table:
CREATE TABLE [dbo].[Duplicates]([DuplicatesNonNull] [varchar](50) NOT NULL) ON [PRIMARY]

Open in new window


Next, I'll populate it with duplicates:
INSERT INTO Duplicates (DuplicatesNonNull) VALUES ('Duplicate1')
                      INSERT INTO Duplicates (DuplicatesNonNull) VALUES ('Duplicate2')
                      INSERT INTO Duplicates (DuplicatesNonNull) VALUES ('Duplicate3')
                      INSERT INTO Duplicates (DuplicatesNonNull) VALUES ('Unique1')
                      INSERT INTO Duplicates (DuplicatesNonNull) VALUES ('Duplicate2')
                      INSERT INTO Duplicates (DuplicatesNonNull) VALUES ('Duplicate1')
                      INSERT INTO Duplicates (DuplicatesNonNull) VALUES ('Unique2')
                      INSERT INTO Duplicates (DuplicatesNonNull) VALUES ('Duplicate2')
                      INSERT INTO Duplicates (DuplicatesNonNull) VALUES ('Duplicate2')
                      INSERT INTO Duplicates (DuplicatesNonNull) VALUES ('Duplicate2')
                      INSERT INTO Duplicates (DuplicatesNonNull) VALUES ('Unique3')
                      INSERT INTO Duplicates (DuplicatesNonNull) VALUES ('Duplicate1')
                      INSERT INTO Duplicates (DuplicatesNonNull) VALUES ('Duplicate3')
                      INSERT INTO Duplicates (DuplicatesNonNull) VALUES ('Duplicate3')

Open in new window

Detecting duplicates is easy with the following query:
SELECT SUM(NbOfDuplicates) NbOfDuplicates FROM (SELECT COUNT(*)-1 NbOfDuplicates FROM dbo.Duplicates GROUP BY CHECKSUM(*)) T HAVING COUNT(*) > 1

Open in new window

Notice that I remove one from the COUNT(*) to exclude the one valid row.  You actually don't need the HAVING COUNT(*) > 1 statement because if a row isn't duplicated COUNT(*) returns 1 which becomes 0.  However, it improves readability.

Also observe that I used CHECKSUM(*) which saves typing all the column names!  CHECKSUM() is under utilized by many developers.  It is a great way to generate a unique hash for one or more columns to use for analyzing data.

Revision: As pointed out in the article comments CHECKSUM() could create false positives (matches that shouldn't match) in very large data sets.  However, if the table contains essential data it should never have been allowed to be in this state!

Finally here is the solution to delete the duplicate rows.
DELETE T
                      FROM
                         (SELECT ROW_NUMBER() OVER(ORDER BY CHECKSUM(*)) RowNb, CHECKSUM(*) ChkSum FROM dbo.Duplicates) T
                      WHERE
                         T.RowNb NOT IN
                         (
                            SELECT MAX(RowNb) FROM (SELECT ROW_NUMBER() OVER(ORDER BY CHECKSUM(*)) RowNb, CHECKSUM(*) ChkSum FROM dbo.Duplicates) T GROUP BY ChkSum
                         )

Open in new window

I use the ROW_NUMBER() function added to SQL 2005 along with the OVER clause to order the sub query by my checksum and create row numbers for this sub query.

Because the sub queries have collapsed all the columns into a hash valued generated by CHECKSUM() it's far smaller than a duplicate or temporary table even if there are millions of rows.

The algorithm then simply creates what is essentially a temporary identity column for the rows, keeps the row with the maximum value and deletes the duplicates!  I could do it with the minimum value however, I don't know if there would be a benefit.

You can convert this strategy to use divide and conquer and use the TOP clause to work with reduce subsets of the table.

Notes

To be clear once again, in a well designed database all key business data should have constraints applied to protect the data from conditions like this.  For example a primary key or a unique index.
If this is essential data that wasn't protected with constraints I'd recommend using HASHBYTES() or manually typing in all the column names instead of using CHECKSUM() or BINARY_CHECKSUM()
In SQL 2000 you could simulate the same the technique by adding a temporary IDENTITY COLUMN then removing it.  However on a very large table it may timeout when adding the column.
If any columns are NTEXT, TEXT, IMAGE et cetera the CHECKSUM() will fail.
CHECKSUM() is case insensitive.  BINARY_CHECKSUM() is case sensitive

Cheers
Ted
4
8,784 Views
Ted BouskillSoftware Development Manager
CERTIFIED EXPERT
Almost 30 years of diverse professional software development experience.

Comments (12)

Ted BouskillSoftware Development Manager
CERTIFIED EXPERT
Top Expert 2009

Author

Commented:
Hi guys

I'm sorry but I've written enough code to realize there isn't always a *perfect* or 'one size fits all' solution to every problem.  The public dialog in this thread is enough to help anyone make a decision about when to use this solution.  Your comments are acting as caveat's for extreme cases.

Personally the only time I create tables without constraints like a primary key or even a unique key is logging.  The only case of duplicate rows I've ever encountered was logging tables or non essential data.  In my web applications due to intermittent issues we accept that some logs are often lost so if this solution removes some extra logs due to false positives we tend not to worry.

If either of you choose to write an article about why this should never happen to business critical data because there should be constraints to prevent duplicates then by all means do so and I'll even add a link to it in this article.

@cyberwiki: Your last comment was rude and I am offended.  I have a real job with real data and am working successfully for a Fortune 500 company.

Have you ever worked with a table in excess of 1 billion rows and would you ever create mission critical table without constraints like a unique primary key?  I feel your comments have moved beyond being constructive or relevant.

I'm not going to assume that all the readers aren't smart enough to use this as a foundation for solving problems.  They are free to remove the CHECKSUM() and use the column names themselves.
CERTIFIED EXPERT
Expert of the Quarter 2010
Expert of the Year 2010

Commented:
>> The algorithm then simply creates what is essentially a temporary identity column for the rows

I am sorry if it offends you, but I call it as i see it.  Let me preface by saying I entered this because you claimed "CHECKSUM() is under utilized by many developers" - that really offends me because you do not even fully understand it.

People reading your article will think that CHECKSUM really produces unique values, just as you yourself believe and propagate.

If you feel you need to, please bring in a moderator and delete my comments.

http://blogs.clarience.com/davide/?m=200709

>> Have you ever worked with a table in excess of 1 billion rows

Well, whadayaknow - yes! Thanks for asking.

Anyway, since you insist on reigniting the "extreme case", "billions of rows" argument, here is my table with 2 records on which 1 got deleted.
use tempdb;
if OBJECT_ID('tempdb..duplicates') > 0 drop table Duplicates
GO
create table Duplicates (A uniqueidentifier, B int, C varchar(25), D int);
GO
insert Duplicates select '3DB7D309-A8F4-47C4-BA90-0CB458B44CB0', 1, 'hello world', 3;
insert Duplicates select 'EFE7F2C5-19F9-42B9-9C16-21BED41E882B', 1, 'hello world', 3;

SELECT ROW_NUMBER() OVER(ORDER BY CHECKSUM(*)) RowNb, CHECKSUM(*) ChkSum FROM dbo.Duplicates;

DELETE T
FROM
   (SELECT ROW_NUMBER() OVER(ORDER BY CHECKSUM(*)) RowNb, CHECKSUM(*) ChkSum FROM dbo.Duplicates) T
WHERE
   T.RowNb NOT IN
   (
      SELECT MAX(RowNb) FROM (SELECT ROW_NUMBER() OVER(ORDER BY CHECKSUM(*)) RowNb, CHECKSUM(*) ChkSum FROM dbo.Duplicates) T GROUP BY ChkSum
   );

SELECT ROW_NUMBER() OVER(ORDER BY CHECKSUM(*)) RowNb, CHECKSUM(*) ChkSum FROM dbo.Duplicates;

Open in new window

CERTIFIED EXPERT
Expert of the Quarter 2010
Expert of the Year 2010

Commented:
These collide as well, when the values are seemingly completely different on all 4 fields.

insert Duplicates select '3DB7D309-A8F4-47C4-BA90-0CB458B44CB0', 'avb', 'BCZ', 'ckz';
insert Duplicates select 'EFE7F2C5-19F9-42B9-9C16-21BED41E882B', 'AAE', 'BRQ', 'CAW';

Checksum BOL reference
http://msdn.microsoft.com/en-us/library/ms189788%28SQL.100%29.aspx
Ted BouskillSoftware Development Manager
CERTIFIED EXPERT
Top Expert 2009

Author

Commented:
@cyberwiki: I don't mind constructive discussion,  portions of your comment was rude and yes I am going to report it to the moderators.

NOTE: In your examples you have a GUID which is unique.  In that case I wouldn't use CHECKSUM() which means I think you are missing my point.  I'm not saying the CHECKSUM() is a final solution and yes it can be abused (like anything can).

Check out my test of your example when I used them as strings (which is likely in a log) along with some other examples which is in the spirit of friendly discussion.
SELECT CHECKSUM('3DB7D309-A8F4-47C4-BA90-0CB458B44CB0', 'avb', 'BCZ', 'ckz')
UNION ALL
SELECT CHECKSUM('EFE7F2C5-19F9-42B9-9C16-21BED41E882B', 'AAE', 'BRQ', 'CAW')

Results: 
1711681669
390310668

SELECT CHECKSUM(CAST('3DB7D309-A8F4-47C4-BA90-0CB458B44CB0' AS uniqueidentifier), 'avb', 'BCZ', 'ckz')
UNION ALL
SELECT CHECKSUM(CAST('EFE7F2C5-19F9-42B9-9C16-21BED41E882B' AS uniqueidentifier), 'AAE', 'BRQ', 'CAW')

Results:
107127020
107127020

SELECT BINARY_CHECKSUM(CAST('3DB7D309-A8F4-47C4-BA90-0CB458B44CB0' AS uniqueidentifier), 'avb', 'BCZ', 'ckz')
UNION ALL
SELECT BINARY_CHECKSUM(CAST('EFE7F2C5-19F9-42B9-9C16-21BED41E882B' AS uniqueidentifier), 'AAE', 'BRQ', 'CAW')

Results:
1025885929
1023845076

Open in new window

Racim BOUDJAKDJIDatabase Architect - Dba - Data Scientist
CERTIFIED EXPERT

Commented:
@tedbilly

I believe that ciberwiki is simply trying to point out the limitations in using CHECKSUM based approach for dealing with duplicates tuples and pretty much most his comments are perfectly valid.  

Knowing these limitations does not say it is a bad solution, it simply completes it.  

No bad intent in that.

Regards...

View More

Have a question about something in this article? You can receive help directly from the article author. Sign up for a free trial to get started.