Delete Duplicate Rows Without a Primary Key in Microsoft SQL 2005

AID: 3380
  • Status: Published

9080 points

  • Bytedbilly
  • TypeTips/Tricks
  • Posted on2010-07-08 at 23:37:21
Awards
  • Community Pick
  • Experts Exchange Approved
I've encountered valid database schemas that do not have a primary key.  For example, I use LogParser from Microsoft to push IIS logs into a SQL database table for processing and analysis.  However, occasionally due to user error or a scheduled task design flaw duplicates have been created.

All essential business data should always be protected with constraints such as a primary key or a unique index.


In my case, one database table with the IIS logs had over a 100,000,000 rows without duplicates!  The usual solutions -- to create a temporary or duplicate table, a while loop or a dreaded resource consuming cursor -- are not always practical.

So in SQL 2005 I've found a very useful function that allowed me to delete duplicates with one query!

First off you have to determine if there are duplicates.  For these examples I'm going to create the following table:
CREATE TABLE [dbo].[Duplicates]([DuplicatesNonNull] [varchar](50) NOT NULL) ON [PRIMARY]
                                    
1:

Select allOpen in new window



Next, I'll populate it with duplicates:
INSERT INTO Duplicates (DuplicatesNonNull) VALUES ('Duplicate1')
INSERT INTO Duplicates (DuplicatesNonNull) VALUES ('Duplicate2')
INSERT INTO Duplicates (DuplicatesNonNull) VALUES ('Duplicate3')
INSERT INTO Duplicates (DuplicatesNonNull) VALUES ('Unique1')
INSERT INTO Duplicates (DuplicatesNonNull) VALUES ('Duplicate2')
INSERT INTO Duplicates (DuplicatesNonNull) VALUES ('Duplicate1')
INSERT INTO Duplicates (DuplicatesNonNull) VALUES ('Unique2')
INSERT INTO Duplicates (DuplicatesNonNull) VALUES ('Duplicate2')
INSERT INTO Duplicates (DuplicatesNonNull) VALUES ('Duplicate2')
INSERT INTO Duplicates (DuplicatesNonNull) VALUES ('Duplicate2')
INSERT INTO Duplicates (DuplicatesNonNull) VALUES ('Unique3')
INSERT INTO Duplicates (DuplicatesNonNull) VALUES ('Duplicate1')
INSERT INTO Duplicates (DuplicatesNonNull) VALUES ('Duplicate3')
INSERT INTO Duplicates (DuplicatesNonNull) VALUES ('Duplicate3')
                                    
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:

Select allOpen in new window


Detecting duplicates is easy with the following query:
SELECT SUM(NbOfDuplicates) NbOfDuplicates FROM (SELECT COUNT(*)-1 NbOfDuplicates FROM dbo.Duplicates GROUP BY CHECKSUM(*)) T HAVING COUNT(*) > 1
                                    
1:

Select allOpen in new window


Notice that I remove one from the COUNT(*) to exclude the one valid row.  You actually don't need the HAVING COUNT(*) > 1 statement because if a row isn't duplicated COUNT(*) returns 1 which becomes 0.  However, it improves readability.

Also observe that I used CHECKSUM(*) which saves typing all the column names!  CHECKSUM() is under utilized by many developers.  It is a great way to generate a unique hash for one or more columns to use for analyzing data.

Revision: As pointed out in the article comments CHECKSUM() could create false positives (matches that shouldn't match) in very large data sets.  However, if the table contains essential data it should never have been allowed to be in this state!


Finally here is the solution to delete the duplicate rows.
DELETE T
FROM
   (SELECT ROW_NUMBER() OVER(ORDER BY CHECKSUM(*)) RowNb, CHECKSUM(*) ChkSum FROM dbo.Duplicates) T
WHERE
   T.RowNb NOT IN
   (
      SELECT MAX(RowNb) FROM (SELECT ROW_NUMBER() OVER(ORDER BY CHECKSUM(*)) RowNb, CHECKSUM(*) ChkSum FROM dbo.Duplicates) T GROUP BY ChkSum
   )
                                    
1:
2:
3:
4:
5:
6:
7:
8:

Select allOpen in new window


I use the ROW_NUMBER() function added to SQL 2005 along with the OVER clause to order the sub query by my checksum and create row numbers for this sub query.

Because the sub queries have collapsed all the columns into a hash valued generated by CHECKSUM() it's far smaller than a duplicate or temporary table even if there are millions of rows.

The algorithm then simply creates what is essentially a temporary identity column for the rows, keeps the row with the maximum value and deletes the duplicates!  I could do it with the minimum value however, I don't know if there would be a benefit.

You can convert this strategy to use divide and conquer and use the TOP clause to work with reduce subsets of the table.

Notes


  • To be clear once again, in a well designed database all key business data should have constraints applied to protect the data from conditions like this.  For example a primary key or a unique index.

  • If this is essential data that wasn't protected with constraints I'd recommend using HASHBYTES() or manually typing in all the column names instead of using CHECKSUM() or BINARY_CHECKSUM()

  • In SQL 2000 you could simulate the same the technique by adding a temporary IDENTITY COLUMN then removing it.  However on a very large table it may timeout when adding the column.

  • If any columns are NTEXT, TEXT, IMAGE et cetera the CHECKSUM() will fail.

  • CHECKSUM() is case insensitive.  BINARY_CHECKSUM() is case sensitive


Cheers
Ted
Asked On
2010-07-08 at 23:37:21ID3380
Tags

SQL 2005

,

DUPLICATES

Topic

SQL Server 2005

Views
3812

Comments

Expert Comment

by: StiggySon on 2010-07-10 at 02:41:44ID: 16825

great article.  I had to do a similar exercise two years ago and finally settled on this technique.

Expert Comment

by: cyberkiwi on 2010-07-12 at 02:38:55ID: 16868

With statements like "CHECKSUM() is under utilized by many developers.  It is a great way to generate a unique hash for one or more columns to use for analyzing data.", you are asking to be flamed.


First, some references.  SQL Server 2000 Books Online, but it will serve.

http://msdn.microsoft.com/en-us/library/aa258245%28SQL.80%29.aspx

(1) The hash value is intended for use in building hash indices.

(2) If one of the values in the expression list changes, the checksum of the list also usually changes. However, there is a small chance that the checksum will not change.

Now, on to the article.

I don't know where you get your reference from, but you seem to imply that it is "under utilized" because it is not used in the specific way you have.  There is a reason for that, which is that your understanding of hashes is flawed.  I underlined the phrase you used, which is the underpinning of your strategy.

Hashes are not unique.  Hashes are meant to reduce data to some manageable distinct (but not  unique) groupings.  A common use of hashes is for passwords.  Given a hashing algorithm, source data can always produce the same hashed output which is all you need to retain without knowing or storing the password itself.  With this in mind, it is useful for building hash indices, but can never be primary keys (fails uniqueness test).

Point 2 from the article only reinforces the point that non-duplicate data can produce the same checksum, resulting in your deletion script removing more than it should.

I would also like to point out that were it to work, or if you substitute your "shortcut" for properly  naming all columns in the table, the deletion query should be as simple as the code box without another nested subquery performing all the same data collection again.
DELETE T
FROM
   (SELECT ROW_NUMBER() OVER(ORDER BY CHECKSUM(*)) RowNb, CHECKSUM(*) ChkSum FROM dbo.Duplicates) T
WHERE
   T.RowNb > 1

                                        
1:
2:
3:
4:
5:

Select allOpen in new window

Author Comment

by: tedbilly on 2010-07-12 at 11:22:37ID: 16884

@cyberwiki: CHECKSUM() isn't used as often as it could be.  Sure functions can be overused, however, I still feel this useful function isn't used as often as it could be.

I've tried to create hash collisions with CHECKSUM() and haven't been able to in the cases in this situation where I use a very large key with multiple columns.  Microsoft engineers want to avoid hash collisions in their indexes so they've done a good job with the function to avoid the collisions.

Finally try your query using all the sample code I provided, the result of my query is
Duplicate2
Duplicate3
Unique1
Unique2
Unique3
Duplicate1
                                        
1:
2:
3:
4:
5:
6:

Select allOpen in new window



When I tried your query my result set is:
Unique1
                                        
1:

Select allOpen in new window



Your query is removing far too many rows.

Expert Comment

by: cyberkiwi on 2010-07-12 at 22:44:54ID: 16923

No wonder you are using group by.
I have always added the PARTITION (sibling of group by) clause into the query in one go.

DELETE T
FROM
   (SELECT ROW_NUMBER() OVER(PARTITION BY CHECKSUM(*) ORDER BY CHECKSUM(*)) RowNb, CHECKSUM(*) ChkSum FROM dbo.Duplicates) T
WHERE
   T.RowNb > 1

Anyway, just because _you_ haven't been able to generate a hash collision doesn't mean you get to assume that for people reading your article.  Especially if Microsoft themselves say so.  It goes without saying that since a checksum is only 4 bytes long, the maximum number of unique hashes is 4+ billion if the entire range of 32 bits is used.  If you had 5 billion rows, collisions are sure to happen.  If you had less, you just have to hit 2 that produce the same checksum.

Author Comment

by: tedbilly on 2010-07-14 at 21:56:46ID: 17024

Hmm, well a table with 4 billion 256 byte rows would be a 1 TB table (assuming it had no indexes) so I think that you've described an extreme case because I doubt that anyone that ever reads my article has a single table that large.

Even then, the technique is still valid if you split the columns into multiple CHECKSUM() functions.  IE: If you had six columns you could use CHECKSUM(Col1,Col2,Col3), CHECKSUM(Col4,Col5,Col6) which would generate a dataset that is smaller than the whole table.

Expert Comment

by: mark_wills on 2010-07-16 at 05:59:44ID: 17055

Interesting Article...

Can imagine checksum(*) causing some issues as cyberkiwi has suggested. I have seen it happen.

More importantly, tables that have Text, Image etc data types checksum(*) will fail.

More importantly, if checking (say) a log as you suggest, then can well imagine a date/time coming into play and causing havoc when trying to identify dupes (being a definition of repeating values in specific columns). In such a case checksum(*) will produce more distinct answers and so miss the dupes.

For that reason I do believe that CHECKSUM(Col1,Col2,Col3) is the more accurate way to go. But, if defining columns, then why use checksum() at all ? You say it is to avoid having to specify columns, and my personal opinion (not saying it is right or wrong) is to always specify column in these types of cases.

Expert Comment

by: cyberkiwi on 2010-07-16 at 06:07:11ID: 17056

ted,

You are still missing the crucial point.
I am pointing out the 4 billion possible values a 4-byte, 32-bit hash can hold.
You can only hash that much data uniquely before the next one will dup against one of the existing hashes, let's call this pair X and Y.
Now IGNORE the 4 billion records that never existed.  Say your table merely had X and Y.
The CHECKSUM(*) values duplicate each other.

Does that make sense ?

Author Comment

by: tedbilly on 2010-07-16 at 12:42:37ID: 17065

Hi guys

I'm sorry but I've written enough code to realize there isn't always a *perfect* or 'one size fits all' solution to every problem.  The public dialog in this thread is enough to help anyone make a decision about when to use this solution.  Your comments are acting as caveat's for extreme cases.

Personally the only time I create tables without constraints like a primary key or even a unique key is logging.  The only case of duplicate rows I've ever encountered was logging tables or non essential data.  In my web applications due to intermittent issues we accept that some logs are often lost so if this solution removes some extra logs due to false positives we tend not to worry.

If either of you choose to write an article about why this should never happen to business critical data because there should be constraints to prevent duplicates then by all means do so and I'll even add a link to it in this article.

@cyberwiki: Your last comment was rude and I am offended.  I have a real job with real data and am working successfully for a Fortune 500 company.

Have you ever worked with a table in excess of 1 billion rows and would you ever create mission critical table without constraints like a unique primary key?  I feel your comments have moved beyond being constructive or relevant.

I'm not going to assume that all the readers aren't smart enough to use this as a foundation for solving problems.  They are free to remove the CHECKSUM() and use the column names themselves.

Expert Comment

by: cyberkiwi on 2010-07-16 at 14:31:34ID: 17073

>> The algorithm then simply creates what is essentially a temporary identity column for the rows

I am sorry if it offends you, but I call it as i see it.  Let me preface by saying I entered this because you claimed "CHECKSUM() is under utilized by many developers" - that really offends me because you do not even fully understand it.

People reading your article will think that CHECKSUM really produces unique values, just as you yourself believe and propagate.

If you feel you need to, please bring in a moderator and delete my comments.

http://blogs.clarience.com/davide/?m=200709

>> Have you ever worked with a table in excess of 1 billion rows

Well, whadayaknow - yes! Thanks for asking.

Anyway, since you insist on reigniting the "extreme case", "billions of rows" argument, here is my table with 2 records on which 1 got deleted.
use tempdb;
if OBJECT_ID('tempdb..duplicates') > 0 drop table Duplicates
GO
create table Duplicates (A uniqueidentifier, B int, C varchar(25), D int);
GO
insert Duplicates select '3DB7D309-A8F4-47C4-BA90-0CB458B44CB0', 1, 'hello world', 3;
insert Duplicates select 'EFE7F2C5-19F9-42B9-9C16-21BED41E882B', 1, 'hello world', 3;

SELECT ROW_NUMBER() OVER(ORDER BY CHECKSUM(*)) RowNb, CHECKSUM(*) ChkSum FROM dbo.Duplicates;

DELETE T
FROM
   (SELECT ROW_NUMBER() OVER(ORDER BY CHECKSUM(*)) RowNb, CHECKSUM(*) ChkSum FROM dbo.Duplicates) T
WHERE
   T.RowNb NOT IN
   (
      SELECT MAX(RowNb) FROM (SELECT ROW_NUMBER() OVER(ORDER BY CHECKSUM(*)) RowNb, CHECKSUM(*) ChkSum FROM dbo.Duplicates) T GROUP BY ChkSum
   );

SELECT ROW_NUMBER() OVER(ORDER BY CHECKSUM(*)) RowNb, CHECKSUM(*) ChkSum FROM dbo.Duplicates;
                                        
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:
15:
16:
17:
18:
19:
20:

Select allOpen in new window

Expert Comment

by: cyberkiwi on 2010-07-16 at 14:37:42ID: 17074

These collide as well, when the values are seemingly completely different on all 4 fields.

insert Duplicates select '3DB7D309-A8F4-47C4-BA90-0CB458B44CB0', 'avb', 'BCZ', 'ckz';
insert Duplicates select 'EFE7F2C5-19F9-42B9-9C16-21BED41E882B', 'AAE', 'BRQ', 'CAW';

Checksum BOL reference
http://msdn.microsoft.com/en-us/library/ms189788%28SQL.100%29.aspx

Author Comment

by: tedbilly on 2010-07-16 at 14:43:19ID: 17075

@cyberwiki: I don't mind constructive discussion,  portions of your comment was rude and yes I am going to report it to the moderators.

NOTE: In your examples you have a GUID which is unique.  In that case I wouldn't use CHECKSUM() which means I think you are missing my point.  I'm not saying the CHECKSUM() is a final solution and yes it can be abused (like anything can).

Check out my test of your example when I used them as strings (which is likely in a log) along with some other examples which is in the spirit of friendly discussion.
SELECT CHECKSUM('3DB7D309-A8F4-47C4-BA90-0CB458B44CB0', 'avb', 'BCZ', 'ckz')
UNION ALL
SELECT CHECKSUM('EFE7F2C5-19F9-42B9-9C16-21BED41E882B', 'AAE', 'BRQ', 'CAW')

Results: 
1711681669
390310668

SELECT CHECKSUM(CAST('3DB7D309-A8F4-47C4-BA90-0CB458B44CB0' AS uniqueidentifier), 'avb', 'BCZ', 'ckz')
UNION ALL
SELECT CHECKSUM(CAST('EFE7F2C5-19F9-42B9-9C16-21BED41E882B' AS uniqueidentifier), 'AAE', 'BRQ', 'CAW')

Results:
107127020
107127020

SELECT BINARY_CHECKSUM(CAST('3DB7D309-A8F4-47C4-BA90-0CB458B44CB0' AS uniqueidentifier), 'avb', 'BCZ', 'ckz')
UNION ALL
SELECT BINARY_CHECKSUM(CAST('EFE7F2C5-19F9-42B9-9C16-21BED41E882B' AS uniqueidentifier), 'AAE', 'BRQ', 'CAW')

Results:
1025885929
1023845076

                                        
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:
15:
16:
17:
18:
19:
20:
21:
22:
23:

Select allOpen in new window

Expert Comment

by: Racimo on 2010-10-14 at 07:24:20ID: 20509

@tedbilly

I believe that ciberwiki is simply trying to point out the limitations in using CHECKSUM based approach for dealing with duplicates tuples and pretty much most his comments are perfectly valid.  

Knowing these limitations does not say it is a bad solution, it simply completes it.  

No bad intent in that.

Regards...

Add your Comment

Please Sign up or Log in to comment on this article.

Join Experts Exchange Today

Gain Access to all our Tech Resources

Get personalized answers

Ask unlimited questions

Access Proven Solutions

Search 3.2 million solutions

Read In-Depth How-To Guides

1000+ articles, demos, & tips

Watch Step by Step Tutorials

Learn direct from top tech pros

And Much More!

Your complete tech resource

See Plans and Pricing

30-day free trial. Register in 60 seconds.

Loading Advertisement...

Top MS SQL Server 2005 Experts

  1. ScottPletcher

    195,617

    Guru

    8,500 points yesterday

    Profile
    Rank: Genius
  2. jogos

    176,191

    Guru

    668 points yesterday

    Profile
    Rank: Sage
  3. acperkins

    140,953

    Master

    1,000 points yesterday

    Profile
    Rank: Genius
  4. TempDBA

    113,707

    Master

    1,168 points yesterday

    Profile
    Rank: Sage
  5. matthewspatrick

    93,824

    Master

    1,600 points yesterday

    Profile
    Rank: Savant
  6. lcohan

    93,302

    Master

    2,000 points yesterday

    Profile
    Rank: Genius
  7. dtodd

    84,612

    Master

    0 points yesterday

    Profile
    Rank: Genius
  8. mwvisa1

    76,166

    Master

    0 points yesterday

    Profile
    Rank: Genius
  9. ValentinoV

    76,011

    Master

    1,800 points yesterday

    Profile
    Rank: Genius
  10. ralmada

    55,844

    Master

    400 points yesterday

    Profile
    Rank: Genius
  11. anujnb

    54,164

    Master

    0 points yesterday

    Profile
    Rank: Wizard
  12. angelIII

    53,846

    Master

    10 points yesterday

    Profile
    Rank: Elite
  13. EugeneZ

    53,602

    Master

    0 points yesterday

    Profile
    Rank: Genius
  14. HainKurt

    49,150

    0 points yesterday

    Profile
    Rank: Genius
  15. Buttercup1

    48,568

    0 points yesterday

    Profile
    Rank: Master
  16. huslayer

    40,600

    0 points yesterday

    Profile
    Rank: Sage
  17. appari

    39,400

    0 points yesterday

    Profile
    Rank: Genius
  18. tim_cs

    34,200

    0 points yesterday

    Profile
    Rank: Wizard
  19. wdosanjos

    33,836

    0 points yesterday

    Profile
    Rank: Genius
  20. dqmq

    31,136

    0 points yesterday

    Profile
    Rank: Genius
  21. Cluskitt

    30,940

    0 points yesterday

    Profile
    Rank: Wizard
  22. SJCFL-Admin

    30,877

    0 points yesterday

    Profile
    Rank: Master
  23. jimhorn

    29,975

    0 points yesterday

    Profile
    Rank: Genius
  24. Brichsoft

    28,107

    0 points yesterday

    Profile
    Rank: Sage
  25. momi_sabag

    27,903

    0 points yesterday

    Profile
    Rank: Genius

Hall Of Fame