<

Want to win a PS4? Go Premium and enter to win our High-Tech Treats giveaway. Enter to Win

x

Delete Duplicate Rows Without a Primary Key in Microsoft SQL 2005

Published on
17,546 Points
7,646 Views
4 Endorsements
Last Modified:
Awarded
Ted Bouskill
Almost 30 years of diverse professional software development experience.
I've encountered valid database schemas that do not have a primary key.  For example, I use LogParser from Microsoft to push IIS logs into a SQL database table for processing and analysis.  However, occasionally due to user error or a scheduled task design flaw duplicates have been created.

All essential business data should always be protected with constraints such as a primary key or a unique index.

In my case, one database table with the IIS logs had over a 100,000,000 rows without duplicates!  The usual solutions -- to create a temporary or duplicate table, a while loop or a dreaded resource consuming cursor -- are not always practical.

So in SQL 2005 I've found a very useful function that allowed me to delete duplicates with one query!

First off you have to determine if there are duplicates.  For these examples I'm going to create the following table:
CREATE TABLE [dbo].[Duplicates]([DuplicatesNonNull] [varchar](50) NOT NULL) ON [PRIMARY]

Open in new window


Next, I'll populate it with duplicates:
INSERT INTO Duplicates (DuplicatesNonNull) VALUES ('Duplicate1')
INSERT INTO Duplicates (DuplicatesNonNull) VALUES ('Duplicate2')
INSERT INTO Duplicates (DuplicatesNonNull) VALUES ('Duplicate3')
INSERT INTO Duplicates (DuplicatesNonNull) VALUES ('Unique1')
INSERT INTO Duplicates (DuplicatesNonNull) VALUES ('Duplicate2')
INSERT INTO Duplicates (DuplicatesNonNull) VALUES ('Duplicate1')
INSERT INTO Duplicates (DuplicatesNonNull) VALUES ('Unique2')
INSERT INTO Duplicates (DuplicatesNonNull) VALUES ('Duplicate2')
INSERT INTO Duplicates (DuplicatesNonNull) VALUES ('Duplicate2')
INSERT INTO Duplicates (DuplicatesNonNull) VALUES ('Duplicate2')
INSERT INTO Duplicates (DuplicatesNonNull) VALUES ('Unique3')
INSERT INTO Duplicates (DuplicatesNonNull) VALUES ('Duplicate1')
INSERT INTO Duplicates (DuplicatesNonNull) VALUES ('Duplicate3')
INSERT INTO Duplicates (DuplicatesNonNull) VALUES ('Duplicate3')

Open in new window

Detecting duplicates is easy with the following query:
SELECT SUM(NbOfDuplicates) NbOfDuplicates FROM (SELECT COUNT(*)-1 NbOfDuplicates FROM dbo.Duplicates GROUP BY CHECKSUM(*)) T HAVING COUNT(*) > 1

Open in new window

Notice that I remove one from the COUNT(*) to exclude the one valid row.  You actually don't need the HAVING COUNT(*) > 1 statement because if a row isn't duplicated COUNT(*) returns 1 which becomes 0.  However, it improves readability.

Also observe that I used CHECKSUM(*) which saves typing all the column names!  CHECKSUM() is under utilized by many developers.  It is a great way to generate a unique hash for one or more columns to use for analyzing data.

Revision: As pointed out in the article comments CHECKSUM() could create false positives (matches that shouldn't match) in very large data sets.  However, if the table contains essential data it should never have been allowed to be in this state!

Finally here is the solution to delete the duplicate rows.
DELETE T
FROM
   (SELECT ROW_NUMBER() OVER(ORDER BY CHECKSUM(*)) RowNb, CHECKSUM(*) ChkSum FROM dbo.Duplicates) T
WHERE
   T.RowNb NOT IN
   (
      SELECT MAX(RowNb) FROM (SELECT ROW_NUMBER() OVER(ORDER BY CHECKSUM(*)) RowNb, CHECKSUM(*) ChkSum FROM dbo.Duplicates) T GROUP BY ChkSum
   )

Open in new window

I use the ROW_NUMBER() function added to SQL 2005 along with the OVER clause to order the sub query by my checksum and create row numbers for this sub query.

Because the sub queries have collapsed all the columns into a hash valued generated by CHECKSUM() it's far smaller than a duplicate or temporary table even if there are millions of rows.

The algorithm then simply creates what is essentially a temporary identity column for the rows, keeps the row with the maximum value and deletes the duplicates!  I could do it with the minimum value however, I don't know if there would be a benefit.

You can convert this strategy to use divide and conquer and use the TOP clause to work with reduce subsets of the table.

Notes

To be clear once again, in a well designed database all key business data should have constraints applied to protect the data from conditions like this.  For example a primary key or a unique index.
If this is essential data that wasn't protected with constraints I'd recommend using HASHBYTES() or manually typing in all the column names instead of using CHECKSUM() or BINARY_CHECKSUM()
In SQL 2000 you could simulate the same the technique by adding a temporary IDENTITY COLUMN then removing it.  However on a very large table it may timeout when adding the column.
If any columns are NTEXT, TEXT, IMAGE et cetera the CHECKSUM() will fail.
CHECKSUM() is case insensitive.  BINARY_CHECKSUM() is case sensitive

Cheers
Ted
4
Comment
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
12 Comments
 

Expert Comment

by:StiggySon
great article.  I had to do a similar exercise two years ago and finally settled on this technique.
0
 
LVL 58

Expert Comment

by:cyberkiwi
With statements like "CHECKSUM() is under utilized by many developers.  It is a great way to generate a unique hash for one or more columns to use for analyzing data.", you are asking to be flamed.


First, some references.  SQL Server 2000 Books Online, but it will serve.

http://msdn.microsoft.com/en-us/library/aa258245%28SQL.80%29.aspx

(1) The hash value is intended for use in building hash indices.

(2) If one of the values in the expression list changes, the checksum of the list also usually changes. However, there is a small chance that the checksum will not change.

Now, on to the article.

I don't know where you get your reference from, but you seem to imply that it is "under utilized" because it is not used in the specific way you have.  There is a reason for that, which is that your understanding of hashes is flawed.  I underlined the phrase you used, which is the underpinning of your strategy.

Hashes are not unique.  Hashes are meant to reduce data to some manageable distinct (but not  unique) groupings.  A common use of hashes is for passwords.  Given a hashing algorithm, source data can always produce the same hashed output which is all you need to retain without knowing or storing the password itself.  With this in mind, it is useful for building hash indices, but can never be primary keys (fails uniqueness test).

Point 2 from the article only reinforces the point that non-duplicate data can produce the same checksum, resulting in your deletion script removing more than it should.

I would also like to point out that were it to work, or if you substitute your "shortcut" for properly  naming all columns in the table, the deletion query should be as simple as the code box without another nested subquery performing all the same data collection again.

DELETE T
FROM
   (SELECT ROW_NUMBER() OVER(ORDER BY CHECKSUM(*)) RowNb, CHECKSUM(*) ChkSum FROM dbo.Duplicates) T
WHERE
   T.RowNb > 1

Open in new window

0
 
LVL 51

Author Comment

by:Ted Bouskill
@cyberwiki: CHECKSUM() isn't used as often as it could be.  Sure functions can be overused, however, I still feel this useful function isn't used as often as it could be.

I've tried to create hash collisions with CHECKSUM() and haven't been able to in the cases in this situation where I use a very large key with multiple columns.  Microsoft engineers want to avoid hash collisions in their indexes so they've done a good job with the function to avoid the collisions.

Finally try your query using all the sample code I provided, the result of my query is
Duplicate2
Duplicate3
Unique1
Unique2
Unique3
Duplicate1

Open in new window


When I tried your query my result set is:
Unique1

Open in new window


Your query is removing far too many rows.
0
Free Backup Tool for VMware and Hyper-V

Restore full virtual machine or individual guest files from 19 common file systems directly from the backup file. Schedule VM backups with PowerShell scripts. Set desired time, lean back and let the script to notify you via email upon completion.  

 
LVL 58

Expert Comment

by:cyberkiwi
No wonder you are using group by.
I have always added the PARTITION (sibling of group by) clause into the query in one go.

DELETE T
FROM
   (SELECT ROW_NUMBER() OVER(PARTITION BY CHECKSUM(*) ORDER BY CHECKSUM(*)) RowNb, CHECKSUM(*) ChkSum FROM dbo.Duplicates) T
WHERE
   T.RowNb > 1

Anyway, just because _you_ haven't been able to generate a hash collision doesn't mean you get to assume that for people reading your article.  Especially if Microsoft themselves say so.  It goes without saying that since a checksum is only 4 bytes long, the maximum number of unique hashes is 4+ billion if the entire range of 32 bits is used.  If you had 5 billion rows, collisions are sure to happen.  If you had less, you just have to hit 2 that produce the same checksum.
0
 
LVL 51

Author Comment

by:Ted Bouskill
Hmm, well a table with 4 billion 256 byte rows would be a 1 TB table (assuming it had no indexes) so I think that you've described an extreme case because I doubt that anyone that ever reads my article has a single table that large.

Even then, the technique is still valid if you split the columns into multiple CHECKSUM() functions.  IE: If you had six columns you could use CHECKSUM(Col1,Col2,Col3), CHECKSUM(Col4,Col5,Col6) which would generate a dataset that is smaller than the whole table.
0
 
LVL 51

Expert Comment

by:Mark Wills
Interesting Article...

Can imagine checksum(*) causing some issues as cyberkiwi has suggested. I have seen it happen.

More importantly, tables that have Text, Image etc data types checksum(*) will fail.

More importantly, if checking (say) a log as you suggest, then can well imagine a date/time coming into play and causing havoc when trying to identify dupes (being a definition of repeating values in specific columns). In such a case checksum(*) will produce more distinct answers and so miss the dupes.

For that reason I do believe that CHECKSUM(Col1,Col2,Col3) is the more accurate way to go. But, if defining columns, then why use checksum() at all ? You say it is to avoid having to specify columns, and my personal opinion (not saying it is right or wrong) is to always specify column in these types of cases.

0
 
LVL 58

Expert Comment

by:cyberkiwi
ted,

You are still missing the crucial point.
I am pointing out the 4 billion possible values a 4-byte, 32-bit hash can hold.
You can only hash that much data uniquely before the next one will dup against one of the existing hashes, let's call this pair X and Y.
Now IGNORE the 4 billion records that never existed.  Say your table merely had X and Y.
The CHECKSUM(*) values duplicate each other.

Does that make sense ?

0
 
LVL 51

Author Comment

by:Ted Bouskill
Hi guys

I'm sorry but I've written enough code to realize there isn't always a *perfect* or 'one size fits all' solution to every problem.  The public dialog in this thread is enough to help anyone make a decision about when to use this solution.  Your comments are acting as caveat's for extreme cases.

Personally the only time I create tables without constraints like a primary key or even a unique key is logging.  The only case of duplicate rows I've ever encountered was logging tables or non essential data.  In my web applications due to intermittent issues we accept that some logs are often lost so if this solution removes some extra logs due to false positives we tend not to worry.

If either of you choose to write an article about why this should never happen to business critical data because there should be constraints to prevent duplicates then by all means do so and I'll even add a link to it in this article.

@cyberwiki: Your last comment was rude and I am offended.  I have a real job with real data and am working successfully for a Fortune 500 company.

Have you ever worked with a table in excess of 1 billion rows and would you ever create mission critical table without constraints like a unique primary key?  I feel your comments have moved beyond being constructive or relevant.

I'm not going to assume that all the readers aren't smart enough to use this as a foundation for solving problems.  They are free to remove the CHECKSUM() and use the column names themselves.
0
 
LVL 58

Expert Comment

by:cyberkiwi
>> The algorithm then simply creates what is essentially a temporary identity column for the rows

I am sorry if it offends you, but I call it as i see it.  Let me preface by saying I entered this because you claimed "CHECKSUM() is under utilized by many developers" - that really offends me because you do not even fully understand it.

People reading your article will think that CHECKSUM really produces unique values, just as you yourself believe and propagate.

If you feel you need to, please bring in a moderator and delete my comments.

http://blogs.clarience.com/davide/?m=200709

>> Have you ever worked with a table in excess of 1 billion rows

Well, whadayaknow - yes! Thanks for asking.

Anyway, since you insist on reigniting the "extreme case", "billions of rows" argument, here is my table with 2 records on which 1 got deleted.
use tempdb;
if OBJECT_ID('tempdb..duplicates') > 0 drop table Duplicates
GO
create table Duplicates (A uniqueidentifier, B int, C varchar(25), D int);
GO
insert Duplicates select '3DB7D309-A8F4-47C4-BA90-0CB458B44CB0', 1, 'hello world', 3;
insert Duplicates select 'EFE7F2C5-19F9-42B9-9C16-21BED41E882B', 1, 'hello world', 3;

SELECT ROW_NUMBER() OVER(ORDER BY CHECKSUM(*)) RowNb, CHECKSUM(*) ChkSum FROM dbo.Duplicates;

DELETE T
FROM
   (SELECT ROW_NUMBER() OVER(ORDER BY CHECKSUM(*)) RowNb, CHECKSUM(*) ChkSum FROM dbo.Duplicates) T
WHERE
   T.RowNb NOT IN
   (
      SELECT MAX(RowNb) FROM (SELECT ROW_NUMBER() OVER(ORDER BY CHECKSUM(*)) RowNb, CHECKSUM(*) ChkSum FROM dbo.Duplicates) T GROUP BY ChkSum
   );

SELECT ROW_NUMBER() OVER(ORDER BY CHECKSUM(*)) RowNb, CHECKSUM(*) ChkSum FROM dbo.Duplicates;

Open in new window

0
 
LVL 58

Expert Comment

by:cyberkiwi
These collide as well, when the values are seemingly completely different on all 4 fields.

insert Duplicates select '3DB7D309-A8F4-47C4-BA90-0CB458B44CB0', 'avb', 'BCZ', 'ckz';
insert Duplicates select 'EFE7F2C5-19F9-42B9-9C16-21BED41E882B', 'AAE', 'BRQ', 'CAW';

Checksum BOL reference
http://msdn.microsoft.com/en-us/library/ms189788%28SQL.100%29.aspx
0
 
LVL 51

Author Comment

by:Ted Bouskill
@cyberwiki: I don't mind constructive discussion,  portions of your comment was rude and yes I am going to report it to the moderators.

NOTE: In your examples you have a GUID which is unique.  In that case I wouldn't use CHECKSUM() which means I think you are missing my point.  I'm not saying the CHECKSUM() is a final solution and yes it can be abused (like anything can).

Check out my test of your example when I used them as strings (which is likely in a log) along with some other examples which is in the spirit of friendly discussion.
SELECT CHECKSUM('3DB7D309-A8F4-47C4-BA90-0CB458B44CB0', 'avb', 'BCZ', 'ckz')
UNION ALL
SELECT CHECKSUM('EFE7F2C5-19F9-42B9-9C16-21BED41E882B', 'AAE', 'BRQ', 'CAW')

Results: 
1711681669
390310668

SELECT CHECKSUM(CAST('3DB7D309-A8F4-47C4-BA90-0CB458B44CB0' AS uniqueidentifier), 'avb', 'BCZ', 'ckz')
UNION ALL
SELECT CHECKSUM(CAST('EFE7F2C5-19F9-42B9-9C16-21BED41E882B' AS uniqueidentifier), 'AAE', 'BRQ', 'CAW')

Results:
107127020
107127020

SELECT BINARY_CHECKSUM(CAST('3DB7D309-A8F4-47C4-BA90-0CB458B44CB0' AS uniqueidentifier), 'avb', 'BCZ', 'ckz')
UNION ALL
SELECT BINARY_CHECKSUM(CAST('EFE7F2C5-19F9-42B9-9C16-21BED41E882B' AS uniqueidentifier), 'AAE', 'BRQ', 'CAW')

Results:
1025885929
1023845076

Open in new window

0
 
LVL 23

Expert Comment

by:Racim BOUDJAKDJI
@tedbilly

I believe that ciberwiki is simply trying to point out the limitations in using CHECKSUM based approach for dealing with duplicates tuples and pretty much most his comments are perfectly valid.  

Knowing these limitations does not say it is a bad solution, it simply completes it.  

No bad intent in that.

Regards...
0

Featured Post

Concerto Cloud for Software Providers & ISVs

Can Concerto Cloud Services help you focus on evolving your application offerings, while delivering the best cloud experience to your customers? From DevOps to revenue models and customer support, the answer is yes!

Learn how Concerto can help you.

Join & Write a Comment

In this video you will find out how to export Office 365 mailboxes using the built in eDiscovery tool. Bear in mind that although this method might be useful in some cases, using PST files as Office 365 backup is troublesome in a long run (more on t…
Please read the paragraph below before following the instructions in the video — there are important caveats in the paragraph that I did not mention in the video. If your PaperPort 12 or PaperPort 14 is failing to start, or crashing, or hanging, …

Keep in touch with Experts Exchange

Tech news and trends delivered to your inbox every month