CREATE TABLE [dbo].[Duplicates]([DuplicatesNonNull] [varchar](50) NOT NULL) ON [PRIMARY]
INSERT INTO Duplicates (DuplicatesNonNull) VALUES ('Duplicate1')
INSERT INTO Duplicates (DuplicatesNonNull) VALUES ('Duplicate2')
INSERT INTO Duplicates (DuplicatesNonNull) VALUES ('Duplicate3')
INSERT INTO Duplicates (DuplicatesNonNull) VALUES ('Unique1')
INSERT INTO Duplicates (DuplicatesNonNull) VALUES ('Duplicate2')
INSERT INTO Duplicates (DuplicatesNonNull) VALUES ('Duplicate1')
INSERT INTO Duplicates (DuplicatesNonNull) VALUES ('Unique2')
INSERT INTO Duplicates (DuplicatesNonNull) VALUES ('Duplicate2')
INSERT INTO Duplicates (DuplicatesNonNull) VALUES ('Duplicate2')
INSERT INTO Duplicates (DuplicatesNonNull) VALUES ('Duplicate2')
INSERT INTO Duplicates (DuplicatesNonNull) VALUES ('Unique3')
INSERT INTO Duplicates (DuplicatesNonNull) VALUES ('Duplicate1')
INSERT INTO Duplicates (DuplicatesNonNull) VALUES ('Duplicate3')
INSERT INTO Duplicates (DuplicatesNonNull) VALUES ('Duplicate3')
Detecting duplicates is easy with the following query:
SELECT SUM(NbOfDuplicates) NbOfDuplicates FROM (SELECT COUNT(*)-1 NbOfDuplicates FROM dbo.Duplicates GROUP BY CHECKSUM(*)) T HAVING COUNT(*) > 1
Notice that I remove one from the COUNT(*) to exclude the one valid row. You actually don't need the HAVING COUNT(*) > 1 statement because if a row isn't duplicated COUNT(*) returns 1 which becomes 0. However, it improves readability.
DELETE T
FROM
(SELECT ROW_NUMBER() OVER(ORDER BY CHECKSUM(*)) RowNb, CHECKSUM(*) ChkSum FROM dbo.Duplicates) T
WHERE
T.RowNb NOT IN
(
SELECT MAX(RowNb) FROM (SELECT ROW_NUMBER() OVER(ORDER BY CHECKSUM(*)) RowNb, CHECKSUM(*) ChkSum FROM dbo.Duplicates) T GROUP BY ChkSum
)
I use the ROW_NUMBER() function added to SQL 2005 along with the OVER clause to order the sub query by my checksum and create row numbers for this sub query.
Have a question about something in this article? You can receive help directly from the article author. Sign up for a free trial to get started.
Comments (12)
Author
Commented:I'm sorry but I've written enough code to realize there isn't always a *perfect* or 'one size fits all' solution to every problem. The public dialog in this thread is enough to help anyone make a decision about when to use this solution. Your comments are acting as caveat's for extreme cases.
Personally the only time I create tables without constraints like a primary key or even a unique key is logging. The only case of duplicate rows I've ever encountered was logging tables or non essential data. In my web applications due to intermittent issues we accept that some logs are often lost so if this solution removes some extra logs due to false positives we tend not to worry.
If either of you choose to write an article about why this should never happen to business critical data because there should be constraints to prevent duplicates then by all means do so and I'll even add a link to it in this article.
@cyberwiki: Your last comment was rude and I am offended. I have a real job with real data and am working successfully for a Fortune 500 company.
Have you ever worked with a table in excess of 1 billion rows and would you ever create mission critical table without constraints like a unique primary key? I feel your comments have moved beyond being constructive or relevant.
I'm not going to assume that all the readers aren't smart enough to use this as a foundation for solving problems. They are free to remove the CHECKSUM() and use the column names themselves.
Commented:
I am sorry if it offends you, but I call it as i see it. Let me preface by saying I entered this because you claimed "CHECKSUM() is under utilized by many developers" - that really offends me because you do not even fully understand it.
People reading your article will think that CHECKSUM really produces unique values, just as you yourself believe and propagate.
If you feel you need to, please bring in a moderator and delete my comments.
http://blogs.clarience.com/davide/?m=200709
>> Have you ever worked with a table in excess of 1 billion rows
Well, whadayaknow - yes! Thanks for asking.
Anyway, since you insist on reigniting the "extreme case", "billions of rows" argument, here is my table with 2 records on which 1 got deleted.
Open in new window
Commented:
insert Duplicates select '3DB7D309-A8F4-47C4-BA90-0
insert Duplicates select 'EFE7F2C5-19F9-42B9-9C16-2
Checksum BOL reference
http://msdn.microsoft.com/en-us/library/ms189788%28SQL.100%29.aspx
Author
Commented:NOTE: In your examples you have a GUID which is unique. In that case I wouldn't use CHECKSUM() which means I think you are missing my point. I'm not saying the CHECKSUM() is a final solution and yes it can be abused (like anything can).
Check out my test of your example when I used them as strings (which is likely in a log) along with some other examples which is in the spirit of friendly discussion.
Open in new window
Commented:
I believe that ciberwiki is simply trying to point out the limitations in using CHECKSUM based approach for dealing with duplicates tuples and pretty much most his comments are perfectly valid.
Knowing these limitations does not say it is a bad solution, it simply completes it.
No bad intent in that.
Regards...
View More