deleting duplicate records in a database without a primary key

It's always a big challenge to find and delete duplicate record from a database table which does not have a primary or unique key in the table. It becomes a more critical situation when the total number of records increases.

So, this article describes some methods on how to find the duplicate records and or delete them. Choosing which method is up to you and depends on things like table size, data types, even SQL Server version and for very large tables, the amount of free disk space.

Our first challenge is to decide which columns define "a duplicate". It could be every column, but things like a datetime "date_added" might be the only field which is different and should be excluded from checking. Then, in a table like a "product barcodes" then it might be just the barcode that needs to be checked. So, you do have to first consider what columns you must use to check for duplicates.

Example 1 - The simple structure

In this example, say you have a table named "EE_table" with the following fields col1,col2,col3,col4,colN. There is nothing special about these columns and are all needed to test for a duplicate.

First let us check if there are any duplicates by running the queries as below.

-- 1st a straight row count
                      SELECT   count(*) 
                      FROM     EE_Table
                      
                      -- 2nd a select using our definition for a duplicate showing just the duplicates
                      SELECT   Col1,Col2,col3,col4,colN 
                      FROM     EE_Table 
                      GROUP BY col1,Col2,col3,col4,colN 
                      HAVING   count(*) > 1
                      
                      -- 3rd which means the rows that we really want to keep are (or use the group by above)
                      SELECT   DISTINCT col1,col2,col3,col4,colN 
                      FROM     EE_Table

Open in new window

Then check how many records are returned by the 2nd Query Result. If there were none, then might need to check the columns being used in the query and adjust. The reason for the first query is to get a total count, and the row count from the 3rd query is what we will be left with.

Always double check, and only when satisfied, you can move on to the next step.

First use the below command to backup the original table data into a second table. It is always a good practice to take backup of original table data into a backup table. If you are performing any bulk operation, if there is any mistake in the live table you have always the backup data to recover from. It can depend on disk space, so be a bit careful, and you might need to save your last SQL backup instead.

-- Take a copy using "select into" the table you do select into must not already exist
                      SELECT * INTO EE_Table_Back FROM EE_Table
                      
                      -- now empty out the original table - now truncate might fail if there are constraints, but our sample is simple and this works
                      TRUNCATE TABLE EE_Table
                      
                      -- then Insert records back with duplicate records filtered out.
                      INSERT EE_Table (col1,col2,col3,col4, colN)
                      SELECT DISTINCT col1,col2,col3,col4,colN FROM EE_Table_back

Open in new window

Example 2 - Selective Columns

In this example, with the same "EE_table" as above, we now find that Col2 is "date_added" a datetime that needs to be excluded from our test. This adds a level of complexity because now we cannot simply copy all the fields, we now have to start identifying individual rows by excluding some columns.

Queries 1 and 2 from the first example are still OK, just use the appropriate columns, this time excluding Col2. The 3rd query is the more difficult one. The rows we want to keep have to include all columns.

-- 1st a straight row count
                      SELECT   count(*) 
                      FROM     EE_Table
                      
                      -- 2nd a select using our definition for a duplicate showing just the duplicates
                      SELECT   Col1,col3,col4,colN 
                      FROM     EE_Table 
                      GROUP BY col1,col3,col4,colN 
                      HAVING   count(*) > 1
                      
                      -- 3rd query.  SQL 2005 and up, we can use the row_number() function and a CTE query
                      ;WITH Dupes AS
                      (Select row_number() over (partition by col1,col3,col4,colN order by col2) as rn, * from EE_Table)
                      SELECT * FROM dupes WHERE rn = 1

Open in new window

In the above 3rd Query, if we said rn > 1 then it shows the duplicated rows and can be used instead of query 2 above

So, as long as we can easily "select the ones we want" then we can use the same approach in example one. Take a backup, Truncate, and populate the ones we want from the backup. Code sample below shows the new insert statement using the row_number() function. That would be run after the backup and truncate of course.

;WITH Dupes as
                      (Select row_number() over (partition by col1,col3,col4,colN order by col2) as rn, * from EE_Table_Back)
                      INSERT EE_Table (Col1, Col2, Col3, Col4, ColN)
                      SELECT Col1, Col2, Col3, Col4, ColN FROM dupes WHERE rn = 1      
                      
                      -- or, since we have a back up any way, we can use the CTE to directly edit the table
                      
                      ;WITH Dupes AS
                      (Select row_number() over (partition by col1,col3,col4,colN order by col2) as rn, * from EE_Table)
                      DELETE FROM dupes WHERE rn > 1    
                      
                      -- if you replace the "delete" with "select * " you get a list of the duplicates as mentioned above.

Open in new window

Example 3 - Using Identities

In this example, with the same "EE_table" as above, with Col2 to be excluded from our test. We can actually "fix in place" by adding some details to our table. Sounds funny, but adding more can sometimes mean less work. This method will also help if you cannot use the row_number() function in example 2.

What we are going to do is add an identity column to our table and then use that to help uniquely identify rows we do not want. We can then drop the column afterwards.

-- add a new identity column to our table - might need to be BIGINT datatype for very large tables, and always take a backup before hand
                      ALTER TABLE EE_Table ADD dupe_id int identity
                      
                      -- now we can select just the duplicates
                      SELECT * FROM EE_Table
                      WHERE dupe_id <> (select min(dupe_id) 
                                        from ee_table d 
                                        where d.col1 = EE_table.col1 
                                        and d.col3 = EE_Table.col3
                                        and d.col4 = EE_Table.col4
                                        and d.colN = EE_Table.colN)
                      
                      -- once we are happy with the select, all we need to do is change into a delete
                      DELETE EE_Table
                      WHERE dupe_id <> (select min(dupe_id) 
                                        from ee_table d 
                                        where d.col1 = EE_table.col1 
                                        and d.col3 = EE_Table.col3
                                        and d.col4 = EE_Table.col4
                                        and d.colN = EE_Table.colN)
                      
                      -- and now remove that column
                      Alter table EE_Table drop column dupe_id

Open in new window

Instead of dropping the column you might consider leaving it there. After all, unless you stop the duplicates from happening in the first place, they will happen again. You might even consider turning it into a surrogate primary key (but that is a different article).

Using an identity can also help fix fragmentation in a large heap table as well. But that is also another article.

By keeping it fairly simple, and always being as explicit as possible by using the correct columns you will be in charge of your cleanup processes. You can use the above as a basis for more advanced methods if needed (like generating row_number() to compare) and/or controlling the number of rows (using rowcounts or top() as an example). If you have very very large tables and need to clean up batches at a time instead of trying to do everything at once, then you will need to use some alternate methods (e.g. cannot use the truncate when using a batch type approach, you have to do it "in situ").

Friends, I have had the same problem in the past where my database had a master table of members and child table of Member Tree , and the member tree table had about 3-4 crore records (that is 3 or 4 tens of million). I have used the above simple techniques (example 1 in this case) and it worked perfectly.

I have used these tricks 4or 5 times now with the perfect results. I am sure they will also work for you. Always take a backup of some description before you start, and always check what you are deleting by displaying samples and testing first.

Thanks
Ravindra

Comments (2)

Mark Wills

Topic Advisor

CERTIFIED EXPERT

Distinguished Expert 2018

Commented: 2010-07-16

Good article, thanks Ravindra.

Deleting duplicates can be hard work, and I think you have covered off the more popular approaches pretty well.

I also like the importance of taking a backup first :)

rdivilbiss

Top Expert 2005

Commented: 2010-07-18

Dups have always caused me grief. This will be bookmarked as a ready reference.

Thanks,
Rod