Delete duplicate records in SQL database using stored procedure

Posted on 2007-12-04
Last Modified: 2008-09-11
I have a SQL server 2000 database that has over a million records in it.  I need to create a stored procedure that i can run that we check for which records have duplicates and then delete all of the duplicates except for one.  The criteria for the duplicates is based on five different columns.  I am hoping somebody can give me some guideance on how to do this.  
Question by:Mike_Stevens
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
LVL 17

Expert Comment

by:Daniel Reynolds
ID: 20405242
Can you provide the structure of the table? Is there a unique key?

Author Comment

ID: 20405276

Column: RecordNumber -  Unique Key
Column: IDNum
Column: Code
Column: Date
Column: Time
Column: State

All of the columns except RecordNumber need to be used in criteria.  Sample below

RecordNumber  IDNum  Code  Date         Time   State
------------------  ---------  ------  -------        ------   --------
100                   12345     A     12122007  1630    CA
101                   12345     A     12122007  1630    CA
102                   12345     A     12122007  1630    CA
103                   12345     A     12122007  1630    CA
104                   12345     A     12122007  1630    CA

Using the sample data above, procedure needs to delete all but one of the records.


Expert Comment

ID: 20405356
I would create a new table with the unique index on those fields, export the data from your old table into this new one (which will only take one of each duplicate because of the index), then clear out your old table and move the records from your new table back to the old table.
Industry Leaders: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

LVL 17

Accepted Solution

Daniel Reynolds earned 500 total points
ID: 20405880
Here is some code you can run iteratively until the rows affected = 0

DELETE  dbo.tblStatus
--SELECT top 1 * 
FROM dbo.tblStatus
WHERE Exists (Select Count(*), IDNum, Code, [Date], [Time], [State]
                              from dbo.tblStatus
                             Group By IDNum, Code, [Date], [Time], [State]
                             Having Count(*) > 1)

Open in new window

LVL 17

Expert Comment

by:Daniel Reynolds
ID: 20405903
And here is a link with additional information.
and one that shows how to do it with a little bit more programming, but all at once.

LVL 11

Expert Comment

ID: 20408127
Code provided by xDJR1875: will delete all record which have duplicate data but Mike_Stevens intended to leave a row for the duplicate data.

You can try as below, it will only left the max record number for the duplicates data.

DELETE  FROM tblStatus
        SELECT MaxRecordNumber, IDNum, Code, Date, Time, State
              SELECT Max(RecordNumber) AS MaxRecordNumber, IDNum, Code, Date, Time, State
              FROM  dbo.tblStatus
              GROUP BY IDNum, Code, Date, Time, State
          ) A
         WHERE A.MaxRecordNumber = tblStatus.RecordNumber AND
                     A.IDNum = tblStatus.IDNum AND A.Code = tblStatus.Code AND
                     A.Date = tblStatus.Date AND A.Time = tblStatus.Time AND
                     A.State = tblStatus.State
LVL 17

Expert Comment

by:Daniel Reynolds
ID: 20408152
Actually, with the SET ROWCOUNT = 1
The code does not delete all duplicate records.
LVL 11

Expert Comment

ID: 20408626
Sorry, my mistake, but for the example of data, the duplicate rows can be more than 2, set rowcount =1  will only delete one of the records which having duplicate.
LVL 17

Expert Comment

by:Daniel Reynolds
ID: 20409336
You are correct that is why I suggest Mike run it iteratively until rows affected = 0.

It is not the most elegant of solutions, but maybe one of  the simplest.

Expert Comment

ID: 21502162
    FROM    MyTable
        FROM        MyTable
        GROUP BY    DuplicatevalueColumn1, DuplicateValueColumn2,

Reference:Pinal Dave

Featured Post

How our DevOps Teams Maximize Uptime

Our Dev teams are like yours. They’re continually cranking out code for new features/bugs fixes, testing, deploying, responding to production monitoring events and more. It’s complex. So, we thought you’d like to see what’s working for us. Read the use case whitepaper.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
Removing SCCM 2016 4 53
T-SQL: Wrong Result 7 39
Suppress if value zero or NULL in crystal report 2 44
Section based report in SSRS 14 35
Everyone has problem when going to load data into Data warehouse (EDW). They all need to confirm that data quality is good but they don't no how to proceed. Microsoft has provided new task within SSIS 2008 called "Data Profiler Task". It solve th…
Load balancing is the method of dividing the total amount of work performed by one computer between two or more computers. Its aim is to get more work done in the same amount of time, ensuring that all the users get served faster.
Using examples as well as descriptions, and references to Books Online, show the documentation available for date manipulation functions and by using a select few of these functions, show how date based data can be manipulated with these functions.
This video shows how to set up a shell script to accept a positional parameter when called, pass that to a SQL script, accept the output from the statement back and then manipulate it in the Shell.

752 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question