Go Premium for a chance to win a PS4. Enter to Win

x
  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 1735
  • Last Modified:

Delete duplicate records in SQL database using stored procedure

I have a SQL server 2000 database that has over a million records in it.  I need to create a stored procedure that i can run that we check for which records have duplicates and then delete all of the duplicates except for one.  The criteria for the duplicates is based on five different columns.  I am hoping somebody can give me some guideance on how to do this.  
0
Mike_Stevens
Asked:
Mike_Stevens
1 Solution
 
Daniel ReynoldsSoftware Applications Developer / IntegratorCommented:
Can you provide the structure of the table? Is there a unique key?
0
 
Mike_StevensAuthor Commented:

tblStatus
Column: RecordNumber -  Unique Key
Column: IDNum
Column: Code
Column: Date
Column: Time
Column: State

All of the columns except RecordNumber need to be used in criteria.  Sample below

RecordNumber  IDNum  Code  Date         Time   State
------------------  ---------  ------  -------        ------   --------
100                   12345     A     12122007  1630    CA
101                   12345     A     12122007  1630    CA
102                   12345     A     12122007  1630    CA
103                   12345     A     12122007  1630    CA
104                   12345     A     12122007  1630    CA

Using the sample data above, procedure needs to delete all but one of the records.

0
 
randomjamesCommented:
I would create a new table with the unique index on those fields, export the data from your old table into this new one (which will only take one of each duplicate because of the index), then clear out your old table and move the records from your new table back to the old table.
0
Learn Veeam advantages over legacy backup

Every day, more and more legacy backup customers switch to Veeam. Technologies designed for the client-server era cannot restore any IT service running in the hybrid cloud within seconds. Learn top Veeam advantages over legacy backup and get Veeam for the price of your renewal

 
Daniel ReynoldsSoftware Applications Developer / IntegratorCommented:
Here is some code you can run iteratively until the rows affected = 0

SET ROWCOUNT 1
DELETE  dbo.tblStatus
--SELECT top 1 * 
FROM dbo.tblStatus
WHERE Exists (Select Count(*), IDNum, Code, [Date], [Time], [State]
                              from dbo.tblStatus
                             Group By IDNum, Code, [Date], [Time], [State]
                             Having Count(*) > 1)

Open in new window

0
 
Daniel ReynoldsSoftware Applications Developer / IntegratorCommented:
And here is a link with additional information.
http://support.microsoft.com/kb/139444
and one that shows how to do it with a little bit more programming, but all at once.
http://www.sql-server-performance.com/articles/dev/dv_delete_duplicates_p1.aspx

0
 
yuchingCommented:
Code provided by xDJR1875: will delete all record which have duplicate data but Mike_Stevens intended to leave a row for the duplicate data.

You can try as below, it will only left the max record number for the duplicates data.

DELETE  FROM tblStatus
WHERE  NOT EXISTS (
        SELECT MaxRecordNumber, IDNum, Code, Date, Time, State
        FROM
        (
              SELECT Max(RecordNumber) AS MaxRecordNumber, IDNum, Code, Date, Time, State
              FROM  dbo.tblStatus
              GROUP BY IDNum, Code, Date, Time, State
          ) A
         WHERE A.MaxRecordNumber = tblStatus.RecordNumber AND
                     A.IDNum = tblStatus.IDNum AND A.Code = tblStatus.Code AND
                     A.Date = tblStatus.Date AND A.Time = tblStatus.Time AND
                     A.State = tblStatus.State
)
0
 
Daniel ReynoldsSoftware Applications Developer / IntegratorCommented:
Actually, with the SET ROWCOUNT = 1
The code does not delete all duplicate records.
0
 
yuchingCommented:
Sorry, my mistake, but for the example of data, the duplicate rows can be more than 2, set rowcount =1  will only delete one of the records which having duplicate.
0
 
Daniel ReynoldsSoftware Applications Developer / IntegratorCommented:
You are correct that is why I suggest Mike run it iteratively until rows affected = 0.

It is not the most elegant of solutions, but maybe one of  the simplest.
0
 
yordan_georgievCommented:
DELETE
    FROM    MyTable
    WHERE    ID NOT IN
    (
    SELECT    MAX(ID)
        FROM        MyTable
        GROUP BY    DuplicatevalueColumn1, DuplicateValueColumn2,
                DuplicateValueColumn2)

Reference:Pinal Dave
http://blog.sqlauthority.com/2007/03/01/sql-server-delete-duplicate-records-rows/
0

Featured Post

Veeam and MySQL: How to Perform Backup & Recovery

MySQL and the MariaDB variant are among the most used databases in Linux environments, and many critical applications support their data on them. Watch this recorded webinar to find out how Veeam Backup & Replication allows you to get consistent backups of MySQL databases.

Tackle projects and never again get stuck behind a technical roadblock.
Join Now