Solved

hOW TO DELETE DUPLICATE RECORDS?

Posted on 2011-03-18
17
264 Views
Last Modified: 2012-05-11
I have a table with 2 col.  One is the self-inc PK, which I want to keep, 2nd is varchar (string).  What's the query like for removing rows that have duplicate string in  2nd column.  
I saw something like this on some web site but not sure if this is correct
delete from Table
where ID in
(SELECT MAX(ID) FROM TABLE WHERE NAME IN (SELECT NAME FROM TABLE
GROUP BY NAME
HAVING COUNT(*)>1)
)
0
Comment
Question by:lapucca
  • 6
  • 5
  • 3
  • +2
17 Comments
 
LVL 11

Expert Comment

by:JoeNuvo
ID: 35169750
that won't do.

should be something like this


DELETE FROM [TABLE]
WHERE ID NOT IN (
	SELECT MIN(ID) FROM [TABLE]
	GROUP BY [Name]
)

Open in new window

0
 
LVL 75

Expert Comment

by:käµfm³d 👽
ID: 35169773
@JoeNuvo

I believe you need to include the HAVING clause  = )
0
 
LVL 75

Expert Comment

by:käµfm³d 👽
ID: 35169774
Actually...  N/M    I must be tired!
0
 
LVL 3

Expert Comment

by:ssisworo
ID: 35169846

delete from Table
where ID in
(SELECT MAX(ID) FROM TABLE WHERE NAME IN (SELECT NAME FROM TABLE
GROUP BY NAME
HAVING COUNT(*)>1)
)

yes... this is work.
but you need repeatly run until... no duplicate
because it just delete the MAX ID where having duplicate NAME
if you have 4 ID with duplicate NAME... you need run this query 3 times,
and no effect for the fourth run and so on
0
 

Author Comment

by:lapucca
ID: 35170106
Is that right?  I would have to run this several times?
0
 
LVL 3

Expert Comment

by:ssisworo
ID: 35170154
yes, of course. I often run a query like this over and over again to clean the data.
if you're not sure, backup first.

here my sample (i run four times) :
QUERY
 query
RESULT
 RESULT
0
 
LVL 11

Expert Comment

by:JoeNuvo
ID: 35170157
The code in your question will delete "1 row" of duplicate record regardless of how many ID.
from below example
1, John
2, Mary
3, John
4, Danny
5, Mary
6, John

There are 3 John, 2 Mary, 1 Danny  which need to delete 2 John and 1 Mary
you have to run your given code 4 rounds.

Why 4 rounds?  Because you will know it's time to stop when it delete nothing (0 row affected)
0
 
LVL 11

Expert Comment

by:JoeNuvo
ID: 35170162
so, if you really like to delete 1 row at a time.

this code will do

Declare @loop int
SET @loop = 0

WHILE @loop = 0
BEGIN
	delete from [Table]
	where ID in
	(SELECT MAX(ID) FROM [TABLE] WHERE NAME IN (SELECT [NAME] FROM [TABLE]
	GROUP BY [NAME]
	HAVING COUNT(*)>1)
	)

	IF @@ROWCOUNT = 0 SET @loop = 1
END

Open in new window

0
Complete VMware vSphere® ESX(i) & Hyper-V Backup

Capture your entire system, including the host, with patented disk imaging integrated with VMware VADP / Microsoft VSS and RCT. RTOs is as low as 15 seconds with Acronis Active Restore™. You can enjoy unlimited P2V/V2V migrations from any source (even from a different hypervisor)

 
LVL 11

Expert Comment

by:JoeNuvo
ID: 35170169
But if you want a little better of code, here is it

Declare @loop int
SET @loop = 0

WHILE @loop = 0
BEGIN
	delete from [Table]
	where ID in
	(SELECT MAX(ID) FROM [TABLE]
	GROUP BY [NAME]
	HAVING COUNT(*)>1
	)

	IF @@ROWCOUNT = 0 SET @loop = 1
END

Open in new window


What is the different? this code will delete duplicate of all name, but just 1 per each name.
So, from old example, need to delete 2 John and 1 Marry.  This code will stop at round 3.
(Round to run = Max Duplicate)
0
 

Author Comment

by:lapucca
ID: 35170170
No, I am looking to run the query once and removes all duplicate rows.  Forget the query I found from some site.  Can someone help me for a query that would do this.  Thank you.
0
 
LVL 23

Expert Comment

by:OP_Zaharin
ID: 35170174
hi lapucca. if you can share with us the data structure that you currently have and show the example how it's duplicate, some of us might be able to point you to the right track :)
0
 
LVL 3

Expert Comment

by:ssisworo
ID: 35170184
see...
1. no effect for the fourth run
2. ID = 7 deleted because double data
    here the explain :
     delete from TABLE1 where ID in
     (SELECT MAX(ID) FROM TABLE1 WHERE NAME IN (SELECT NAME FROM TABLE1
      GROUP BY NAME    HAVING COUNT(*)>1))
    a. (SELECT NAME FROM TABLE1 GROUP BY NAME    HAVING COUNT(*)>1) ===> four
    b. (SELECT MAX(ID) FROM TABLE1 WHERE NAME IN ('four') ===> 7
    c. delete from TABLE1 where ID in ('7')
    d. ID = 7 deleted from TABLE1



0
 
LVL 11

Accepted Solution

by:
JoeNuvo earned 250 total points
ID: 35170187
best method, should be my comment ID 35170169
since it will utilized index when perform delete.
It may loop few times, however, you just call it once.

But, if you want it to run without a loop
then my 1st comment (ID 35169750) will do it for you.

just replace [Table], [ID], [Name]  with your actual table/field name

0
 
LVL 3

Assisted Solution

by:ssisworo
ssisworo earned 250 total points
ID: 35170203
here... for only 1 run :

--------------------------

delete from TABLE1 where ID in
(SELECT (ID) FROM TABLE1 WHERE NAME IN (SELECT NAME FROM TABLE1 GROUP BY NAME HAVING COUNT(*)>1)) and
ID not in
(SELECT MIN(ID) FROM TABLE1 WHERE NAME IN (SELECT NAME FROM TABLE1 GROUP BY NAME HAVING COUNT(*)>1) GROUP BY NAME)


----------------------------------------------------------------------

QUERY :
 QUERY
RESULT :
 result
0
 

Author Closing Comment

by:lapucca
ID: 35170217
Thanks everyone.  I'm sorry for not being very clear from the beginning.
0
 
LVL 3

Expert Comment

by:ssisworo
ID: 35170223

yeah.... the query from JoeNuvo is much better :

here, the screen shoot

QUERY :

 QUERY
RESULT :

 RESULT
0
 
LVL 11

Expert Comment

by:JoeNuvo
ID: 35170233
Hi ssisworo,

My query may more simple, but performance wise, maybe your "Assist Solution" will be better.
Since it got more change of utilized index, and also reduce number of concern ID.

but I'm not going to confirm it ^_^
0

Featured Post

Backup Your Microsoft Windows Server®

Backup all your Microsoft Windows Server – on-premises, in remote locations, in private and hybrid clouds. Your entire Windows Server will be backed up in one easy step with patented, block-level disk imaging. We achieve RTOs (recovery time objectives) as low as 15 seconds.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

If you have heard of RFC822 date formats, they can be quite a challenge in SQL Server. RFC822 is an Internet standard format for email message headers, including all dates within those headers. The RFC822 protocols are available in detail at:   ht…
This article describes how to use the timestamp of existing data in a database to allow Tableau to calculate the prior work day instead of relying on case statements or if statements to calculate the days of the week.
This Micro Tutorial will teach you how to censor certain areas of your screen. The example in this video will show a little boy's face being blurred. This will be demonstrated using Adobe Premiere Pro CS6.
This tutorial gives a high-level tour of the interface of Marketo (a marketing automation tool to help businesses track and engage prospective customers and drive them to purchase). You will see the main areas including Marketing Activities, Design …

911 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

16 Experts available now in Live!

Get 1:1 Help Now