Solved

How to delete duplicate rows

Posted on 2014-04-25
22
306 Views
Last Modified: 2014-09-02
Hi all:
I've some duplicated rows in my table (tktrans).  the only different field between duplicated record is (Sn). I would like to delete on of the duplicate rows.
0
Comment
Question by:khaled salem
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 5
  • 4
  • 4
  • +3
22 Comments
 
LVL 143

Accepted Solution

by:
Guy Hengel [angelIII / a3] earned 250 total points
ID: 40022566
you posted in both ORACLE and MS SQL server, please clarify ...

you may want to read up this article, though it doesn't explain the DELETE part, but helps to understand the basic concept nevertheless:
http://www.experts-exchange.com/Database/Miscellaneous/A_3203-DISTINCT-vs-GROUP-BY-and-why-does-it-not-work-for-my-query.html

so,you need to decide which of the 2 (+) rows you want to keep, let's say lowest value of Sn, and which field(s) are to be taken as "key" fields to consider the records "duplicated"

the below code assumes it's 1 field to determin the "duplication", you can add more condition if it's several columns
DELETE  t
 FROM yourtable T
  WHERE EXISTS( SELECT NULL FROM yourtable n
                     WHERE n.Sn > t.SN
                        AND n.keyfield = t.keyfield
  )

Open in new window


before the delete, you may want to run some SELECT along the same lines
0
 
LVL 8

Assisted Solution

by:ProjectChampion
ProjectChampion earned 125 total points
ID: 40022581
This should do the trick:

With CTE as (Select *, RN = Row_Number()  OVER (Partition by Col1, ..., Coln Order BY Col1,...Coln, Sn) From tktrans)
Delete CTE Where RN > 1;

Col1,..., Coln are all columns in your table except SN.
0
 
LVL 32

Assisted Solution

by:awking00
awking00 earned 125 total points
ID: 40022583
Just a little more info. Can there be multiple duplicates (i.e. triplicates, quadruplicates, etc.)? If so, do you just want to delete one or do you want to delete all but? We would need to know the criteria for the one(s) you want to keep. Perhaps you can provide some sample data and how you would want it to look like after the deletes.
0
PeopleSoft Has Never Been Easier

PeopleSoft Adoption Made Smooth & Simple!

On-The-Job Training Is made Intuitive & Easy With WalkMe's On-Screen Guidance Tool.  Claim Your Free WalkMe Account Now

 
LVL 32

Expert Comment

by:awking00
ID: 40022599
>>the only different field between duplicated record is (Sn).<<
This is also unclear to me. Do you mean that the Sn value could be different, but if all the other fields are the same, it would be considered a duplicate to be deleted? Or do you mean that, if the same Sn value appears more than once, then the repeated values are considered duplicate records? Again, some sample data and expected results would help greatly in clarifying your intent.
0
 
LVL 8

Expert Comment

by:ProjectChampion
ID: 40022635
The advantage of the method in my previous post is that even if all columns in the duplicate rows are identical it'll still get rid of all duplicates.
0
 
LVL 32

Expert Comment

by:awking00
ID: 40022675
ProjectChampion, I agree that your method would remove all duplicate rows, but the asker stated he wanted to remove one of the duplicate rows. If he means that literally (which I doubt but still would like clarification), your method would also work by changing to "where rn = 1". The other unclear issue is just how the Sn field relates to the duplicates. Additionally, your method would perform much better if there were a primary key (or keys) to partition on rather than every column, which is why I wanted to see some sample data.
0
 
LVL 8

Expert Comment

by:ProjectChampion
ID: 40022748
Changing to  "where rn = 1" would delete all unique rows and one copy of each duplicate row, leaving only the second, third, ... copies of the duplicates. The WHERE clause should remain as "RN > 1".

Also partitioning on PK (if any) will be pointless as PK are by definition unique, so any partitioning (or Group By) must include the identical columns, which according to khaledsalem's post "the only different field between duplicated record is (Sn)" means all columns except Sn.
0
 
LVL 32

Expert Comment

by:awking00
ID: 40022817
ProjectChampion,
First let me say I wasn't by any means trying to be critical of your proposed solutions. Actually, I think they're precisely what I would propose if I had the same confidence in what the asker's intent is as you seem to. However, because I didn't have that confidence, I simply wanted to get more clarification. :-)

Sorry, I meant to say "where rn = 2" if the intent was to only delete one row.

>>"the only different field between duplicated record is (Sn)" means all columns except Sn.<<
I'm glad you understand what that meant, but I felt it was extremely vague and just wanted a further explanation (in fact, I still do).
0
 

Author Comment

by:khaled salem
ID: 40022992
Dear Guy Hengel [angelIII / a3]
I am trying your answer its looks like correct 100%. but the result different.
I have found the 17 duplicate rows by the using code
select  tktno, count(tktno)
from tktrans
where airlineid=148 and issuedate>='2014-01-01' AND tktstatus='VO'
group by tktno
having count(tktno) >1

Open in new window

when try to use your code found 54 duplicate rows
 
select t.sn, t.tktno, t.tktstatus, t.issuedate 
  from tktrans t
 where issuedate>='2014-01-01' and t.tktstatus='VO' and t.airlineid=148 and  t.sn = 
    ( SELECT MAX(i.sn)
        from tktrans i
       where i.tktno = t.tktno and i.tktstatus='VO'
    )
    order by t.tktno

Open in new window

0
 
LVL 143

Assisted Solution

by:Guy Hengel [angelIII / a3]
Guy Hengel [angelIII / a3] earned 250 total points
ID: 40023175
your first code returns how many different values you have that have duplicates (or more)
your second query is missing the AND i.airlineid = 148 ... and is returning all the rows that have the same sn value, which might be several rows ...
0
 
LVL 27

Expert Comment

by:skullnobrains
ID: 40028961
what about this ?

insert into tktrans_tmp select tktno, issuedate, tktstatus, ... , max/min(sn)
from tktrans
group by  tktno, issuedate, tktstatus, ...

rename table tkrans to tktrans_with_dups , tktrans_tmp to tktrans
drop table tktrans_with_dups

you can try the select query alone first in order to make sure it gives the data you expect
0
 
LVL 2

Expert Comment

by:Eric Mulovhedzi
ID: 40030274
If there is a duplicate, that means there must be a field on table (tktrans) that yo do not want  appearing in more than one raw on the resultant query. Assuming the field name is (Sn):

in My SQL you will use:

DELETE `Sn` 
FROM `tktrans` tk1, `tktrans` tk2 
WHERE tk1.id < tk2.id AND tk1.`Sn` = tk2.`Sn`

Open in new window


Bu its much safer to double check this by doing a SELECT first instead of DELETE i.e.

SELECT 
    `id`,`Sn` 
FROM `tktrans` tk1, `tktrans` tk2 
WHERE tk1.id < tk2.id AND tk1.`Sn` = tk2.`Sn`

Open in new window


This is so, provided your table (tktrans) as a unique field called 'id', which most people will usually make a unique table field and auto-increment by default but lets leave that topic for another day.
0
 
LVL 143

Expert Comment

by:Guy Hengel [angelIII / a3]
ID: 40291959
0
 
LVL 27

Expert Comment

by:skullnobrains
ID: 40295346
@guy: don't really care about points, but i'm interested in what you think is wrong about the solution i posted
0
 
LVL 143

Expert Comment

by:Guy Hengel [angelIII / a3]
ID: 40295604
Your technique would surely work, but has the side effect of "dropping" the table. I only would apply this on really large tables with many rows to be deleted,  where the simple delete would eventually fail....
0
 
LVL 27

Expert Comment

by:skullnobrains
ID: 40297932
duplicating the data would be a side effect, which would also happen with other techniques in many cases, but for some columns only because the server requires a temporary table to handle the order/group by clause

i hardly see why dropping would be viewed as a side effect, however. this approach has the above mentioned disadvantage, but replacing a table with another is an instantaneous and atomic operation, and avoids scattered row locking. if rows are timestamped, it can also be devised in such a way that the second table can be built incrementally so you'd actually lock the table for a fraction of a second only. depending on the use cases, (namely the frequency and duration of write operations on the table) either approach can be better
0
 
LVL 143

Expert Comment

by:Guy Hengel [angelIII / a3]
ID: 40298005
dropping a table has the following "side effects",
A) it drops plenty of other things assigned to the table:
  * constraints
  * indexes
  * permissions
  * statistics
  * replication
  * partitions

B) you may not be able to "drop/replace" the table if
 * it's linked in a foreign key being the "master" table
 * a schema triggers is not allowing the drop
 * table is "schema bound" to views

surely all of those things can be "listed" on the existing table and be recreated, but its just tedious, hence that technique should really and only be used in very rare cases.
0
 
LVL 27

Expert Comment

by:skullnobrains
ID: 40298219
interesting point, thanks for sharing.

actually, apart from statistics,

- A is not really an issue because most db engines will allow the rest to be copied using a single "create table ... like ..." query. (possibly with an extra including clause for example in postgres)

- constraints on the other hand would require some extra scripting : this table's constraint will be kept, but foreign keys referencing columns from the subject table will not (you'd need to add a "cascade constraints" clause to the drop statement)

- (unfrequent) cases where views would be an issue would require to remap the views to the new table before performing the update. note that a regular view would not need any modification unless it is referenced by foreign constraints which is not even feasible in many db engines and definitely very bad practice.

thanks again for pointing out. actually, if some of the table columns are foreign keys of other tables, i would not recommend it either without further digging
0

Featured Post

Free Tool: IP Lookup

Get more info about an IP address or domain name, such as organization, abuse contacts and geolocation.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

I'm trying, I really am. But I've seen so many wrong approaches involving date(time) boundaries I despair about my inability to explain it. I've seen quite a few recently that define a non-leap year as 364 days, or 366 days and the list goes on. …
This article explains how to reset the password of the sa account on a Microsoft SQL Server.  The steps in this article work in SQL 2005, 2008, 2008 R2, 2012, 2014 and 2016.
This video shows how to configure and send email from and Oracle database using both UTL_SMTP and UTL_MAIL, as well as comparing UTL_SMTP to a manual SMTP conversation with a mail server.
This video explains what a user managed backup is and shows how to take one, providing a couple of simple example scripts.

752 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question