• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 331
  • Last Modified:

How to delete duplicate rows

Hi all:
I've some duplicated rows in my table (tktrans).  the only different field between duplicated record is (Sn). I would like to delete on of the duplicate rows.
0
khaled salem
Asked:
khaled salem
  • 5
  • 4
  • 4
  • +3
4 Solutions
 
Guy Hengel [angelIII / a3]Billing EngineerCommented:
you posted in both ORACLE and MS SQL server, please clarify ...

you may want to read up this article, though it doesn't explain the DELETE part, but helps to understand the basic concept nevertheless:
http://www.experts-exchange.com/Database/Miscellaneous/A_3203-DISTINCT-vs-GROUP-BY-and-why-does-it-not-work-for-my-query.html

so,you need to decide which of the 2 (+) rows you want to keep, let's say lowest value of Sn, and which field(s) are to be taken as "key" fields to consider the records "duplicated"

the below code assumes it's 1 field to determin the "duplication", you can add more condition if it's several columns
DELETE  t
 FROM yourtable T
  WHERE EXISTS( SELECT NULL FROM yourtable n
                     WHERE n.Sn > t.SN
                        AND n.keyfield = t.keyfield
  )

Open in new window


before the delete, you may want to run some SELECT along the same lines
0
 
ProjectChampionCommented:
This should do the trick:

With CTE as (Select *, RN = Row_Number()  OVER (Partition by Col1, ..., Coln Order BY Col1,...Coln, Sn) From tktrans)
Delete CTE Where RN > 1;

Col1,..., Coln are all columns in your table except SN.
0
 
awking00Commented:
Just a little more info. Can there be multiple duplicates (i.e. triplicates, quadruplicates, etc.)? If so, do you just want to delete one or do you want to delete all but? We would need to know the criteria for the one(s) you want to keep. Perhaps you can provide some sample data and how you would want it to look like after the deletes.
0
What does it mean to be "Always On"?

Is your cloud always on? With an Always On cloud you won't have to worry about downtime for maintenance or software application code updates, ensuring that your bottom line isn't affected.

 
awking00Commented:
>>the only different field between duplicated record is (Sn).<<
This is also unclear to me. Do you mean that the Sn value could be different, but if all the other fields are the same, it would be considered a duplicate to be deleted? Or do you mean that, if the same Sn value appears more than once, then the repeated values are considered duplicate records? Again, some sample data and expected results would help greatly in clarifying your intent.
0
 
ProjectChampionCommented:
The advantage of the method in my previous post is that even if all columns in the duplicate rows are identical it'll still get rid of all duplicates.
0
 
awking00Commented:
ProjectChampion, I agree that your method would remove all duplicate rows, but the asker stated he wanted to remove one of the duplicate rows. If he means that literally (which I doubt but still would like clarification), your method would also work by changing to "where rn = 1". The other unclear issue is just how the Sn field relates to the duplicates. Additionally, your method would perform much better if there were a primary key (or keys) to partition on rather than every column, which is why I wanted to see some sample data.
0
 
ProjectChampionCommented:
Changing to  "where rn = 1" would delete all unique rows and one copy of each duplicate row, leaving only the second, third, ... copies of the duplicates. The WHERE clause should remain as "RN > 1".

Also partitioning on PK (if any) will be pointless as PK are by definition unique, so any partitioning (or Group By) must include the identical columns, which according to khaledsalem's post "the only different field between duplicated record is (Sn)" means all columns except Sn.
0
 
awking00Commented:
ProjectChampion,
First let me say I wasn't by any means trying to be critical of your proposed solutions. Actually, I think they're precisely what I would propose if I had the same confidence in what the asker's intent is as you seem to. However, because I didn't have that confidence, I simply wanted to get more clarification. :-)

Sorry, I meant to say "where rn = 2" if the intent was to only delete one row.

>>"the only different field between duplicated record is (Sn)" means all columns except Sn.<<
I'm glad you understand what that meant, but I felt it was extremely vague and just wanted a further explanation (in fact, I still do).
0
 
khaled salemSoftware DeveloperAuthor Commented:
Dear Guy Hengel [angelIII / a3]
I am trying your answer its looks like correct 100%. but the result different.
I have found the 17 duplicate rows by the using code
select  tktno, count(tktno)
from tktrans
where airlineid=148 and issuedate>='2014-01-01' AND tktstatus='VO'
group by tktno
having count(tktno) >1

Open in new window

when try to use your code found 54 duplicate rows
 
select t.sn, t.tktno, t.tktstatus, t.issuedate 
  from tktrans t
 where issuedate>='2014-01-01' and t.tktstatus='VO' and t.airlineid=148 and  t.sn = 
    ( SELECT MAX(i.sn)
        from tktrans i
       where i.tktno = t.tktno and i.tktstatus='VO'
    )
    order by t.tktno

Open in new window

0
 
Guy Hengel [angelIII / a3]Billing EngineerCommented:
your first code returns how many different values you have that have duplicates (or more)
your second query is missing the AND i.airlineid = 148 ... and is returning all the rows that have the same sn value, which might be several rows ...
0
 
skullnobrainsCommented:
what about this ?

insert into tktrans_tmp select tktno, issuedate, tktstatus, ... , max/min(sn)
from tktrans
group by  tktno, issuedate, tktstatus, ...

rename table tkrans to tktrans_with_dups , tktrans_tmp to tktrans
drop table tktrans_with_dups

you can try the select query alone first in order to make sure it gives the data you expect
0
 
Eric MulovhedziCommented:
If there is a duplicate, that means there must be a field on table (tktrans) that yo do not want  appearing in more than one raw on the resultant query. Assuming the field name is (Sn):

in My SQL you will use:

DELETE `Sn` 
FROM `tktrans` tk1, `tktrans` tk2 
WHERE tk1.id < tk2.id AND tk1.`Sn` = tk2.`Sn`

Open in new window


Bu its much safer to double check this by doing a SELECT first instead of DELETE i.e.

SELECT 
    `id`,`Sn` 
FROM `tktrans` tk1, `tktrans` tk2 
WHERE tk1.id < tk2.id AND tk1.`Sn` = tk2.`Sn`

Open in new window


This is so, provided your table (tktrans) as a unique field called 'id', which most people will usually make a unique table field and auto-increment by default but lets leave that topic for another day.
0
 
Guy Hengel [angelIII / a3]Billing EngineerCommented:
0
 
skullnobrainsCommented:
@guy: don't really care about points, but i'm interested in what you think is wrong about the solution i posted
0
 
Guy Hengel [angelIII / a3]Billing EngineerCommented:
Your technique would surely work, but has the side effect of "dropping" the table. I only would apply this on really large tables with many rows to be deleted,  where the simple delete would eventually fail....
0
 
skullnobrainsCommented:
duplicating the data would be a side effect, which would also happen with other techniques in many cases, but for some columns only because the server requires a temporary table to handle the order/group by clause

i hardly see why dropping would be viewed as a side effect, however. this approach has the above mentioned disadvantage, but replacing a table with another is an instantaneous and atomic operation, and avoids scattered row locking. if rows are timestamped, it can also be devised in such a way that the second table can be built incrementally so you'd actually lock the table for a fraction of a second only. depending on the use cases, (namely the frequency and duration of write operations on the table) either approach can be better
0
 
Guy Hengel [angelIII / a3]Billing EngineerCommented:
dropping a table has the following "side effects",
A) it drops plenty of other things assigned to the table:
  * constraints
  * indexes
  * permissions
  * statistics
  * replication
  * partitions

B) you may not be able to "drop/replace" the table if
 * it's linked in a foreign key being the "master" table
 * a schema triggers is not allowing the drop
 * table is "schema bound" to views

surely all of those things can be "listed" on the existing table and be recreated, but its just tedious, hence that technique should really and only be used in very rare cases.
0
 
skullnobrainsCommented:
interesting point, thanks for sharing.

actually, apart from statistics,

- A is not really an issue because most db engines will allow the rest to be copied using a single "create table ... like ..." query. (possibly with an extra including clause for example in postgres)

- constraints on the other hand would require some extra scripting : this table's constraint will be kept, but foreign keys referencing columns from the subject table will not (you'd need to add a "cascade constraints" clause to the drop statement)

- (unfrequent) cases where views would be an issue would require to remap the views to the new table before performing the update. note that a regular view would not need any modification unless it is referenced by foreign constraints which is not even feasible in many db engines and definitely very bad practice.

thanks again for pointing out. actually, if some of the table columns are foreign keys of other tables, i would not recommend it either without further digging
0

Featured Post

Industry Leaders: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

  • 5
  • 4
  • 4
  • +3
Tackle projects and never again get stuck behind a technical roadblock.
Join Now