Solved

How to delete duplicate rows

Posted on 2014-04-25
22
284 Views
Last Modified: 2014-09-02
Hi all:
I've some duplicated rows in my table (tktrans).  the only different field between duplicated record is (Sn). I would like to delete on of the duplicate rows.
0
Comment
Question by:khaledsalem
  • 5
  • 4
  • 4
  • +3
22 Comments
 
LVL 142

Accepted Solution

by:
Guy Hengel [angelIII / a3] earned 250 total points
ID: 40022566
you posted in both ORACLE and MS SQL server, please clarify ...

you may want to read up this article, though it doesn't explain the DELETE part, but helps to understand the basic concept nevertheless:
http://www.experts-exchange.com/Database/Miscellaneous/A_3203-DISTINCT-vs-GROUP-BY-and-why-does-it-not-work-for-my-query.html

so,you need to decide which of the 2 (+) rows you want to keep, let's say lowest value of Sn, and which field(s) are to be taken as "key" fields to consider the records "duplicated"

the below code assumes it's 1 field to determin the "duplication", you can add more condition if it's several columns
DELETE  t
 FROM yourtable T
  WHERE EXISTS( SELECT NULL FROM yourtable n
                     WHERE n.Sn > t.SN
                        AND n.keyfield = t.keyfield
  )

Open in new window


before the delete, you may want to run some SELECT along the same lines
0
 
LVL 8

Assisted Solution

by:ProjectChampion
ProjectChampion earned 125 total points
ID: 40022581
This should do the trick:

With CTE as (Select *, RN = Row_Number()  OVER (Partition by Col1, ..., Coln Order BY Col1,...Coln, Sn) From tktrans)
Delete CTE Where RN > 1;

Col1,..., Coln are all columns in your table except SN.
0
 
LVL 31

Assisted Solution

by:awking00
awking00 earned 125 total points
ID: 40022583
Just a little more info. Can there be multiple duplicates (i.e. triplicates, quadruplicates, etc.)? If so, do you just want to delete one or do you want to delete all but? We would need to know the criteria for the one(s) you want to keep. Perhaps you can provide some sample data and how you would want it to look like after the deletes.
0
 
LVL 31

Expert Comment

by:awking00
ID: 40022599
>>the only different field between duplicated record is (Sn).<<
This is also unclear to me. Do you mean that the Sn value could be different, but if all the other fields are the same, it would be considered a duplicate to be deleted? Or do you mean that, if the same Sn value appears more than once, then the repeated values are considered duplicate records? Again, some sample data and expected results would help greatly in clarifying your intent.
0
 
LVL 8

Expert Comment

by:ProjectChampion
ID: 40022635
The advantage of the method in my previous post is that even if all columns in the duplicate rows are identical it'll still get rid of all duplicates.
0
 
LVL 31

Expert Comment

by:awking00
ID: 40022675
ProjectChampion, I agree that your method would remove all duplicate rows, but the asker stated he wanted to remove one of the duplicate rows. If he means that literally (which I doubt but still would like clarification), your method would also work by changing to "where rn = 1". The other unclear issue is just how the Sn field relates to the duplicates. Additionally, your method would perform much better if there were a primary key (or keys) to partition on rather than every column, which is why I wanted to see some sample data.
0
 
LVL 8

Expert Comment

by:ProjectChampion
ID: 40022748
Changing to  "where rn = 1" would delete all unique rows and one copy of each duplicate row, leaving only the second, third, ... copies of the duplicates. The WHERE clause should remain as "RN > 1".

Also partitioning on PK (if any) will be pointless as PK are by definition unique, so any partitioning (or Group By) must include the identical columns, which according to khaledsalem's post "the only different field between duplicated record is (Sn)" means all columns except Sn.
0
 
LVL 31

Expert Comment

by:awking00
ID: 40022817
ProjectChampion,
First let me say I wasn't by any means trying to be critical of your proposed solutions. Actually, I think they're precisely what I would propose if I had the same confidence in what the asker's intent is as you seem to. However, because I didn't have that confidence, I simply wanted to get more clarification. :-)

Sorry, I meant to say "where rn = 2" if the intent was to only delete one row.

>>"the only different field between duplicated record is (Sn)" means all columns except Sn.<<
I'm glad you understand what that meant, but I felt it was extremely vague and just wanted a further explanation (in fact, I still do).
0
 

Author Comment

by:khaledsalem
ID: 40022992
Dear Guy Hengel [angelIII / a3]
I am trying your answer its looks like correct 100%. but the result different.
I have found the 17 duplicate rows by the using code
select  tktno, count(tktno)
from tktrans
where airlineid=148 and issuedate>='2014-01-01' AND tktstatus='VO'
group by tktno
having count(tktno) >1

Open in new window

when try to use your code found 54 duplicate rows
 
select t.sn, t.tktno, t.tktstatus, t.issuedate 
  from tktrans t
 where issuedate>='2014-01-01' and t.tktstatus='VO' and t.airlineid=148 and  t.sn = 
    ( SELECT MAX(i.sn)
        from tktrans i
       where i.tktno = t.tktno and i.tktstatus='VO'
    )
    order by t.tktno

Open in new window

0
What Should I Do With This Threat Intelligence?

Are you wondering if you actually need threat intelligence? The answer is yes. We explain the basics for creating useful threat intelligence.

 
LVL 142

Assisted Solution

by:Guy Hengel [angelIII / a3]
Guy Hengel [angelIII / a3] earned 250 total points
ID: 40023175
your first code returns how many different values you have that have duplicates (or more)
your second query is missing the AND i.airlineid = 148 ... and is returning all the rows that have the same sn value, which might be several rows ...
0
 
LVL 26

Expert Comment

by:skullnobrains
ID: 40028961
what about this ?

insert into tktrans_tmp select tktno, issuedate, tktstatus, ... , max/min(sn)
from tktrans
group by  tktno, issuedate, tktstatus, ...

rename table tkrans to tktrans_with_dups , tktrans_tmp to tktrans
drop table tktrans_with_dups

you can try the select query alone first in order to make sure it gives the data you expect
0
 
LVL 2

Expert Comment

by:Eric Mulovhedzi
ID: 40030274
If there is a duplicate, that means there must be a field on table (tktrans) that yo do not want  appearing in more than one raw on the resultant query. Assuming the field name is (Sn):

in My SQL you will use:

DELETE `Sn` 
FROM `tktrans` tk1, `tktrans` tk2 
WHERE tk1.id < tk2.id AND tk1.`Sn` = tk2.`Sn`

Open in new window


Bu its much safer to double check this by doing a SELECT first instead of DELETE i.e.

SELECT 
    `id`,`Sn` 
FROM `tktrans` tk1, `tktrans` tk2 
WHERE tk1.id < tk2.id AND tk1.`Sn` = tk2.`Sn`

Open in new window


This is so, provided your table (tktrans) as a unique field called 'id', which most people will usually make a unique table field and auto-increment by default but lets leave that topic for another day.
0
 
LVL 142

Expert Comment

by:Guy Hengel [angelIII / a3]
ID: 40291959
0
 
LVL 26

Expert Comment

by:skullnobrains
ID: 40295346
@guy: don't really care about points, but i'm interested in what you think is wrong about the solution i posted
0
 
LVL 142

Expert Comment

by:Guy Hengel [angelIII / a3]
ID: 40295604
Your technique would surely work, but has the side effect of "dropping" the table. I only would apply this on really large tables with many rows to be deleted,  where the simple delete would eventually fail....
0
 
LVL 26

Expert Comment

by:skullnobrains
ID: 40297932
duplicating the data would be a side effect, which would also happen with other techniques in many cases, but for some columns only because the server requires a temporary table to handle the order/group by clause

i hardly see why dropping would be viewed as a side effect, however. this approach has the above mentioned disadvantage, but replacing a table with another is an instantaneous and atomic operation, and avoids scattered row locking. if rows are timestamped, it can also be devised in such a way that the second table can be built incrementally so you'd actually lock the table for a fraction of a second only. depending on the use cases, (namely the frequency and duration of write operations on the table) either approach can be better
0
 
LVL 142

Expert Comment

by:Guy Hengel [angelIII / a3]
ID: 40298005
dropping a table has the following "side effects",
A) it drops plenty of other things assigned to the table:
  * constraints
  * indexes
  * permissions
  * statistics
  * replication
  * partitions

B) you may not be able to "drop/replace" the table if
 * it's linked in a foreign key being the "master" table
 * a schema triggers is not allowing the drop
 * table is "schema bound" to views

surely all of those things can be "listed" on the existing table and be recreated, but its just tedious, hence that technique should really and only be used in very rare cases.
0
 
LVL 26

Expert Comment

by:skullnobrains
ID: 40298219
interesting point, thanks for sharing.

actually, apart from statistics,

- A is not really an issue because most db engines will allow the rest to be copied using a single "create table ... like ..." query. (possibly with an extra including clause for example in postgres)

- constraints on the other hand would require some extra scripting : this table's constraint will be kept, but foreign keys referencing columns from the subject table will not (you'd need to add a "cascade constraints" clause to the drop statement)

- (unfrequent) cases where views would be an issue would require to remap the views to the new table before performing the update. note that a regular view would not need any modification unless it is referenced by foreign constraints which is not even feasible in many db engines and definitely very bad practice.

thanks again for pointing out. actually, if some of the table columns are foreign keys of other tables, i would not recommend it either without further digging
0

Featured Post

How your wiki can always stay up-to-date

Quip doubles as a “living” wiki and a project management tool that evolves with your organization. As you finish projects in Quip, the work remains, easily accessible to all team members, new and old.
- Increase transparency
- Onboard new hires faster
- Access from mobile/offline

Join & Write a Comment

Occasionally there is a need to clean table columns, especially if you have inherited legacy data. There are obviously many ways to accomplish that, including elaborate UPDATE queries with anywhere from one to numerous REPLACE functions (even within…
JSON is being used more and more, besides XML, and you surely wanted to parse the data out into SQL instead of doing it in some Javascript. The below function in SQL Server can do the job for you, returning a quick table with the parsed data.
This video shows how to Export data from an Oracle database using the Datapump Export Utility.  The corresponding Datapump Import utility is also discussed and demonstrated.
Viewers will learn how to use the SELECT statement in SQL and will be exposed to the many uses the SELECT statement has.

758 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

19 Experts available now in Live!

Get 1:1 Help Now