Solved

remove duplicate rows (mysql) / preserve newest row

Posted on 2009-04-06
12
735 Views
Last Modified: 2012-05-06
I use the following queries to delete duplicate rows:
$dbh->do( "ALTER IGNORE TABLE data ADD UNIQUE INDEX dupidx ( field1, field2, field3, field4 )" );
$dbh->do( "ALTER TABLE leads DROP INDEX dupidx" );

However, this way it preserves the oldest entry and deletes the following rows with same data. IS there a way to delete the old one and keep the newest row of the duplicates?
0
Comment
Question by:mypatek
  • 5
  • 5
  • 2
12 Comments
 
LVL 25

Expert Comment

by:lwadwell
ID: 24082220
Hi mypatek,

how do you know which one is "newer" than the others?

lwadwell
0
 

Author Comment

by:mypatek
ID: 24085002
Hi lwadwell

There is an integer field in the table, "transid" with increasing numbers

i.e.

400000
400001
400002
...

and so on.. as higher the number the newer the row
0
 
LVL 25

Expert Comment

by:lwadwell
ID: 24092505
mypatek,

Your best bet here is to use a DELETE statement.  Something like ...

delete t1 from yourtable t1, yourtable t2
 where t1.dup_col1 = t2.dup_col1
   and t1.dup_col2 = t2.dup_col2
   and t1.dup_col3 = t2.dup_col3
   and t1.transid  < t2.transid;

lwadwell
0
 

Author Comment

by:mypatek
ID: 24102268
is this not achievable without creating a second table? as there are about 200k rows in this table I'm a bit concerned about performance issues
0
 
LVL 25

Expert Comment

by:lwadwell
ID: 24102422
mypatek,

there is no second table in the DELETE statement.  The same table ... the one you wish to delete from ... is referenced twice.

lwadwell
0
 

Author Comment

by:mypatek
ID: 24119385
ok, thanks for the hint.

howeber, i'm not fully satisfied yet with your solution as mysql crashes with your proposal:

DELETE t2 FROM tablename as t1, tablename as t2 where t1.col1 = t2.col1 and t1.col2 = t2.col2 and t1.col3 = t2.col3 and t1.col4 = t2.col4 and t1.col5 = t2.col5 and t1.transid < t2.transid



The job is done in about 2-3 seconds with the following statement:

ALTER IGNORE TABLE tablename ADD UNIQUE INDEX dupidx ( col1, col2, col3, col4, col5 )

but it keeps the oldest of multiple identical rows. Is it possible to add a ORDER BY statement to this to delete the older rows first?
0
IT, Stop Being Called Into Every Meeting

Highfive is so simple that setting up every meeting room takes just minutes and every employee will be able to start or join a call from any room with ease. Never be called into a meeting just to get it started again. This is how video conferencing should work!

 
LVL 25

Expert Comment

by:lwadwell
ID: 24120825
mypatek,

in your last post you had my proposal as "DELETE t2 ..." that should be "DELETE t1 ..." to meet your requirement, I assume that was just a typo.

To the best of my knowledge ... there is nothing you can do with the ALTER statement.  There is no order by you can add.  mysql is just adding the rows to the index ... one row at a time and removing rows that would be duplicate.  It is not trying to keep the oldest, it just keeps the one it added to the index first - which in this case happen to be the oldest.  To change this behaviour you would need to change mysql or reorganize the table in the database - this would need a second table; and even then I cannot guarentee it would work.

lwadwell
0
 
LVL 6

Expert Comment

by:BALMUKUND KESHAV
ID: 24120894
If your table is not too large, you can write a small script like this :
use table as recordset order by unquie field, transid desc
rs.movefirst
do while not rs.eof()
 fld=unique field
rs.movenext
if rs.fields(unique field)=fld
  delete
end if
loop

The above  script will delete those records which are duplicate, if your table has triplicate or multi records .. then it will not work, Try at your own risk,

Regards,

Bm Keshav



0
 

Author Comment

by:mypatek
ID: 24173056
Thank you, Bm Keshav. Unfortunately the table does not only contain duplicates, but also triplicates etc.

however, the syntax from Iwadwell is working now after creating a corresponding index for that table. no performance issues anymore.

one last thing needs to be resolved:
how can i get out the values of a field of all deleted rows? the name of the field is 'transid' and i need it to delete other data outside of the database corresponding with the deleted row.

it is running within perl..

i wrote the following (not tested yet) is it correct, or is there a simpler way to achieve this?:

my $sth = $dbh->prepare("SELECT transid FROM tablename as t1, tablename as t2 where t1.col1 = t2.col1 and t1.col2 = t2.col2 and t1.col3 = t2.col3 and t1.col4 = t2.col4 and t1.col5 = t2.col5 and t1.transid < t2.transid'");

$sth->execute;
$rows = $sth->rows;

if ($rows) {
   $sth->bind_columns(undef, \$transid);
   while (my @row=$sth->fetchrow_array) {
      my $rowref=$dbh->selectall_arrayref("delete from... LIMIT 1");
   }
   $sth->finish;
}
0
 
LVL 6

Expert Comment

by:BALMUKUND KESHAV
ID: 24181795
No idea about Perl???????

Bm Keshav

0
 
LVL 25

Accepted Solution

by:
lwadwell earned 125 total points
ID: 24182236
BMKESHAV,

As long as this is done before the data is deleted from the main table ... your concept is right.  I have simplified the code and have shown it below.  Note: I am showing two different DELETE's to show how multiple tables can be done in the one loop; remove or add as needed.

I do need to ask why you were specifiying "LIMIT 1" on your DELETE statement?  As transid is the FK in the other tables ... would you not want to delete all rows with that FK?

Otherwise ... if you have deleted the rows from the main table, you could accomplish this in a single DELETE per table:
DELETE FROM othertablename as d WHERE NOT EXISTS (SELECT 1 FROM tablename as t1 WHERE t1.transid = d.transid);

N.B. as always, for optimal performance - appropriate indexes need to exist.

lwadwell
my $sth = $dbh->prepare("SELECT t1.transid FROM tablename as t1, tablename as t2 where t1.col1 = t2.col1 and t1.col2 = t2.col2 and t1.col3 = t2.col3 and t1.col4 = t2.col4 and t1.col5 = t2.col5 and t1.transid < t2.transid");
my $dh1 = $dbh->prepare("DELETE FROM othertablename WHERE transid=?");
my $dh2 = $dbh->prepare("DELETE FROM anothertable WHERE transid=?"); 
$sth->execute(); 
while (my @row=$sth->fetchrow_array) { 
    $dh1->execute($row[0]);
    $dh2->execute($row[0]);
}

Open in new window

0
 

Author Closing Comment

by:mypatek
ID: 31567279
It works now like a charme, Thank you!
0

Featured Post

Highfive Gives IT Their Time Back

Highfive is so simple that setting up every meeting room takes just minutes and every employee will be able to start or join a call from any room with ease. Never be called into a meeting just to get it started again. This is how video conferencing should work!

Join & Write a Comment

As they say in love and is true in SQL: you can sum some Data some of the time, but you can't always aggregate all Data all the time! Introduction: By the end of this Article it is my intention to bring the meaning and value of the above quote to…
This article describes how to use the timestamp of existing data in a database to allow Tableau to calculate the prior work day instead of relying on case statements or if statements to calculate the days of the week.
Sending a Secure fax is easy with eFax Corporate (http://www.enterprise.efax.com). First, Just open a new email message.  In the To field, type your recipient's fax number @efaxsend.com. You can even send a secure international fax — just include t…
Polish reports in Access so they look terrific. Take yourself to another level. Equations, Back Color, Alternate Back Color. Write easy VBA Code. Tighten space to use less pages. Launch report from a menu, considering criteria only when it is filled…

747 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

8 Experts available now in Live!

Get 1:1 Help Now