Solved

remove duplicate rows (mysql) / preserve newest row

Posted on 2009-04-06
12
738 Views
Last Modified: 2012-05-06
I use the following queries to delete duplicate rows:
$dbh->do( "ALTER IGNORE TABLE data ADD UNIQUE INDEX dupidx ( field1, field2, field3, field4 )" );
$dbh->do( "ALTER TABLE leads DROP INDEX dupidx" );

However, this way it preserves the oldest entry and deletes the following rows with same data. IS there a way to delete the old one and keep the newest row of the duplicates?
0
Comment
Question by:mypatek
  • 5
  • 5
  • 2
12 Comments
 
LVL 25

Expert Comment

by:lwadwell
ID: 24082220
Hi mypatek,

how do you know which one is "newer" than the others?

lwadwell
0
 

Author Comment

by:mypatek
ID: 24085002
Hi lwadwell

There is an integer field in the table, "transid" with increasing numbers

i.e.

400000
400001
400002
...

and so on.. as higher the number the newer the row
0
 
LVL 25

Expert Comment

by:lwadwell
ID: 24092505
mypatek,

Your best bet here is to use a DELETE statement.  Something like ...

delete t1 from yourtable t1, yourtable t2
 where t1.dup_col1 = t2.dup_col1
   and t1.dup_col2 = t2.dup_col2
   and t1.dup_col3 = t2.dup_col3
   and t1.transid  < t2.transid;

lwadwell
0
 

Author Comment

by:mypatek
ID: 24102268
is this not achievable without creating a second table? as there are about 200k rows in this table I'm a bit concerned about performance issues
0
 
LVL 25

Expert Comment

by:lwadwell
ID: 24102422
mypatek,

there is no second table in the DELETE statement.  The same table ... the one you wish to delete from ... is referenced twice.

lwadwell
0
 

Author Comment

by:mypatek
ID: 24119385
ok, thanks for the hint.

howeber, i'm not fully satisfied yet with your solution as mysql crashes with your proposal:

DELETE t2 FROM tablename as t1, tablename as t2 where t1.col1 = t2.col1 and t1.col2 = t2.col2 and t1.col3 = t2.col3 and t1.col4 = t2.col4 and t1.col5 = t2.col5 and t1.transid < t2.transid



The job is done in about 2-3 seconds with the following statement:

ALTER IGNORE TABLE tablename ADD UNIQUE INDEX dupidx ( col1, col2, col3, col4, col5 )

but it keeps the oldest of multiple identical rows. Is it possible to add a ORDER BY statement to this to delete the older rows first?
0
DevOps Toolchain Recommendations

Read this Gartner Research Note and discover how your IT organization can automate and optimize DevOps processes using a toolchain architecture.

 
LVL 25

Expert Comment

by:lwadwell
ID: 24120825
mypatek,

in your last post you had my proposal as "DELETE t2 ..." that should be "DELETE t1 ..." to meet your requirement, I assume that was just a typo.

To the best of my knowledge ... there is nothing you can do with the ALTER statement.  There is no order by you can add.  mysql is just adding the rows to the index ... one row at a time and removing rows that would be duplicate.  It is not trying to keep the oldest, it just keeps the one it added to the index first - which in this case happen to be the oldest.  To change this behaviour you would need to change mysql or reorganize the table in the database - this would need a second table; and even then I cannot guarentee it would work.

lwadwell
0
 
LVL 6

Expert Comment

by:BALMUKUND KESHAV
ID: 24120894
If your table is not too large, you can write a small script like this :
use table as recordset order by unquie field, transid desc
rs.movefirst
do while not rs.eof()
 fld=unique field
rs.movenext
if rs.fields(unique field)=fld
  delete
end if
loop

The above  script will delete those records which are duplicate, if your table has triplicate or multi records .. then it will not work, Try at your own risk,

Regards,

Bm Keshav



0
 

Author Comment

by:mypatek
ID: 24173056
Thank you, Bm Keshav. Unfortunately the table does not only contain duplicates, but also triplicates etc.

however, the syntax from Iwadwell is working now after creating a corresponding index for that table. no performance issues anymore.

one last thing needs to be resolved:
how can i get out the values of a field of all deleted rows? the name of the field is 'transid' and i need it to delete other data outside of the database corresponding with the deleted row.

it is running within perl..

i wrote the following (not tested yet) is it correct, or is there a simpler way to achieve this?:

my $sth = $dbh->prepare("SELECT transid FROM tablename as t1, tablename as t2 where t1.col1 = t2.col1 and t1.col2 = t2.col2 and t1.col3 = t2.col3 and t1.col4 = t2.col4 and t1.col5 = t2.col5 and t1.transid < t2.transid'");

$sth->execute;
$rows = $sth->rows;

if ($rows) {
   $sth->bind_columns(undef, \$transid);
   while (my @row=$sth->fetchrow_array) {
      my $rowref=$dbh->selectall_arrayref("delete from... LIMIT 1");
   }
   $sth->finish;
}
0
 
LVL 6

Expert Comment

by:BALMUKUND KESHAV
ID: 24181795
No idea about Perl???????

Bm Keshav

0
 
LVL 25

Accepted Solution

by:
lwadwell earned 125 total points
ID: 24182236
BMKESHAV,

As long as this is done before the data is deleted from the main table ... your concept is right.  I have simplified the code and have shown it below.  Note: I am showing two different DELETE's to show how multiple tables can be done in the one loop; remove or add as needed.

I do need to ask why you were specifiying "LIMIT 1" on your DELETE statement?  As transid is the FK in the other tables ... would you not want to delete all rows with that FK?

Otherwise ... if you have deleted the rows from the main table, you could accomplish this in a single DELETE per table:
DELETE FROM othertablename as d WHERE NOT EXISTS (SELECT 1 FROM tablename as t1 WHERE t1.transid = d.transid);

N.B. as always, for optimal performance - appropriate indexes need to exist.

lwadwell
my $sth = $dbh->prepare("SELECT t1.transid FROM tablename as t1, tablename as t2 where t1.col1 = t2.col1 and t1.col2 = t2.col2 and t1.col3 = t2.col3 and t1.col4 = t2.col4 and t1.col5 = t2.col5 and t1.transid < t2.transid");
my $dh1 = $dbh->prepare("DELETE FROM othertablename WHERE transid=?");
my $dh2 = $dbh->prepare("DELETE FROM anothertable WHERE transid=?"); 
$sth->execute(); 
while (my @row=$sth->fetchrow_array) { 
    $dh1->execute($row[0]);
    $dh2->execute($row[0]);
}

Open in new window

0
 

Author Closing Comment

by:mypatek
ID: 31567279
It works now like a charme, Thank you!
0

Featured Post

DevOps Toolchain Recommendations

Read this Gartner Research Note and discover how your IT organization can automate and optimize DevOps processes using a toolchain architecture.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

In database programming, custom sort order seems to be necessary quite often, at least in my experience and time here at EE. Within the realm of custom sorting is the sorting of numbers and text independently (i.e., treating the numbers as number…
If you find yourself in this situation “I have used SELECT DISTINCT but I’m getting duplicates” then I'm sorry to say you are using the wrong SQL technique as it only does one thing which is: produces whole rows that are unique. If the results you a…
Video by: Mark
This lesson goes over how to construct ordered and unordered lists and how to create hyperlinks.
Hi friends,  in this video  I'll show you how new windows 10 user can learn the using of windows 10. Thank you.

863 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

19 Experts available now in Live!

Get 1:1 Help Now