asked on

Delete duplicates from table using dense_rank() in DB2

Hello all,

I haven't seen this subject in the context of DB2 so I thought I'd ask. I'm trying to delete duplicates from a table within DB2 using dense_rank and columns col1, col2, col3, and col4. I think i'm halfway there but need help in forming the delete portion of the statement:

These select statement shows me what I want, meaning that I see the duplicates according to the RANK_DATA column I've created (showing two 1's). But from there I'm not quite sure how to isolate just one of those guys and delete it without deleting all instances of that row. Any help is appreciated!

Thanks,
Glen

select 
col1, 
col2, 
col3, 
col4,
dense_rank() over (order by col1, col2, col3, col4) as rank-data
from schema1.mytable with ur;

Open in new window

SOLUTION

Kent Olsen

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

jisoo411

ASKER

Hi Kent,

Thanks for the response! Yeah I saw someone suggest using dense_rank() in order to delete duplicates. It was in Oracle though and I couldn't figure out how to make it work in DB2:

delete from <table_name> where rowid in
(select rn from
(Select rowid rn,
dense_rank() over (partition by col1,col2.. order by rowid) ln
from <table_name>)
where ln <> 1)

But I think this guy assumed that the table had a surrogate key (which I don't have).

Unfortunately I can't utilize a view for my specific scenario. But a funny looking nested sql statement would work. Would it look like this?

DELETE FROM (
SELECT col1, col2, col3, col4, row_number() over (partition by col1, col2, col3, col4) as RN FROM mytable) x WHERE x.RN > 1;

Thanks,
Glen

jisoo411

ASKER

I should add that my table is also partitioned by date:

SELECT col1, col2, col3, col4, row_number () over (partition by col1, col2, col3, col4) FROM mytable WHERE col5 = 'YYYY-MM-DD';

Kent Olsen

Hi Glen,

Excellent. That should work just fine, though you don't need to identify the derived table

DELETE FROM (
SELECT col1, col2, col3, col4, row_number() over (partition by col1, col2, col3, col4) as RN FROM mytable) x
WHERE x.RN > 1;

simplifies to

DELETE FROM (
SELECT col1, col2, col3, col4, row_number() over (partition by col1, col2, col3, col4) as RN FROM mytable)
WHERE RN > 1;

Kent

Kent Olsen

Missed your last post...

When you say that you're partitioned by date, do you mean that the table is partitioned (with some dates residing in one partition and other dates in a second partition) or that your query needs to includes a date column in the partition clause?

Both are trivial. If the table is partitioned, you don't really care. DB2 will handle it. If you need to include a data column to determine the row_number, just include it in the partition list.

Kent

jisoo411

ASKER

Sorry I should have phrased it better. One of the columns is a date column which I use to clean up subsets of data based on date. So I figure putting WHERE col5 = 'YYYY-MM-DD' right after FROM mytable in the nexted select will suffice. So now I have the following query:

DELETE FROM (SELECT col1, col2, col3, col4, row_number() over (partition by col1, col2, col3, col4) as RN FROM mytable WHERE col5 = 'YYYY-MM-DD') WHERE RN > 1;

I tried to execute but I get an error message:

SQL State: 42601
Vendor Code: -104
Message: [SQL0104] Token ( was not valid. Valid tokens: <IDENTIFIER>. Cause. . . .: A syntax error was detected at token (. Token ( is not a valid token. A partial list of valid tokens is <IDENTIFIER>...

I'm guessing that DB2 is complaining that the subselect doesn't have an aliase or something? I tried that and still got the same error message.

Glen

Kent Olsen

Hi Glen,

I suspect a typo somewhere. Below is a test script that works fine for me. I've tried several variations of data and filter.

Kent

--drop table kk;
create table kk ( c1 varchar (10), c2 varchar (10), c3 varchar (10), d date);

insert into kk values ('a', 'b', 'c', '2005-01-01');
insert into kk values ('a', 'b', 'c', '2008-01-01');
insert into kk values ('a', 'b', 'd', '2008-01-01');
insert into kk values ('a', 'b', 'd', '2008-01-01');
insert into kk values ('a', 'b', 'd', '2008-01-01');

select c1, c2, c3, row_number () over (partition by c1, c2, c3) 
from kk
where d > '2007-01-01';

delete from
(
  select c1, c2, c3, row_number () over (partition by c1, c2, c3) as rn 
  from kk
  where d > '2007-01-01'
)
where rn > 1;

Open in new window

jisoo411

ASKER

I copied your test script and it works all the way down until the delete statement where it gives me that same token error regarding that first parenthesis.

The other weird thing is I put together a test script utilizing the create view method you gave earlier and that one works perfectly. Strange?

Kent Olsen

What database and version are you using?

jisoo411

ASKER

I'm not exactly sure but I'm using UDB on iSeries.

Kent Olsen

Ahhhh.

iSeries. IBM actually makes 3 different DB2s -- iSeries, Mainframe, and LUW. The SQL that we're testing works on LUW and should work on the mainframe. iSeries is different.

Here's a common example:

DELETE FROM LIBA/FILEA F1 WHERE RRN(F1) > (select MIN(RRN(F2)) FROM FILEA F2 WHERE F2.INVNO=F1.INVNO)

Can you morph that into something what will work for you?

Kent

jisoo411

ASKER

Yeah I'm starting to realize how different iSeries DB2 is (I come from an AIX/DB2 9.1 background). I'm not really familiar enough with iSeries UDB syntax to adjust that delete statement. I could try to make that create view method work though since it executes well manually (I'm working this stuff into Informatica).

ASKER CERTIFIED SOLUTION

Kent Olsen

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

jisoo411

ASKER

Awesome, that delete statement worked! Thanks a bunch Kent.