I am trying to delete some duplicate records in a Postgres 9.6 database. In particular I want to delete any the entry that has a duplicate step_id (bigint) AND the smallest size of a bytea field called "doc2" using the length function.
For example in this section of the data:
step_id | doc2 | doc2size |
10580774 | binary data | 318 |
10580774 | binary data | 369 |
10580775 | binary data | 318 |
10580775 | binary data | 379 |
10580776 | binary data | 318 |
10580776 | binary data | 373 |
10580777 | binary data | 318 |
10580777 | binary data | 370 |
I would want to delete the 1st, 3rd, 5th and 7th rows of data (ie. those with the smallest doc2size of the duplicated entries)
Here is the query to I used to pull this data:
select step_id, doc2, length(doc2) as doc2size
from "TAStepsDoc" X
WHERE (select count(*) from "TAStepsDoc" Y where X.step_id = Y.step_id) > 1
ORDER BY step_id
I have figured out how to delete the duplicate rows generally but I can't seem to work out how to delete the duplicate row with the smallest length(doc2) result. Any assistance would be appreciated.
ALTER TABLE mytable ADD COLUMN id SERIAL PRIMARY KEY
orALTER TABLE mytable ADD COLUMN id SERIAL
with CTE as (
select
*
, row_number() over(partition by step_id order by length(doc2) DESC) as rn
from mytable
)
delete
from mytable t
where exists (select NULL
from CTE
where t.id = CTE.id and CTE.rn > 1
CREATE TABLE mytable(
step_id INTEGER NOT NULL, doc2 VARCHAR(11) NOT NULL
);
INSERT INTO mytable(step_id,doc2) VALUES (1,'same size');
INSERT INTO mytable(step_id,doc2) VALUES (1,'same size');
INSERT INTO mytable(step_id,doc2) VALUES (10580774,'XXXX');
INSERT INTO mytable(step_id,doc2) VALUES (10580774,'XXXXXX');
INSERT INTO mytable(step_id,doc2) VALUES (10580775,'XX');
INSERT INTO mytable(step_id,doc2) VALUES (10580775,'XXXX');
INSERT INTO mytable(step_id,doc2) VALUES (10580776,'XXXXX');
INSERT INTO mytable(step_id,doc2) VALUES (10580776,'XX');
INSERT INTO mytable(step_id,doc2) VALUES (10580777,'X');
INSERT INTO mytable(step_id,doc2) VALUES (10580777,'XXXX');
ALTER TABLE mytable ADD COLUMN tempid SERIAL
/* ACTUALLY DELETE THE ROWS */
with CTE as (
select
*
, row_number() over(partition by step_id order by length(doc2) DESC) as rn
from mytable
)
delete
from mytable t
where exists (select NULL
from CTE
where t.tempid = CTE.tempid and CTE.rn > 1)
ALTER TABLE mytable DROP COLUMN tempid
select *, length(doc2)from mytable
step_id | doc2 | length |
---|---|---|
1 | same size | 9 |
10580774 | XXXXXX | 6 |
10580775 | XXXX | 4 |
10580776 | XXXXX | 5 |
10580777 | XXXX | 4 |
/* ACTUALLY DELETE THE ROWS */
with CTE as (
select
ctid
, row_number() over(partition by step_id order by length(doc2) DESC) as rn
from mytable
)
delete
from mytable t
where ctid in (select ctid
from CTE
where rn > 1)
If not, the problem you will face is that doc2 might be equal in size in some instances and when this happens it will be impossible to match to a single row unless there is an ID field (or equivalent). The following will work on your sample data
Open in new window
BUT it will delete BOTH highlighted rows that I addedThe result is:
Open in new window
ALSO you have the added problem that the doc2size isn't part of the persistent data, and I have assumed that I can usean equality predicate on the binary data which may be untrue.
The problem would be much easier, and more accurate, if every row had a unique identifier.
see: https://dbfiddle.uk/?rdbms=postgres_9.6&fiddle=aed0be67b99da1645ae8420bb11016a4