Avatar of Leonard Gallion
Leonard Gallion
Flag for United States of America asked on

How to delete duplicate rows using the Length function in Postgres SQL 9.6

I am trying to delete some duplicate records in a Postgres 9.6 database. In particular I want to delete any the entry that has a duplicate step_id (bigint) AND the smallest size of a bytea field called "doc2" using the length function.


For example in this section of the data:

 

step_iddoc2doc2size
10580774binary data318
10580774binary data369
10580775binary data318
10580775binary data379
10580776binary data318
10580776binary data373
10580777binary data318
10580777binary data370

 

I would want to delete the 1st, 3rd, 5th and 7th rows of data (ie. those with the smallest doc2size of the duplicated entries)


Here is the query to I used to pull this data:

select step_id, doc2, length(doc2) as doc2size
from "TAStepsDoc" X
WHERE (select count(*) from "TAStepsDoc" Y where X.step_id = Y.step_id) > 1
ORDER BY step_id

Open in new window

I have figured out how to delete the duplicate rows generally but I can't seem to work out how to delete the duplicate row with the smallest length(doc2) result. Any assistance would be appreciated.

PostgreSQLSQLDatabases

Avatar of undefined
Last Comment
slightwv (䄆 Netminder)

8/22/2022 - Mon
PortletPaul

Is there a unique identifier for each row in the table? e.g. "ID"

If not, the problem you will face is that doc2 might be equal in size in some instances and when this happens it will be impossible to match to a single row unless there is an ID field (or equivalent). The following will work on your sample data

CREATE TABLE mytable(
   step_id  INTEGER  NOT NULL
  ,doc2     VARCHAR(11) NOT NULL
  ,doc2size INTEGER  NOT NULL
);
INSERT INTO mytable(step_id,doc2,doc2size) VALUES (1,'same size',24);
INSERT INTO mytable(step_id,doc2,doc2size) VALUES (1,'same size',24);
INSERT INTO mytable(step_id,doc2,doc2size) VALUES (10580774,'318',318);
INSERT INTO mytable(step_id,doc2,doc2size) VALUES (10580774,'369',369);
INSERT INTO mytable(step_id,doc2,doc2size) VALUES (10580775,'318',318);
INSERT INTO mytable(step_id,doc2,doc2size) VALUES (10580775,'379',379);
INSERT INTO mytable(step_id,doc2,doc2size) VALUES (10580776,'318',318);
INSERT INTO mytable(step_id,doc2,doc2size) VALUES (10580776,'373',373);
INSERT INTO mytable(step_id,doc2,doc2size) VALUES (10580777,'318',318);
INSERT INTO mytable(step_id,doc2,doc2size) VALUES (10580777,'370',370);

with CTE as (
select
*
, row_number() over(partition by step_id order by doc2size DESC) as rn
from mytable
)
delete
from mytable t
where exists (
     select null
     from CTE 
     where t.step_id = cte.step_id
     and t.doc2 = cte.doc2
     and CTE.rn > 1)

Open in new window

BUT it will delete BOTH highlighted rows that I added
The result is:
+----------+------+----------+
| step_id  | doc2 | doc2size |
+----------+------+----------+
| 10580774 |  369 |      369 |
| 10580775 |  379 |      379 |
| 10580776 |  373 |      373 |
| 10580777 |  370 |      370 |
+----------+------+----------+

Open in new window


ALSO you have the added problem that the doc2size isn't part of the persistent data, and I have assumed that I can usean equality predicate on the binary data which may be untrue.

The problem would be much easier, and more accurate, if every row had a unique identifier.

see: https://dbfiddle.uk/?rdbms=postgres_9.6&fiddle=aed0be67b99da1645ae8420bb11016a4
 
Leonard Gallion

ASKER
Unfortunately there is no unique identifier right now. That field was supposed to be step_id BUT due to some unknown database error duplicate entries were created for step_id although it was marked as a unique field. All together there are 1539 rows that are duplicates and need to go so I can reset the step_id field to be unique.

Also unfortunately there are duplicate Step_ID & Doc2 duplicate rows fields with equal value length values so I need to retain one of the two.
 
Is there anyway to create a new, temporary column and use something like ROW_NUMBER() to create a unique ID in that new filed, use that ID to clean up the data then drop that column after the cleanup?

The overall project is that I need to restore this (very large) Postgres data onto a new server (using backup and restore) and so I must do a process to temporarily remove the unique constraint on this table so I can do the restore without errors, then clean these duplicate records through one or more queries and then restore the constraint and put the database into production to complete the migration.
ASKER CERTIFIED SOLUTION
PortletPaul

THIS SOLUTION ONLY AVAILABLE TO MEMBERS.
View this solution by signing up for a free trial.
Members can start a 7-Day free trial and enjoy unlimited access to the platform.
See Pricing Options
Start Free Trial
GET A PERSONALIZED SOLUTION
Ask your own question & get feedback from real experts
Find out why thousands trust the EE community with their toughest problems.
PortletPaul

You can add the needed column by either of these
ALTER TABLE mytable ADD COLUMN id SERIAL PRIMARY KEY

Open in new window

or
ALTER TABLE mytable ADD COLUMN id SERIAL

Open in new window


see it running here
https://dbfiddle.uk/?rdbms=postgres_9.6&fiddle=30cf83e419d928dd616f34c8a99e3adb

Once you have cleaned-up the data then you can remove that column if needed.
Experts Exchange is like having an extremely knowledgeable team sitting and waiting for your call. Couldn't do my job half as well as I do without it!
James Murphy
Leonard Gallion

ASKER
A huge thanks for the link to dbfiddle! I was looking for a sandbox to test queries on and couldn't find one before :)

Let me play with the query you built for a bit and see if I can swap out the doc2size as a field with length(doc2) instead
PortletPaul

You can include the length(doc2)  into the calculation of row numbers, like this:

with CTE as (
select
*
, row_number() over(partition by step_id order by length(doc2) DESC) as rn
from mytable
)
delete
from mytable t
where exists (select NULL
              from CTE
              where t.id = CTE.id and CTE.rn > 1

Open in new window


Leonard Gallion

ASKER
Just to end the documentation chain, here is the full steps combined:
https://dbfiddle.uk/?rdbms=postgres_9.6&fiddle=afdff883129cde72f77eb16eb8218f88
CREATE TABLE mytable(
   step_id  INTEGER  NOT NULL, doc2     VARCHAR(11) NOT NULL
);
INSERT INTO mytable(step_id,doc2) VALUES (1,'same size');
INSERT INTO mytable(step_id,doc2) VALUES (1,'same size');
INSERT INTO mytable(step_id,doc2) VALUES (10580774,'XXXX');
INSERT INTO mytable(step_id,doc2) VALUES (10580774,'XXXXXX');
INSERT INTO mytable(step_id,doc2) VALUES (10580775,'XX');
INSERT INTO mytable(step_id,doc2) VALUES (10580775,'XXXX');
INSERT INTO mytable(step_id,doc2) VALUES (10580776,'XXXXX');
INSERT INTO mytable(step_id,doc2) VALUES (10580776,'XX');
INSERT INTO mytable(step_id,doc2) VALUES (10580777,'X');
INSERT INTO mytable(step_id,doc2) VALUES (10580777,'XXXX');

Open in new window

ALTER TABLE mytable ADD COLUMN tempid SERIAL

Open in new window

/* ACTUALLY DELETE THE ROWS */
with CTE as (
select
*
, row_number() over(partition by step_id order by length(doc2) DESC) as rn
from mytable
)
delete
from mytable t
where exists (select NULL
              from CTE
              where t.tempid = CTE.tempid and CTE.rn > 1)

Open in new window

ALTER TABLE mytable DROP COLUMN tempid

Open in new window


select *, length(doc2)from mytable            

Open in new window

step_iddoc2length
1same size9
10580774XXXXXX6
10580775XXXX4
10580776XXXXX5
10580777XXXX4

⚡ FREE TRIAL OFFER
Try out a week of full access for free.
Find out why thousands trust the EE community with their toughest problems.
slightwv (䄆 Netminder)

FYSA:
You can do this w/o the tempid.

/* ACTUALLY DELETE THE ROWS */
with CTE as (
select
ctid
, row_number() over(partition by step_id order by length(doc2) DESC) as rn
from mytable
)
delete
from mytable t
where ctid in (select ctid
              from CTE
              where rn > 1)

Open in new window