Link to home
Start Free TrialLog in
Avatar of trevor1940
trevor1940

asked on

SQL: query for removing duplicates

Hi

I'm attempting to clean up a database and need to remove duplicate entries

manually I'm doing this
select * from person where person.tmdb_id = 17867;
-- Gives 
id            name                       tmdb_id      profile_path
412	Thomas McCarthy	17867	/bz5ppWS58ryvKG4QQpIsqRXY68w.jpg
28522	Tom McCarthy	17867	/bz5ppWS58ryvKG4QQpIsqRXY68w.jpg

-- Pick 1 usually the lowest id 

update movielinkperson set personId = 412 where personId = 28522;
delete  from person where id = 28522;

Open in new window



Is there a way I can do this via script?

this found 58 entries (OK not a big number but I'm lazy)

select COUNT(*) from person GROUP BY person.tmdb_id  HAVING  COUNT(*)>1;

Open in new window

Avatar of lcohan
lcohan
Flag of Canada image

To be 100% safe I suggest you test the script below against a copy of the actual table first to make sure it deleted duplicate records you want and nothing else. This assuming the "person" table is not huge - which by its name it cant be so...please follow the steps below to remove the duplicates from a testing person_temp table and only if the results are as expected then run the same against your actual "person" table. Even there...I suggest you make a copy of the "person" table to a "person_bak" AS IS before removing the duplicates - just in case...
Note the script below keeps the "earliest" entry from the table or "the lowest id" as you mentioned however if you want to keep the "latest" then just replace the MIN with MAX and the ASC with DESC in the ORDER BY below.



--create a copy of your table for testing purpose
select * into person_tmp from person;

--check table to see duplicate values
select COUNT(tmdb_id),tmdb_id,min(id) from person_tmp GROUP BY tmdb_id  HAVING  COUNT(*)>1;

--Deleting duplicate values
DELETE a
FROM person_tmp a
INNER JOIN
(SELECT ID, RANK() OVER(PARTITION BY tmdb_id ORDER BY ID ASC) AS rnk FROM person_tmp ) b
ON a.ID=b.ID
WHERE b.rnk>1

--check table to make sure duplicate data was removed
select COUNT(tmdb_id),tmdb_id,min(id) from person_tmp GROUP BY tmdb_id  HAVING  COUNT(*)>1;
Avatar of trevor1940
trevor1940

ASKER

Hi

That seems to remove the duplicates from peron_tmp but How dose this update the movielinkperson table

as in the opening question?

update movielinkperson set personId = 412 where personId = 28522;
OR
update movielinkperson set personId = <KEPT personId > where personId = <REMOVED personId >;

Open in new window

Do you have all the person IDs in movielinkperson table?
ASKER CERTIFIED SOLUTION
Avatar of Sharath S
Sharath S
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Hi Run your query in sqlfiddle which looks like it's working

FYI  The  movielinkperson Table has
id, movie_id , personid , and role it basically links a person to a movie

I wrote a short C# script to do this