Link to home
Start Free TrialLog in
Avatar of Richard
RichardFlag for United States of America

asked on

Remove duplicate entries from MSSQL table

I have a data table (Table S) which has about 60 fields.  Entries are considered duplicate if 3 fields (S.FName, S.LName, S.Company) contain the same data .  Some of the duplicates were used to create an entry in another table (Table P).  A field in table S (S.SID) is used to link the two tables.  What I need to do is remove the entries from table S which don't have an entry in table P.  I can find the duplicates using:
select Company,FName,LName,count(Sid) from TableS 
	group by  Company,LName ,FName having count(Sid)> 1 order by Company

Open in new window

My problem is that I need the S.SID value to determine if that record has been used to create an entry in table P.  If I include S.SID in the Select statement, I need to include it in the Group By statement or I get an error.  If it is in the Group By statement, I get no data because the S.SID values are unique.
Any thoughts on this.  I'm sure there must be same way to do this in T-SQL.
Avatar of Jim Horn
Jim Horn
Flag of United States of America image

>What I need to do is remove the entries from table S which don't have an entry in table P.
DELETE s
FROM TableS as s
   -- LEFT means include all the rows in s ...
   LEFT JOIN TableP p ON s.SID = p.SID
-- ... that do not have a related row in p
WHERE p.SID IS NULL

Open in new window


To handle the duplicates issue, check out SQL Server Delete Duplicate Rows, which has T-SQL code that illustrates how to use custom 'what columns define duplicate' and 'what columns (if any) define which one to keep/delete' to pull off a delete.

Good luck.
Jim
To find them use something like :

select Company,FName,LName,count(Sid) from TableS
left join TableP on TableS.Sid = TableP.Sid
where TableP.Sid is NULL
group by  Company,LName ,FName having count(Sid)> 1 order by Company

what version of SQL are you on ? There are some window functions from 2012 onwards that makes count() a bit more powerful... Rest assured you can definitely do this in T-SQL.
Sorry Jim, werent any posts when I opened...
Avatar of Richard

ASKER

Hi Jim,
For testing, I modified your solution to:
Select s.SID,s.Company ,  s.FName ,s.LName ,p.SID 
FROM TableS as s
   -- LEFT means include all the rows in s ...
   LEFT JOIN TableP  p ON s.SID = p.SID
-- ... that do not have a related row in p
WHERE p.SID IS NULL order by s.Company

Open in new window

Always test first!
This found ALL the rows in TableS which don't have an entry in TableP!
My original table contains data as:
SID        Company     FName     LName      (A bunch of other fields which may or may not be the same - these are "Don't Care")
12345     Comp1        John        Jones
23412     Comp1        John        Jones     (Duplicate of the one above)
43214     Comp2        Fred        Jones
45673     Comp3        Alex        Wilson
23478     Comp4        Fred       Jones     (Not duplicate - different company)
43542     Comp2        Fred       Jones     (Duplicate with SID 43214 above)

My original data file contains about 6800 records of which about 800 are duplicated.  When the deletions are done, the new table will have 6400.  There is no more that 1 duplicate of an entry (ex: the first two rows shown above make up a duplicate.  There are no more "Comp1,John,Jones" entries in the table.

I only want to delete the rows from TableS which 1) are duplicates and 2) don't have an entry in TableP.  My feeling is that some type of CTE would find the duplicates, using the logic I showed above, then from that resultant table see which ones don't exist in TableP using the SID value.
Avatar of Richard

ASKER

Hi Mark,
I'm using SQL Server 2008
Hello TopCatOnyx

Please try my solution given below

Delete u FROM
(
      SELECT SID,Company,FName,LName,count(*) OVER (PARTITION BY Company,FName,LName) counts from TableS
) as u WHERE u.counts > 1
AND  NOT EXISTS ( SELECT NULL FROM tableP p WHERE u.SID = p.SID )

Vaibhav
>> Always test first!

Absolutely !

But you do need to test for the dupe part as well. Doing a left join will find entries in TableS that dont exist in TableP which you have found.

But they may or may not be duplicates so you do need the group by for TableS as well - otherwise you only have half the answer.

I am now rethinking that the SID to remove is the SID that created the dupe and remove that SID because that one alone doesnt also exist in TableP whereas the other one does.

So, no point for checking for SID's until AFTER the dupes are found. once the dupes are found, then we can go looking for candidate sids....

Right ?

So, once we know the candidate TableS, we then use that to find our orphans. We just need to know which SID to delete

Using :
12345     Comp1        John        Jones
23412     Comp1        John        Jones     (Duplicate of the one above)

If 12345 doesnt exist in TableP but 23412 does, then we need to target 12345 for delete

But what if neither SID's exist TableP ?

Do we keep them both, or just keep at least one. What criteria - maybe keep the lowest SID, or keep the highest SID ?  

We probably need something more like

Select D.Company,D.FName,D.LName, S.SID from
   (select Company,FName,LName,count(Sid) SC from TableS
      group by  Company,LName ,FName having count(Sid)> 1 ) D
inner join TableS S on D.Company = S.company and D.FName = S.FName and D.LName = D.LName

then we have our list of SID's to check

Sound OK so far ?

Then just need feedback on which SID

(and will provide better code - mainly concept to make very sure I understand)
Hello TopCatOnyx

Sorry I have updated my last comment. Please try.

Delete u FROM
(
      SELECT SID,Company,FName,LName,ROW_NUMBER() OVER (PARTITION BY Company,FName,LName) row_numbe from TableS
) as u WHERE u.row_numbe > 1
AND  NOT EXISTS ( SELECT NULL FROM tableP p WHERE u.SID = p.SID )

Vaibhav
ASKER CERTIFIED SOLUTION
Avatar of Mark Wills
Mark Wills
Flag of Australia image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of Richard

ASKER

Hi Vaibhav and Mark,
Sorry about the response delay - had to get some sleep.
Anyway,
Vaibhav:
I tried your modified code (using SELECT rather than DELETE!!!).  It returned all the duplicate rows (Both rows) which don't exist in TableP.  We only want to delete one of those rows and keep the other.
Select u.sid,u.Company ,u.FName ,u.LName  FROM
(
      SELECT SID,Company,FName,LName,count(*) OVER (PARTITION BY Company,FName,LName) counts from TableS
) as u WHERE u.counts > 1 
AND  NOT EXISTS ( SELECT NULL FROM TableP  p WHERE u.SID = p.SID )

Open in new window

This is good - it gives the set from which we delete.  However, as written, it will delete all the duplicates (both records).  We don't care which one is deleted, but we need to keep one.

Mark:
The SID is auto generated when the TableS data is input.  It is the IDENTITY field.  When an entry is made in TableP, the TableS IDENTITY is used to link the two records together.  So we take the TableS.SID to see if a record exists in TableP.  There is no possibility of triplicates.

The SID field is the only field which is common to both tables.

The SID field doesn't exist in the dups_CTE table, so can't be used in subsequent SELECTs.  That is the main problem I had when I started this.  You can't have the SID field in the SELECT statement because then it needs to be in the GROUP BY statement, but all SIDs are different (as it is the IDENTITY field of the TableS).

Also, there is a possibility that there is more than 1 record in TableP for a given SID.  That shouldn't make any difference, if ANY record exists in TableS then the associated record in TableS is NOT deleted.  There is also the possibility that records exist in TableP for both SIDs from TableS - in which case both of the duplicate records are kept.

Also, finally,
Hopefully this is a one shot need - to fix some data which got corrupted (same data entered into the table twice), however I want to keep it around in case someone screws up in the future!
Understand all of that - thanks.

Do you even try the solution ?

It does work even though it might read as if I have not, or have included, errors or omissions. the CTE (Common Table Expression) was tested using the temporary tables (also provided).

For your own use, just do a global replace of #TableS for your TableS and #TableP for your TableP.

If you have tried it, what happened ? Did you get errors ?

Oh, and the dupes_CTE is NOT a table it is the first part of a CTE - a result set that can be referenced and reused.
Avatar of Richard

ASKER

Hi Mark,
I did try it. In the second part - on line 11 your inner join need to use the SID field (only field which is common to both tables).  I replaced the references to the "Company, FName, & LNAME" with the SID as "inner join TableS S on S.SubID = D.Subid". As I noted in my last post, the SID field is the only common field between the tow tables.  D.SID  doesn't exist (SQL gives the error: "The multi-part identifier 'S.SID' could not be bound") in the dups_CTE.  When executing the code, I get the error: "Invalid column name 'Sid'".

I agree with you that a CTE doesn't necessarily create an actual table, it does create a "virtual" table which has many of the capabilities of a "real" table.
Line 11 must not include SID - we are still gathering, it is essentially joining back on itself so we can get to the Sid
Line 12 is where it determines the existence in the second table.

The theory is you only have to change #TableS to whatever your real TableS is called, and #TableP to whatever it is called.

Should Not need to change other parts. Other than real table names and real column names. Honestly.

Seriously, dont change the code, just use the temp tables as an example first and work through those first.
I have been known to write code where the only comment is :
Dont touch this part it is magic

Might be guilty of it here, so let me explain

Step 1 - get dupes  

select Company,FName,LName
  from #TableS  
  group by  Company,LName ,FName
  having count(Sid)> 1

Step 2 -- Find if the duplicates have Sids existing in other table but to get the SID we need to go back to detailed data in TableS, so, join back on itself

  Select D.Company,D.FName,D.LName, S.SID, ROW_NUMBER() OVER (PARTITION BY D.Company,D.FName,D.LName order by SID) RN  
  from dupes_CTE D
  inner join #TableS S on D.Company = S.company and D.FName = S.FName and D.LName = S.LName  
  where SID not in (select sid from #TableP)

Step 3 -- get the Sids we probably dont want to keep

  Select D.Sid,D.Company,D.Fname,D.Lname , case when H.RN is NULL then 'Delete this one' else 'Keeper' end Msg              
  from Dupe_Hdr D
  left join Dupe_Hdr H on D.Company = H.company and D.FName = H.FName and D.LName = H.LName and D.RN < H.RN                    
  where H.RN is NULL      --   see Notes 2 and 3

If you dont understand or disagree with the steps, then lets discuss before you change any code.

I cannot see your changes, and it sounds as if you doubt my code - let's talk....
Avatar of Richard

ASKER

Hi Mark,
Sorry - it wasn't intuitive obvious (to me) what you were doing in line 11.  I don't doubt your code, just that I (and many others on EE) have gotten responses with typos.  All that aside....

I took your suggestion and set up a new database with just you tables and data. This looks good. I end up with the records from TableS which should be dropped.

Now all I have left is to filter TableS to a select few (about 6,000 out of 2.5 mil or so total records in the real table).  The field which distinguishes the subset is PName and contains a value like 'PN1217T4'.  Only the 6,000 records contain that value. There is a chance that duplicates may exist in other subsets, but I'm not trying to fix them (nor can I change anything outside of my subset of 6000 records).  I tried using a "WHERE" clause on the CTE - after the "HAVING" but that doesn't work.  I suppose we could add the PName field to the SELECT statement.  Is there a better way?
You mean In step 1 ?

If that is in step 1 then the 'where' must be after the 'from' but before the 'group by'

select Company,FName,LName
  from #TableS  
where PName = 'PN1217T4'
  group by  Company,LName ,FName
  having count(Sid)> 1


And completely understand about typo's
Which is why I applauded your "Always test first" approach :)
Avatar of Richard

ASKER

Hi Mark,
one last stupid question -
I get a list from the final select statement which contains all the TableS records to delete.  In that list is the SID of the record to delete.  How do I actually delete the records?  I tried delete from TableS where SID = Bad_Sid.SID and ...= Bad_Sid(2), but neither works.
Hello TopCatOnyx,

Please take backup of your table TableS. After that please use below command. I think it will work.

;WITH CTE AS
(
      Select u.sid,u.Company ,u.FName ,u.LName , ROW_NUMBER() OVER (PARTITION BY u.Company ,u.FName ,u.LName ORDER BY u.sid DESC) rnk FROM
      (
              SELECT SID,Company,FName,LName,count(*) OVER (PARTITION BY Company,FName,LName) counts from TableS
      ) as u WHERE u.counts > 1
      AND  NOT EXISTS ( SELECT NULL FROM TableP  p WHERE u.SID = p.SID )
)
DELETE FROM CTE WHERE rnk > 1

Vaibhav
Avatar of Richard

ASKER

Hi Vaibhav and Mark,
Vaibhav:
I tried your latest idea, and nothing happened.  I'm using  small test tables, and nothing was deleted from TableS

Mark:
Works like a champ.
Avatar of Richard

ASKER

Super solution Mark.
Very responsive - I needed to have the Database fixed by Monday!
Thank you very much.
Sorry TopCatOnyx,

Didnt see your 'one last question' until just now...

But judging by your comment above, I guess you figured it out.

The delete statement was in the code, just commented out. Hope you found it OK.

Thank you very much for your patience and showing trust in the solution. Much appreciated.

Cheers,
Mark Wills