Remove duplicate entries from MSSQL table

I have a data table (Table S) which has about 60 fields.  Entries are considered duplicate if 3 fields (S.FName, S.LName, S.Company) contain the same data .  Some of the duplicates were used to create an entry in another table (Table P).  A field in table S (S.SID) is used to link the two tables.  What I need to do is remove the entries from table S which don't have an entry in table P.  I can find the duplicates using:
select Company,FName,LName,count(Sid) from TableS 
	group by  Company,LName ,FName having count(Sid)> 1 order by Company

Open in new window

My problem is that I need the S.SID value to determine if that record has been used to create an entry in table P.  If I include S.SID in the Select statement, I need to include it in the Group By statement or I get an error.  If it is in the Group By statement, I get no data because the S.SID values are unique.
Any thoughts on this.  I'm sure there must be same way to do this in T-SQL.
TopCatOnyxAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Jim HornMicrosoft SQL Server Developer, Architect, and AuthorCommented:
>What I need to do is remove the entries from table S which don't have an entry in table P.
DELETE s
FROM TableS as s
   -- LEFT means include all the rows in s ...
   LEFT JOIN TableP p ON s.SID = p.SID
-- ... that do not have a related row in p
WHERE p.SID IS NULL

Open in new window


To handle the duplicates issue, check out SQL Server Delete Duplicate Rows, which has T-SQL code that illustrates how to use custom 'what columns define duplicate' and 'what columns (if any) define which one to keep/delete' to pull off a delete.

Good luck.
Jim
0
Mark WillsTopic AdvisorCommented:
To find them use something like :

select Company,FName,LName,count(Sid) from TableS
left join TableP on TableS.Sid = TableP.Sid
where TableP.Sid is NULL
group by  Company,LName ,FName having count(Sid)> 1 order by Company

what version of SQL are you on ? There are some window functions from 2012 onwards that makes count() a bit more powerful... Rest assured you can definitely do this in T-SQL.
0
Mark WillsTopic AdvisorCommented:
Sorry Jim, werent any posts when I opened...
0
Ultimate Tool Kit for Technology Solution Provider

Broken down into practical pointers and step-by-step instructions, the IT Service Excellence Tool Kit delivers expert advice for technology solution providers. Get your free copy now.

TopCatOnyxAuthor Commented:
Hi Jim,
For testing, I modified your solution to:
Select s.SID,s.Company ,  s.FName ,s.LName ,p.SID 
FROM TableS as s
   -- LEFT means include all the rows in s ...
   LEFT JOIN TableP  p ON s.SID = p.SID
-- ... that do not have a related row in p
WHERE p.SID IS NULL order by s.Company

Open in new window

Always test first!
This found ALL the rows in TableS which don't have an entry in TableP!
My original table contains data as:
SID        Company     FName     LName      (A bunch of other fields which may or may not be the same - these are "Don't Care")
12345     Comp1        John        Jones
23412     Comp1        John        Jones     (Duplicate of the one above)
43214     Comp2        Fred        Jones
45673     Comp3        Alex        Wilson
23478     Comp4        Fred       Jones     (Not duplicate - different company)
43542     Comp2        Fred       Jones     (Duplicate with SID 43214 above)

My original data file contains about 6800 records of which about 800 are duplicated.  When the deletions are done, the new table will have 6400.  There is no more that 1 duplicate of an entry (ex: the first two rows shown above make up a duplicate.  There are no more "Comp1,John,Jones" entries in the table.

I only want to delete the rows from TableS which 1) are duplicates and 2) don't have an entry in TableP.  My feeling is that some type of CTE would find the duplicates, using the logic I showed above, then from that resultant table see which ones don't exist in TableP using the SID value.
0
TopCatOnyxAuthor Commented:
Hi Mark,
I'm using SQL Server 2008
0
Vaibhav GoelMSBI , SQL ConsultantCommented:
Hello TopCatOnyx

Please try my solution given below

Delete u FROM
(
      SELECT SID,Company,FName,LName,count(*) OVER (PARTITION BY Company,FName,LName) counts from TableS
) as u WHERE u.counts > 1
AND  NOT EXISTS ( SELECT NULL FROM tableP p WHERE u.SID = p.SID )

Vaibhav
0
Mark WillsTopic AdvisorCommented:
>> Always test first!

Absolutely !

But you do need to test for the dupe part as well. Doing a left join will find entries in TableS that dont exist in TableP which you have found.

But they may or may not be duplicates so you do need the group by for TableS as well - otherwise you only have half the answer.

I am now rethinking that the SID to remove is the SID that created the dupe and remove that SID because that one alone doesnt also exist in TableP whereas the other one does.

So, no point for checking for SID's until AFTER the dupes are found. once the dupes are found, then we can go looking for candidate sids....

Right ?

So, once we know the candidate TableS, we then use that to find our orphans. We just need to know which SID to delete

Using :
12345     Comp1        John        Jones
23412     Comp1        John        Jones     (Duplicate of the one above)

If 12345 doesnt exist in TableP but 23412 does, then we need to target 12345 for delete

But what if neither SID's exist TableP ?

Do we keep them both, or just keep at least one. What criteria - maybe keep the lowest SID, or keep the highest SID ?  

We probably need something more like

Select D.Company,D.FName,D.LName, S.SID from
   (select Company,FName,LName,count(Sid) SC from TableS
      group by  Company,LName ,FName having count(Sid)> 1 ) D
inner join TableS S on D.Company = S.company and D.FName = S.FName and D.LName = D.LName

then we have our list of SID's to check

Sound OK so far ?

Then just need feedback on which SID

(and will provide better code - mainly concept to make very sure I understand)
0
Vaibhav GoelMSBI , SQL ConsultantCommented:
Hello TopCatOnyx

Sorry I have updated my last comment. Please try.

Delete u FROM
(
      SELECT SID,Company,FName,LName,ROW_NUMBER() OVER (PARTITION BY Company,FName,LName) row_numbe from TableS
) as u WHERE u.row_numbe > 1
AND  NOT EXISTS ( SELECT NULL FROM tableP p WHERE u.SID = p.SID )

Vaibhav
0
Mark WillsTopic AdvisorCommented:
Havent heard back about which Sid to keep, and  been thinking ...
what if both Sid's have entries in TableP - does that mean they are not Dupes ?
Do we need to cater for triplicates ?
Relying on Sid  as per  ...because the S.SID values are unique

So, while I have been thinking and waiting for feedback, thought I would play with some test data. So, I used your example above :
-- Drop table #TableP
Create table #TableP (SID int)
go
Insert #TableP values (23412)
;
-- Drop table #TableS
Create table #TableS (SID int, Company varchar(20), Fname varchar(20), Lname varchar(20))
go
insert #TableS values
(12345,'Comp1','John','Jones'),
(23412,'Comp1','John','Jones'), --     (Duplicate of the one above)
(43214,'Comp2','Fred','Jones'),
(45673,'Comp3','Alex','Wilson'),
(23478,'Comp4','Fred','Jones'), --     (Not duplicate - different company)
(43542,'Comp2','Fred','Jones') --     (Duplicate with SID 43214 above)
;

Open in new window


and now some query stuff - a lot more involved, but very deliberately step by step. Might be able to further optimise, but that depends on being a once off or not.
;with dupes_CTE as
( -- get duplicates
  select Company,FName,LName 
  from #TableS  
  group by  Company,LName ,FName 
  having count(Sid)> 1
), Dupe_Hdr as          
( -- Find if the duplicates have Sids existing in other table
  Select D.Company,D.FName,D.LName, S.SID, ROW_NUMBER() OVER (PARTITION BY D.Company,D.FName,D.LName order by SID) RN         -- See Note 1
  from dupes_CTE D
  inner join #TableS S on D.Company = S.company and D.FName = S.FName and D.LName = S.LName
  where SID not in (select sid from #TableP)
), Bad_Sid as
( -- get the Sids we probably dont want to keep
  Select D.Sid,D.Company,D.Fname,D.Lname --, case when H.RN is NULL then 'Delete this one' else 'Keeper' end Msg               -- see note 2
  from Dupe_Hdr D
  left join Dupe_Hdr H on D.Company = H.company and D.FName = H.FName and D.LName = H.LName and D.RN < H.RN                     
  where H.RN is NULL                                                                                                           -- see note 3
) -- Do something with offending Sids
Select * from Bad_Sid order by 2,3,4,1                                                                                         -- see note 4
--delete #Tables where Sid in (select sid from Bad_Sid)                                                                        -- see note 5

/*
Note 1  -  this is the part where we need to decide the Order By 
Note 2  -  You can uncomment after the D.Lname to get some diagnostics 
Note 3  -  CAUTION : To make use of diagnostics, can comment out the 'where H.RN is NULL' 
        -  MUST use 'where H.RN is NULL' after diagnostics VERY IMPORTANT before you run the delete
Note 4  -  The select will show the Sid that as destined for Delete
Note 5  -  If the select is Ok, then make it do the delete instead of select - comment out select and uncomment the delete.
        -  MUST use 'where H.RN is NULL' after diagnostics VERY IMPORTANT before you run the delete
        -  BACKUP before you begin - at very least SELECT * into TABLES_BACKUP from TABLES

*/

Open in new window


Before you delete, and after the Test, please let me know your thoughts, concerns, ot any other comment you care to share...

Cheers,
Mark Wills
0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
TopCatOnyxAuthor Commented:
Hi Vaibhav and Mark,
Sorry about the response delay - had to get some sleep.
Anyway,
Vaibhav:
I tried your modified code (using SELECT rather than DELETE!!!).  It returned all the duplicate rows (Both rows) which don't exist in TableP.  We only want to delete one of those rows and keep the other.
Select u.sid,u.Company ,u.FName ,u.LName  FROM
(
      SELECT SID,Company,FName,LName,count(*) OVER (PARTITION BY Company,FName,LName) counts from TableS
) as u WHERE u.counts > 1 
AND  NOT EXISTS ( SELECT NULL FROM TableP  p WHERE u.SID = p.SID )

Open in new window

This is good - it gives the set from which we delete.  However, as written, it will delete all the duplicates (both records).  We don't care which one is deleted, but we need to keep one.

Mark:
The SID is auto generated when the TableS data is input.  It is the IDENTITY field.  When an entry is made in TableP, the TableS IDENTITY is used to link the two records together.  So we take the TableS.SID to see if a record exists in TableP.  There is no possibility of triplicates.

The SID field is the only field which is common to both tables.

The SID field doesn't exist in the dups_CTE table, so can't be used in subsequent SELECTs.  That is the main problem I had when I started this.  You can't have the SID field in the SELECT statement because then it needs to be in the GROUP BY statement, but all SIDs are different (as it is the IDENTITY field of the TableS).

Also, there is a possibility that there is more than 1 record in TableP for a given SID.  That shouldn't make any difference, if ANY record exists in TableS then the associated record in TableS is NOT deleted.  There is also the possibility that records exist in TableP for both SIDs from TableS - in which case both of the duplicate records are kept.

Also, finally,
Hopefully this is a one shot need - to fix some data which got corrupted (same data entered into the table twice), however I want to keep it around in case someone screws up in the future!
0
Mark WillsTopic AdvisorCommented:
Understand all of that - thanks.

Do you even try the solution ?

It does work even though it might read as if I have not, or have included, errors or omissions. the CTE (Common Table Expression) was tested using the temporary tables (also provided).

For your own use, just do a global replace of #TableS for your TableS and #TableP for your TableP.

If you have tried it, what happened ? Did you get errors ?

Oh, and the dupes_CTE is NOT a table it is the first part of a CTE - a result set that can be referenced and reused.
0
TopCatOnyxAuthor Commented:
Hi Mark,
I did try it. In the second part - on line 11 your inner join need to use the SID field (only field which is common to both tables).  I replaced the references to the "Company, FName, & LNAME" with the SID as "inner join TableS S on S.SubID = D.Subid". As I noted in my last post, the SID field is the only common field between the tow tables.  D.SID  doesn't exist (SQL gives the error: "The multi-part identifier 'S.SID' could not be bound") in the dups_CTE.  When executing the code, I get the error: "Invalid column name 'Sid'".

I agree with you that a CTE doesn't necessarily create an actual table, it does create a "virtual" table which has many of the capabilities of a "real" table.
0
Mark WillsTopic AdvisorCommented:
Line 11 must not include SID - we are still gathering, it is essentially joining back on itself so we can get to the Sid
Line 12 is where it determines the existence in the second table.

The theory is you only have to change #TableS to whatever your real TableS is called, and #TableP to whatever it is called.

Should Not need to change other parts. Other than real table names and real column names. Honestly.

Seriously, dont change the code, just use the temp tables as an example first and work through those first.
0
Mark WillsTopic AdvisorCommented:
I have been known to write code where the only comment is :
Dont touch this part it is magic

Might be guilty of it here, so let me explain

Step 1 - get dupes  

select Company,FName,LName
  from #TableS  
  group by  Company,LName ,FName
  having count(Sid)> 1

Step 2 -- Find if the duplicates have Sids existing in other table but to get the SID we need to go back to detailed data in TableS, so, join back on itself

  Select D.Company,D.FName,D.LName, S.SID, ROW_NUMBER() OVER (PARTITION BY D.Company,D.FName,D.LName order by SID) RN  
  from dupes_CTE D
  inner join #TableS S on D.Company = S.company and D.FName = S.FName and D.LName = S.LName  
  where SID not in (select sid from #TableP)

Step 3 -- get the Sids we probably dont want to keep

  Select D.Sid,D.Company,D.Fname,D.Lname , case when H.RN is NULL then 'Delete this one' else 'Keeper' end Msg              
  from Dupe_Hdr D
  left join Dupe_Hdr H on D.Company = H.company and D.FName = H.FName and D.LName = H.LName and D.RN < H.RN                    
  where H.RN is NULL      --   see Notes 2 and 3

If you dont understand or disagree with the steps, then lets discuss before you change any code.

I cannot see your changes, and it sounds as if you doubt my code - let's talk....
0
TopCatOnyxAuthor Commented:
Hi Mark,
Sorry - it wasn't intuitive obvious (to me) what you were doing in line 11.  I don't doubt your code, just that I (and many others on EE) have gotten responses with typos.  All that aside....

I took your suggestion and set up a new database with just you tables and data. This looks good. I end up with the records from TableS which should be dropped.

Now all I have left is to filter TableS to a select few (about 6,000 out of 2.5 mil or so total records in the real table).  The field which distinguishes the subset is PName and contains a value like 'PN1217T4'.  Only the 6,000 records contain that value. There is a chance that duplicates may exist in other subsets, but I'm not trying to fix them (nor can I change anything outside of my subset of 6000 records).  I tried using a "WHERE" clause on the CTE - after the "HAVING" but that doesn't work.  I suppose we could add the PName field to the SELECT statement.  Is there a better way?
0
Mark WillsTopic AdvisorCommented:
You mean In step 1 ?

If that is in step 1 then the 'where' must be after the 'from' but before the 'group by'

select Company,FName,LName
  from #TableS  
where PName = 'PN1217T4'
  group by  Company,LName ,FName
  having count(Sid)> 1


And completely understand about typo's
Which is why I applauded your "Always test first" approach :)
0
TopCatOnyxAuthor Commented:
Hi Mark,
one last stupid question -
I get a list from the final select statement which contains all the TableS records to delete.  In that list is the SID of the record to delete.  How do I actually delete the records?  I tried delete from TableS where SID = Bad_Sid.SID and ...= Bad_Sid(2), but neither works.
0
Vaibhav GoelMSBI , SQL ConsultantCommented:
Hello TopCatOnyx,

Please take backup of your table TableS. After that please use below command. I think it will work.

;WITH CTE AS
(
      Select u.sid,u.Company ,u.FName ,u.LName , ROW_NUMBER() OVER (PARTITION BY u.Company ,u.FName ,u.LName ORDER BY u.sid DESC) rnk FROM
      (
              SELECT SID,Company,FName,LName,count(*) OVER (PARTITION BY Company,FName,LName) counts from TableS
      ) as u WHERE u.counts > 1
      AND  NOT EXISTS ( SELECT NULL FROM TableP  p WHERE u.SID = p.SID )
)
DELETE FROM CTE WHERE rnk > 1

Vaibhav
0
TopCatOnyxAuthor Commented:
Hi Vaibhav and Mark,
Vaibhav:
I tried your latest idea, and nothing happened.  I'm using  small test tables, and nothing was deleted from TableS

Mark:
Works like a champ.
0
TopCatOnyxAuthor Commented:
Super solution Mark.
Very responsive - I needed to have the Database fixed by Monday!
Thank you very much.
0
Mark WillsTopic AdvisorCommented:
Sorry TopCatOnyx,

Didnt see your 'one last question' until just now...

But judging by your comment above, I guess you figured it out.

The delete statement was in the code, just commented out. Hope you found it OK.

Thank you very much for your patience and showing trust in the solution. Much appreciated.

Cheers,
Mark Wills
0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Microsoft SQL Server

From novice to tech pro — start learning today.

Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.