Delete/merge some records based on a count

I am trying to delete/merge some records from my dataset as there are a nuber of duplicates but I'm struggling to figure out how to do this. I have the following code:

SELECT     fnam03, snam03, had103, hpc103, hpc203, dtob03, COUNT(*) AS No_of_records
FROM         dbo.ptp03
GROUP BY fnam03, snam03, had103, hpc103, hpc203, dtob03
HAVING      (COUNT(*) > 1) and snam03 is not null
order by fnam03, snam03

and this allows me to detect the duplicates in my dataset (about 849 from about 12000). However I also need to be able to perform some other calculations prior to the merging taking place:

If there are 2 records (as most of the duplicates are) then I need to check a field called RGNO03 to see if either record has this populated. If one of them does then I wish to keep this record and discard the other. If RGNO03 is NULL in both records then it should check RCNO03 and the record with the highest number in this field should be retained and the lower record discarded. However, if RGNO03 is popluated in both records then no de-duplication should take place.

BTW RGNO03 and RCNO03 are both in the PTP03 table (the same table as the rest of the query fields)

The same principals should be applied to records that have a count value of 3 or 4 (there aren't that many of them so I could do this in stages if needed).

Anyone got any suggestions on a possible way forward on this one?
LVL 2
Steven O'NeillSolutions ArchitectAsked:
Who is Participating?
 
Mark WillsTopic AdvisorCommented:
OK, first step is to create a backup prior to doing anything - found this is the first step toward success - like an insurance policy that you never want to cash in - and follows along the lines or "murphy's law"...

Then we will remove those NULL rgno03 where there is a populated rgno03, That will leave us with either populated rgn03's or, NULL rgno03's and if the latter, then we will end up keeping the tempid which represents the highest rcno03...

If that sounds OK, then look at the following... And as a first pass would suggest you change the "delete" to a " select *" to visualize what is happening...
select * into ptp03_backup_dupes from ptp03
 
delete from ptp03
where rgno03 is NULL 
and exists (select null from ptp03 p where p.fnam03=ptp03.fnam03 and p.snam03=ptp03.snam03 and p.had103=ptp03.had103 and p.hpc203=ptp03.hpc203 and p.dtob03=ptp03.dtob03 and p.rgno03 is not null)
 
delete from ptp03
where rgno03 is NULL 
and not exists (select null from ptp03 p where p.fnam03=ptp03.fnam03 and p.snam03=ptp03.snam03 and p.had103=ptp03.had103 and p.hpc203=ptp03.hpc203 and p.dtob03=ptp03.dtob03 and p.rgno03 is not null)
and rcno03 <>  (select top 1 rcno03 from ptp03 p where p.fnam03=ptp03.fnam03 and p.snam03=ptp03.snam03 and p.had103=ptp03.had103 and p.hpc203=ptp03.hpc203 and p.dtob03=ptp03.dtob03 order by rcno03 desc, tempid desc)

Open in new window

0
 
Mark WillsTopic AdvisorCommented:
OK, might be a bit convaluted...

but, Keep if RGN003 is populated. If not (in all instances) keep the highest RCN003.


so, probably easiest to do this in two parts... First being get rid of the null RGNo03 where there is a populated one, Second being keep the highest RCNo03 when all RGNo03 are null.

One question first...  is it an "oh" then zero three ?
Next question, is RCNo03 a unique identifier per record, and if not, is there one ?
0
 
Steven O'NeillSolutions ArchitectAuthor Commented:
Hi Mark
Thanx for offering to help out here. To answer you're questions:

1. Yes it is rgno03 (with an oh) and rcno03 (again with an oh);
2. No rcno03 isn't unique but we do have a uniquie identifier called TempID. It's an integer.
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.