Avatar of RUA Volunteer2?
RUA Volunteer2?
Flag for United States of America asked on

What are the best practices for placement of SQL when checking for duplicates? WHERE clause JOINS, beginning or end?

Where do you put a checking for and removing duplicates query when you need to remove all extras. The goal is to check name, date of birth, address of person.  Then to make sure that I am getting only the last person from their most recent visit or date time stamp occurrence. (We'll call it their most recent date of service) or in this case "s._serviced_date"
I think it will look something like this
Select Name, Address, Date of Birth From Personnel
Where Max(Service_Date) ----Do I need any date formatting here or just the field?
Group By Name, Address, Date of Birth
Having Count(*) >1

Please feel free to check that. Especially If MAX() has to have special date format. May not need any?
THE BIG QUESTION HERE IS WHERE DO I PLACE IT AND WHY. WHAT ARE THE BEST PRACTICES OR WHAT IS GOOD POLICY TO DO IT ONE WAY OVER ANOTHER? the following is a mockup of the query. 

declare @from_dob date = dateadd(year,-45,'20210101')
declare @to_dob date = dateadd(year,-40,'20211231')
declare @research_date date = '20150101'
declare @region varchar(100)
        select @region = dbo.z_bi_get_region(db_name())
 
select 
@region as region
, personnel_id
, last_name
, first_name
, gender_code
, birth_date
,datediff(year,birth_date,cast(getdate() as date)) as age_at_run_date
, mobile_phone_number
, home_phone_number
, primary_email_address
, primary_address_line1
, primary_address_line2
, primary_address_city
, primary_address_state
, primary_address_zipcode
, death_flag 
 ,isnull(s1.disease_found,'N') as disease_found
 ,@from_dob as from_DOB, @to_dob as to_DOB, @research_date as research_date
 
from
   c_personnel pers
   outer apply
   (
     select top 1 'Y' as disease_found from c_research s
       inner join c_research_item si on s.research_key=si.research_key and si.active_flag='Y'
       inner join l_procedure_code pc on si.procedure_code=pc.procedure_code 
       inner join l_disease_type dt on pc.disease_type_code=mt.disease_type_code and dt.disease_type_code in ('Cancer','Heart','Kidney',’Lung’,'Accident',’Neurological’)
       inner join l_location st on s._location_code=st.location_code
     where 
       pers.personnel_key=s.personnel_key
       and s._serviced_date  >= @research_date
       AND st.location_code NOT IN ('HP','CLNK') AND st.practice_code <> 'DocNABox'
   ) as s1
where 
   pers.birth_date >= @from_dob 
   and pers.birth_date < dateadd(d,1,@to_dob)
   and pers.master_personnel_key is null
   AND pers.issuer_of_personnel_id = 'system' 
   and pers.gender_code='M'
   and isnull(pers.death_flag,'N')='N'
   and isnull(s1.disease_found,'N')='N'
  order by datediff(year,birth_date,cast(getdate() as date))

Open in new window

THE BIG QUESTION HERE IS WHERE DO I PLACE IT AND WHY. WHAT ARE THE BEST PRACTICES OR WHAT IS GOOD POLICY TO DO IT ONE WAY OVER ANOTHER? the following is a mockup of the query. 

Open in new window



Microsoft AccessMicrosoft SQL ServerMicrosoft Server OSSSRSSQL

Avatar of undefined
Last Comment
RUA Volunteer2?

8/22/2022 - Mon
ASKER CERTIFIED SOLUTION
arnold

THIS SOLUTION ONLY AVAILABLE TO MEMBERS.
View this solution by signing up for a free trial.
Members can start a 7-Day free trial and enjoy unlimited access to the platform.
See Pricing Options
Start Free Trial
GET A PERSONALIZED SOLUTION
Ask your own question & get feedback from real experts
Find out why thousands trust the EE community with their toughest problems.
SOLUTION
Jim Horn

THIS SOLUTION ONLY AVAILABLE TO MEMBERS.
View this solution by signing up for a free trial.
Members can start a 7-Day free trial and enjoy unlimited access to the platform.
See Pricing Options
Start Free Trial
GET A PERSONALIZED SOLUTION
Ask your own question & get feedback from real experts
Find out why thousands trust the EE community with their toughest problems.
RUA Volunteer2?

ASKER
One thing I was hoping to get is "where does the duplicate checker typically go". Does it go in the where clause? I don't know or see a clear place to put it? Does it go at the very end of the existing query that delivers the data. I kind of like Jim Horn's information as it gives you options. Especially the Soft Delete. That would work best in my situation.
What I get confused about is if I have the records returned in a large select statement as above. Do I encapsulate that whole query and define it as "Data" then say Can someone give me an example of placement "where" does the Soft Delete code go" Do I need to create a CTE or insert a table? I can often find an example of the code it is "where does the code go" What has to happen prior. Do I have to give the query above a name and use the insert function then below the "framed query" then do I create a separate query where the soft delete goes...?
SOLUTION
arnold

THIS SOLUTION ONLY AVAILABLE TO MEMBERS.
View this solution by signing up for a free trial.
Members can start a 7-Day free trial and enjoy unlimited access to the platform.
See Pricing Options
Start Free Trial
⚡ FREE TRIAL OFFER
Try out a week of full access for free.
Find out why thousands trust the EE community with their toughest problems.
RUA Volunteer2?

ASKER
If you are referring to Select Distinct ...not select Unique. I think you are mistaken sir. Select Unique is only applicable to Oracle this is MS SQL Server.
RUA Volunteer2?

ASKER
I am looking at Jim Horns Soft Delete of duplicates gentlemen. I do not know what to do with the above code. The problem is I DO NOT KNOW WHERE to put the code now. Do I insert a new table somewhere and then use the code Jim provided. I do not know if I need to insert a new table or somehow incorporate the query I have above turning it into some kind of inserted table?
-- Set all is_active values to 'Y' (just for the demo)  
UPDATE ....update what? The table? Am I supposed to define a table here or an existing one that relates to the query above?
 SET is_active='Y' GO.  .....So this defines the field/column as a Y
-- Now set all duplicate rows is_active='N'
;with " Some identifier like a single character s2 as ( SELECT *,  row_number() over(partition by last_name, first_name, scheduled_start_date, birth_date, address1 order by ISNULL(birth_date,''), MAX(scheduled_start_date)) as row_number    
FROM "some reference to a table" )
UPDATE "short reference to table"  
SET is_active='N'
WHERE row_number > 1

Open in new window


I have no idea what needs to be done in relation to the query I provided above in the beginning of the question vs what I need to do with this. I am confused about WHERE this would go and what needs to happen to the query above in order for it to accept the code Jim provided. The data I am have has duplicate rows of names and addresses of the same people who may have had a phone number or address change. The goal is to use the most recent visit as the tru current record to count only once. otherwise we have 400k records returned. 

Open in new window


Your help has saved me hundreds of hours of internet surfing.
fblack61
SOLUTION
arnold

THIS SOLUTION ONLY AVAILABLE TO MEMBERS.
View this solution by signing up for a free trial.
Members can start a 7-Day free trial and enjoy unlimited access to the platform.
See Pricing Options
Start Free Trial
⚡ FREE TRIAL OFFER
Try out a week of full access for free.
Find out why thousands trust the EE community with their toughest problems.
RUA Volunteer2?

ASKER
That was helpful. Thank you both.