[Okta Webinar] Learn how to a build a cloud-first strategyRegister Now

x
  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 500
  • Last Modified:

Name variation matching

We have a number of jobs that perform matches across different tables. The initial match is based on the forename and surname of the person in question being equal.  If the forename and surname is the same and, a record is created in another table and further matching is performed to give a final match rating.

We have just purchased a database containing million of name variations.  These are in two tables (one for forenames and one for surnames) with the following columns:
"      Source_Name  This contains a list of Names that you can search on (E.g. Steve)
"      Name_Variation  This list the variation of the sources name. Each variation is on a new row. (I.e. there may be 2 rows with the source name Steve. One with the variation of Stephen and the other with  the variation Steven)
"      Score  This is a rating system with 100 being a close match and 75 being a loose match.

I am having trouble getting the matching script to return results in an efficient manner.  I believe that this is caused by the cross join that I have used in the below script.  Is there a more efficient way of doing it?

SELECT Distinct
       AVP.ID,
	   PER.PersonID,
	   1,
	   1
FROM PERSON P 
Cross JOIN C6..PERSON_SUBSET PER
WHERE P.Forename <> ''
       AND
       P.Surname <> ''
And (P.Forename = PER.EnglishForename  Or PER.PERSONID in (
			Select PersonID
			From C6..Person_Subset	
			Where EnglishForename In 
				(Select Name_Variation 
				from Names_Database..Forename   
				Where [Source_Name] = P.Forename and Score >= 95)))
AND (P.Surname = PER.EnglishSurname OR PER.PERSONID in (
			Select PersonID 
			From C6..Person_Subset
			Where EnglishSurname in
				(Select Name_Variation
				from Names_Database..Surname 
				Where [Source_Name] = P.Surname and Score >= 95)))

Open in new window

0
iis-it
Asked:
iis-it
  • 2
1 Solution
 
GregTSmithCommented:
I've done my best to make sure the syntax is correct, but it's a little tough without actually being able to execute the query.  Something like the following should yield the same result set.
SELECT 
  P.ID, -- I changed AVP to P here... not sure what AVP was an alias to...?
  X.PersonID, 
  1, 
  1 
FROM 
  PERSON P 
  INNER JOIN (
    SELECT 
      PS.PersonID, 
      PS.EnglishForename, 
      PS.EnglishSurname, 
      F.Source_Name [AlternateForename], 
      S.Source_Name [AlternateSurname] 
    FROM   
      C6..PERSON_SUBSET PS 
      INNER JOIN Names_Database..Forename F ON 
        F.Name_Variation = PS.EnglishForename 
          AND F.Score >= 95 
      INNER JOIN Names_Database..Surname S ON 
        S.Name_Variation = PS.EnglishSurname 
          AND S.Score >= 95 
  ) X ON 
    LEN(P.Forename) != 0 
      AND LEN(P.Surname) != 0 
      AND (
        X.EnglishForename = P.Forename 
        OR X.AlternateForename = P.Forename 
      )
      AND (
        X.EnglishSurname = P.Surname 
        OR X.AlternateSurname = P.Surname 
      )
GROUP BY 
  P.ID, 
  X.PersonID 

Open in new window

0
 
iis-itAuthor Commented:
Thanks for your answer. A member of the SQL Server central forums (RBarryYoung) came up with the most efficient solution.
create proc spSurnameMatching_TempTAnswer
 
AS
--====== Create the 95% Forename table + all Person names
Create table #Forename95_U_P(
  
Source_Name varchar(20)
 
, Name_Variation varchar(20)
 
, Primary Key (Source_name, Name_Variation)
)
INSERT into #Forename95_U_P
 
Select Source_name, Name_variation
  
--From Names_Database..Forename
  
From Forename
  
Where Score >= 95
INSERT into #Forename95_U_P
 
Select distinct Forename, Forename
  
From PERSON P
  
Left Join #Forename95_U_P U
     
ON P.Forename = U.Source_name
        
And P.Forename = U.Name_variation
  
Where U.Source_name IS NULL
----====== Create the 95% Surname table + all Person names
Create table #Surname95_U_P(
   
Source_Name varchar(20)
 
, Name_Variation varchar(20)
 
, Primary Key (Source_name, Name_Variation)
)
INSERT into #Surname95_U_P
 
Select Source_name, Name_variation
  
--From Names_Database..Surname
  
From Surname
  
Where Score >= 95
INSERT into #Surname95_U_P
 
Select distinct Surname, Surname
  
From PERSON P
    
Left Join #Surname95_U_P U
      
ON P.Surname = U.Source_name
         
And P.Surname = U.Name_variation
  
Where U.Source_name IS NULL
--====== Find all Person-subsets that 95% match persons
;WITH ctePersToPersSubs as (
   
Select *
     
FROM PERSON P
    
INNER JOIN PERSON_SUBSET PER
      
ON PER.EnglishForename IN (
        
Select fn.Name_Variation
         
From #Forename95_U_P fn
         
Where fn.Source_Name = P.Forename
         
)
      
AND PER.EnglishSurname IN (
           
Select fn.Name_Variation
         
From #Surname95_U_P fn
          
Where fn.Source_Name = P.Surname
          
)
)
SELECT Distinct
   
P.ID,
   
P.PersonID,
   
1,
   
1
 
From ctePersToPersSubs P
 
WHERE P.Forename <> ''
  
AND  P.Surname <> ''
--====== clean-up temp tables
drop table #Forename95_U_P
drop table #Surname95_U_P
--==================

Open in new window

0
 
GregTSmithCommented:
I'm surprised that the temp table solution is faster since it is both reading and writing the records, where my example only reads...  you still might get better result from the query I provided by adding indexes...
0

Featured Post

Prep for the ITIL® Foundation Certification Exam

December’s Course of the Month is now available! Enroll to learn ITIL® Foundation best practices for delivering IT services effectively and efficiently.

  • 2
Tackle projects and never again get stuck behind a technical roadblock.
Join Now