Solved

Name variation matching

Posted on 2009-05-15
3
494 Views
Last Modified: 2012-05-07
We have a number of jobs that perform matches across different tables. The initial match is based on the forename and surname of the person in question being equal.  If the forename and surname is the same and, a record is created in another table and further matching is performed to give a final match rating.

We have just purchased a database containing million of name variations.  These are in two tables (one for forenames and one for surnames) with the following columns:
"      Source_Name  This contains a list of Names that you can search on (E.g. Steve)
"      Name_Variation  This list the variation of the sources name. Each variation is on a new row. (I.e. there may be 2 rows with the source name Steve. One with the variation of Stephen and the other with  the variation Steven)
"      Score  This is a rating system with 100 being a close match and 75 being a loose match.

I am having trouble getting the matching script to return results in an efficient manner.  I believe that this is caused by the cross join that I have used in the below script.  Is there a more efficient way of doing it?

SELECT Distinct
       AVP.ID,
	   PER.PersonID,
	   1,
	   1
FROM PERSON P 
Cross JOIN C6..PERSON_SUBSET PER
WHERE P.Forename <> ''
       AND
       P.Surname <> ''
And (P.Forename = PER.EnglishForename  Or PER.PERSONID in (
			Select PersonID
			From C6..Person_Subset	
			Where EnglishForename In 
				(Select Name_Variation 
				from Names_Database..Forename   
				Where [Source_Name] = P.Forename and Score >= 95)))
AND (P.Surname = PER.EnglishSurname OR PER.PERSONID in (
			Select PersonID 
			From C6..Person_Subset
			Where EnglishSurname in
				(Select Name_Variation
				from Names_Database..Surname 
				Where [Source_Name] = P.Surname and Score >= 95)))

Open in new window

0
Comment
Question by:iis-it
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 2
3 Comments
 
LVL 3

Expert Comment

by:GregTSmith
ID: 24401360
I've done my best to make sure the syntax is correct, but it's a little tough without actually being able to execute the query.  Something like the following should yield the same result set.
SELECT 
  P.ID, -- I changed AVP to P here... not sure what AVP was an alias to...?
  X.PersonID, 
  1, 
  1 
FROM 
  PERSON P 
  INNER JOIN (
    SELECT 
      PS.PersonID, 
      PS.EnglishForename, 
      PS.EnglishSurname, 
      F.Source_Name [AlternateForename], 
      S.Source_Name [AlternateSurname] 
    FROM   
      C6..PERSON_SUBSET PS 
      INNER JOIN Names_Database..Forename F ON 
        F.Name_Variation = PS.EnglishForename 
          AND F.Score >= 95 
      INNER JOIN Names_Database..Surname S ON 
        S.Name_Variation = PS.EnglishSurname 
          AND S.Score >= 95 
  ) X ON 
    LEN(P.Forename) != 0 
      AND LEN(P.Surname) != 0 
      AND (
        X.EnglishForename = P.Forename 
        OR X.AlternateForename = P.Forename 
      )
      AND (
        X.EnglishSurname = P.Surname 
        OR X.AlternateSurname = P.Surname 
      )
GROUP BY 
  P.ID, 
  X.PersonID 

Open in new window

0
 

Accepted Solution

by:
iis-it earned 0 total points
ID: 24412418
Thanks for your answer. A member of the SQL Server central forums (RBarryYoung) came up with the most efficient solution.
create proc spSurnameMatching_TempTAnswer
 
AS
--====== Create the 95% Forename table + all Person names
Create table #Forename95_U_P(
  
Source_Name varchar(20)
 
, Name_Variation varchar(20)
 
, Primary Key (Source_name, Name_Variation)
)
INSERT into #Forename95_U_P
 
Select Source_name, Name_variation
  
--From Names_Database..Forename
  
From Forename
  
Where Score >= 95
INSERT into #Forename95_U_P
 
Select distinct Forename, Forename
  
From PERSON P
  
Left Join #Forename95_U_P U
     
ON P.Forename = U.Source_name
        
And P.Forename = U.Name_variation
  
Where U.Source_name IS NULL
----====== Create the 95% Surname table + all Person names
Create table #Surname95_U_P(
   
Source_Name varchar(20)
 
, Name_Variation varchar(20)
 
, Primary Key (Source_name, Name_Variation)
)
INSERT into #Surname95_U_P
 
Select Source_name, Name_variation
  
--From Names_Database..Surname
  
From Surname
  
Where Score >= 95
INSERT into #Surname95_U_P
 
Select distinct Surname, Surname
  
From PERSON P
    
Left Join #Surname95_U_P U
      
ON P.Surname = U.Source_name
         
And P.Surname = U.Name_variation
  
Where U.Source_name IS NULL
--====== Find all Person-subsets that 95% match persons
;WITH ctePersToPersSubs as (
   
Select *
     
FROM PERSON P
    
INNER JOIN PERSON_SUBSET PER
      
ON PER.EnglishForename IN (
        
Select fn.Name_Variation
         
From #Forename95_U_P fn
         
Where fn.Source_Name = P.Forename
         
)
      
AND PER.EnglishSurname IN (
           
Select fn.Name_Variation
         
From #Surname95_U_P fn
          
Where fn.Source_Name = P.Surname
          
)
)
SELECT Distinct
   
P.ID,
   
P.PersonID,
   
1,
   
1
 
From ctePersToPersSubs P
 
WHERE P.Forename <> ''
  
AND  P.Surname <> ''
--====== clean-up temp tables
drop table #Forename95_U_P
drop table #Surname95_U_P
--==================

Open in new window

0
 
LVL 3

Expert Comment

by:GregTSmith
ID: 24413276
I'm surprised that the temp table solution is faster since it is both reading and writing the records, where my example only reads...  you still might get better result from the query I provided by adding indexes...
0

Featured Post

[Webinar] Code, Load, and Grow

Managing multiple websites, servers, applications, and security on a daily basis? Join us for a webinar on May 25th to learn how to simplify administration and management of virtual hosts for IT admins, create a secure environment, and deploy code more effectively and frequently.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Read about achieving the basic levels of HRIS security in the workplace.
A company’s centralized system that manages user data, security, and distributed resources is often a focus of criminal attention. Active Directory (AD) is no exception. In truth, it’s even more likely to be targeted due to the number of companies …
Video by: Steve
Using examples as well as descriptions, step through each of the common simple join types, explaining differences in syntax, differences in expected outputs and showing how the queries run along with the actual outputs based upon a simple set of dem…
Polish reports in Access so they look terrific. Take yourself to another level. Equations, Back Color, Alternate Back Color. Write easy VBA Code. Tighten space to use less pages. Launch report from a menu, considering criteria only when it is filled…

734 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question