?
Solved

Name variation matching

Posted on 2009-05-15
3
Medium Priority
?
497 Views
Last Modified: 2012-05-07
We have a number of jobs that perform matches across different tables. The initial match is based on the forename and surname of the person in question being equal.  If the forename and surname is the same and, a record is created in another table and further matching is performed to give a final match rating.

We have just purchased a database containing million of name variations.  These are in two tables (one for forenames and one for surnames) with the following columns:
"      Source_Name  This contains a list of Names that you can search on (E.g. Steve)
"      Name_Variation  This list the variation of the sources name. Each variation is on a new row. (I.e. there may be 2 rows with the source name Steve. One with the variation of Stephen and the other with  the variation Steven)
"      Score  This is a rating system with 100 being a close match and 75 being a loose match.

I am having trouble getting the matching script to return results in an efficient manner.  I believe that this is caused by the cross join that I have used in the below script.  Is there a more efficient way of doing it?

SELECT Distinct
       AVP.ID,
	   PER.PersonID,
	   1,
	   1
FROM PERSON P 
Cross JOIN C6..PERSON_SUBSET PER
WHERE P.Forename <> ''
       AND
       P.Surname <> ''
And (P.Forename = PER.EnglishForename  Or PER.PERSONID in (
			Select PersonID
			From C6..Person_Subset	
			Where EnglishForename In 
				(Select Name_Variation 
				from Names_Database..Forename   
				Where [Source_Name] = P.Forename and Score >= 95)))
AND (P.Surname = PER.EnglishSurname OR PER.PERSONID in (
			Select PersonID 
			From C6..Person_Subset
			Where EnglishSurname in
				(Select Name_Variation
				from Names_Database..Surname 
				Where [Source_Name] = P.Surname and Score >= 95)))

Open in new window

0
Comment
Question by:iis-it
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 2
3 Comments
 
LVL 3

Expert Comment

by:GregTSmith
ID: 24401360
I've done my best to make sure the syntax is correct, but it's a little tough without actually being able to execute the query.  Something like the following should yield the same result set.
SELECT 
  P.ID, -- I changed AVP to P here... not sure what AVP was an alias to...?
  X.PersonID, 
  1, 
  1 
FROM 
  PERSON P 
  INNER JOIN (
    SELECT 
      PS.PersonID, 
      PS.EnglishForename, 
      PS.EnglishSurname, 
      F.Source_Name [AlternateForename], 
      S.Source_Name [AlternateSurname] 
    FROM   
      C6..PERSON_SUBSET PS 
      INNER JOIN Names_Database..Forename F ON 
        F.Name_Variation = PS.EnglishForename 
          AND F.Score >= 95 
      INNER JOIN Names_Database..Surname S ON 
        S.Name_Variation = PS.EnglishSurname 
          AND S.Score >= 95 
  ) X ON 
    LEN(P.Forename) != 0 
      AND LEN(P.Surname) != 0 
      AND (
        X.EnglishForename = P.Forename 
        OR X.AlternateForename = P.Forename 
      )
      AND (
        X.EnglishSurname = P.Surname 
        OR X.AlternateSurname = P.Surname 
      )
GROUP BY 
  P.ID, 
  X.PersonID 

Open in new window

0
 

Accepted Solution

by:
iis-it earned 0 total points
ID: 24412418
Thanks for your answer. A member of the SQL Server central forums (RBarryYoung) came up with the most efficient solution.
create proc spSurnameMatching_TempTAnswer
 
AS
--====== Create the 95% Forename table + all Person names
Create table #Forename95_U_P(
  
Source_Name varchar(20)
 
, Name_Variation varchar(20)
 
, Primary Key (Source_name, Name_Variation)
)
INSERT into #Forename95_U_P
 
Select Source_name, Name_variation
  
--From Names_Database..Forename
  
From Forename
  
Where Score >= 95
INSERT into #Forename95_U_P
 
Select distinct Forename, Forename
  
From PERSON P
  
Left Join #Forename95_U_P U
     
ON P.Forename = U.Source_name
        
And P.Forename = U.Name_variation
  
Where U.Source_name IS NULL
----====== Create the 95% Surname table + all Person names
Create table #Surname95_U_P(
   
Source_Name varchar(20)
 
, Name_Variation varchar(20)
 
, Primary Key (Source_name, Name_Variation)
)
INSERT into #Surname95_U_P
 
Select Source_name, Name_variation
  
--From Names_Database..Surname
  
From Surname
  
Where Score >= 95
INSERT into #Surname95_U_P
 
Select distinct Surname, Surname
  
From PERSON P
    
Left Join #Surname95_U_P U
      
ON P.Surname = U.Source_name
         
And P.Surname = U.Name_variation
  
Where U.Source_name IS NULL
--====== Find all Person-subsets that 95% match persons
;WITH ctePersToPersSubs as (
   
Select *
     
FROM PERSON P
    
INNER JOIN PERSON_SUBSET PER
      
ON PER.EnglishForename IN (
        
Select fn.Name_Variation
         
From #Forename95_U_P fn
         
Where fn.Source_Name = P.Forename
         
)
      
AND PER.EnglishSurname IN (
           
Select fn.Name_Variation
         
From #Surname95_U_P fn
          
Where fn.Source_Name = P.Surname
          
)
)
SELECT Distinct
   
P.ID,
   
P.PersonID,
   
1,
   
1
 
From ctePersToPersSubs P
 
WHERE P.Forename <> ''
  
AND  P.Surname <> ''
--====== clean-up temp tables
drop table #Forename95_U_P
drop table #Surname95_U_P
--==================

Open in new window

0
 
LVL 3

Expert Comment

by:GregTSmith
ID: 24413276
I'm surprised that the temp table solution is faster since it is both reading and writing the records, where my example only reads...  you still might get better result from the query I provided by adding indexes...
0

Featured Post

Free Tool: ZipGrep

ZipGrep is a utility that can list and search zip (.war, .ear, .jar, etc) archives for text patterns, without the need to extract the archive's contents.

One of a set of tools we're offering as a way to say thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

In this blog post, we’ll look at how using thread_statistics can cause high memory usage.
In today's business world, data is more important than ever for informing marketing campaigns. Accessing and using data, however, may not come naturally to some creative marketing professionals. Here are four tips for adapting to wield data for insi…
This is a high-level webinar that covers the history of enterprise open source database use. It addresses both the advantages companies see in using open source database technologies, as well as the fears and reservations they might have. In this…
In this video, Percona Solution Engineer Rick Golba discuss how (and why) you implement high availability in a database environment. To discuss how Percona Consulting can help with your design and architecture needs for your database and infrastr…

770 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question