grg99:
"I would use some ranking function that compares their billing information..."
Unless I can think of something better, I have these three string-based functions:
f1: Leftmost 3 characters of first name and leftmost 4 characters of last name
f2: Leftmost 10 characters of email
f3: Lastname + state
So that if f1(order #1) = f1(order #2), f1 calls them equal. Now, these functions are not equal. f2 has a very low false positive rate, for example. So I will probably (?) use a weighted sum of these f = f1 + f2 + f3
You'll have to forgive the rustiness/nonexistence of my math skills, but here is my reasoning regarding some sort of matrix is that if I'm comparing all pairs of orders against all others, I'd essentially have a 2-d matrix, N x N where N is the number of orders being compared. Each element Xij would be the confidence returned by this function. It would be symmetric about its diagonal. Doing some Googling, I ran across some pages that mention "Cluster Analysis" and "distance matrices" which seem to be related to my problem. ...But they are too dense for me to read. Am I heading off the deep end here, perhaps making the problem too complicated?
- hbz
Main Topics
Browse All Topics





by: grg99Posted on 2004-10-28 at 12:08:15ID: 12437423
I would use some ranking function that compares their billing information but makes allowances for the common variations:
-----
Missing "mr" or "Mrs" or "ms"
Missing word on end of street address "drive", "st", "street"
Extra words on end of street "Apt 204"
Extra digits on zip code (zip+9)
Alternate city name (Queens instead of New York, Cambridge instead of Boston),
easiest way is to check if alternate name maps to same or adjacent ZIP code.
SLIGHT mispelling of name.
Missing middle name or initial instead of middle
Alternate names- Bob for Robert, etc...
-------------------------
Important fields to NOT consider similar include:
Anything but a slight mispelling of the name
Any change in numeric address.
----------------
I dont see how anything as clean as a matrix is going to be of much use as this is a rather human quirk based problem.
--------------------------
You'll have to resign yourself to some false positive matches.. For example our insurance agent had two clients whose medical files got badly mixed up, they had identical (and unusual) names, same DOB, and same physician.