[Last Call] Learn how to a build a cloud-first strategyRegister Now

x
  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 285
  • Last Modified:

find duplicates but with typing errors

I need to find duplicates in a table. But the Access' find-duplicates-wizard is not enough. If, for example, I have a patient name 'Smith', and it is duplicated in another record but with some kind of typing mistake - like 'Smigh', I need some kind of a mechanism that will consider 'Smigh' as a duplicate of 'Smith', that will warn me that 'Smigh' MIGHT be a duplicate of 'Smith'. Maybe it isn't - maybe there's a real patient whose name is Smigh, but I want to be warned that it could be a duplicate with typing mistake. Any suggestions? by SQL satement? or by VBA code?
0
NNOAM1
Asked:
NNOAM1
  • 2
  • 2
1 Solution
 
lwadwellCommented:
A common technique used to determine possible matches called the "Levenshtein Distance".  It is an algorithm that compares two strings and calculates the least number of single character changes to make the two values the same.  For example the distance between your 'Smith' and 'Smigh' is 1 ... a good candidate match.

Have a read of this article:
http://bytes.com/topic/access/insights/909002-levenshtein-approximate-string-matching
0
 
NNOAM1Author Commented:
Considering the fact that my table contains 144000 records, and every single patient-name should be checked against ALL the others - makes me wonder if using "Levenshtein Distance" is practical for my case.
0
 
lwadwellCommented:
In that case ... doing a Levenshtein for all records would take a while - but I still recommend its use.  In the end you need to determine 'how close' a match you are wiling to look at.  If the distance was > 3 ... I would suggest it was a weak match and not worth looking at (even > 2 may be a good limit).  In the end I would recommend a layered approach.  
Level 1: already equal ... no need to compare.
Level 2: length difference ... if the string lengths differ by > 3 - no need to look
Level 3: a soundex compare (http://msdn.microsoft.com/en-us/library/office/aa662178(v=office.11).aspx) ... this check whether they 'sound' the same phonetically.  This is quite a weak method though... that is why I suggest Levenshtein.
Level 4: Levensthein distance.
0
 
frankhelkCommented:
There's not much of a chance to circumvent the checking, and the L.-distance seems not much different from other ways to evaluate the similarity. Maybe you can improve that a bit (just an idea, never tried by myself):

sorting the entries
checking each entry only against the following 2
sorting again, while leaving the first character out
checking each entry only against the following 2
and so on...

Another try might be to fully compare only entries that are not much different in length, +-1 i.e. For that you should calculate the length of every entry ONCE and store it for comparisons. That would prevent calculating the full distance for obvious different string pair like "Smith"/"Leuttheuser-Schnarrenberger".

On the other side ... it's just 2,1E+10 comprarisons ... with a decent machine, the data completey loaded into RAM instead of a database od disk, and a good multithreaded attempt you'll be done fast. Contemplating about it, that might be a good task for a CUDA application.

[kidding] If everything fails, make a BOINC project from it ... [kidding off]
0
 
NNOAM1Author Commented:
Thank you!
0

Featured Post

Technology Partners: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

  • 2
  • 2
Tackle projects and never again get stuck behind a technical roadblock.
Join Now