find duplicates but with typing errors

Posted on 2012-09-09
Last Modified: 2012-09-10
I need to find duplicates in a table. But the Access' find-duplicates-wizard is not enough. If, for example, I have a patient name 'Smith', and it is duplicated in another record but with some kind of typing mistake - like 'Smigh', I need some kind of a mechanism that will consider 'Smigh' as a duplicate of 'Smith', that will warn me that 'Smigh' MIGHT be a duplicate of 'Smith'. Maybe it isn't - maybe there's a real patient whose name is Smigh, but I want to be warned that it could be a duplicate with typing mistake. Any suggestions? by SQL satement? or by VBA code?
Question by:NNOAM1
    LVL 25

    Expert Comment

    A common technique used to determine possible matches called the "Levenshtein Distance".  It is an algorithm that compares two strings and calculates the least number of single character changes to make the two values the same.  For example the distance between your 'Smith' and 'Smigh' is 1 ... a good candidate match.

    Have a read of this article:

    Author Comment

    Considering the fact that my table contains 144000 records, and every single patient-name should be checked against ALL the others - makes me wonder if using "Levenshtein Distance" is practical for my case.
    LVL 25

    Accepted Solution

    In that case ... doing a Levenshtein for all records would take a while - but I still recommend its use.  In the end you need to determine 'how close' a match you are wiling to look at.  If the distance was > 3 ... I would suggest it was a weak match and not worth looking at (even > 2 may be a good limit).  In the end I would recommend a layered approach.  
    Level 1: already equal ... no need to compare.
    Level 2: length difference ... if the string lengths differ by > 3 - no need to look
    Level 3: a soundex compare ( ... this check whether they 'sound' the same phonetically.  This is quite a weak method though... that is why I suggest Levenshtein.
    Level 4: Levensthein distance.
    LVL 13

    Expert Comment

    There's not much of a chance to circumvent the checking, and the L.-distance seems not much different from other ways to evaluate the similarity. Maybe you can improve that a bit (just an idea, never tried by myself):

    sorting the entries
    checking each entry only against the following 2
    sorting again, while leaving the first character out
    checking each entry only against the following 2
    and so on...

    Another try might be to fully compare only entries that are not much different in length, +-1 i.e. For that you should calculate the length of every entry ONCE and store it for comparisons. That would prevent calculating the full distance for obvious different string pair like "Smith"/"Leuttheuser-Schnarrenberger".

    On the other side ... it's just 2,1E+10 comprarisons ... with a decent machine, the data completey loaded into RAM instead of a database od disk, and a good multithreaded attempt you'll be done fast. Contemplating about it, that might be a good task for a CUDA application.

    [kidding] If everything fails, make a BOINC project from it ... [kidding off]

    Author Closing Comment

    Thank you!

    Featured Post

    What Security Threats Are You Missing?

    Enhance your security with threat intelligence from the web. Get trending threat insights on hackers, exploits, and suspicious IP addresses delivered to your inbox with our free Cyber Daily.

    Join & Write a Comment

    This isn't a frequent question on EE. I must have seen it three or four times (among several thousand questions). However, I use this trick quite often, most frequently as a delayed Current event. A form does not expose it's calculation dependenc…
    Experts-Exchange is a great place to come for help with solutions for your database issues, and many problems are resolved within minutes of being posted.  Others take a little more time and effort and often providing a sample database is very helpf…
    Get people started with the utilization of class modules. Class modules can be a powerful tool in Microsoft Access. They allow you to create self-contained objects that encapsulate functionality. They can easily hide the complexity of a process from…
    Show developers how to use a criteria form to limit the data that appears on an Access report. It is a common requirement that users can specify the criteria for a report at runtime. The easiest way to accomplish this is using a criteria form that a…

    755 members asked questions and received personalized solutions in the past 7 days.

    Join the community of 500,000 technology professionals and ask your questions.

    Join & Ask a Question

    Need Help in Real-Time?

    Connect with top rated Experts

    20 Experts available now in Live!

    Get 1:1 Help Now