"Fuzzy" string comparison

Posted on 1998-01-30
Medium Priority
Last Modified: 2008-02-01
This query is more concerned with programming techniques than VB specifically.

I've recently had a large database dumped on me and job one is to clean up misspelt and miskeyed entries.  For instance, the database contains a field called "PC Model" and within this field "Compaq Deskpro" has been variously entered as "Coqmap Deksrops" "Compack Desqproo", etc.

I have coded a number of sort routines to compare actual entries to ideal entries on the basis of "like" comparisons, number and order of consonants, etc.

This is working after a fashion but what I really need is a robust alogrithm/programming device to make sense of actual data as compared to ideal data. The ideal tool would have an output based on probabilities, e.g.

"Copmaq Diskrop" is most likely to be "Compaq Deskpro"... and least likely to be "IBM AT"

Does anyone know of any resource where I can review theories, flow-charts, metalanguage code, or anything that will allow me to code the kind of solution I am looking for?

Question by:neilvenables
  • 4
  • 2
  • 2
  • +1

Author Comment

ID: 1454308
Edited text of question

Author Comment

ID: 1454309
Adjusted points to 400

Expert Comment

ID: 1454310
You can use a custom control for spell checking (there are plenty available) and create a custom dictionary containing only computer names to reduce the list the spellchecker checks against.

I'll not click answer so that you might wait for a code solution if anyone has one.  But if this ends up to be the best solution, I hope you'll give me credit for the answer.

The new generation of project management tools

With monday.com’s project management tool, you can see what everyone on your team is working in a single glance. Its intuitive dashboards are customizable, so you can create systems that work for you.


Expert Comment

ID: 1454311
By the way, A custom control written efficiently in C is likely to do checking faster than a VB algorithm.

Accepted Solution

anthonyc earned 1600 total points
ID: 1454312
This is tough, because you basically want to program artificial intelligence.  I would not let the code you write make the overall decision on what fixed value to put into the fields.  However, it could be to your advantage to have it select the "most likely" answer and then press OK to accept it.  That I could try to implement.  Basically, you would need to do this:

1)  Generate a list of VALID computer entries.  This will serve as your source list.

2)  Build another table to contain KNOWN mappings.  This table will be used to contain previously defined solutions.  For example, if IBM AT is typed "IPM AT" and you hit OK to select IBM AT, the system shouldn't ask you the next time IPM AT is typed in.

3)  Write this query.

select strValidValue as ValidValue, strValidValue as TypedValue
  from tblValidValues
 union strValidValue as ValidValue, strTypedValue as TypedValue
  from tblMappedValues

and put it in a recordset

4)  Now write this query and loop through it

select distinct computername
  from sourcetable

5) Select a percision (first 2 chars, first 3, etc)

5) For each computername in the sourcetable, do this

rsValidValues.findfirst "ucase$(strTypedValue) = """ & ucase$(rsValues!computername) & """"

6)  If a record is found in the above (rsValidValues.nomatch = false) then use rsValidValues!strValidValue as your computer name.

7)  If no record is presented, then do

rsvalidvalues.findfirst "strTypedName like """ & left$(rsValues!computername, intPercision) & "*""" 'this matches to the correct percision

8)  Present a listbox of all VALID computer entries, with the one found in step 7 selected (if nothing found, select nothing)

9)  When a user selects a valid entry, save it, and what was typed in tblMappedValues

Eventually, it'll get smart and you won't have to do much interaction.  In the beginning you are going to have to "train" it :)

Expert Comment

ID: 1454313
You might also want to have a look at Soundex-algorithm. The Soundex code is an indexing system which translates names into a 4 digit code consisting of 1 letter and 3 numbers. The most familiar application of Soundex is its use by the US Bureau of the Census to create an index for individuals listed in the US census records after 1880.

Now, for your examples with different (mis)spellings it returns following codes

Compaq Deskpro -> C512
Coqmap Deksrops -> C251
Compack Desqproo -> C512
Copmaq Diskrop -> C152

I added some produced by hitting several adjacent keys when typing (eg. F instead of D)

conpaw feslpto -> C511
conpaw seslpto -> C512
xonpaw seslpto -> X512

See a pattern?

One possibility is to insert a field with Soundex code into your database, you could build search routines for computer names which sound like "Compaq Deskpro" using different variations of the soundex code of the correct entry and then offer the user the possibility to correct them.

Eg. for Compaq Deskpro

... Like "C512" ...
... Like "?512" ...
... Like "C51?" ... etc.

Anyway, these are man-made errors and only people can reliably (?) correct them. Eg. we used to have a small computer manufacturer called IPM, but as you said anthonyc, most likely nowadays it would refer to IBM, but who knows.

If you like to see VB code for soundex-algorithm send me e-mail at Lasse.Rantanen@sci.fi

Expert Comment

ID: 1454314
Just one thing more ... if you happen to use some SQL server, it propable has soundex and related functions already.

A useful function is DIFFERENCE which compares soundex codes for two entries and returns a value describing how close they are

SELECT DIFFERENCE('Compaq Deskpro', 'Cimpaq Deskpro')

returns 4 (highest similarity)

SELECT DIFFERENCE('Compaq Deskpro', 'Xompaq Deskpro')

will return 2, a lower probability to be same strings.

Values returned are 0 to 4.

Author Comment

ID: 1454315
Excellent! Thanks for the help guys.

Author Comment

ID: 1454316
Excellent! Thanks for the help guys.

Featured Post

Never miss a deadline with monday.com

The revolutionary project management tool is here!   Plan visually with a single glance and make sure your projects get done.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

I was working on a PowerPoint add-in the other day and a client asked me "can you implement a feature which processes a chart when it's pasted into a slide from another deck?". It got me wondering how to hook into built-in ribbon events in Office.
This article describes some techniques which will make your VBA or Visual Basic Classic code easier to understand and maintain, whether by you, your replacement, or another Experts-Exchange expert.
As developers, we are not limited to the functions provided by the VBA language. In addition, we can call the functions that are part of the Windows operating system. These functions are part of the Windows API (Application Programming Interface). U…
This lesson covers basic error handling code in Microsoft Excel using VBA. This is the first lesson in a 3-part series that uses code to loop through an Excel spreadsheet in VBA and then fix errors, taking advantage of error handling code. This l…
Suggested Courses
Course of the Month4 days, 19 hours left to enroll

601 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question