"Fuzzy" string comparison

This query is more concerned with programming techniques than VB specifically.

I've recently had a large database dumped on me and job one is to clean up misspelt and miskeyed entries.  For instance, the database contains a field called "PC Model" and within this field "Compaq Deskpro" has been variously entered as "Coqmap Deksrops" "Compack Desqproo", etc.

I have coded a number of sort routines to compare actual entries to ideal entries on the basis of "like" comparisons, number and order of consonants, etc.

This is working after a fashion but what I really need is a robust alogrithm/programming device to make sense of actual data as compared to ideal data. The ideal tool would have an output based on probabilities, e.g.

"Copmaq Diskrop" is most likely to be "Compaq Deskpro"... and least likely to be "IBM AT"

Does anyone know of any resource where I can review theories, flow-charts, metalanguage code, or anything that will allow me to code the kind of solution I am looking for?

Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

neilvenablesAuthor Commented:
Edited text of question
neilvenablesAuthor Commented:
Adjusted points to 400
You can use a custom control for spell checking (there are plenty available) and create a custom dictionary containing only computer names to reduce the list the spellchecker checks against.

I'll not click answer so that you might wait for a code solution if anyone has one.  But if this ends up to be the best solution, I hope you'll give me credit for the answer.

The Ultimate Tool Kit for Technolgy Solution Provi

Broken down into practical pointers and step-by-step instructions, the IT Service Excellence Tool Kit delivers expert advice for technology solution providers. Get your free copy for valuable how-to assets including sample agreements, checklists, flowcharts, and more!

By the way, A custom control written efficiently in C is likely to do checking faster than a VB algorithm.
This is tough, because you basically want to program artificial intelligence.  I would not let the code you write make the overall decision on what fixed value to put into the fields.  However, it could be to your advantage to have it select the "most likely" answer and then press OK to accept it.  That I could try to implement.  Basically, you would need to do this:

1)  Generate a list of VALID computer entries.  This will serve as your source list.

2)  Build another table to contain KNOWN mappings.  This table will be used to contain previously defined solutions.  For example, if IBM AT is typed "IPM AT" and you hit OK to select IBM AT, the system shouldn't ask you the next time IPM AT is typed in.

3)  Write this query.

select strValidValue as ValidValue, strValidValue as TypedValue
  from tblValidValues
 union strValidValue as ValidValue, strTypedValue as TypedValue
  from tblMappedValues

and put it in a recordset

4)  Now write this query and loop through it

select distinct computername
  from sourcetable

5) Select a percision (first 2 chars, first 3, etc)

5) For each computername in the sourcetable, do this

rsValidValues.findfirst "ucase$(strTypedValue) = """ & ucase$(rsValues!computername) & """"

6)  If a record is found in the above (rsValidValues.nomatch = false) then use rsValidValues!strValidValue as your computer name.

7)  If no record is presented, then do

rsvalidvalues.findfirst "strTypedName like """ & left$(rsValues!computername, intPercision) & "*""" 'this matches to the correct percision

8)  Present a listbox of all VALID computer entries, with the one found in step 7 selected (if nothing found, select nothing)

9)  When a user selects a valid entry, save it, and what was typed in tblMappedValues

Eventually, it'll get smart and you won't have to do much interaction.  In the beginning you are going to have to "train" it :)

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
You might also want to have a look at Soundex-algorithm. The Soundex code is an indexing system which translates names into a 4 digit code consisting of 1 letter and 3 numbers. The most familiar application of Soundex is its use by the US Bureau of the Census to create an index for individuals listed in the US census records after 1880.

Now, for your examples with different (mis)spellings it returns following codes

Compaq Deskpro -> C512
Coqmap Deksrops -> C251
Compack Desqproo -> C512
Copmaq Diskrop -> C152

I added some produced by hitting several adjacent keys when typing (eg. F instead of D)

conpaw feslpto -> C511
conpaw seslpto -> C512
xonpaw seslpto -> X512

See a pattern?

One possibility is to insert a field with Soundex code into your database, you could build search routines for computer names which sound like "Compaq Deskpro" using different variations of the soundex code of the correct entry and then offer the user the possibility to correct them.

Eg. for Compaq Deskpro

... Like "C512" ...
... Like "?512" ...
... Like "C51?" ... etc.

Anyway, these are man-made errors and only people can reliably (?) correct them. Eg. we used to have a small computer manufacturer called IPM, but as you said anthonyc, most likely nowadays it would refer to IBM, but who knows.

If you like to see VB code for soundex-algorithm send me e-mail at Lasse.Rantanen@sci.fi
Just one thing more ... if you happen to use some SQL server, it propable has soundex and related functions already.

A useful function is DIFFERENCE which compares soundex codes for two entries and returns a value describing how close they are

SELECT DIFFERENCE('Compaq Deskpro', 'Cimpaq Deskpro')

returns 4 (highest similarity)

SELECT DIFFERENCE('Compaq Deskpro', 'Xompaq Deskpro')

will return 2, a lower probability to be same strings.

Values returned are 0 to 4.
neilvenablesAuthor Commented:
Excellent! Thanks for the help guys.
neilvenablesAuthor Commented:
Excellent! Thanks for the help guys.
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Visual Basic Classic

From novice to tech pro — start learning today.