Link to home
Start Free TrialLog in
Avatar of neilvenables
neilvenables

asked on

"Fuzzy" string comparison

This query is more concerned with programming techniques than VB specifically.

I've recently had a large database dumped on me and job one is to clean up misspelt and miskeyed entries.  For instance, the database contains a field called "PC Model" and within this field "Compaq Deskpro" has been variously entered as "Coqmap Deksrops" "Compack Desqproo", etc.

I have coded a number of sort routines to compare actual entries to ideal entries on the basis of "like" comparisons, number and order of consonants, etc.

This is working after a fashion but what I really need is a robust alogrithm/programming device to make sense of actual data as compared to ideal data. The ideal tool would have an output based on probabilities, e.g.

"Copmaq Diskrop" is most likely to be "Compaq Deskpro"... and least likely to be "IBM AT"

Does anyone know of any resource where I can review theories, flow-charts, metalanguage code, or anything that will allow me to code the kind of solution I am looking for?

Cheers,
Avatar of neilvenables
neilvenables

ASKER

Edited text of question
Adjusted points to 400
You can use a custom control for spell checking (there are plenty available) and create a custom dictionary containing only computer names to reduce the list the spellchecker checks against.

I'll not click answer so that you might wait for a code solution if anyone has one.  But if this ends up to be the best solution, I hope you'll give me credit for the answer.

By the way, A custom control written efficiently in C is likely to do checking faster than a VB algorithm.
ASKER CERTIFIED SOLUTION
Avatar of anthonyc
anthonyc

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
You might also want to have a look at Soundex-algorithm. The Soundex code is an indexing system which translates names into a 4 digit code consisting of 1 letter and 3 numbers. The most familiar application of Soundex is its use by the US Bureau of the Census to create an index for individuals listed in the US census records after 1880.

Now, for your examples with different (mis)spellings it returns following codes

Compaq Deskpro -> C512
Coqmap Deksrops -> C251
Compack Desqproo -> C512
Copmaq Diskrop -> C152

I added some produced by hitting several adjacent keys when typing (eg. F instead of D)

conpaw feslpto -> C511
conpaw seslpto -> C512
xonpaw seslpto -> X512

See a pattern?

One possibility is to insert a field with Soundex code into your database, you could build search routines for computer names which sound like "Compaq Deskpro" using different variations of the soundex code of the correct entry and then offer the user the possibility to correct them.

Eg. for Compaq Deskpro

... Like "C512" ...
... Like "?512" ...
... Like "C51?" ... etc.

Anyway, these are man-made errors and only people can reliably (?) correct them. Eg. we used to have a small computer manufacturer called IPM, but as you said anthonyc, most likely nowadays it would refer to IBM, but who knows.

If you like to see VB code for soundex-algorithm send me e-mail at Lasse.Rantanen@sci.fi
Just one thing more ... if you happen to use some SQL server, it propable has soundex and related functions already.

A useful function is DIFFERENCE which compares soundex codes for two entries and returns a value describing how close they are

SELECT DIFFERENCE('Compaq Deskpro', 'Cimpaq Deskpro')

returns 4 (highest similarity)

SELECT DIFFERENCE('Compaq Deskpro', 'Xompaq Deskpro')

will return 2, a lower probability to be same strings.

Values returned are 0 to 4.
Excellent! Thanks for the help guys.
Excellent! Thanks for the help guys.