Solved

"Fuzzy" string comparison

Posted on 1998-01-30
9
289 Views
Last Modified: 2008-02-01
This query is more concerned with programming techniques than VB specifically.

I've recently had a large database dumped on me and job one is to clean up misspelt and miskeyed entries.  For instance, the database contains a field called "PC Model" and within this field "Compaq Deskpro" has been variously entered as "Coqmap Deksrops" "Compack Desqproo", etc.

I have coded a number of sort routines to compare actual entries to ideal entries on the basis of "like" comparisons, number and order of consonants, etc.

This is working after a fashion but what I really need is a robust alogrithm/programming device to make sense of actual data as compared to ideal data. The ideal tool would have an output based on probabilities, e.g.

"Copmaq Diskrop" is most likely to be "Compaq Deskpro"... and least likely to be "IBM AT"

Does anyone know of any resource where I can review theories, flow-charts, metalanguage code, or anything that will allow me to code the kind of solution I am looking for?

Cheers,
0
Comment
Question by:neilvenables
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 4
  • 2
  • 2
  • +1
9 Comments
 

Author Comment

by:neilvenables
ID: 1454308
Edited text of question
0
 

Author Comment

by:neilvenables
ID: 1454309
Adjusted points to 400
0
 
LVL 8

Expert Comment

by:mrmick
ID: 1454310
You can use a custom control for spell checking (there are plenty available) and create a custom dictionary containing only computer names to reduce the list the spellchecker checks against.

I'll not click answer so that you might wait for a code solution if anyone has one.  But if this ends up to be the best solution, I hope you'll give me credit for the answer.

0
Online Training Solution

Drastically shorten your training time with WalkMe's advanced online training solution that Guides your trainees to action. Forget about retraining and skyrocket knowledge retention rates.

 
LVL 8

Expert Comment

by:mrmick
ID: 1454311
By the way, A custom control written efficiently in C is likely to do checking faster than a VB algorithm.
0
 
LVL 6

Accepted Solution

by:
anthonyc earned 400 total points
ID: 1454312
This is tough, because you basically want to program artificial intelligence.  I would not let the code you write make the overall decision on what fixed value to put into the fields.  However, it could be to your advantage to have it select the "most likely" answer and then press OK to accept it.  That I could try to implement.  Basically, you would need to do this:

1)  Generate a list of VALID computer entries.  This will serve as your source list.

2)  Build another table to contain KNOWN mappings.  This table will be used to contain previously defined solutions.  For example, if IBM AT is typed "IPM AT" and you hit OK to select IBM AT, the system shouldn't ask you the next time IPM AT is typed in.

3)  Write this query.

select strValidValue as ValidValue, strValidValue as TypedValue
  from tblValidValues
 union strValidValue as ValidValue, strTypedValue as TypedValue
  from tblMappedValues

and put it in a recordset

4)  Now write this query and loop through it

select distinct computername
  from sourcetable

5) Select a percision (first 2 chars, first 3, etc)

5) For each computername in the sourcetable, do this

rsValidValues.findfirst "ucase$(strTypedValue) = """ & ucase$(rsValues!computername) & """"

6)  If a record is found in the above (rsValidValues.nomatch = false) then use rsValidValues!strValidValue as your computer name.

7)  If no record is presented, then do

rsvalidvalues.findfirst "strTypedName like """ & left$(rsValues!computername, intPercision) & "*""" 'this matches to the correct percision

8)  Present a listbox of all VALID computer entries, with the one found in step 7 selected (if nothing found, select nothing)

9)  When a user selects a valid entry, save it, and what was typed in tblMappedValues


Eventually, it'll get smart and you won't have to do much interaction.  In the beginning you are going to have to "train" it :)
0
 
LVL 4

Expert Comment

by:rantanen
ID: 1454313
You might also want to have a look at Soundex-algorithm. The Soundex code is an indexing system which translates names into a 4 digit code consisting of 1 letter and 3 numbers. The most familiar application of Soundex is its use by the US Bureau of the Census to create an index for individuals listed in the US census records after 1880.

Now, for your examples with different (mis)spellings it returns following codes

Compaq Deskpro -> C512
Coqmap Deksrops -> C251
Compack Desqproo -> C512
Copmaq Diskrop -> C152

I added some produced by hitting several adjacent keys when typing (eg. F instead of D)

conpaw feslpto -> C511
conpaw seslpto -> C512
xonpaw seslpto -> X512

See a pattern?

One possibility is to insert a field with Soundex code into your database, you could build search routines for computer names which sound like "Compaq Deskpro" using different variations of the soundex code of the correct entry and then offer the user the possibility to correct them.

Eg. for Compaq Deskpro

... Like "C512" ...
... Like "?512" ...
... Like "C51?" ... etc.

Anyway, these are man-made errors and only people can reliably (?) correct them. Eg. we used to have a small computer manufacturer called IPM, but as you said anthonyc, most likely nowadays it would refer to IBM, but who knows.

If you like to see VB code for soundex-algorithm send me e-mail at Lasse.Rantanen@sci.fi
0
 
LVL 4

Expert Comment

by:rantanen
ID: 1454314
Just one thing more ... if you happen to use some SQL server, it propable has soundex and related functions already.

A useful function is DIFFERENCE which compares soundex codes for two entries and returns a value describing how close they are

SELECT DIFFERENCE('Compaq Deskpro', 'Cimpaq Deskpro')

returns 4 (highest similarity)

SELECT DIFFERENCE('Compaq Deskpro', 'Xompaq Deskpro')

will return 2, a lower probability to be same strings.

Values returned are 0 to 4.
0
 

Author Comment

by:neilvenables
ID: 1454315
Excellent! Thanks for the help guys.
0
 

Author Comment

by:neilvenables
ID: 1454316
Excellent! Thanks for the help guys.
0

Featured Post

Free Tool: SSL Checker

Scans your site and returns information about your SSL implementation and certificate. Helpful for debugging and validating your SSL configuration.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

This article describes some techniques which will make your VBA or Visual Basic Classic code easier to understand and maintain, whether by you, your replacement, or another Experts-Exchange expert.
Since upgrading to Office 2013 or higher installing the Smart Indenter addin will fail. This article will explain how to install it so it will work regardless of the Office version installed.
As developers, we are not limited to the functions provided by the VBA language. In addition, we can call the functions that are part of the Windows operating system. These functions are part of the Windows API (Application Programming Interface). U…
Get people started with the process of using Access VBA to control Excel using automation, Microsoft Access can control other applications. An example is the ability to programmatically talk to Excel. Using automation, an Access application can laun…
Suggested Courses
Course of the Month4 days, 1 hour left to enroll

630 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question