Solved

"Fuzzy" string comparison

Posted on 1998-01-30
9
281 Views
Last Modified: 2008-02-01
This query is more concerned with programming techniques than VB specifically.

I've recently had a large database dumped on me and job one is to clean up misspelt and miskeyed entries.  For instance, the database contains a field called "PC Model" and within this field "Compaq Deskpro" has been variously entered as "Coqmap Deksrops" "Compack Desqproo", etc.

I have coded a number of sort routines to compare actual entries to ideal entries on the basis of "like" comparisons, number and order of consonants, etc.

This is working after a fashion but what I really need is a robust alogrithm/programming device to make sense of actual data as compared to ideal data. The ideal tool would have an output based on probabilities, e.g.

"Copmaq Diskrop" is most likely to be "Compaq Deskpro"... and least likely to be "IBM AT"

Does anyone know of any resource where I can review theories, flow-charts, metalanguage code, or anything that will allow me to code the kind of solution I am looking for?

Cheers,
0
Comment
Question by:neilvenables
  • 4
  • 2
  • 2
  • +1
9 Comments
 

Author Comment

by:neilvenables
ID: 1454308
Edited text of question
0
 

Author Comment

by:neilvenables
ID: 1454309
Adjusted points to 400
0
 
LVL 8

Expert Comment

by:mrmick
ID: 1454310
You can use a custom control for spell checking (there are plenty available) and create a custom dictionary containing only computer names to reduce the list the spellchecker checks against.

I'll not click answer so that you might wait for a code solution if anyone has one.  But if this ends up to be the best solution, I hope you'll give me credit for the answer.

0
 
LVL 8

Expert Comment

by:mrmick
ID: 1454311
By the way, A custom control written efficiently in C is likely to do checking faster than a VB algorithm.
0
How to run any project with ease

Manage projects of all sizes how you want. Great for personal to-do lists, project milestones, team priorities and launch plans.
- Combine task lists, docs, spreadsheets, and chat in one
- View and edit from mobile/offline
- Cut down on emails

 
LVL 6

Accepted Solution

by:
anthonyc earned 400 total points
ID: 1454312
This is tough, because you basically want to program artificial intelligence.  I would not let the code you write make the overall decision on what fixed value to put into the fields.  However, it could be to your advantage to have it select the "most likely" answer and then press OK to accept it.  That I could try to implement.  Basically, you would need to do this:

1)  Generate a list of VALID computer entries.  This will serve as your source list.

2)  Build another table to contain KNOWN mappings.  This table will be used to contain previously defined solutions.  For example, if IBM AT is typed "IPM AT" and you hit OK to select IBM AT, the system shouldn't ask you the next time IPM AT is typed in.

3)  Write this query.

select strValidValue as ValidValue, strValidValue as TypedValue
  from tblValidValues
 union strValidValue as ValidValue, strTypedValue as TypedValue
  from tblMappedValues

and put it in a recordset

4)  Now write this query and loop through it

select distinct computername
  from sourcetable

5) Select a percision (first 2 chars, first 3, etc)

5) For each computername in the sourcetable, do this

rsValidValues.findfirst "ucase$(strTypedValue) = """ & ucase$(rsValues!computername) & """"

6)  If a record is found in the above (rsValidValues.nomatch = false) then use rsValidValues!strValidValue as your computer name.

7)  If no record is presented, then do

rsvalidvalues.findfirst "strTypedName like """ & left$(rsValues!computername, intPercision) & "*""" 'this matches to the correct percision

8)  Present a listbox of all VALID computer entries, with the one found in step 7 selected (if nothing found, select nothing)

9)  When a user selects a valid entry, save it, and what was typed in tblMappedValues


Eventually, it'll get smart and you won't have to do much interaction.  In the beginning you are going to have to "train" it :)
0
 
LVL 4

Expert Comment

by:rantanen
ID: 1454313
You might also want to have a look at Soundex-algorithm. The Soundex code is an indexing system which translates names into a 4 digit code consisting of 1 letter and 3 numbers. The most familiar application of Soundex is its use by the US Bureau of the Census to create an index for individuals listed in the US census records after 1880.

Now, for your examples with different (mis)spellings it returns following codes

Compaq Deskpro -> C512
Coqmap Deksrops -> C251
Compack Desqproo -> C512
Copmaq Diskrop -> C152

I added some produced by hitting several adjacent keys when typing (eg. F instead of D)

conpaw feslpto -> C511
conpaw seslpto -> C512
xonpaw seslpto -> X512

See a pattern?

One possibility is to insert a field with Soundex code into your database, you could build search routines for computer names which sound like "Compaq Deskpro" using different variations of the soundex code of the correct entry and then offer the user the possibility to correct them.

Eg. for Compaq Deskpro

... Like "C512" ...
... Like "?512" ...
... Like "C51?" ... etc.

Anyway, these are man-made errors and only people can reliably (?) correct them. Eg. we used to have a small computer manufacturer called IPM, but as you said anthonyc, most likely nowadays it would refer to IBM, but who knows.

If you like to see VB code for soundex-algorithm send me e-mail at Lasse.Rantanen@sci.fi
0
 
LVL 4

Expert Comment

by:rantanen
ID: 1454314
Just one thing more ... if you happen to use some SQL server, it propable has soundex and related functions already.

A useful function is DIFFERENCE which compares soundex codes for two entries and returns a value describing how close they are

SELECT DIFFERENCE('Compaq Deskpro', 'Cimpaq Deskpro')

returns 4 (highest similarity)

SELECT DIFFERENCE('Compaq Deskpro', 'Xompaq Deskpro')

will return 2, a lower probability to be same strings.

Values returned are 0 to 4.
0
 

Author Comment

by:neilvenables
ID: 1454315
Excellent! Thanks for the help guys.
0
 

Author Comment

by:neilvenables
ID: 1454316
Excellent! Thanks for the help guys.
0

Featured Post

Top 6 Sources for Identifying Threat Actor TTPs

Understanding your enemy is essential. These six sources will help you identify the most popular threat actor tactics, techniques, and procedures (TTPs).

Join & Write a Comment

There are many ways to remove duplicate entries in an SQL or Access database. Most make you temporarily insert an ID field, make a temp table and copy data back and forth, and/or are slow. Here is an easy way in VB6 using ADO to remove duplicate row…
I’ve seen a number of people looking for examples of how to access web services from VB6.  I’ve been using a test harness I built in VB6 (using many resources I found online) that I use for small projects to work out how to communicate with web serv…
Get people started with the utilization of class modules. Class modules can be a powerful tool in Microsoft Access. They allow you to create self-contained objects that encapsulate functionality. They can easily hide the complexity of a process from…
This lesson covers basic error handling code in Microsoft Excel using VBA. This is the first lesson in a 3-part series that uses code to loop through an Excel spreadsheet in VBA and then fix errors, taking advantage of error handling code. This l…

757 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

20 Experts available now in Live!

Get 1:1 Help Now