Solved

Need some string searhcing help

Posted on 2006-10-22
2
195 Views
Last Modified: 2010-04-16
Hi,

I am creating a string match scoring algorithm to help match street names, in spite of user mis-spellings.  And I need a way to find a score a letter based on its location.

For example:

If I'm looking for Maple St, and am testing Mapple St

The sheer number of matching chracters is important, and I check for that.  But it's not enough.  Too many false positives.  So I added a scoring methodology that checks for word chunks, twl letters in size.

MA, AP, PL, LE versus MA, AP, PP, PL, LE

and that helps.  But believe it or not, I still have problems.

My algorithm of course picks Mapple St as the best match if the correcly spelled one is not in the list.  The problem I have is when Maple St is not in the list and there is nothing close.  I need a way to differentiate from words that have lots of letter matches and word chunk matches, and words that clearly are different.

WOODLAND versus DOGWOOD, say.

So I want to search based on position as well. Let's go back to Maple...

I would like to iterate through every letter in Maple St, one at a time.

I would like to get a score back from Mapple St for each letter, based on each matching character's proximity to the position in the original street name.

'e', in position 4, would get a score 1

Maple St              3 - it's in the exactly right position
Mapple St            2 - it's off by one
Maapple St          1 - it's off by two

otherise there's no score allowed for the 'e'

If you have any better ideas, I'd love to hear them.  But I'm more interested in how to execute this with C#.

Thanks,
Bob

0
Comment
Question by:ba272
2 Comments
 
LVL 35

Accepted Solution

by:
Raynard7 earned 500 total points
ID: 17786181
Hi,

To do this what most search engines use is a combination of sundex and levenshtein distance calculations.

The first (soundex) evaluates the similarity in sounds using the consonants between two strings. There are very many different ways of doing this - giving weighting to repeated letters, some bring it to 4 digits others are unlimited - below is a working example of a soundex algorithm using c#.  Many databases have these functions built in.

Soundex
http://www.csharphelp.com/archives2/archive394.html

Levenshtein Distance is the difference between two words - by adding , deleting, changing characters.  Using this you can measure the similarity between the words - if they are within a certain tolerance then you can suggest them as being similar.

Levenshtein Distance
http://www.merriampark.com/ld.htm



There is plenty on google if you search for either of these terms which can give a much more detailed explanation than I can here.
0
 

Author Comment

by:ba272
ID: 17786252
Thanks.  I found a C# version and tested it.  It works okay.  But before I decide I'd like to find other aproached that may give me more control over the scoring methodology.

I'm still interested in hearing about other position searching functions there are built into C#.

Thanks.
0

Featured Post

IT, Stop Being Called Into Every Meeting

Highfive is so simple that setting up every meeting room takes just minutes and every employee will be able to start or join a call from any room with ease. Never be called into a meeting just to get it started again. This is how video conferencing should work!

Join & Write a Comment

Suggested Solutions

Bit flags and bit flag manipulation is perhaps one of the most underrated strategies in programming, likely because most programmers developing in high-level languages rely too much on the high-level features, and forget about the low-level ones. Th…
Article by: Najam
Having new technologies does not mean they will completely replace old components.  Recently I had to create WCF that will be called by VB6 component.  Here I will describe what steps one should follow while doing so, please feel free to post any qu…
Internet Business Fax to Email Made Easy - With eFax Corporate (http://www.enterprise.efax.com), you'll receive a dedicated online fax number, which is used the same way as a typical analog fax number. You'll receive secure faxes in your email, fr…
Polish reports in Access so they look terrific. Take yourself to another level. Equations, Back Color, Alternate Back Color. Write easy VBA Code. Tighten space to use less pages. Launch report from a menu, considering criteria only when it is filled…

760 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

21 Experts available now in Live!

Get 1:1 Help Now