Solved

Need some string searhcing help

Posted on 2006-10-22
2
198 Views
Last Modified: 2010-04-16
Hi,

I am creating a string match scoring algorithm to help match street names, in spite of user mis-spellings.  And I need a way to find a score a letter based on its location.

For example:

If I'm looking for Maple St, and am testing Mapple St

The sheer number of matching chracters is important, and I check for that.  But it's not enough.  Too many false positives.  So I added a scoring methodology that checks for word chunks, twl letters in size.

MA, AP, PL, LE versus MA, AP, PP, PL, LE

and that helps.  But believe it or not, I still have problems.

My algorithm of course picks Mapple St as the best match if the correcly spelled one is not in the list.  The problem I have is when Maple St is not in the list and there is nothing close.  I need a way to differentiate from words that have lots of letter matches and word chunk matches, and words that clearly are different.

WOODLAND versus DOGWOOD, say.

So I want to search based on position as well. Let's go back to Maple...

I would like to iterate through every letter in Maple St, one at a time.

I would like to get a score back from Mapple St for each letter, based on each matching character's proximity to the position in the original street name.

'e', in position 4, would get a score 1

Maple St              3 - it's in the exactly right position
Mapple St            2 - it's off by one
Maapple St          1 - it's off by two

otherise there's no score allowed for the 'e'

If you have any better ideas, I'd love to hear them.  But I'm more interested in how to execute this with C#.

Thanks,
Bob

0
Comment
Question by:ba272
2 Comments
 
LVL 35

Accepted Solution

by:
Raynard7 earned 500 total points
ID: 17786181
Hi,

To do this what most search engines use is a combination of sundex and levenshtein distance calculations.

The first (soundex) evaluates the similarity in sounds using the consonants between two strings. There are very many different ways of doing this - giving weighting to repeated letters, some bring it to 4 digits others are unlimited - below is a working example of a soundex algorithm using c#.  Many databases have these functions built in.

Soundex
http://www.csharphelp.com/archives2/archive394.html

Levenshtein Distance is the difference between two words - by adding , deleting, changing characters.  Using this you can measure the similarity between the words - if they are within a certain tolerance then you can suggest them as being similar.

Levenshtein Distance
http://www.merriampark.com/ld.htm



There is plenty on google if you search for either of these terms which can give a much more detailed explanation than I can here.
0
 

Author Comment

by:ba272
ID: 17786252
Thanks.  I found a C# version and tested it.  It works okay.  But before I decide I'd like to find other aproached that may give me more control over the scoring methodology.

I'm still interested in hearing about other position searching functions there are built into C#.

Thanks.
0

Featured Post

Free Tool: IP Lookup

Get more info about an IP address or domain name, such as organization, abuse contacts and geolocation.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Summary: Persistence is the capability of an application to store the state of objects and recover it when necessary. This article compares the two common types of serialization in aspects of data access, readability, and runtime cost. A ready-to…
Entity Framework is a powerful tool to help you interact with the DataBase but still doesn't help much when we have a Stored Procedure that returns more than one resultset. The solution takes some of out-of-the-box thinking; read on!
In an interesting question (https://www.experts-exchange.com/questions/29008360/) here at Experts Exchange, a member asked how to split a single image into multiple images. The primary usage for this is to place many photographs on a flatbed scanner…

829 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question