Solved

Need some string searhcing help

Posted on 2006-10-22
2
202 Views
Last Modified: 2010-04-16
Hi,

I am creating a string match scoring algorithm to help match street names, in spite of user mis-spellings.  And I need a way to find a score a letter based on its location.

For example:

If I'm looking for Maple St, and am testing Mapple St

The sheer number of matching chracters is important, and I check for that.  But it's not enough.  Too many false positives.  So I added a scoring methodology that checks for word chunks, twl letters in size.

MA, AP, PL, LE versus MA, AP, PP, PL, LE

and that helps.  But believe it or not, I still have problems.

My algorithm of course picks Mapple St as the best match if the correcly spelled one is not in the list.  The problem I have is when Maple St is not in the list and there is nothing close.  I need a way to differentiate from words that have lots of letter matches and word chunk matches, and words that clearly are different.

WOODLAND versus DOGWOOD, say.

So I want to search based on position as well. Let's go back to Maple...

I would like to iterate through every letter in Maple St, one at a time.

I would like to get a score back from Mapple St for each letter, based on each matching character's proximity to the position in the original street name.

'e', in position 4, would get a score 1

Maple St              3 - it's in the exactly right position
Mapple St            2 - it's off by one
Maapple St          1 - it's off by two

otherise there's no score allowed for the 'e'

If you have any better ideas, I'd love to hear them.  But I'm more interested in how to execute this with C#.

Thanks,
Bob

0
Comment
Question by:ba272
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
2 Comments
 
LVL 35

Accepted Solution

by:
Raynard7 earned 500 total points
ID: 17786181
Hi,

To do this what most search engines use is a combination of sundex and levenshtein distance calculations.

The first (soundex) evaluates the similarity in sounds using the consonants between two strings. There are very many different ways of doing this - giving weighting to repeated letters, some bring it to 4 digits others are unlimited - below is a working example of a soundex algorithm using c#.  Many databases have these functions built in.

Soundex
http://www.csharphelp.com/archives2/archive394.html

Levenshtein Distance is the difference between two words - by adding , deleting, changing characters.  Using this you can measure the similarity between the words - if they are within a certain tolerance then you can suggest them as being similar.

Levenshtein Distance
http://www.merriampark.com/ld.htm



There is plenty on google if you search for either of these terms which can give a much more detailed explanation than I can here.
0
 

Author Comment

by:ba272
ID: 17786252
Thanks.  I found a C# version and tested it.  It works okay.  But before I decide I'd like to find other aproached that may give me more control over the scoring methodology.

I'm still interested in hearing about other position searching functions there are built into C#.

Thanks.
0

Featured Post

What does it mean to be "Always On"?

Is your cloud always on? With an Always On cloud you won't have to worry about downtime for maintenance or software application code updates, ensuring that your bottom line isn't affected.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Article by: Ivo
C# And Nullable Types Since 2.0 C# has Nullable(T) Generic Structure. The idea behind is to allow value type objects to have null values just like reference types have. This concerns scenarios where not all data sources have values (like a databa…
Performance in games development is paramount: every microsecond counts to be able to do everything in less than 33ms (aiming at 16ms). C# foreach statement is one of the worst performance killers, and here I explain why.
Michael from AdRem Software outlines event notifications and Automatic Corrective Actions in network monitoring. Automatic Corrective Actions are scripts, which can automatically run upon discovery of a certain undesirable condition in your network.…
This is my first video review of Microsoft Bookings, I will be doing a part two with a bit more information, but wanted to get this out to you folks.
Suggested Courses

615 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question