A Guide to Approximate String Matching

Okay. So what exactly is the problem here?

How often have we come across situations where we need to know if two strings are 'similar' but not necessarily the same? I have, plenty of times. Until recently, I thought any functionality like that would be impossible to get my head around. Fortunately, there was already a vast field of well developed research to come to my rescue!

Whoa, hang on! I'm a beginner here. What kind of field is this?

This measure of 'similarity' between strings is known as 'Approximate String Matching'. The basic steps followed in comparing any two strings are as followed:

a) Analyze the two strings
b) Compute a 'metric' for these two strings
c) Check if the value of the metric is above or below a certain threshold.
d) Decide if the strings 'match'.

Metrics? Approximate String matching? Steps? What??

Relax. The above 4 steps will become a little clearer the moment I elaborate on one of these metrics.

But wait. What metrics are these?

A metric is a quantity of measurement to put it simply. Here we need to choose an appropriate metric to represent how similar or different two strings are.

Oh okay. So if we're measuring the same thing, why do we need different metrics?

A person's weight can be measured in kilograms or pounds. Your height can be in feet or inches. Similarly, there are different approaches to measure how similar two strings are.

So there are more metrics available?

There are plenty of metrics to choose from. Some are:

Levenshtein edit distance
Hamming distance
Longest common substring
Longest common subsequence

Can you tell me a little more about each?

Of course! We'll be covering the Levenshtein Edit distance in more detail later, but here's a little bit of information about the other three:

Hamming Distance: The hamming distance, to put it simply, measures how different two strings are based on the position of their characters. For example, the distance between "John" and "Joan" is 1. However, this metric takes the position of characters as a major aspect of the comparison between the strings. This means "John" and "Jonh" have a hamming distance of 2. While we know, at first glance, that the second string is probably just a typographical error of the first, this metric concludes that they are very different.

Longest Common Substring:As the name suggests, this method finds the length of the longest substring present in the strings being compared. For example, in the strings "BON","RIBBON" and "BONAFIDE", the longest common substring is "BON". Clearly, the longer the common substring is, the more similar are the strings being compared. Again, as is the case with the hamming distance, position of each character is essential to determine the degree of similarity between strings.

Longest Common Subsequence:It's important to note that while it is easy to confuse this method with the previous, they are in fact very different. A substring is a sequence of characters as they appear in the original string. That is to say, a substring has to form a directly mappable part of the original string and contain characters in exactly the same order as the parent string. A subsequence however, has no such restriction. Characters in a subsequence only need to have the same relative position as the parent string.

Let's take an example here. We have the string "TURTLE". Some valid substrings of this string are - "URTLE","TUR","RTL", etc. Some valid subsequences of this string are - "TE","TUE","RLE". Notice, that in the subsequence, the absolute position of the characters is irrelevant, but their relative position must be the same as the parent string. "RLE" is valid because even though they don't appear consecutively in the string "TURTLE", the character 'R' still appears before 'L' which appears before 'E'.

We can now understand what the longest common subsequence searches for. If we compare "PENNY","ENTITY" and "EPIPHANY", we will observe that the longest common subsequence is "ENY". While position in this method is not as critical as in the other two methods mentioned above, relative position still reduces this method's effectiveness when confronted with typos.

So which one can I use?

My preference and, indeed, the one I always use is the Levenshtein distance. This is mainly because the Levenshtein distance also takes into account typos and does not, necessarily, take into account the actual positions of the string as strictly as some of the other metrics. You can get more information on each of these, and more in their individual Wikipedia pages.

Okay great. So now I know that I'm going to calculate the Levenshtein distance for my strings. Can you tell me something more about this?

The basic idea of the Levenshtein distance is to find the number of changes you need to make from one string , to get the other. For example, if I have "John" and "Johnny", I need to either add two characters to the first to get the second, or subtract two characters from the second, to get the first. That gives me a distance of 2. Similarly, if I have "John" and "Jones" I need to remove a character 'h' from "John" and add two characters 'e' and 's' to make it "Jones". That gives me a distance of 3.

Ah okay, that makes sense.

See, I told you!

Writing code for that is a whole other ball game.

Yes, I know. Fortunately, to compute Levenshtein distance there is code available on the internet for unrestricted personal and commercial use. Below is one such function in C#:

http://dotnetperls.com/levenshtein

public static int GetLevenshteinDistance(string s, string t)
                      {
                          int n = s.Length;
                          int m = t.Length;
                          int[,] d = new int[n + 1, m + 1];
                      
                          if (n == 0)
                          {
                              return m;
                          }
                      
                          if (m == 0)
                          {
                              return n;
                          }
                      
                          for (int i = 0; i <= n; d[i, 0] = i++)
                          {
                          }
                      
                          for (int j = 0; j <= m; d[0, j] = j++)
                          {
                          }
                      
                          for (int i = 1; i <= n; i++)
                          {
                              for (int j = 1; j <= m; j++)
                              {
                                  int cost = (t[j - 1] == s[i - 1]) ? 0 : 1;
                                  d[i, j] = Math.Min(
                                  Math.Min(d[i - 1, j] + 1, d[i, j - 1] + 1),
                                  d[i - 1, j - 1] + cost);
                              }
                          }
                      
                          return d[n, m];
                      }

Open in new window

So what do I do with this now?

This code is going to return a number similar to the ones we arrived at earlier. It would return 2 for "John" and "Johnny" and 3 for "John" and "Jones". Now we can try and derive (in percentage) how much the string has to be altered to be made equal with the other.

Consider A = "John" and B = "Johnny".

Since we're checking two strings, we can either change A to B or B to A. Either one will need the same number of operations.

An ideal thing to do here is find out, which string would be altered the least to obtain the other string?

I didn't understand that.

Well, here we need to compare A and B. We know now that the distance between these is 2. How effective is the change on each of these strings? Changing A would involve a net change of 2 out of 4 character. Changing B would involve a net change of 2 out of 6 characters.

Clearly, 2/4 (50%) is more than 2/6 (33.33%).

Okay. What do I do with this information?

The purpose here is to find out a 'score' which we can then use to compare any two strings.

So this score is basically the lowest percentage of change in either string?

Exactly. In the case of A and B, the score will be 2/6 since it is lesser than 2/4. Similarly for any two strings A and B the score can be obtained by using:

double score = GetLevenshteinDistance(A,B)/Math.Max(A.Length,B.Length);

Open in new window

This is exactly what we did earlier. We just divide the distance by the largest length to get a score which when multiplied by 100, will give you the percentage of change.

And how does getting this score help?

This means you can now programmatically specify a condition like - 'Only consider two strings equal if either of the strings does not have to be changed more than 30% of what it originally is.

How exactly?

Using this code:

String A = "John";
                      String B = "Johnny";
                      int distance = GetLevenshteinDistance(A, B);
                      double score = (double)distance / Math.Max(A.Length, B.Length);
                      
                      if (score < 0.3) //0.3 corresponds to 30%
                      {
                          Console.WriteLine("Bingo. These are almost the same.");
                      }
                      else
                      {
                          Console.WriteLine("Nope. Not close enough.");
                      }

Open in new window

Cool. What if I want them to be considered as equal only if the change percentage is, say, less than or equal to 22%?

you just change the '0.3' in the code to'0.22'. This works for any percentage you want.

Can you show me some place I can actually use this?

Let's see. Suppose you need to make a program to get the scores of students from a database based on a teacher entering the name as input.

Teachers are human, last time I checked. Suppose the teacher entered 'Jonh' instead of 'John'. You obviously will not find a record for 'Jonh' in the database. But, in case you're using this method of approximate matching and you compare 'John' with the input 'Jonh', it will register a match and seamlessly return John's scores without any issues.

Is there anything else I should know before using this?

Yes. There are always other metrics to choose from and different ones are suitable for different uses. Use this article as a starting point to research other ways of approximately matching strings.

Great. Anything else before I leave?

Yes! You can mark this article as helpful if you thought it was :-)

Edit:

As DanRollins pointed out in his comment, Soundex is another metric used for approximate string matching. It is used to determine whether two strings 'sound' the same. Soundex belongs to a branch of algorithms called phonetic algorithms. You can read more about it in the wiki page here.

Comments (6)

Mark Wills

Topic Advisor

CERTIFIED EXPERT

Distinguished Expert 2018

Commented: 2010-08-22

Good Article, thanks... And yes, I did mark this article helpful :)

GwynforWeb

Commented: 2011-06-26

Nice job, well done.

DanRollins

Author of the Year 2009

Commented: 2011-06-27

Another commonly-used metric is SOUNDEX and it has the advantage of being directly available (no manual calculations) in SQL Server and perhaps other DBMSes.

aikimark

Get vaccinated; Social distance; Wear a mask

Top Expert 2014

I spotted a typo
>>We can now understand what the longest common subsequence searches for. If we compare "PENNY","ENTITY" and "EPIPHANY", we will observe that the longest common substring is "ENY".

ENY is a sub-sequence, not a substring.

===========
Here is a T-SQL implementaion:
http://www.merriampark.com/ldtsql.htm

ozo

Most Valuable Expert 2014

Top Expert 2015

Commented: 2014-09-01

A nice algorithm to search for approximate string matches is agrep
http://en.wikipedia.org/wiki/Agrep

A nice way to find longest common substrings is to use a suffix tree

A nice way to find find longest common subsequences is Hirschberg's algorithm