R is considered the predominant language for data scientist and statisticians. Learn how to use R for your own data science projects.

Published on

22,031 Points

How often have we come across situations where we need to know if two strings are 'similar' but not necessarily the same? I have, plenty of times. Until recently, I thought any functionality like that would be impossible to get my head around. Fortunately, there was already a vast field of well developed research to come to my rescue!

This measure of 'similarity' between strings is known as 'Approximate String Matching'. The basic steps followed in comparing any two strings are as followed:

a) Analyze the two strings

b) Compute a 'metric' for these two strings

c) Check if the value of the metric is above or below a certain threshold.

d) Decide if the strings 'match'.

Relax. The above 4 steps will become a little clearer the moment I elaborate on one of these metrics.

A metric is a quantity of measurement to put it simply. Here we need to choose an appropriate metric to represent how similar or different two strings are.

A person's weight can be measured in kilograms or pounds. Your height can be in feet or inches. Similarly, there are different approaches to measure how similar two strings are.

There are plenty of metrics to choose from. Some are:

Levenshtein edit distance

Hamming distance

Longest common substring

Longest common subsequence

Of course! We'll be covering the Levenshtein Edit distance in more detail later, but here's a little bit of information about the other three:

Let's take an example here. We have the string "TURTLE". Some valid substrings of this string are - "URTLE","TUR","RTL", etc. Some valid subsequences of this string are - "TE","TUE","RLE". Notice, that in the subsequence, the absolute position of the characters is irrelevant, but their relative position must be the same as the parent string. "RLE" is valid because even though they don't appear consecutively in the string "TURTLE", the character 'R' still appears before 'L' which appears before 'E'.

We can now understand what the longest common subsequence searches for. If we compare "PENNY","ENTITY" and "EPIPHANY", we will observe that the longest common subsequence is "ENY". While position in this method is not as critical as in the other two methods mentioned above, relative position still reduces this method's effectiveness when confronted with typos.

My preference and, indeed, the one I always use is the Levenshtein distance. This is mainly because the Levenshtein distance also takes into account typos and does not, necessarily, take into account the actual positions of the string as strictly as some of the other metrics. You can get more information on each of these, and more in their individual Wikipedia pages.

The basic idea of the Levenshtein distance is to find the number of changes you need to make from one string , to get the other. For example, if I have "John" and "Johnny", I need to either add two characters to the first to get the second, or subtract two characters from the second, to get the first. That gives me a distance of 2. Similarly, if I have "John" and "Jones" I need to remove a character 'h' from "John" and add two characters 'e' and 's' to make it "Jones". That gives me a distance of 3.

See, I told you!

Yes, I know. Fortunately, to compute Levenshtein distance there is code available on the internet for unrestricted personal and commercial use. Below is one such function in C#:

http://dotnetperls.com/levenshtein

```
public static int GetLevenshteinDistance(string s, string t)
{
int n = s.Length;
int m = t.Length;
int[,] d = new int[n + 1, m + 1];
if (n == 0)
{
return m;
}
if (m == 0)
{
return n;
}
for (int i = 0; i <= n; d[i, 0] = i++)
{
}
for (int j = 0; j <= m; d[0, j] = j++)
{
}
for (int i = 1; i <= n; i++)
{
for (int j = 1; j <= m; j++)
{
int cost = (t[j - 1] == s[i - 1]) ? 0 : 1;
d[i, j] = Math.Min(
Math.Min(d[i - 1, j] + 1, d[i, j - 1] + 1),
d[i - 1, j - 1] + cost);
}
}
return d[n, m];
}
```

This code is going to return a number similar to the ones we arrived at earlier. It would return 2 for "John" and "Johnny" and 3 for "John" and "Jones". Now we can try and derive (in percentage) how much the string has to be altered to be made equal with the other.

Consider A = "John" and B = "Johnny".

Since we're checking two strings, we can either change A to B or B to A. Either one will need the same number of operations.

An ideal thing to do here is find out, which string would be altered the least to obtain the other string?

Well, here we need to compare A and B. We know now that the distance between these is 2. How effective is the change on each of these strings? Changing A would involve a net change of 2 out of 4 character. Changing B would involve a net change of 2 out of 6 characters.

Clearly, 2/4 (50%) is more than 2/6 (33.33%).

The purpose here is to find out a 'score' which we can then use to compare any two strings.

Exactly. In the case of A and B, the score will be 2/6 since it is lesser than 2/4. Similarly for any two strings A and B the score can be obtained by using:

```
double score = GetLevenshteinDistance(A,B)/Math.Max(A.Length,B.Length);
```

This is exactly what we did earlier. We just divide the distance by the largest length to get a score which when multiplied by 100, will give you the percentage of change.

This means you can now programmatically specify a condition like - 'Only consider two strings equal if either of the strings does not have to be changed more than 30% of what it originally is.

Using this code:

```
String A = "John";
String B = "Johnny";
int distance = GetLevenshteinDistance(A, B);
double score = (double)distance / Math.Max(A.Length, B.Length);
if (score < 0.3) //0.3 corresponds to 30%
{
Console.WriteLine("Bingo. These are almost the same.");
}
else
{
Console.WriteLine("Nope. Not close enough.");
}
```

you just change the '0.3' in the code to'0.22'. This works for any percentage you want.

Let's see. Suppose you need to make a program to get the scores of students from a database based on a teacher entering the name as input.

Teachers are human, last time I checked. Suppose the teacher entered 'Jonh' instead of 'John'. You obviously will not find a record for 'Jonh' in the database. But, in case you're using this method of approximate matching and you compare 'John' with the input 'Jonh', it will register a match and seamlessly return John's scores without any issues.

Yes. There are always other metrics to choose from and different ones are suitable for different uses. Use this article as a starting point to research other ways of approximately matching strings.

Yes! You can mark this article as helpful if you thought it was :-)

As DanRollins pointed out in his comment, Soundex is another metric used for approximate string matching. It is used to determine whether two strings 'sound' the same. Soundex belongs to a branch of algorithms called phonetic algorithms. You can read more about it in the wiki page here.

6 Comments

>>We can now understand what the longest common subsequence searches for. If we compare "PENNY","ENTITY" and "EPIPHANY", we will observe that the longest common substring is "ENY".

===========

Here is a T-SQL implementaion:

http://www.merriampark.com/ldtsql.htm

http://en.wikipedia.org/wiki/Agrep

A nice way to find longest common substrings is to use a suffix tree

A nice way to find find longest common subsequences is Hirschberg's algorithm

Title | Views | Activity |
---|---|---|

Binary Search | 840 | |

Finding GCD (Euclidean Algorithm) | 1,025 | |

Introduction to Recursion | 1,045 | |

Bloom Filters | 804 |

Next Article:Introduction to Recursion