Solved

# How to find the longest common substing from 2 strings?

Posted on 2011-05-02
844 Views
Hey,

I was trying to find the longest common substring from 2 different strings, like...the longest common substring between AAAAAAAAAACCCTCCCCCC and TTTTTTTAAAAACCCCGGGG is AAAAACCC.

The first thing that comes to my mind, is, I take the shorter string, and keep comparing the first letter in the shorter string, to the first letter in the larger string, and if the letters don't match, we can simply remove that letter from the larger string, and try again, but if they do match, we move on to the next string and go on.  At any point, if the n+1 letter doesn't match, we keep the string we found, and start fresh...

But... this seems to be a brute force method... is there a way we can do this in linear time?

One idea that comes to mind, is that we take the first letter in string1, and start from the first occurrence of that letter in string2, but a few problems that arise from that approach, is that I may accidentally delete the largest string, or, it might not help at all if both strings start with the same letter...

Another idea that comes to mind... would it be faster if I just cycled through all the permutations of the larger string, and just subtract the two strings and store the results in an integer array? Like...

a  b  c d d d d e f    f
a  d  u d d d d e l    f
0 !0 !0 0 0 0 0 0 !0 0

Basically, I'd just check for the longest string of 0's, and since I would keep one string constant, I would know exactly what the largest string is...

But I don't think either of these ways can get the job done in linear time... could anyone help me find an algorithm that performs this task in O(mn) time? Where m and n are the lengths of the two strings?

0
Question by:errang

LVL 31

Expert Comment

0

LVL 86

Expert Comment

These two should be a good starting point:

http://en.wikipedia.org/wiki/Longest_common_subsequence_problem
http://en.wikibooks.org/wiki/Algorithm_Implementation/Strings/Longest_common_substring

You will find a nice implementation at http://en.wikibooks.org/wiki/Algorithm_Implementation/Strings/Longest_common_subsequence#C++
``````// Compute the longest common subsequence between X and Y
// On return, C will contain the LCS table.
// C[m][n] will contain the length of the longest common subsequence.
template<typename RanIt> size_t **
LCSLength(RanIt X, RanIt Xend, RanIt Y, RanIt Yend)
{
size_t m = std::distance(X, Xend);
size_t n = std::distance(Y, Yend);
size_t **C = Allocate2DArray<size_t>(m+1, n+1);

for (size_t i=0; i<=m; ++i)
C[i][0] = 0;

for (size_t j=0; j<=n; ++j)
C[0][j] = 0;

for (size_t i=0; i<m; ++i)
for (size_t j=0; j<n; ++j)
if (X[i] == Y[j])
C[i+1][j+1] = C[i][j] + 1;
else
C[i+1][j+1] = std::max(C[i+1][j], C[i][j+1]);

return C;
}
``````
0

Author Comment

Hm... is RanIT a datatype? I've never seen it before.
0

Author Comment

Sorry, meant to ask what type of a datatype RanIt is, a brief google search tells me its a "protected object" used in iterator and combination classes.
0

LVL 86

Expert Comment

'RanIT' is short for 'random iterator' - the 'Usage' section helps to understand that:
``````char X[] = "XMJYAUZ";
char Y[] = "MZJAWXU";
size_t **C = LCSLength(X, X+strlen(X), Y, Y+strlen(Y));
``````
0

LVL 31

Accepted Solution

Because the author said in other question,  http://www.experts-exchange.com/Programming/Languages/C_Sharp/Q_26986121.html , that this site http://marknelson.us/1996/08/01/suffix-trees/ has viruses, I included the contents of this link in the question.
0

Author Comment

Hm... so the idea is, I create trees of both strings, and find the biggest "subtree" present in both trees, and then we have the biggest substring?

One more question about this approach though... I know binary trees are quick to search through, but how do you search through a tree, when you don't know what you are searching for?
0

LVL 86

Assisted Solution

Well, see the pages I linked above, they explain that quite well...
0

Author Comment

Ah, alright, thanks =), guess I'll have to read those sites more closely to figure that out.
0

LVL 31

Expert Comment

These videos may provide additional insight.

MIT:
Lecture 15: Dynamic Programming, Longest Common Subsequence
http://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-046j-introduction-to-algorithms-sma-5503-fall-2005/video-lectures/lecture-15-dynamic-programming-longest-common-subsequence/

India Institute of Technology:
Lecture - 17 Case Study: Searching for Patterns

Lecture - 18 Tries

0

LVL 31

Expert Comment

>> is there a way we can do this in linear time?
"You can find the lengths and starting positions of the longest common substrings of S and T in O(n+m).  Finding them by dynamic programming costs O(nm)."
http://en.wikipedia.org/wiki/Longest_common_substring_problem#Algorithms

I wasn't able to find code online for this LCS/suffix solution. But, if you are interested, I'll try over the weekend to provide you with more references to the suffix implementation.

BTW - references to longest common subsequence do not appear to apply to your OP, as you are asking for the longest common substring. Note that the number of substrings in a string ~ O(n^2); whereas the number of subsequences in a string ~ O(2^n).

The memory requirements in jkr's post is O(mn). But in his link, you will find a C++ version that only requires O(2n) = O(n) memory here. I modified this latter version to bring back to the caller the largest common substring.
0

Author Comment

Ah, sweet, thanks =)
0

LVL 31

Expert Comment

The previous lecture 18 on Tries gives good background on tries, compressed tries, and decent introduction to suffix trees (although some parts were not intelligible - either speaking too fast, or stopped speaking English). But that lecture should help in reading the article posted in the other Question.

Here are slides on Edward McCreight's 1976 algorithm
http://www.math.tau.ac.il/~haimk/seminar00/iddo-McCreight_slides.ppt

I found this video lecture given by Esko Ukkonen, University of Helsinki. His algorithm builds the suffix tree in O(n) time.
http://videolectures.net/aop05_ukkonen_sthmt/

Before considering longest common substring, consider string searching times. KMP finds a string in O(m+n) time, but if n is huge (e.g., complete works of Shakespeare), then O(m+n) is too long especially if there are multiple queries.

Using suffix tree, the search time for each query is O(m) after the tree is built. But the limiting factor is how long does it take to build the suffix tree. It was Esko Ukkonen who succeeded in building the tree in O(n) time, even though the number of substrings is O(n^2).

So, using the Ukkonen algorithm, then the first query to find a matching string takes O(m+n) time, but subsequent queries takes only O(m) times once the suffix tree representing the string of n symbols has been built.

I haven't worked out the fine details for the longest common substring, but consider..
You noticed that when building the suffix tree that there may be identical substrings, in which case you record the starting position of each occurrence. After building the first suffix tree for str1, you could then build suffix tree for str2. In fact, you could build it continuing on the first tree. And when you find an identical substring, you could keep track of its length; and if that length is the current max length, then set max length and the positions in the two input strings.
0

## Featured Post

### Suggested Solutions

UPD maximums on Red Hat 6 96
Which IDE to use to begin C++ training? 5 54
FMX enumerated colours 2 54
c++ syntax question 9 25
Templates For Beginners Or How To Encourage The Compiler To Work For You Introduction This tutorial is targeted at the reader who is, perhaps, familiar with the basics of C++ but would prefer a little slower introduction to the more ad…
Windows programmers of the C/C++ variety, how many of you realise that since Window 9x Microsoft has been lying to you about what constitutes Unicode (http://en.wikipedia.org/wiki/Unicode)? They will have you believe that Unicode requires you to use…
The goal of this video is to provide viewers with basic examples to understand and use conditional statements in the C programming language.
The viewer will learn additional member functions of the vector class. Specifically, the capacity and swap member functions will be introduced.