http://www.experts-exchang

Solved

Posted on 2011-05-02

Hey,

I was trying to find the longest common substring from 2 different strings, like...the longest common substring between AAAAAAAAAACCCTCCCCCC and TTTTTTTAAAAACCCCGGGG is AAAAACCC.

The first thing that comes to my mind, is, I take the shorter string, and keep comparing the first letter in the shorter string, to the first letter in the larger string, and if the letters don't match, we can simply remove that letter from the larger string, and try again, but if they do match, we move on to the next string and go on. At any point, if the n+1 letter doesn't match, we keep the string we found, and start fresh...

But... this seems to be a brute force method... is there a way we can do this in linear time?

One idea that comes to mind, is that we take the first letter in string1, and start from the first occurrence of that letter in string2, but a few problems that arise from that approach, is that I may accidentally delete the largest string, or, it might not help at all if both strings start with the same letter...

Another idea that comes to mind... would it be faster if I just cycled through all the permutations of the larger string, and just subtract the two strings and store the results in an integer array? Like...

a b c d d d d e f f

a d u d d d d e l f

0 !0 !0 0 0 0 0 0 !0 0

Basically, I'd just check for the longest string of 0's, and since I would keep one string constant, I would know exactly what the largest string is...

But I don't think either of these ways can get the job done in linear time... could anyone help me find an algorithm that performs this task in O(mn) time? Where m and n are the lengths of the two strings?

Thanks in advance!

I was trying to find the longest common substring from 2 different strings, like...the longest common substring between AAAAAAAAAACCCTCCCCCC and TTTTTTTAAAAACCCCGGGG is AAAAACCC.

The first thing that comes to my mind, is, I take the shorter string, and keep comparing the first letter in the shorter string, to the first letter in the larger string, and if the letters don't match, we can simply remove that letter from the larger string, and try again, but if they do match, we move on to the next string and go on. At any point, if the n+1 letter doesn't match, we keep the string we found, and start fresh...

But... this seems to be a brute force method... is there a way we can do this in linear time?

One idea that comes to mind, is that we take the first letter in string1, and start from the first occurrence of that letter in string2, but a few problems that arise from that approach, is that I may accidentally delete the largest string, or, it might not help at all if both strings start with the same letter...

Another idea that comes to mind... would it be faster if I just cycled through all the permutations of the larger string, and just subtract the two strings and store the results in an integer array? Like...

a b c d d d d e f f

a d u d d d d e l f

0 !0 !0 0 0 0 0 0 !0 0

Basically, I'd just check for the longest string of 0's, and since I would keep one string constant, I would know exactly what the largest string is...

But I don't think either of these ways can get the job done in linear time... could anyone help me find an algorithm that performs this task in O(mn) time? Where m and n are the lengths of the two strings?

Thanks in advance!

13 Comments

http://www.experts-exchang

http://en.wikipedia.org/wi

http://en.wikibooks.org/wi

You will find a nice implementation at http://en.wikibooks.org/wi

```
// Compute the longest common subsequence between X and Y
// On return, C will contain the LCS table.
// C[m][n] will contain the length of the longest common subsequence.
template<typename RanIt> size_t **
LCSLength(RanIt X, RanIt Xend, RanIt Y, RanIt Yend)
{
size_t m = std::distance(X, Xend);
size_t n = std::distance(Y, Yend);
size_t **C = Allocate2DArray<size_t>(m+1, n+1);
for (size_t i=0; i<=m; ++i)
C[i][0] = 0;
for (size_t j=0; j<=n; ++j)
C[0][j] = 0;
for (size_t i=0; i<m; ++i)
for (size_t j=0; j<n; ++j)
if (X[i] == Y[j])
C[i+1][j+1] = C[i][j] + 1;
else
C[i+1][j+1] = std::max(C[i+1][j], C[i][j+1]);
return C;
}
```

```
char X[] = "XMJYAUZ";
char Y[] = "MZJAWXU";
size_t **C = LCSLength(X, X+strlen(X), Y, Y+strlen(Y));
```

One more question about this approach though... I know binary trees are quick to search through, but how do you search through a tree, when you don't know what you are searching for?

Lecture 15: Dynamic Programming, Longest Common Subsequence

http://ocw.mit.edu/courses

Lecture - 17 Case Study: Searching for Patterns

http://www.youtube.com/wat

Lecture - 18 Tries

http://www.youtube.com/wat

http://en.wikipedia.org/wi

I wasn't able to find code online for this LCS/suffix solution. But, if you are interested, I'll try over the weekend to provide you with more references to the suffix implementation.

BTW - references to longest common subsequence do not appear to apply to your OP, as you are asking for the longest common substring. Note that the number of substrings in a string ~ O(n^2); whereas the number of subsequences in a string ~ O(2^n).

The memory requirements in jkr's post is O(mn). But in his link, you will find a C++ version that only requires O(2n) = O(n) memory here. I modified this latter version to bring back to the caller the largest common substring.

Here are slides on Edward McCreight's 1976 algorithm

http://www.math.tau.ac.il/

I found this video lecture given by Esko Ukkonen, University of Helsinki. His algorithm builds the suffix tree in O(n) time.

http://videolectures.net/a

Before considering longest common substring, consider string searching times. KMP finds a string in O(m+n) time, but if n is huge (e.g., complete works of Shakespeare), then O(m+n) is too long especially if there are multiple queries.

Using suffix tree, the search time for each query is O(m) after the tree is built. But the limiting factor is how long does it take to build the suffix tree. It was Esko Ukkonen who succeeded in building the tree in O(n) time, even though the number of substrings is O(n^2).

So, using the Ukkonen algorithm, then the first query to find a matching string takes O(m+n) time, but subsequent queries takes only O(m) times once the suffix tree representing the string of n symbols has been built.

I haven't worked out the fine details for the

You noticed that when building the suffix tree that there may be identical substrings, in which case you record the starting position of each occurrence. After building the first suffix tree for str1, you could then build suffix tree for str2. In fact, you could build it continuing on the first tree. And when you find an identical substring, you could keep track of its length; and if that length is the current max length, then set max length and the positions in the two input strings.

Title | # Comments | Views | Activity |
---|---|---|---|

UPD maximums on Red Hat | 6 | 96 | |

Which IDE to use to begin C++ training? | 5 | 54 | |

FMX enumerated colours | 2 | 54 | |

c++ syntax question | 9 | 25 |

Join the community of 500,000 technology professionals and ask your questions.

Connect with top rated Experts

**23** Experts available now in Live!