Link to home
Start Free TrialLog in
Avatar of errang
errangFlag for Afghanistan

asked on

How to find the longest common substing from 2 strings?

Hey,

        I was trying to find the longest common substring from 2 different strings, like...the longest common substring between AAAAAAAAAACCCTCCCCCC and TTTTTTTAAAAACCCCGGGG is AAAAACCC.

The first thing that comes to my mind, is, I take the shorter string, and keep comparing the first letter in the shorter string, to the first letter in the larger string, and if the letters don't match, we can simply remove that letter from the larger string, and try again, but if they do match, we move on to the next string and go on.  At any point, if the n+1 letter doesn't match, we keep the string we found, and start fresh...

But... this seems to be a brute force method... is there a way we can do this in linear time?

One idea that comes to mind, is that we take the first letter in string1, and start from the first occurrence of that letter in string2, but a few problems that arise from that approach, is that I may accidentally delete the largest string, or, it might not help at all if both strings start with the same letter...

Another idea that comes to mind... would it be faster if I just cycled through all the permutations of the larger string, and just subtract the two strings and store the results in an integer array? Like...

a  b  c d d d d e f    f
a  d  u d d d d e l    f
0 !0 !0 0 0 0 0 0 !0 0

Basically, I'd just check for the longest string of 0's, and since I would keep one string constant, I would know exactly what the largest string is...

But I don't think either of these ways can get the job done in linear time... could anyone help me find an algorithm that performs this task in O(mn) time? Where m and n are the lengths of the two strings?

Thanks in advance!
Avatar of phoffric
phoffric

These two should be a good starting point:

http://en.wikipedia.org/wiki/Longest_common_subsequence_problem
http://en.wikibooks.org/wiki/Algorithm_Implementation/Strings/Longest_common_substring

You will find a nice implementation at http://en.wikibooks.org/wiki/Algorithm_Implementation/Strings/Longest_common_subsequence#C++
// Compute the longest common subsequence between X and Y
// On return, C will contain the LCS table.
// C[m][n] will contain the length of the longest common subsequence.
template<typename RanIt> size_t **   
LCSLength(RanIt X, RanIt Xend, RanIt Y, RanIt Yend)
{
        size_t m = std::distance(X, Xend);
        size_t n = std::distance(Y, Yend);
        size_t **C = Allocate2DArray<size_t>(m+1, n+1);        
 
        for (size_t i=0; i<=m; ++i)
          C[i][0] = 0;
 
        for (size_t j=0; j<=n; ++j)
          C[0][j] = 0;
 
        for (size_t i=0; i<m; ++i)
          for (size_t j=0; j<n; ++j)
            if (X[i] == Y[j])
              C[i+1][j+1] = C[i][j] + 1;
            else
              C[i+1][j+1] = std::max(C[i+1][j], C[i][j+1]);
 
        return C;
}

Open in new window

Avatar of errang

ASKER

Hm... is RanIT a datatype? I've never seen it before.
Avatar of errang

ASKER

Sorry, meant to ask what type of a datatype RanIt is, a brief google search tells me its a "protected object" used in iterator and combination classes.
'RanIT' is short for 'random iterator' - the 'Usage' section helps to understand that:
char X[] = "XMJYAUZ";
        char Y[] = "MZJAWXU";
        size_t **C = LCSLength(X, X+strlen(X), Y, Y+strlen(Y));

Open in new window

ASKER CERTIFIED SOLUTION
Avatar of phoffric
phoffric

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of errang

ASKER

Hm... so the idea is, I create trees of both strings, and find the biggest "subtree" present in both trees, and then we have the biggest substring?

One more question about this approach though... I know binary trees are quick to search through, but how do you search through a tree, when you don't know what you are searching for?
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of errang

ASKER

Ah, alright, thanks =), guess I'll have to read those sites more closely to figure that out.
These videos may provide additional insight.

MIT:
Lecture 15: Dynamic Programming, Longest Common Subsequence
    http://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-046j-introduction-to-algorithms-sma-5503-fall-2005/video-lectures/lecture-15-dynamic-programming-longest-common-subsequence/

India Institute of Technology:
Lecture - 17 Case Study: Searching for Patterns
    http://www.youtube.com/watch?v=Zj_er99KMb8&feature=related

Lecture - 18 Tries
    http://www.youtube.com/watch?v=uhAUk63tLRM&p=BF3763AF2E1C572F

>> is there a way we can do this in linear time?
     "You can find the lengths and starting positions of the longest common substrings of S and T in O(n+m).  Finding them by dynamic programming costs O(nm)."
    http://en.wikipedia.org/wiki/Longest_common_substring_problem#Algorithms

I wasn't able to find code online for this LCS/suffix solution. But, if you are interested, I'll try over the weekend to provide you with more references to the suffix implementation.

BTW - references to longest common subsequence do not appear to apply to your OP, as you are asking for the longest common substring. Note that the number of substrings in a string ~ O(n^2); whereas the number of subsequences in a string ~ O(2^n).

The memory requirements in jkr's post is O(mn). But in his link, you will find a C++ version that only requires O(2n) = O(n) memory here. I modified this latter version to bring back to the caller the largest common substring.
Avatar of errang

ASKER

Ah, sweet, thanks =)
The previous lecture 18 on Tries gives good background on tries, compressed tries, and decent introduction to suffix trees (although some parts were not intelligible - either speaking too fast, or stopped speaking English). But that lecture should help in reading the article posted in the other Question.

Here are slides on Edward McCreight's 1976 algorithm
     http://www.math.tau.ac.il/~haimk/seminar00/iddo-McCreight_slides.ppt

I found this video lecture given by Esko Ukkonen, University of Helsinki. His algorithm builds the suffix tree in O(n) time.
    http://videolectures.net/aop05_ukkonen_sthmt/

Before considering longest common substring, consider string searching times. KMP finds a string in O(m+n) time, but if n is huge (e.g., complete works of Shakespeare), then O(m+n) is too long especially if there are multiple queries.

Using suffix tree, the search time for each query is O(m) after the tree is built. But the limiting factor is how long does it take to build the suffix tree. It was Esko Ukkonen who succeeded in building the tree in O(n) time, even though the number of substrings is O(n^2).

So, using the Ukkonen algorithm, then the first query to find a matching string takes O(m+n) time, but subsequent queries takes only O(m) times once the suffix tree representing the string of n symbols has been built.

I haven't worked out the fine details for the longest common substring, but consider..
You noticed that when building the suffix tree that there may be identical substrings, in which case you record the starting position of each occurrence. After building the first suffix tree for str1, you could then build suffix tree for str2. In fact, you could build it continuing on the first tree. And when you find an identical substring, you could keep track of its length; and if that length is the current max length, then set max length and the positions in the two input strings.