This course will introduce you to C++ 11 and teach you about syntax fundamentals.

I need algorithm for this problem:

For given two strings of length m and n respectively, find maximum substring of both strings (maximal length), where character place in substring is not important.

example:

string 1 = abcde

string 2 = ercdbth

resulting string = bcd since it is contained in both strings (aBCDe, erCDBth - uppercase letters show resulting substring).

I would prefer algorithm that makes use of graph theory.

For given two strings of length m and n respectively, find maximum substring of both strings (maximal length), where character place in substring is not important.

example:

string 1 = abcde

string 2 = ercdbth

resulting string = bcd since it is contained in both strings (aBCDe, erCDBth - uppercase letters show resulting substring).

I would prefer algorithm that makes use of graph theory.

String 1: qwertp

String 2: serwghtp

Result: wer (qWERt, sERWght)

String 1: qwert

String 2: qawsedrftg

Result: any character from string 1, because maximal length of substring is 1.

Given two strings, S of length m and T of length n, find the longest strings which are a substrings of both S and T.

Example: The string ana is equal to substrings (and subsequences) of banana at two different offsets:

banana

| | | | |

ana| |

| | |

ana

"

This is definition from wikipedia article you have submitted.

What I need is longest common substring where the order of elements don't have to be preserved, just need to be continues. I hope it is clearer now.

I don't immediately see a way to do it other than the brute-force O(n^2 * log(n)) method of sorting and comparing all the multisets of substrings. There very well could be a mapping onto a well-known graph theory problem, but I don't see it yet.

the example below, demo the simplicity of it, the problem is efficiency in very long strings.

1. take shorter str. len(shorter) = short_len

2. set match_len = 1

3. foreach substr of shorter, length match_len (i.e. short_orig_str = substr( shorter , i , match_len ); [i = 0 .. (short_len - match_len)] )

4. foreach all permutations of substr of longer length match_len look for equality

5. increment match_len until reached short_len

algo. length:

short_len*long_len*1! + (short_len-1)*(long_len-1)

example:

shorter = abcde , short_len = 5

longer = ercdbth , long_len = 7

for match_len = 3

short_orig_str = {abc , bcd , cde}

long_str = {erc , rcd , cdb , dbt , bth}

long_permute_str = { {erc,ecr,cer,cre,rec,rce} , ... , {bth,bht,hbt,htb,tbh,thb} }

tal

Step1. find minimum lengh string

Step2. make all PnC string

Step3. take each string and find maximum mathcing substring

then divide the longer string into pieces between those delimiters.

Since the rule is the substrings must be contiguous, any non-member character will break a contiguous string.

for example...

String 1: qwertp

String 2: serwghtp

s,g,h not in String1

that leaves "erw" and "tp"

since we want the largest string, start with erw, and apply the same rule back to string1.

qtp are not in erw, so... that leaves.

String1a = wer

String2a = erw search those to see if they are the same

In this example, we would be done, but lets say they don't match

then go to string2b = tp, strip string1 of non-member charcters and that leaves string1b= tp

compare tp to tp and we have match.

This back-and-forth stripping will lead to relatively small strings of nothing but common charactrers in relatively short order and, since permutation of characters is not important, once you have stripped the strings as far as they will go, what is left will be your common substrings. Start with the largest and work down so you find the longest string first. Rather than finding a bunch of small strings and building up.

Here's my brute-force algorithm, running on strings s1 and s2. It takes O(n^2) space, where n is the longer of the two strings, and something like O(n^2 * log(n)) time. If the number of distinct symbols in the alphabet is small compared to the string length, it would make sense to use a more compact representation for the substring multiset, like an array of integers.

```
for len = max(length(s1), length(s2))-1 downto 1 {
create two arrays of strings, one of all substrings of s1 with length len, and the other the same for s2.
sort the characters in each individual substring, i.e. turn "band" into "abdn".
sort each array alphabetically
Look for a common entry in both arrays.
if you found a common entry:
search s1 and s2 for a substring that matches the pattern. This is the result.
}
no common substring was found (there must be no common characters between the two strings)
```

I have abcdefg and gfedcba they are identical in one pass because they have no unshared characters.

Can you provide an example that shows my algorithm failing?

Maybe I didn't explain it well enough, or maybe I just don't see the hole in it that you do.

Adnan, where did you find this problem, and what makes you think that graph theory will help? If it was in the midst of some info about a particular area of graph theory, then that would probably give a good hint about where to look for an efficient algorithm.

$\=$/;

print ms('qwertp','serwghtp');

sub ss{

return join'',map{chr($_) x $_[0]->[$_]}0..$#{$_[0]};

}

sub ms{

my ($string1,$string2)=@_;

($string1,$string2) = ($string2,$string1) if (length $string2) < (length $string1);

my @string1=unpack"C*",$strin

my @string2=unpack"C*",$strin

my @result;

my@c1=([]);

$c1[0][$_]++ for @string1;

$c1{pack"w*",@{$c1[0]}}=0;

my @c2=();

for $o ( 0..$#string2-$#string1 ){

@{$c2[$o]}=();

$c2[$o][$_]++ for @string2[$o..$o+$#string1]

}

for $l ( reverse 1..length($string1) ){

%c1=();

$c1{pack"w*",@{$_}}++ for @c1;

$c1{pack"w*",@{$_}} && return ss $_ for @c2;

push @c1,[@{$c1[-1]}];

$c1[-1][$string1[-$l]]--;

$c1[$_][$string1[$_+$l-1]]

push @c2,[@{$c2[-1]}];

$c2[-1][$string2[-$l]]--;

$c2[$_][$string2[$_+$l-1]]

}

}

Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.

The syntax should be fairly readable as pseudo-code for other languages.

varchar2(32767) is just a string (up to 32K long)

DBMS_SQL.varchar2a is equivalent to an array of strings

Here are my timings (in seconds) to find the substrings of 100 randomly generated pairs of strings of ranging in length 1 to 30, with each test repeated 10 times.

You'll see, even with some shortcuts, it still ramps up very quickly in execution time.

For example, it took 0.172 seconds to search 100 pairs of random strings 2-characters long, 10 times

it took 213 seconds (3.5 minutes) to search 100 pairs of random strings 64-character long 10 times

02 .172

04 .184

08 .254

16 .642

32 1.943

64 213.375

Open in new window