Solved

Posted on 2008-11-10

I need algorithm for this problem:

For given two strings of length m and n respectively, find maximum substring of both strings (maximal length), where character place in substring is not important.

example:

string 1 = abcde

string 2 = ercdbth

resulting string = bcd since it is contained in both strings (aBCDe, erCDBth - uppercase letters show resulting substring).

I would prefer algorithm that makes use of graph theory.

For given two strings of length m and n respectively, find maximum substring of both strings (maximal length), where character place in substring is not important.

example:

string 1 = abcde

string 2 = ercdbth

resulting string = bcd since it is contained in both strings (aBCDe, erCDBth - uppercase letters show resulting substring).

I would prefer algorithm that makes use of graph theory.

23 Comments

String 1: qwertp

String 2: serwghtp

Result: wer (qWERt, sERWght)

String 1: qwert

String 2: qawsedrftg

Result: any character from string 1, because maximal length of substring is 1.

Given two strings, S of length m and T of length n, find the longest strings which are a substrings of both S and T.

Example: The string ana is equal to substrings (and subsequences) of banana at two different offsets:

banana

| | | | |

ana| |

| | |

ana

"

This is definition from wikipedia article you have submitted.

What I need is longest common substring where the order of elements don't have to be preserved, just need to be continues. I hope it is clearer now.

I don't immediately see a way to do it other than the brute-force O(n^2 * log(n)) method of sorting and comparing all the multisets of substrings. There very well could be a mapping onto a well-known graph theory problem, but I don't see it yet.

the example below, demo the simplicity of it, the problem is efficiency in very long strings.

1. take shorter str. len(shorter) = short_len

2. set match_len = 1

3. foreach substr of shorter, length match_len (i.e. short_orig_str = substr( shorter , i , match_len ); [i = 0 .. (short_len - match_len)] )

4. foreach all permutations of substr of longer length match_len look for equality

5. increment match_len until reached short_len

algo. length:

short_len*long_len*1! + (short_len-1)*(long_len-1)

example:

shorter = abcde , short_len = 5

longer = ercdbth , long_len = 7

for match_len = 3

short_orig_str = {abc , bcd , cde}

long_str = {erc , rcd , cdb , dbt , bth}

long_permute_str = { {erc,ecr,cer,cre,rec,rce} , ... , {bth,bht,hbt,htb,tbh,thb} }

tal

Step1. find minimum lengh string

Step2. make all PnC string

Step3. take each string and find maximum mathcing substring

then divide the longer string into pieces between those delimiters.

Since the rule is the substrings must be contiguous, any non-member character will break a contiguous string.

for example...

String 1: qwertp

String 2: serwghtp

s,g,h not in String1

that leaves "erw" and "tp"

since we want the largest string, start with erw, and apply the same rule back to string1.

qtp are not in erw, so... that leaves.

String1a = wer

String2a = erw search those to see if they are the same

In this example, we would be done, but lets say they don't match

then go to string2b = tp, strip string1 of non-member charcters and that leaves string1b= tp

compare tp to tp and we have match.

This back-and-forth stripping will lead to relatively small strings of nothing but common charactrers in relatively short order and, since permutation of characters is not important, once you have stripped the strings as far as they will go, what is left will be your common substrings. Start with the largest and work down so you find the longest string first. Rather than finding a bunch of small strings and building up.

Here's my brute-force algorithm, running on strings s1 and s2. It takes O(n^2) space, where n is the longer of the two strings, and something like O(n^2 * log(n)) time. If the number of distinct symbols in the alphabet is small compared to the string length, it would make sense to use a more compact representation for the substring multiset, like an array of integers.

```
for len = max(length(s1), length(s2))-1 downto 1 {
create two arrays of strings, one of all substrings of s1 with length len, and the other the same for s2.
sort the characters in each individual substring, i.e. turn "band" into "abdn".
sort each array alphabetically
Look for a common entry in both arrays.
if you found a common entry:
search s1 and s2 for a substring that matches the pattern. This is the result.
}
no common substring was found (there must be no common characters between the two strings)
```

I have abcdefg and gfedcba they are identical in one pass because they have no unshared characters.

Can you provide an example that shows my algorithm failing?

Maybe I didn't explain it well enough, or maybe I just don't see the hole in it that you do.

Adnan, where did you find this problem, and what makes you think that graph theory will help? If it was in the midst of some info about a particular area of graph theory, then that would probably give a good hint about where to look for an efficient algorithm.

The syntax should be fairly readable as pseudo-code for other languages.

varchar2(32767) is just a string (up to 32K long)

DBMS_SQL.varchar2a is equivalent to an array of strings

Here are my timings (in seconds) to find the substrings of 100 randomly generated pairs of strings of ranging in length 1 to 30, with each test repeated 10 times.

You'll see, even with some shortcuts, it still ramps up very quickly in execution time.

For example, it took 0.172 seconds to search 100 pairs of random strings 2-characters long, 10 times

it took 213 seconds (3.5 minutes) to search 100 pairs of random strings 64-character long 10 times

02 .172

04 .184

08 .254

16 .642

32 1.943

64 213.375

```
PROCEDURE find_shared_substrings(
p_a IN varchar2,
p_b IN varchar2,
p_a_list IN OUT DBMS_SQL.varchar2a,
p_b_list IN OUT DBMS_SQL.varchar2a
)
IS
v_not_a varchar2(32767);
v_not_b varchar2(32767);
v_a varchar2(32767) := NULL;
v_b varchar2(32767) := NULL;
BEGIN
v_not_a := p_b;
v_not_b := p_a;
FOR i IN 1 .. LENGTH(p_b)
LOOP
v_not_b := REPLACE(v_not_b, SUBSTR(p_b, i, 1), NULL);
END LOOP;
FOR i IN 1 .. LENGTH(p_a)
LOOP
v_not_a := REPLACE(v_not_a, SUBSTR(p_a, i, 1), NULL);
END LOOP;
FOR i IN 1 .. LENGTH(p_a)
LOOP
IF NVL(INSTR(v_not_b, SUBSTR(p_a, i, 1)), 0) = 0
THEN
v_a := v_a || SUBSTR(p_a, i, 1);
ELSE
IF v_a IS NOT NULL
THEN
p_a_list(p_a_list.COUNT + 1) := v_a;
v_a := NULL;
END IF;
END IF;
END LOOP;
IF v_a IS NOT NULL
THEN
p_a_list(p_a_list.COUNT + 1) := v_a;
END IF;
FOR i IN 1 .. LENGTH(p_b)
LOOP
IF NVL(INSTR(v_not_a, SUBSTR(p_b, i, 1)), 0) = 0
THEN
v_b := v_b || SUBSTR(p_b, i, 1);
ELSE
IF v_b IS NOT NULL
THEN
p_b_list(p_b_list.COUNT + 1) := v_b;
v_b := NULL;
END IF;
END IF;
END LOOP;
IF v_b IS NOT NULL
THEN
p_b_list(p_b_list.COUNT + 1) := v_b;
END IF;
END find_shared_substrings;
FUNCTION max_common_substring(p_a IN varchar2, p_b IN varchar2)
RETURN varchar2
DETERMINISTIC
IS
v_alist DBMS_SQL.varchar2a;
v_blist DBMS_SQL.varchar2a;
v_temp varchar2(32767) := NULL;
v_max varchar2(32767) := NULL;
v_shortstr varchar2(32767);
v_longstr varchar2(32767);
x integer;
y integer;
z integer;
BEGIN
/*
Look for end cases to stop the recursion.
*/
IF p_a IS NULL OR p_b IS NULL
THEN
RETURN NULL;
ELSIF INSTR(p_a, p_b) > 0
THEN
RETURN p_b;
ELSIF INSTR(p_b, p_a) > 0
THEN
RETURN p_a;
ELSIF LENGTH(p_a) = 1 OR LENGTH(p_a) = 1
THEN
RETURN NULL;
ELSIF LENGTH(p_a) = LENGTH(p_b) AND sort_string(p_a) = sort_string(p_b)
THEN
RETURN p_a;
END IF;
find_shared_substrings(p_a, p_b, v_alist, v_blist);
FOR i IN 1 .. v_alist.COUNT
LOOP
FOR j IN 1 .. v_blist.COUNT
LOOP
-- only need to recurse into smaller substrings
-- if there is something to do. If the existing substrings
-- are equal to what we started with then we've divided as far as we can.
IF v_alist(i) != p_a OR v_blist(j) != p_b
THEN
v_temp := max_common_substring(v_alist(i), v_blist(j));
ELSE -- the substrings have been as reduced as they can be.
IF LENGTH(v_alist(i)) > LENGTH(v_blist(j))
THEN
v_shortstr := v_blist(j);
v_longstr := v_alist(i);
ELSE
v_shortstr := v_alist(i);
v_longstr := v_blist(j);
END IF;
x := LENGTH(v_shortstr);
v_temp := NULL;
WHILE x >= 1 AND v_temp IS NULL
LOOP
y := 1;
WHILE y <= LENGTH(v_shortstr) - x + 1 AND v_temp IS NULL
LOOP
z := 1;
WHILE z <= LENGTH(v_longstr) - x + 1 AND v_temp IS NULL
LOOP
IF sort_string(SUBSTR(v_shortstr, y, x)) =
sort_string(SUBSTR(v_longstr, z, x))
THEN
v_temp := SUBSTR(v_shortstr, y, x);
ELSE
z := z + 1;
END IF;
END LOOP;
y := y + 1;
END LOOP;
x := x - 1;
END LOOP;
END IF;
IF v_max IS NULL OR LENGTH(v_temp) > LENGTH(v_max)
THEN
v_max := v_temp;
END IF;
END LOOP;
END LOOP;
RETURN v_max;
END max_common_substring;
```

$\=$/;

print ms('qwertp','serwghtp');

sub ss{

return join'',map{chr($_) x $_[0]->[$_]}0..$#{$_[0]};

}

sub ms{

my ($string1,$string2)=@_;

($string1,$string2) = ($string2,$string1) if (length $string2) < (length $string1);

my @string1=unpack"C*",$strin

my @string2=unpack"C*",$strin

my @result;

my@c1=([]);

$c1[0][$_]++ for @string1;

$c1{pack"w*",@{$c1[0]}}=0;

my @c2=();

for $o ( 0..$#string2-$#string1 ){

@{$c2[$o]}=();

$c2[$o][$_]++ for @string2[$o..$o+$#string1]

}

for $l ( reverse 1..length($string1) ){

%c1=();

$c1{pack"w*",@{$_}}++ for @c1;

$c1{pack"w*",@{$_}} && return ss $_ for @c2;

push @c1,[@{$c1[-1]}];

$c1[-1][$string1[-$l]]--;

$c1[$_][$string1[$_+$l-1]]

push @c2,[@{$c2[-1]}];

$c2[-1][$string2[-$l]]--;

$c2[$_][$string2[$_+$l-1]]

}

}

By clicking you are agreeing to Experts Exchange's Terms of Use.

Title | # Comments | Views | Activity |
---|---|---|---|

Adjust number proportionately | 12 | 59 | |

coolers to wear for eyes in workplace (8-10 hours of computer starring!) | 10 | 81 | |

fizzArray3 challenge | 1 | 47 | |

Core java. Create the class with method that do logic for all elements of collection. | 1 | 19 |

Join the community of 500,000 technology professionals and ask your questions.

Connect with top rated Experts

**8** Experts available now in Live!