[Last Call] Learn how to a build a cloud-first strategyRegister Now

x
  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 1410
  • Last Modified:

Find maximum substring

I need algorithm for this problem:

For given two strings of length m and n respectively, find maximum substring of both strings (maximal length), where character place in substring is not important.

example:
string 1 = abcde
string 2 = ercdbth

resulting string = bcd since it is contained in both strings (aBCDe, erCDBth - uppercase letters show resulting substring).

I would prefer algorithm that makes use of graph theory.
0
AdnanKurtovic
Asked:
AdnanKurtovic
  • 8
  • 6
  • 4
  • +3
4 Solutions
 
sdstuberCommented:
just curious,  why is the algorithm method important?
0
 
AdnanKurtovicAuthor Commented:
It is not important, just preferred :)
0
Independent Software Vendors: We Want Your Opinion

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 
AdnanKurtovicAuthor Commented:
Similar, but this is not my problem. Please read example. Here is another one:
String 1: qwertp
String 2: serwghtp
Result: wer (qWERt, sERWght)

String 1: qwert
String 2: qawsedrftg
Result: any character from string 1, because maximal length of substring is 1.


0
 
ozoCommented:
How is that different from
Given two strings, S of length m and T of length n, find the longest strings which are a substrings of both S and T.
0
 
AdnanKurtovicAuthor Commented:
"A subsequence, substring, prefix or suffix of a string is a subset of the symbols in a string, where the order of the elements is preserved.

Example: The string ana is equal to substrings (and subsequences) of banana at two different offsets:
banana
   | | | | |
  ana| |
         | | |
        ana
"

This is definition from wikipedia article you have submitted.

What I need is longest common substring where the order of elements don't have to be preserved, just need to be continues. I hope it is clearer now.
0
 
NovaDenizenCommented:
So, the problem is to take two strings A and B, and find the longest respective substrings of A and B that are permutations of each other.

I don't immediately see a way to do it other than the brute-force O(n^2 * log(n)) method of sorting and comparing all the multisets of substrings.  There very well could be a mapping onto a well-known graph theory problem, but I don't see it yet.
0
 
TalmashCommented:
my algo. is not optimal, and not graph based:
the example below, demo the simplicity of it, the problem is efficiency in very long strings.

1. take shorter str. len(shorter) = short_len
2. set match_len = 1
3. foreach substr of shorter, length match_len (i.e. short_orig_str = substr( shorter , i , match_len ); [i = 0 .. (short_len - match_len)] )
4. foreach all permutations of substr of longer length match_len look for equality
5. increment match_len until reached short_len

algo. length:

short_len*long_len*1! + (short_len-1)*(long_len-1)*2! + ... + 1*(long_len-short_len)*(short_len)!

example:
shorter = abcde , short_len = 5
longer = ercdbth , long_len = 7

for match_len = 3
short_orig_str = {abc , bcd , cde}
long_str = {erc , rcd , cdb , dbt , bth}
long_permute_str = { {erc,ecr,cer,cre,rec,rce} , ... , {bth,bht,hbt,htb,tbh,thb} }

tal
0
 
jindalankushCommented:
last algo is also good, but reduce some time follow this steps
Step1.  find minimum lengh string
Step2.  make all PnC string
Step3.  take each string and find maximum mathcing substring
0
 
sdstuberCommented:
another help,  find each character in longer string that is not in shorter string.
then divide the longer string into pieces between those delimiters.
Since the rule is the substrings must be contiguous, any non-member character will break a contiguous string.

for example...
String 1: qwertp
String 2: serwghtp

s,g,h not in String1

that leaves  "erw" and "tp"

since we want the largest string,  start with erw, and apply the same rule back to string1.

qtp are not in erw,  so... that leaves.

String1a =  wer
String2a =  erw  search those to see if they are the same

In this example, we would be done, but lets say they don't match

then go to string2b = tp,  strip string1 of non-member charcters and that leaves string1b= tp

compare tp to tp and we have match.

This back-and-forth stripping will lead to relatively small strings of nothing but common charactrers in relatively short order and, since permutation of characters is not important, once you have stripped the strings as far as they will go, what is left will be your common substrings.  Start with the largest and work down so you find the longest string first.  Rather than finding a bunch of small strings and building up.



0
 
NovaDenizenCommented:
It would be nice if all the pairs of strings had unshared characters, but you really can't rely on that assumption.  Also, any algorithm that requires generating all permutations of a substring will take a ludicrously long time to execute.

Here's my brute-force algorithm, running on strings s1 and s2.  It takes O(n^2) space, where n is the longer of the two strings, and something like O(n^2 * log(n)) time.  If the number of distinct symbols in the alphabet is small compared to the string length, it would make sense to use a more compact representation for the substring multiset, like an array of integers.

   
for len = max(length(s1), length(s2))-1 downto 1 {
    create two arrays of strings, one of all substrings of s1 with length len, and the other the same for s2.
    sort the characters in each individual substring, i.e. turn "band" into "abdn".
    sort each array alphabetically
    Look for a common entry in both arrays.
    if you found a common entry:
        search s1 and s2 for a substring that matches the pattern.  This is the result.
}
no common substring was found (there must be no common characters between the two strings)
 
        
    

Open in new window

0
 
sdstuberCommented:
that's the nice thing about splitting on unshared characters.  You don't have to determine any permutations at all.

I have abcdefg  and gfedcba  they are identical in one pass because they have no unshared characters.  

0
 
NovaDenizenCommented:
I think you're assuming that no characters are repeated.
0
 
sdstuberCommented:
I'm afraid I don't understand your counter argument.

Can you provide an example that shows my algorithm failing?

Maybe I didn't explain it well enough, or maybe I just don't see the hole in it that you do.
0
 
AdnanKurtovicAuthor Commented:
Characters can be repeated.
0
 
AdnanKurtovicAuthor Commented:
I am testing sdstuber's algo and I like it this far. NovaDenizen's algo is brute force like he stated, so I can not consider it as a solution, but I will test it as well and if runing time is reasonable ... :)
0
 
sdstuberCommented:
oh, I get it now!

aaaa  and aaaaaaaaaaaaaaaaaaaa would be a false positive  ok, I'll work on that
0
 
NovaDenizenCommented:
More like "aabbb" and "aaabb" would be a false 5-character match, but yes.

Adnan, where did you find this problem, and what makes you think that graph theory will help?  If it was in the midst of some info about a particular area of graph theory, then that would probably give a good hint about where to look for an efficient algorithm.
0
 
sdstuberCommented:
ok, I've already got the duplicate characters fixed in my code, but I've got another problem somewhere in my string splitting.  I'll get that worked out sometime tonight when I get a few minutes and I'll post the code.  I'm mostly an oracle db guy, so it'll be in pl/sql but it's easy to ready.
0
 
sdstuberCommented:
Below is the pl/sql code I used to solve this...
The syntax should be fairly readable as pseudo-code for other languages.
varchar2(32767) is just a string (up to 32K long)
DBMS_SQL.varchar2a  is equivalent to an array of strings



Here are my timings (in seconds) to find the substrings of 100 randomly generated pairs of strings of ranging in length 1 to 30, with each test repeated 10 times.
You'll see, even with some shortcuts, it still ramps up very quickly in execution time.


For example,  it took 0.172 seconds to search 100 pairs of random strings 2-characters long, 10 times
it took 213 seconds (3.5 minutes) to search 100 pairs of random strings 64-character long 10 times

02  .172
04  .184
08  .254
16  .642
32  1.943
64  213.375
   PROCEDURE find_shared_substrings(
        p_a        IN     varchar2,
        p_b        IN     varchar2,
        p_a_list   IN OUT DBMS_SQL.varchar2a,
        p_b_list   IN OUT DBMS_SQL.varchar2a
    )
    IS
        v_not_a   varchar2(32767);
        v_not_b   varchar2(32767);
        v_a       varchar2(32767) := NULL;
        v_b       varchar2(32767) := NULL;
    BEGIN
        v_not_a   := p_b;
        v_not_b   := p_a;
 
        FOR i IN 1 .. LENGTH(p_b)
        LOOP
            v_not_b   := REPLACE(v_not_b, SUBSTR(p_b, i, 1), NULL);
        END LOOP;
 
        FOR i IN 1 .. LENGTH(p_a)
        LOOP
            v_not_a   := REPLACE(v_not_a, SUBSTR(p_a, i, 1), NULL);
        END LOOP;
 
        FOR i IN 1 .. LENGTH(p_a)
        LOOP
            IF NVL(INSTR(v_not_b, SUBSTR(p_a, i, 1)), 0) = 0
            THEN
                v_a   := v_a || SUBSTR(p_a, i, 1);
            ELSE
                IF v_a IS NOT NULL
                THEN
                    p_a_list(p_a_list.COUNT + 1)   := v_a;
                    v_a                            := NULL;
                END IF;
            END IF;
        END LOOP;
 
        IF v_a IS NOT NULL
        THEN
            p_a_list(p_a_list.COUNT + 1)   := v_a;
        END IF;
 
        FOR i IN 1 .. LENGTH(p_b)
        LOOP
            IF NVL(INSTR(v_not_a, SUBSTR(p_b, i, 1)), 0) = 0
            THEN
                v_b   := v_b || SUBSTR(p_b, i, 1);
            ELSE
                IF v_b IS NOT NULL
                THEN
                    p_b_list(p_b_list.COUNT + 1)   := v_b;
                    v_b                            := NULL;
                END IF;
            END IF;
        END LOOP;
 
        IF v_b IS NOT NULL
        THEN
            p_b_list(p_b_list.COUNT + 1)   := v_b;
        END IF;
    END find_shared_substrings;
 
    FUNCTION max_common_substring(p_a IN varchar2, p_b IN varchar2)
        RETURN varchar2
        DETERMINISTIC
    IS
        v_alist      DBMS_SQL.varchar2a;
        v_blist      DBMS_SQL.varchar2a;
        v_temp       varchar2(32767) := NULL;
        v_max        varchar2(32767) := NULL;
 
        v_shortstr   varchar2(32767);
        v_longstr    varchar2(32767);
        x            integer;
        y            integer;
        z            integer;
    BEGIN
        /*
          Look for end cases to stop the recursion.
        */
        IF p_a IS NULL OR p_b IS NULL
        THEN
            RETURN NULL;
        ELSIF INSTR(p_a, p_b) > 0
        THEN
            RETURN p_b;
        ELSIF INSTR(p_b, p_a) > 0
        THEN
            RETURN p_a;
        ELSIF LENGTH(p_a) = 1 OR LENGTH(p_a) = 1
        THEN
            RETURN NULL;
        ELSIF LENGTH(p_a) = LENGTH(p_b) AND sort_string(p_a) = sort_string(p_b)
        THEN
            RETURN p_a;
        END IF;
 
        find_shared_substrings(p_a, p_b, v_alist, v_blist);
 
        FOR i IN 1 .. v_alist.COUNT
        LOOP
            FOR j IN 1 .. v_blist.COUNT
            LOOP
                -- only need to recurse into smaller substrings
                -- if there is something to do.  If the existing substrings
                -- are equal to what we started with then we've divided as far as we can.
                IF v_alist(i) != p_a OR v_blist(j) != p_b
                THEN
                    v_temp   := max_common_substring(v_alist(i), v_blist(j));
                ELSE                          -- the substrings have been as reduced as they can be.
                    IF LENGTH(v_alist(i)) > LENGTH(v_blist(j))
                    THEN
                        v_shortstr   := v_blist(j);
                        v_longstr    := v_alist(i);
                    ELSE
                        v_shortstr   := v_alist(i);
                        v_longstr    := v_blist(j);
                    END IF;
 
                    x        := LENGTH(v_shortstr);
 
                    v_temp   := NULL;
 
                    WHILE x >= 1 AND v_temp IS NULL
                    LOOP
                        y   := 1;
 
                        WHILE y <= LENGTH(v_shortstr) - x + 1 AND v_temp IS NULL
                        LOOP
                            z   := 1;
 
                            WHILE z <= LENGTH(v_longstr) - x + 1 AND v_temp IS NULL
                            LOOP
                                IF sort_string(SUBSTR(v_shortstr, y, x)) =
                                       sort_string(SUBSTR(v_longstr, z, x))
                                THEN
                                    v_temp   := SUBSTR(v_shortstr, y, x);
                                ELSE
                                    z   := z + 1;
                                END IF;
                            END LOOP;
 
                            y   := y + 1;
                        END LOOP;
 
                        x   := x - 1;
                    END LOOP;
                END IF;
 
                IF v_max IS NULL OR LENGTH(v_temp) > LENGTH(v_max)
                THEN
                    v_max   := v_temp;
                END IF;
            END LOOP;
        END LOOP;
 
        RETURN v_max;
    END max_common_substring;

Open in new window

0
 
ozoCommented:
#!/usr/bin/perl
$\=$/;
print ms('qwertp','serwghtp');
sub ss{
   return join'',map{chr($_) x $_[0]->[$_]}0..$#{$_[0]};
}
sub ms{
    my ($string1,$string2)=@_;
    ($string1,$string2) = ($string2,$string1) if (length $string2) < (length $string1);
    my @string1=unpack"C*",$string1;
    my @string2=unpack"C*",$string2;
    my @result;
    my@c1=([]);
    $c1[0][$_]++ for @string1;
    $c1{pack"w*",@{$c1[0]}}=0;

    my @c2=();
    for $o ( 0..$#string2-$#string1 ){
        @{$c2[$o]}=();
        $c2[$o][$_]++ for @string2[$o..$o+$#string1];
    }
    for $l ( reverse 1..length($string1) ){
        %c1=();
        $c1{pack"w*",@{$_}}++ for @c1;

        $c1{pack"w*",@{$_}} && return ss $_ for @c2;

        push @c1,[@{$c1[-1]}];
        $c1[-1][$string1[-$l]]--;
        $c1[$_][$string1[$_+$l-1]]-- for 0..$#c1-1;

        push @c2,[@{$c2[-1]}];
        $c2[-1][$string2[-$l]]--;
        $c2[$_][$string2[$_+$l-1]]--  for 0..$#c2-1;
   }
}
0
 
sdstuberCommented:
ozo, I don't read perl please explain that code.  I believe you that it probably works, but I have no idea what it says.
0
 
AdnanKurtovicAuthor Commented:
ozo please explain code I don't read perl either :(

NovaDenizen, this is problem that arise in research. I can only say project is about genom and identification of diseases. We thought using some kind of trees would benefit in speed (if search is involved).
0

Featured Post

How to Use the Help Bell

Need to boost the visibility of your question for solutions? Use the Experts Exchange Help Bell to confirm priority levels and contact subject-matter experts for question attention.  Check out this how-to article for more information.

  • 8
  • 6
  • 4
  • +3
Tackle projects and never again get stuck behind a technical roadblock.
Join Now