Don't understand the shift-or algorithm for string matching.

Posted on 2006-06-06
Last Modified: 2008-01-09
Hi Experts,
I don't seem to understand the shift-or algorithm, which is used to find one string within another.
Here are two descriptions and implementations:

I would be very happy if someone could break down the code into easily understandable fragments.

Question by:allmer
    LVL 12

    Expert Comment

    We are not allowed to do homework questions. Can you please try it yourself and post the code if you find any difficulty.
    LVL 4

    Assisted Solution

    LVL 4

    Expert Comment

    A tip: first try to understand how the algorithm works, doing it by hand a few times for some simple sequences. Then analyse the example codes, then write your own code.
    LVL 22

    Expert Comment

    Here's somebody's somewhat garbled explanation:

    "The Shift-Or algorithm uses matrices Rl, 0  l  k of size m  (n + 1) and
    mask matrix D of size m  jAj. Each element rl
    j;i, 0 < j  m, 0  i  n,
    contains 0, if the edit distance between string p1 : : : pj and string ending at
    position i in text T is  l, or 1, otherwise. Each element dj;x, 0 < j  m,
    x 2 A, contains 0, if pj = x, or 1, otherwise. The matrices are implemented
    as tables of bit vectors as follows:
    i =
    and D[x] =
    ; 0  i  n; 0  l  k; x 2 A: (13)
    6.3.2 String matching Exact string matching In the exact string matching, vectors
    i , 0  i  n, are computed as follows [BYG92]:
    j;0 := 1; 0 < j  m
    i := shl(R0
    i&#1048576;1) or D[ti]; 0 < i  n (14)
    In Formula 14 operation shl() is the bitwise operation left shift that
    inserts 0 at the beginning of vector and operation or is the bitwise operation
    or. Term shl(R0
    i&#1048576;1) or D[ti] represents matching|position i in text T is
    increased, position in pattern P is increased by operation shl(), and the
    positions corresponding to the input symbol ti are selected by term or D[ti].
    Pattern P is found at position ti&#1048576;m+1 : : : ti if r0
    m;i = 0, 0 < i  n.
    An example of mask matrix D for pattern P = adbbca is shown in
    Table 6.4 and an example of matrix R0 for exact searching for pattern P =
    adbbca in text T = adcabcaabadbbca is shown in Table 6.5.
    D a b c d A n fa; b; c; dg
    a 0 1 1 1 1
    d 1 1 1 0 1
    b 1 0 1 1 1
    b 1 0 1 1 1
    c 1 1 0 1 1
    a 0 1 1 1 1
    Table 6.4: Matrix D for pattern P = adbbca
    R0 - a d c a b c a a b a d b b c a
    a 1 0 1 1 0 1 1 0 0 1 0 1 1 1 1 0
    d 1 1 0 1 1 1 1 1 1 1 1 0 1 1 1 1
    b 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1
    b 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1
    c 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1
    a 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0
    Table 6.5: Matrix R0 for the exact string matching (for pattern P = adbbca
    and text T = adcabcaabadbbca)
    Theorem 6.13
    Shift-Or algorithm described by Formula 14 simulates a run of the NFA for
    the exact string matching.
    In Shift-Or algorithm for the exact string matching there is one bit vector
    i , 0  i  n, which represents the set of active states of NFA. In the vector,
    0 represents active state and 1 represents non-active state of the simulated
    NFA. So in Shift-Or algorithm, the set of active states from Section 6.1 is
    implemented by bit vector.
    In Formula 14, term shl(R0
    i&#1048576;1) or D[ti] represents matching transition|
    each active state is moved to the next position in the right2 in the same level.
    All active states are moved at once and only the transitions corresponding
    to read symbol ti are selected by mask vector D[ti], which changes 0 to 1
    in each such state that its incoming matching transition is not labeled by
    ti. The initial state of NFA is not in vector R0 and it is implemented by
    inserting 0 at the beginning of the vector in operation shl()|initial state is
    always active because of its self loop.
    At the beginning only the initial state is active therefore R0
    0 = 1(m).
    If r0
    m;i = 0, 0 < i  n, then we report that the nal state is active and
    thus the pattern is found ending at position i in text T. 2
    The basic idea is to precompute a few hints-- for example, given what you know about the current matching state, what does your precomputed table tell you about the next possible match?  

    LVL 5

    Author Comment

    @rajeev_devin: This is homework I assigned to myself.
    @e_tadeu I liked the powerpoint :)
    @grg a little cryptic :(

    Well the problem is not the understanding of the shift-or algorithm itself,
    but the implementations of the algorithm on the pages I posted above.
    I am not a programmer by training and the code looks quite cryptic to me.
    So I would like to be walked through it ;)
    Anyway here it is:
      int preSo(char *x, int m, unsigned int S[]) {
        unsigned int j, lim;
        int i;
        for (i = 0; i < ASIZE; ++i)
          S[i] = ~0;
        for (lim = i = 0, j = 1; i < m; ++i, j <<= 1) {
          S[x[i]] &= ~j;
          lim |= j;
        lim = ~(lim>>1);
    Thanks so far,
    LVL 12

    Expert Comment

    >> @rajeev_devin: This is homework I assigned to myself.
    Then you try it yourself :)
    LVL 22

    Accepted Solution

    Okay, here's  astab at explaining it:

     int preSo(char *x, int m, unsigned int S[]) { // x is the input string of characters, S is a bitmask array
        unsigned int j, lim;     int i;

        for (i = 0; i < ASIZE; ++i)      S[i] = ~0;   // initialize the S array to all 1 bits

        for (lim = i = 0, j = 1; i < m; ++i, j <<= 1) {  // repeat m times, with j starting as 1, then shifting left one bit each time
          S[ x[i] ] &= ~j;  //turn off the j'th bit of S indexed by the i'th character of the input string
          lim |= j;          // set the i'th bit of lim
    // at this point lim has "m" bits, right justified

        lim = ~(lim>>1);  // shift lim right one, flip all the bits
    // now lim has 32-m-+ bits set, left justified

        return(lim);  // return that bit pattern

    // S has been changed to be:
    //S[ 'A' ] has all bits ON, except if the input string had an 'A' in the q'th position, bit 'q' of S[ 'A' ] is now turned OFF

    //So you can use this 'S' array to quickly answer the question: is there an 'A' anywhere in the string?
    //If S[ 'A' ] == ~0, then there are no A's in the string.
    //If it equals anything else, the bits that are ON tell you where in the string the A's exist (up to 32, on a 32-bit int machine)
    // You can even use the whole S word in a quick masking operation to check for multiple occurrences in one swoop.
    //  clever


    Featured Post

    How your wiki can always stay up-to-date

    Quip doubles as a “living” wiki and a project management tool that evolves with your organization. As you finish projects in Quip, the work remains, easily accessible to all team members, new and old.
    - Increase transparency
    - Onboard new hires faster
    - Access from mobile/offline

    Join & Write a Comment

    Errors will happen. It is a fact of life for the programmer. How and when errors are detected have a great impact on quality and cost of a product. It is better to detect errors at compile time, when possible and practical. Errors that make their wa…
    Go is an acronym of golang, is a programming language developed Google in 2007. Go is a new language that is mostly in the C family, with significant input from Pascal/Modula/Oberon family. Hence Go arisen as low-level language with fast compilation…
    The goal of the video will be to teach the user the difference and consequence of passing data by value vs passing data by reference in C++. An example of passing data by value as well as an example of passing data by reference will be be given. Bot…
    The viewer will learn how to clear a vector as well as how to detect empty vectors in C++.

    728 members asked questions and received personalized solutions in the past 7 days.

    Join the community of 500,000 technology professionals and ask your questions.

    Join & Ask a Question

    Need Help in Real-Time?

    Connect with top rated Experts

    17 Experts available now in Live!

    Get 1:1 Help Now