Link to home
Start Free TrialLog in
Avatar of allmer
allmerFlag for Türkiye

asked on

Don't understand the shift-or algorithm for string matching.

Hi Experts,
I don't seem to understand the shift-or algorithm, which is used to find one string within another.
Here are two descriptions and implementations:
http://en.wikipedia.org/wiki/Shift_Or_Algorithm
http://www-igm.univ-mlv.fr/~lecroq/string/node6.html#SECTION0060

I would be very happy if someone could break down the code into easily understandable fragments.

Thanks,
Jens
Avatar of rajeev_devin
rajeev_devin

We are not allowed to do homework questions. Can you please try it yourself and post the code if you find any difficulty.
SOLUTION
Avatar of e_tadeu
e_tadeu

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
A tip: first try to understand how the algorithm works, doing it by hand a few times for some simple sequences. Then analyse the example codes, then write your own code.
Here's somebody's somewhat garbled explanation:

"The Shift-Or algorithm uses matrices Rl, 0  l  k of size m  (n + 1) and
mask matrix D of size m  jAj. Each element rl
j;i, 0 < j  m, 0  i  n,
contains 0, if the edit distance between string p1 : : : pj and string ending at
position i in text T is  l, or 1, otherwise. Each element dj;x, 0 < j  m,
x 2 A, contains 0, if pj = x, or 1, otherwise. The matrices are implemented
as tables of bit vectors as follows:
Rl
i =
266664
rl
1;i
rl
2;i
...
rl
m;i
377775
and D[x] =
266664
d1;x
d2;x
...
dm;x
377775
; 0  i  n; 0  l  k; x 2 A: (13)
211
6.3.2 String matching
6.3.2.1 Exact string matching In the exact string matching, vectors
R0
i , 0  i  n, are computed as follows [BYG92]:
r0
j;0 := 1; 0 < j  m
R0
i := shl(R0
i&#1048576;1) or D[ti]; 0 < i  n (14)
In Formula 14 operation shl() is the bitwise operation left shift that
inserts 0 at the beginning of vector and operation or is the bitwise operation
or. Term shl(R0
i&#1048576;1) or D[ti] represents matching|position i in text T is
increased, position in pattern P is increased by operation shl(), and the
positions corresponding to the input symbol ti are selected by term or D[ti].
Pattern P is found at position ti&#1048576;m+1 : : : ti if r0
m;i = 0, 0 < i  n.
An example of mask matrix D for pattern P = adbbca is shown in
Table 6.4 and an example of matrix R0 for exact searching for pattern P =
adbbca in text T = adcabcaabadbbca is shown in Table 6.5.
D a b c d A n fa; b; c; dg
a 0 1 1 1 1
d 1 1 1 0 1
b 1 0 1 1 1
b 1 0 1 1 1
c 1 1 0 1 1
a 0 1 1 1 1
Table 6.4: Matrix D for pattern P = adbbca
R0 - a d c a b c a a b a d b b c a
a 1 0 1 1 0 1 1 0 0 1 0 1 1 1 1 0
d 1 1 0 1 1 1 1 1 1 1 1 0 1 1 1 1
b 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1
b 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1
c 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1
a 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0
Table 6.5: Matrix R0 for the exact string matching (for pattern P = adbbca
and text T = adcabcaabadbbca)
Theorem 6.13
Shift-Or algorithm described by Formula 14 simulates a run of the NFA for
the exact string matching.
212
Proof
In Shift-Or algorithm for the exact string matching there is one bit vector
R0
i , 0  i  n, which represents the set of active states of NFA. In the vector,
0 represents active state and 1 represents non-active state of the simulated
NFA. So in Shift-Or algorithm, the set of active states from Section 6.1 is
implemented by bit vector.
In Formula 14, term shl(R0
i&#1048576;1) or D[ti] represents matching transition|
each active state is moved to the next position in the right2 in the same level.
All active states are moved at once and only the transitions corresponding
to read symbol ti are selected by mask vector D[ti], which changes 0 to 1
in each such state that its incoming matching transition is not labeled by
ti. The initial state of NFA is not in vector R0 and it is implemented by
inserting 0 at the beginning of the vector in operation shl()|initial state is
always active because of its self loop.
At the beginning only the initial state is active therefore R0
0 = 1(m).
If r0
m;i = 0, 0 < i  n, then we report that the nal state is active and
thus the pattern is found ending at position i in text T. 2
---------------------------------------------------
The basic idea is to precompute a few hints-- for example, given what you know about the current matching state, what does your precomputed table tell you about the next possible match?  


"
Avatar of allmer

ASKER

@rajeev_devin: This is homework I assigned to myself.
@e_tadeu I liked the powerpoint :)
@grg a little cryptic :(

Well the problem is not the understanding of the shift-or algorithm itself,
but the implementations of the algorithm on the pages I posted above.
I am not a programmer by training and the code looks quite cryptic to me.
So I would like to be walked through it ;)
Anyway here it is:
  int preSo(char *x, int m, unsigned int S[]) {
    unsigned int j, lim;
    int i;
    for (i = 0; i < ASIZE; ++i)
      S[i] = ~0;
    for (lim = i = 0, j = 1; i < m; ++i, j <<= 1) {
      S[x[i]] &= ~j;
      lim |= j;
    }
    lim = ~(lim>>1);
    return(lim);
  }
Thanks so far,
Jens
>> @rajeev_devin: This is homework I assigned to myself.
Then you try it yourself :)
ASKER CERTIFIED SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial