Solved

# Don't understand the shift-or algorithm for string matching.

Posted on 2006-06-06
257 Views
Hi Experts,
I don't seem to understand the shift-or algorithm, which is used to find one string within another.
Here are two descriptions and implementations:
http://en.wikipedia.org/wiki/Shift_Or_Algorithm
http://www-igm.univ-mlv.fr/~lecroq/string/node6.html#SECTION0060

I would be very happy if someone could break down the code into easily understandable fragments.

Thanks,
Jens
0
Question by:allmer

LVL 12

Expert Comment

We are not allowed to do homework questions. Can you please try it yourself and post the code if you find any difficulty.
0

LVL 4

Assisted Solution

0

LVL 4

Expert Comment

A tip: first try to understand how the algorithm works, doing it by hand a few times for some simple sequences. Then analyse the example codes, then write your own code.
0

LVL 22

Expert Comment

Here's somebody's somewhat garbled explanation:

"The Shift-Or algorithm uses matrices Rl, 0  l  k of size m  (n + 1) and
mask matrix D of size m  jAj. Each element rl
j;i, 0 < j  m, 0  i  n,
contains 0, if the edit distance between string p1 : : : pj and string ending at
position i in text T is  l, or 1, otherwise. Each element dj;x, 0 < j  m,
x 2 A, contains 0, if pj = x, or 1, otherwise. The matrices are implemented
as tables of bit vectors as follows:
Rl
i =
266664
rl
1;i
rl
2;i
...
rl
m;i
377775
and D[x] =
266664
d1;x
d2;x
...
dm;x
377775
; 0  i  n; 0  l  k; x 2 A: (13)
211
6.3.2 String matching
6.3.2.1 Exact string matching In the exact string matching, vectors
R0
i , 0  i  n, are computed as follows [BYG92]:
r0
j;0 := 1; 0 < j  m
R0
i := shl(R0
i&#1048576;1) or D[ti]; 0 < i  n (14)
In Formula 14 operation shl() is the bitwise operation left shift that
inserts 0 at the beginning of vector and operation or is the bitwise operation
or. Term shl(R0
i&#1048576;1) or D[ti] represents matching|position i in text T is
increased, position in pattern P is increased by operation shl(), and the
positions corresponding to the input symbol ti are selected by term or D[ti].
Pattern P is found at position ti&#1048576;m+1 : : : ti if r0
m;i = 0, 0 < i  n.
An example of mask matrix D for pattern P = adbbca is shown in
Table 6.4 and an example of matrix R0 for exact searching for pattern P =
D a b c d A n fa; b; c; dg
a 0 1 1 1 1
d 1 1 1 0 1
b 1 0 1 1 1
b 1 0 1 1 1
c 1 1 0 1 1
a 0 1 1 1 1
Table 6.4: Matrix D for pattern P = adbbca
R0 - a d c a b c a a b a d b b c a
a 1 0 1 1 0 1 1 0 0 1 0 1 1 1 1 0
d 1 1 0 1 1 1 1 1 1 1 1 0 1 1 1 1
b 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1
b 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1
c 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1
a 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0
Table 6.5: Matrix R0 for the exact string matching (for pattern P = adbbca
Theorem 6.13
Shift-Or algorithm described by Formula 14 simulates a run of the NFA for
the exact string matching.
212
Proof
In Shift-Or algorithm for the exact string matching there is one bit vector
R0
i , 0  i  n, which represents the set of active states of NFA. In the vector,
0 represents active state and 1 represents non-active state of the simulated
NFA. So in Shift-Or algorithm, the set of active states from Section 6.1 is
implemented by bit vector.
In Formula 14, term shl(R0
i&#1048576;1) or D[ti] represents matching transition|
each active state is moved to the next position in the right2 in the same level.
All active states are moved at once and only the transitions corresponding
to read symbol ti are selected by mask vector D[ti], which changes 0 to 1
in each such state that its incoming matching transition is not labeled by
ti. The initial state of NFA is not in vector R0 and it is implemented by
inserting 0 at the beginning of the vector in operation shl()|initial state is
always active because of its self loop.
At the beginning only the initial state is active therefore R0
0 = 1(m).
If r0
m;i = 0, 0 < i  n, then we report that the nal state is active and
thus the pattern is found ending at position i in text T. 2
---------------------------------------------------
The basic idea is to precompute a few hints-- for example, given what you know about the current matching state, what does your precomputed table tell you about the next possible match?

"
0

LVL 5

Author Comment

@rajeev_devin: This is homework I assigned to myself.
@e_tadeu I liked the powerpoint :)
@grg a little cryptic :(

Well the problem is not the understanding of the shift-or algorithm itself,
but the implementations of the algorithm on the pages I posted above.
I am not a programmer by training and the code looks quite cryptic to me.
So I would like to be walked through it ;)
Anyway here it is:
int preSo(char *x, int m, unsigned int S[]) {
unsigned int j, lim;
int i;
for (i = 0; i < ASIZE; ++i)
S[i] = ~0;
for (lim = i = 0, j = 1; i < m; ++i, j <<= 1) {
S[x[i]] &= ~j;
lim |= j;
}
lim = ~(lim>>1);
return(lim);
}
Thanks so far,
Jens
0

LVL 12

Expert Comment

>> @rajeev_devin: This is homework I assigned to myself.
Then you try it yourself :)
0

LVL 22

Accepted Solution

Okay, here's  astab at explaining it:

int preSo(char *x, int m, unsigned int S[]) { // x is the input string of characters, S is a bitmask array
unsigned int j, lim;     int i;

for (i = 0; i < ASIZE; ++i)      S[i] = ~0;   // initialize the S array to all 1 bits

for (lim = i = 0, j = 1; i < m; ++i, j <<= 1) {  // repeat m times, with j starting as 1, then shifting left one bit each time
S[ x[i] ] &= ~j;  //turn off the j'th bit of S indexed by the i'th character of the input string
lim |= j;          // set the i'th bit of lim
}
// at this point lim has "m" bits, right justified

lim = ~(lim>>1);  // shift lim right one, flip all the bits
// now lim has 32-m-+ bits set, left justified

return(lim);  // return that bit pattern

// S has been changed to be:
//S[ 'A' ] has all bits ON, except if the input string had an 'A' in the q'th position, bit 'q' of S[ 'A' ] is now turned OFF

//So you can use this 'S' array to quickly answer the question: is there an 'A' anywhere in the string?
//If S[ 'A' ] == ~0, then there are no A's in the string.
//If it equals anything else, the bits that are ON tell you where in the string the A's exist (up to 32, on a 32-bit int machine)
// You can even use the whole S word in a quick masking operation to check for multiple occurrences in one swoop.
//  clever

}
0

## Featured Post

Errors will happen. It is a fact of life for the programmer. How and when errors are detected have a great impact on quality and cost of a product. It is better to detect errors at compile time, when possible and practical. Errors that make their wa…
Go is an acronym of golang, is a programming language developed Google in 2007. Go is a new language that is mostly in the C family, with significant input from Pascal/Modula/Oberon family. Hence Go arisen as low-level language with fast compilation…
The goal of the video will be to teach the user the difference and consequence of passing data by value vs passing data by reference in C++. An example of passing data by value as well as an example of passing data by reference will be be given. Bot…
The viewer will learn how to clear a vector as well as how to detect empty vectors in C++.