Solved

Posted on 2006-06-06

Hi Experts,

I don't seem to understand the shift-or algorithm, which is used to find one string within another.

Here are two descriptions and implementations:

http://en.wikipedia.org/wiki/Shift_Or_Algorithm

http://www-igm.univ-mlv.fr/~lecroq/string/node6.html#SECTION0060

I would be very happy if someone could break down the code into easily understandable fragments.

Thanks,

Jens

I don't seem to understand the shift-or algorithm, which is used to find one string within another.

Here are two descriptions and implementations:

http://en.wikipedia.org/wi

http://www-igm.univ-mlv.fr

I would be very happy if someone could break down the code into easily understandable fragments.

Thanks,

Jens

7 Comments

www.mis.nsysu.edu.tw/~syhw

"The Shift-Or algorithm uses matrices Rl, 0 l k of size m (n + 1) and

mask matrix D of size m jAj. Each element rl

j;i, 0 < j m, 0 i n,

contains 0, if the edit distance between string p1 : : : pj and string ending at

position i in text T is l, or 1, otherwise. Each element dj;x, 0 < j m,

x 2 A, contains 0, if pj = x, or 1, otherwise. The matrices are implemented

as tables of bit vectors as follows:

Rl

i =

266664

rl

1;i

rl

2;i

...

rl

m;i

377775

and D[x] =

266664

d1;x

d2;x

...

dm;x

377775

; 0 i n; 0 l k; x 2 A: (13)

211

6.3.2 String matching

6.3.2.1 Exact string matching In the exact string matching, vectors

R0

i , 0 i n, are computed as follows [BYG92]:

r0

j;0 := 1; 0 < j m

R0

i := shl(R0

i􀀀1) or D[ti]; 0 < i n (14)

In Formula 14 operation shl() is the bitwise operation left shift that

inserts 0 at the beginning of vector and operation or is the bitwise operation

or. Term shl(R0

i􀀀1) or D[ti] represents matching|position i in text T is

increased, position in pattern P is increased by operation shl(), and the

positions corresponding to the input symbol ti are selected by term or D[ti].

Pattern P is found at position ti􀀀m+1 : : : ti if r0

m;i = 0, 0 < i n.

An example of mask matrix D for pattern P = adbbca is shown in

Table 6.4 and an example of matrix R0 for exact searching for pattern P =

adbbca in text T = adcabcaabadbbca is shown in Table 6.5.

D a b c d A n fa; b; c; dg

a 0 1 1 1 1

d 1 1 1 0 1

b 1 0 1 1 1

b 1 0 1 1 1

c 1 1 0 1 1

a 0 1 1 1 1

Table 6.4: Matrix D for pattern P = adbbca

R0 - a d c a b c a a b a d b b c a

a 1 0 1 1 0 1 1 0 0 1 0 1 1 1 1 0

d 1 1 0 1 1 1 1 1 1 1 1 0 1 1 1 1

b 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1

b 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1

c 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1

a 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0

Table 6.5: Matrix R0 for the exact string matching (for pattern P = adbbca

and text T = adcabcaabadbbca)

Theorem 6.13

Shift-Or algorithm described by Formula 14 simulates a run of the NFA for

the exact string matching.

212

Proof

In Shift-Or algorithm for the exact string matching there is one bit vector

R0

i , 0 i n, which represents the set of active states of NFA. In the vector,

0 represents active state and 1 represents non-active state of the simulated

NFA. So in Shift-Or algorithm, the set of active states from Section 6.1 is

implemented by bit vector.

In Formula 14, term shl(R0

i􀀀1) or D[ti] represents matching transition|

each active state is moved to the next position in the right2 in the same level.

All active states are moved at once and only the transitions corresponding

to read symbol ti are selected by mask vector D[ti], which changes 0 to 1

in each such state that its incoming matching transition is not labeled by

ti. The initial state of NFA is not in vector R0 and it is implemented by

inserting 0 at the beginning of the vector in operation shl()|initial state is

always active because of its self loop.

At the beginning only the initial state is active therefore R0

0 = 1(m).

If r0

m;i = 0, 0 < i n, then we report that the nal state is active and

thus the pattern is found ending at position i in text T. 2

--------------------------

The basic idea is to precompute a few hints-- for example, given what you know about the current matching state, what does your precomputed table tell you about the next possible match?

"

@e_tadeu I liked the powerpoint :)

@grg a little cryptic :(

Well the problem is not the understanding of the shift-or algorithm itself,

but the implementations of the algorithm on the pages I posted above.

I am not a programmer by training and the code looks quite cryptic to me.

So I would like to be walked through it ;)

Anyway here it is:

int preSo(char *x, int m, unsigned int S[]) {

unsigned int j, lim;

int i;

for (i = 0; i < ASIZE; ++i)

S[i] = ~0;

for (lim = i = 0, j = 1; i < m; ++i, j <<= 1) {

S[x[i]] &= ~j;

lim |= j;

}

lim = ~(lim>>1);

return(lim);

}

Thanks so far,

Jens

int preSo(char *x, int m, unsigned int S[]) { // x is the input string of characters, S is a bitmask array

unsigned int j, lim; int i;

for (i = 0; i < ASIZE; ++i) S[i] = ~0; // initialize the S array to all 1 bits

for (lim = i = 0, j = 1; i < m; ++i, j <<= 1) { // repeat m times, with j starting as 1, then shifting left one bit each time

S[ x[i] ] &= ~j; //turn off the j'th bit of S indexed by the i'th character of the input string

lim |= j; // set the i'th bit of lim

}

// at this point lim has "m" bits, right justified

lim = ~(lim>>1); // shift lim right one, flip all the bits

// now lim has 32-m-+ bits set, left justified

return(lim); // return that bit pattern

// S has been changed to be:

//S[ 'A' ] has all bits ON, except if the input string had an 'A' in the q'th position, bit 'q' of S[ 'A' ] is now turned OFF

//So you can use this 'S' array to quickly answer the question: is there an 'A' anywhere in the string?

//If S[ 'A' ] == ~0, then there are no A's in the string.

//If it equals anything else, the bits that are ON tell you where in the string the A's exist (up to 32, on a 32-bit int machine)

// You can even use the whole S word in a quick masking operation to check for multiple occurrences in one swoop.

// clever

}

Title | # Comments | Views | Activity |
---|---|---|---|

Finding a good hash function | 4 | 109 | |

c++ error C2501: 'accessMSdb::create2DArray |
12 | 58 | |

FMX StringGrid1->Canvas->FillR |
3 | 77 | |

How to copy an image file into clipboard C/C++? | 1 | 107 |

The viewer will learn how to clear a vector as well as how to detect empty vectors in C++.

Join the community of 500,000 technology professionals and ask your questions.

Connect with top rated Experts

**17** Experts available now in Live!