?
Solved

Don't understand the shift-or algorithm for string matching.

Posted on 2006-06-06
7
Medium Priority
?
266 Views
Last Modified: 2008-01-09
Hi Experts,
I don't seem to understand the shift-or algorithm, which is used to find one string within another.
Here are two descriptions and implementations:
http://en.wikipedia.org/wiki/Shift_Or_Algorithm
http://www-igm.univ-mlv.fr/~lecroq/string/node6.html#SECTION0060

I would be very happy if someone could break down the code into easily understandable fragments.

Thanks,
Jens
0
Comment
Question by:allmer
  • 2
  • 2
  • 2
  • +1
7 Comments
 
LVL 12

Expert Comment

by:rajeev_devin
ID: 16841134
We are not allowed to do homework questions. Can you please try it yourself and post the code if you find any difficulty.
0
 
LVL 4

Assisted Solution

by:e_tadeu
e_tadeu earned 600 total points
ID: 16841452
0
 
LVL 4

Expert Comment

by:e_tadeu
ID: 16841467
A tip: first try to understand how the algorithm works, doing it by hand a few times for some simple sequences. Then analyse the example codes, then write your own code.
0
What does it mean to be "Always On"?

Is your cloud always on? With an Always On cloud you won't have to worry about downtime for maintenance or software application code updates, ensuring that your bottom line isn't affected.

 
LVL 22

Expert Comment

by:grg99
ID: 16841910
Here's somebody's somewhat garbled explanation:

"The Shift-Or algorithm uses matrices Rl, 0  l  k of size m  (n + 1) and
mask matrix D of size m  jAj. Each element rl
j;i, 0 < j  m, 0  i  n,
contains 0, if the edit distance between string p1 : : : pj and string ending at
position i in text T is  l, or 1, otherwise. Each element dj;x, 0 < j  m,
x 2 A, contains 0, if pj = x, or 1, otherwise. The matrices are implemented
as tables of bit vectors as follows:
Rl
i =
266664
rl
1;i
rl
2;i
...
rl
m;i
377775
and D[x] =
266664
d1;x
d2;x
...
dm;x
377775
; 0  i  n; 0  l  k; x 2 A: (13)
211
6.3.2 String matching
6.3.2.1 Exact string matching In the exact string matching, vectors
R0
i , 0  i  n, are computed as follows [BYG92]:
r0
j;0 := 1; 0 < j  m
R0
i := shl(R0
i&#1048576;1) or D[ti]; 0 < i  n (14)
In Formula 14 operation shl() is the bitwise operation left shift that
inserts 0 at the beginning of vector and operation or is the bitwise operation
or. Term shl(R0
i&#1048576;1) or D[ti] represents matching|position i in text T is
increased, position in pattern P is increased by operation shl(), and the
positions corresponding to the input symbol ti are selected by term or D[ti].
Pattern P is found at position ti&#1048576;m+1 : : : ti if r0
m;i = 0, 0 < i  n.
An example of mask matrix D for pattern P = adbbca is shown in
Table 6.4 and an example of matrix R0 for exact searching for pattern P =
adbbca in text T = adcabcaabadbbca is shown in Table 6.5.
D a b c d A n fa; b; c; dg
a 0 1 1 1 1
d 1 1 1 0 1
b 1 0 1 1 1
b 1 0 1 1 1
c 1 1 0 1 1
a 0 1 1 1 1
Table 6.4: Matrix D for pattern P = adbbca
R0 - a d c a b c a a b a d b b c a
a 1 0 1 1 0 1 1 0 0 1 0 1 1 1 1 0
d 1 1 0 1 1 1 1 1 1 1 1 0 1 1 1 1
b 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1
b 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1
c 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1
a 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0
Table 6.5: Matrix R0 for the exact string matching (for pattern P = adbbca
and text T = adcabcaabadbbca)
Theorem 6.13
Shift-Or algorithm described by Formula 14 simulates a run of the NFA for
the exact string matching.
212
Proof
In Shift-Or algorithm for the exact string matching there is one bit vector
R0
i , 0  i  n, which represents the set of active states of NFA. In the vector,
0 represents active state and 1 represents non-active state of the simulated
NFA. So in Shift-Or algorithm, the set of active states from Section 6.1 is
implemented by bit vector.
In Formula 14, term shl(R0
i&#1048576;1) or D[ti] represents matching transition|
each active state is moved to the next position in the right2 in the same level.
All active states are moved at once and only the transitions corresponding
to read symbol ti are selected by mask vector D[ti], which changes 0 to 1
in each such state that its incoming matching transition is not labeled by
ti. The initial state of NFA is not in vector R0 and it is implemented by
inserting 0 at the beginning of the vector in operation shl()|initial state is
always active because of its self loop.
At the beginning only the initial state is active therefore R0
0 = 1(m).
If r0
m;i = 0, 0 < i  n, then we report that the nal state is active and
thus the pattern is found ending at position i in text T. 2
---------------------------------------------------
The basic idea is to precompute a few hints-- for example, given what you know about the current matching state, what does your precomputed table tell you about the next possible match?  


"
0
 
LVL 5

Author Comment

by:allmer
ID: 16843289
@rajeev_devin: This is homework I assigned to myself.
@e_tadeu I liked the powerpoint :)
@grg a little cryptic :(

Well the problem is not the understanding of the shift-or algorithm itself,
but the implementations of the algorithm on the pages I posted above.
I am not a programmer by training and the code looks quite cryptic to me.
So I would like to be walked through it ;)
Anyway here it is:
  int preSo(char *x, int m, unsigned int S[]) {
    unsigned int j, lim;
    int i;
    for (i = 0; i < ASIZE; ++i)
      S[i] = ~0;
    for (lim = i = 0, j = 1; i < m; ++i, j <<= 1) {
      S[x[i]] &= ~j;
      lim |= j;
    }
    lim = ~(lim>>1);
    return(lim);
  }
Thanks so far,
Jens
0
 
LVL 12

Expert Comment

by:rajeev_devin
ID: 16848953
>> @rajeev_devin: This is homework I assigned to myself.
Then you try it yourself :)
0
 
LVL 22

Accepted Solution

by:
grg99 earned 1400 total points
ID: 16851079
Okay, here's  astab at explaining it:

 int preSo(char *x, int m, unsigned int S[]) { // x is the input string of characters, S is a bitmask array
    unsigned int j, lim;     int i;

    for (i = 0; i < ASIZE; ++i)      S[i] = ~0;   // initialize the S array to all 1 bits



    for (lim = i = 0, j = 1; i < m; ++i, j <<= 1) {  // repeat m times, with j starting as 1, then shifting left one bit each time
      S[ x[i] ] &= ~j;  //turn off the j'th bit of S indexed by the i'th character of the input string
      lim |= j;          // set the i'th bit of lim
    }
// at this point lim has "m" bits, right justified


    lim = ~(lim>>1);  // shift lim right one, flip all the bits
// now lim has 32-m-+ bits set, left justified

    return(lim);  // return that bit pattern


// S has been changed to be:
//S[ 'A' ] has all bits ON, except if the input string had an 'A' in the q'th position, bit 'q' of S[ 'A' ] is now turned OFF

//So you can use this 'S' array to quickly answer the question: is there an 'A' anywhere in the string?
//If S[ 'A' ] == ~0, then there are no A's in the string.
//If it equals anything else, the bits that are ON tell you where in the string the A's exist (up to 32, on a 32-bit int machine)
// You can even use the whole S word in a quick masking operation to check for multiple occurrences in one swoop.
//  clever

  }
0

Featured Post

Free Tool: Port Scanner

Check which ports are open to the outside world. Helps make sure that your firewall rules are working as intended.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

When writing generic code, using template meta-programming techniques, it is sometimes useful to know if a type is convertible to another type. A good example of when this might be is if you are writing diagnostic instrumentation for code to generat…
  Included as part of the C++ Standard Template Library (STL) is a collection of generic containers. Each of these containers serves a different purpose and has different pros and cons. It is often difficult to decide which container to use and …
The viewer will learn how to pass data into a function in C++. This is one step further in using functions. Instead of only printing text onto the console, the function will be able to perform calculations with argumentents given by the user.
The viewer will be introduced to the technique of using vectors in C++. The video will cover how to define a vector, store values in the vector and retrieve data from the values stored in the vector.

749 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question