# Can't understand this - bitwise ops

I'm just not making any sense out of this code and that really bothers me :) See if you can understand why it works. If you can please explain it to me!

This is basically the core loop of microsoft's memchr crt function.

the char we're looking for (rememeber it's ASCII so only 7 bits are used) is stored in every byte of the dword N

IE if your looking for 'C' (43 in hex), then  N = 0x43434343

read 4 bytes as a dword into A  ((7e432e5a, 7e3c2e5a) see below)

I worked this out with hex numbers on a calculator for two values of A, in brackets next to the ops. First value is 4 bytes, one of which is 0x43, the second value is 4 unmatching bytes.

xor A with N. A becomes (3D006D19, 3D7F6D19). The matching byte of the dword is completely cleared now.

Set a second dword, B equal to the constant 7EFEFEFF (why this value?)

xor B with -1 (0xFFFFFFFF) B becomes (C2FF92E6, C28092E6)

xor B with A, B becomes (C2FF92E6, 7EFEFEFE)

and (&) with 0x81010100. If the result is zero, the character is not in the dword. If it is zero, the character may* be in the dword.

So that's it. Why those values for the constants? I need to modify this code to look for an entire byte (not just the first 7 bits) in 4 bytes of a dword. I suspect it's not just a 'plug and play' thing. It might not even be possible to use this code to find an enitre byte. It might be making use of the fact that the 8th bit will be cleared.

*there is some further processing, which may reject A as containing the char, included in assembler below. ecx is A. bl is 'C', the character we're looking for.

char_is_found:
xor     cl,bl           ; is it byte 0
je      short byte_0
xor     ch,bl           ; is it byte 1
je      short byte_1
shr     ecx,10h         ; is it byte 2
xor     cl,bl
je      short byte_2
xor     ch,bl           ; is it byte 3
je      short byte_3
jmp     short main_loop ; taken if bits 24-30 are clear and bit
; 31 is set

LVL 3
###### Who is Participating?

Commented:
// if looking or 0x43

unsigned int v = ...; // 32-bit word to check

v ^= mask; // will set any byte containing 0x43 to 0x00

// true if any of the bytes are 0
bool hasZeroByte = ~((((v & 0x7F7F7F7F) + 0x7F7F7F7F) | v) | 0x7F7F7F7F);

This works for 8 bit bytes.

I got this from a web page called "Bit Twiddling Hacks", it's very entertaining: http://graphics.stanford.edu/~seander/bithacks.html

0

Commented:
i thought though ASCII is 7-bit, it is common that we uses 8-bits, where the eighth bit is the parity bit?

>> Why those values for the constants?

anyway, notice that 7EFEFEFF AND FFFFFFFF = 81010100

let's number the bytes from right to left. i.e. right-most byte is first byte.
as you noticed, of the four bytes in A, if there is a matching byte, that byte (after A is processed, i.e. xor with N) is always zero.

assuming the matching byte is in the first byte,
notice that the first bit of FE (right to left again) is zero. If the first bit of the second byte of A is 1, the
addition of A into B will give a byte with the first bit 1, which when xor-ed with FF will become 0. Addition with A again
will gives you back the 1, and therefore you can AND it with 01 in the end to get a non-zero result. Similarly if the first
bit of the A is 0, xor-ing with FF will have given it a 1. So this property of FE makes it desirable to be in the constant.

the more important property is the ability to carry forward/overflow. notice the first 3 bytes are FF, FE and FE. considering
the case where the first byte does not match, means there will always be a overflow to FE from the FF byte. In the case
where the matching byte is at the FE position, this would mean that the FE byte will become FF. Notice how this is similar
to the above case, now we only consider the last 3 bytes i.e. 7EFEFF (the 2nd byte had become FF). So following from the above result, this will also give a non-zero result.

this however cannot be continued up to the last byte (because there is no more bytes after the last byte to make use of this property). So you can't have FE at the last byte. My guess on why it chose 7E is one it just happens that 7FFFFFFF is the largest signed long value. Also notice 7E xor FF = 81 = 10000001, lotsa zeros not many ones. Which means when you AND it,
there is a higher chance that it will filter out those cases where it does not match (though not all). Notice this is
probably the best you can get because you need a 1 at the first bit (due to the previous property we made use of). And you
can't have 00000001 because that coincides with FE.

note that the 81010100 is obtained from the 7EFEFEFF.

0

Commented:
come to think of it, those guys that did this are geniuses!
btw, this is interesting, do you have the entire source? if you don't mind I would be happy if you can share with me
(through email) at coolbreeze at phreaker.net

0

Commented:
Hi CoolBreeze,

i just wrote little program

int main (int c, char **v) {
memchr ("abcdefgh", 'c', 8) ;
return 0 ;
}

compiled, debuged and opened the asm window (Alt-8) this is the source i got:

--- intel\memchr.asm  -----------------------------------------------------------------------------------------------------
memchr:
00401060   mov         eax,dword ptr [esp+0Ch]
00401064   push        ebx
00401065   test        eax,eax
00401067   je          retnull (004010b3)
00401069   mov         edx,dword ptr [esp+8]
0040106D   xor         ebx,ebx
0040106F   mov         bl,byte ptr [esp+0Ch]
00401073   test        edx,3
00401079   je          main_loop_start (0040108d)
str_misaligned:
0040107B   mov         cl,byte ptr [edx]
0040107D   inc         edx
0040107E   xor         cl,bl
00401080   je          found (004010ee)
00401082   dec         eax
00401083   je          retnull (004010b3)
00401085   test        edx,3
0040108B   jne         str_misaligned (0040107b)
main_loop_start:
0040108D   sub         eax,4
00401090   jb          tail_less_then_4 (004010a4)
00401092   push        edi
00401093   mov         edi,ebx
00401095   shl         ebx,8
0040109A   mov         edi,ebx
0040109C   shl         ebx,10h
004010A1   jmp         main_loop_entry (004010ba)
return_from_main:
004010A3   pop         edi
tail_less_then_4:
004010A7   je          retnull (004010b3)
tail_loop:
004010A9   mov         cl,byte ptr [edx]
004010AB   inc         edx
004010AC   xor         cl,bl
004010AE   je          found (004010ee)
004010B0   dec         eax
004010B1   jne         tail_loop (004010a9)
retnull:
004010B3   pop         ebx
004010B4   ret
main_loop:
004010B5   sub         eax,4
004010B8   jb          return_from_main (004010a3)
main_loop_entry:
004010BA   mov         ecx,dword ptr [edx]
004010BC   xor         ecx,ebx
004010BE   mov         edi,7EFEFEFFh
004010C5   xor         ecx,0FFh
004010C8   xor         ecx,edi
004010CD   and         ecx,81010100h
004010D3   je          main_loop (004010b5)
char_is_found:
004010D5   mov         ecx,dword ptr [edx-4]
004010D8   xor         cl,bl
004010DA   je          byte_0 (004010ff)
004010DC   xor         ch,bl
004010DE   je          byte_1 (004010f9)
004010E0   shr         ecx,10h
004010E3   xor         cl,bl
004010E5   je          byte_2 (004010f3)
004010E7   xor         ch,bl
004010E9   je          byte_3 (004010ed)
004010EB   jmp         main_loop (004010b5)
byte_3:
004010ED   pop         edi
found:
004010EE   lea         eax,[edx-1]
004010F1   pop         ebx
004010F2   ret
byte_2:
004010F3   lea         eax,[edx-2]
004010F6   pop         edi
004010F7   pop         ebx
004010F8   ret
byte_1:
004010F9   lea         eax,[edx-3]
004010FC   pop         edi
004010FD   pop         ebx
004010FE   ret
byte_0:
004010FF   lea         eax,[edx-4]
00401102   pop         edi
00401103   pop         ebx
00401104   ret
--- No source file  ----------------------------------------------------------------------------------------------------

Cheers! S.
0

Commented:
oh i see. thanks.
0

Commented:
i studied the code and the principles and i must say that microsofts are real bit twiddling gurus :).

lets take some 4 byte sequence

"xx aa xx xx"   our byte is there, but we don't know it yet

xor "aa aa aa aa"   this is our byte 4 times
"xx 00 xx xx"   00 appears in the position of the byte .. let's find it

add "fe fe fe ff"   adding ff is like subtracting 1 with overflowing (that why those fe's are there"
c "-2 ff -1 -1"   here you see, that -2 appeared just above the 00 byte (overflow didnt appear) (00 -1=ff)

now you can test the overflow. if 0, fourth byte is what we're looking for
other way is to add "fe fe fe fe" and then adc 0 (that way the carry would overflow to the least significant bit)
same effect would come if you subtract "01 01 01 01"

"xx 00 xx xx"   just xor the found sequence with added sequence
xor "-2 ff -1 -1"
"x0 x1 x1 x1"   least bit in each byte tells us which byte was -2
and "01 01 01 01"   just clear unsignificant bits
"00 01 01 01"
xor "01 01 01 01"   check if any byte in the sequence is 00 (the least bit of it)
"01 00 00 00"   if this is not 0 we found our byte, now we can check the original 4byte sequence for which byte is it

or you can calculate the exact position from the bits using some more bit twiddling methods :)

simple, isn't it?

S.
0

Commented:
who's correct here? from the code given by skypalae:

main_loop_entry:
004010BA   mov         ecx,dword ptr [edx]
004010BC   xor         ecx,ebx
004010BE   mov         edi,7EFEFEFFh
004010C5   xor         ecx,0FFh
004010C8   xor         ecx,edi
004010CD   and         ecx,81010100h
004010D3   je          main_loop (004010b5)

especially the line : xor ecx, 0FFh

sandra-24 had described that it is xor-ed with 0xFFFFFFFF, and furthermore it is xor-ed with B not A?

so who's correct here? or are both equivalent?
0

Commented:
CoolBreeze,

i think both, because of size of ecx, the 0ffh will be extended to 32bit form (with sign bit)

S.
0

Commented:
skypalae, did you got the explanation you got from the code you posted?
I couldn't find the FEFEFEFF you are talking about in the code.

furthermore,

"xx 00 xx xx"   just xor the found sequence with added sequence
xor "-2 ff -1 -1"

the added sequence is not "-2 ff -1 -1" according to the code. It is the addition of "xx 00 xx xx" and "fe fe fe ff"
in other words, you might not get this "x0 x1 x1 x1"

so the following is not applicable:
"x0 x1 x1 x1"   least bit in each byte tells us which byte was -2
0

Commented:
CoolBreeze, ok, ok, ok,

maybe my explanations were bit unclear. I just wanted to explain how to determine the equal byte from the quadruple. Check out this code, it demonstrates my thoughts (it doesn't do the pre-alignment and post-alignment .. thie microsofts "str_misaligned" and "tail_loop" parts .. that is entered data has to be placed on address aligned to 4 bytes and it's length has to be divisible by 4 too)

void *mymemchr (const void *buf, int c, size_t count) {
__asm {
mov            ebx, buf
jz            l_ret

xor            eax, eax
mov            al, byte ptr c
mov            ecx, eax
shl            eax, 8
mov            eax, ecx
shl            ecx, 16

mov            ecx, count

l_loop:
mov            edx, dword ptr [ebx]
xor            edx, eax
mov            esi, edx

sub            edx, 01010101h
sbb            edx, 0

xor            edx, esi
and            edx, 01010101h
xor            edx, 01010101h
jne            l_found

sub            ecx, 4
jnz            l_loop

l_found:
ror            edx, 9
ror            edx, 8
ror            edx, 8

inc            ebx
inc            ebx
inc            ebx

l_ret:
mov            eax, ebx
}
}

of course it can be optimized (at least by putting 01010101h into edi), but it is hust for demonstration of the principles.

S.
0

Commented:
i tested out your function using the below program, making only the changes:
jz          l_ret
changes to:
jz           not_found

and also:
l_ret:
mov          eax, ebx
changes to:
l_ret:
mov          eax, ebx

not_found:
mov          eax, 0

------------------------------------------- START OF CODE ------------------------------------------------------
#include <stdio.h>

void* mymemchr (const void *buf, int c, size_t count) {
__asm {
mov          ebx, buf
jz           not_found

xor          eax, eax
mov          al, byte ptr c
mov          ecx, eax
shl          eax, 8
mov          eax, ecx
shl          ecx, 16

mov          ecx, count

l_loop:
mov          edx, dword ptr [ebx]
xor          edx, eax
mov          esi, edx

sub          edx, 01010101h
sbb          edx, 0

xor          edx, esi
and          edx, 01010101h
xor          edx, 01010101h
jne          l_found

sub          ecx, 4
jnz          l_loop

l_found:
ror          edx, 9
ror          edx, 8
ror          edx, 8

inc          ebx
inc          ebx
inc          ebx

l_ret:
mov          eax, ebx

not_found:
mov          eax, 0
}
}

int main() {
char *a = "abcd";
char *x;
char *y;

x = mymemchr(a, 'c', 4);
y = memchr(a, 'c', 4);
printf("a = %d\n", a);
printf("x = %d\n", x);
printf("y = %d\n", y);
}

------------------------------------------- END OF CODE -----------------------------------------------------------

and the output I obtained is:

a = 4235560
x = 0
y = 4235562

there is something wrong...
0

Commented:
>> that is entered data has to be placed on address aligned to 4 bytes and it's length has to be divisible by 4 too

just noticed this. how can we ensure this is so? so that I can try the code out properly.
0

Commented:
what a folly, i forgot that l_ret falls through to not_found
0

Commented:
Yup .. i'm working on new version

also if ebx (base) == 0, you can just assign that to eax (return), so that piece of code is useless

S.

0

Commented:
okay, fixed that not_found thing. but here's comes another problem:

#include <stdio.h>

void* mymemchr (const void *buf, int c, size_t count) {
__asm {
mov          ebx, buf
cmp          ebx, 0
jz           not_found

xor          eax, eax
mov          al, byte ptr c
mov          ecx, eax
shl          eax, 8
mov          eax, ecx
shl          ecx, 16

mov          ecx, count

l_loop:
mov          edx, dword ptr [ebx]
xor          edx, eax
mov          esi, edx

sub          edx, 01010101h
sbb          edx, 0

xor          edx, esi
and          edx, 01010101h
xor          edx, 01010101h
jne          l_found

sub          ecx, 4
jnz          l_loop
jmp          not_found

l_found:
ror          edx, 9
ror          edx, 8
ror          edx, 8

inc          ebx
inc          ebx
inc          ebx

l_ret:
mov          eax, ebx
jmp          the_end

not_found:
mov          eax, 0

the_end:
}
}

int main() {
char b[] = { 0x07, 0x04, 0x06, 0x07 };
char *a = b;
char *x;
char *y;

x = mymemchr(a, 0x04, 4);
y = memchr(a, 0x04, 4);
printf("a = %d\n", a);
printf("x = %d\n", x);
printf("y = %d\n", y);

printf("\n");

x = mymemchr(a, 0x06, 4);
y = memchr(a, 0x06, 4);
printf("a = %d\n", a);
printf("x = %d\n", x);
printf("y = %d\n", y);
}

the output is:

a = 1245064
x = 1245065
y = 1245065

a = 1245064
x = 1245064
y = 1245066

as you can see, the first search is okay (x == y), the second one isn't. what's wrong?
0

Commented:
I've (hopefully) fixed the code. Now it does the same as MS memchr (misaligned memory and length not divisible by 4) but it uses my process. Just to be correct, the algorithm I've developped is just exactly the same as original MS, but it uses more understandable technique (as i described above). The MS version doesn't count on the overflow and thus it implements the guessing part. My algo finds the position correctly (that's the difference between 0xfefefeff and 0x8efefeff that you asked first)

Sandra, I hope that you understand the algorithm. You can just check it similar to what CoolBreeze did.

void *mymemchr (const void *buf, int c, size_t count) {
__asm {
mov            edi, 01010101h

mov            ebx, buf
test      ebx, ebx
jz            l_ret

xor            edx, edx
mov            dl, byte ptr c
mov            ecx, edx
shl            ecx, 8
mov            ecx, edx
shl            ecx, 16

mov            ecx, count

l_misaligned:
test      ebx, 3
jz            l_aligned
l_tail:
cmp            dl, byte ptr [ebx]
jz            l_ret
inc            ebx
loop      l_misaligned
l_aligned:

test      ecx, ecx
jz            l_not_found

l_loop:
test      ecx, -4
jz            l_tail

mov            eax, dword ptr [ebx]
xor            eax, edx
mov            esi, eax

sub            eax, edi
sbb            al, 0

xor            eax, esi
and            eax, edi
xor            eax, edi
jne            l_found

sub            ecx, 4
jnz            l_loop

l_not_found:
xor            ebx, ebx
jmp            l_ret

l_found:
ror            eax, 9
ror            eax, 8
ror            eax, 8

inc            ebx
inc            ebx
inc            ebx

l_ret:
mov            eax, ebx
}
}

S.
0

Commented:
LIttle Historical Note:

The guys are Microsoft are NOT to be given credit for this bit of clever code!

This kind of stuff has been done on long word-length machines since at least 1966.

The Control Data 6600 operating systems ( 60-bit words ) used code like this to find and replace characters since around that time frame.  particularly clever were routines like SFN (Space Fill Name) which would space fill a filename out to 10 characters, all with just logical operations, done 60-bits at a time.

0

Commented:
most good algorithms were allready invented, we are left here to copy them *sigh*
0

Author Commented:
Actually I think wayside's algorithim is pretty damn close from an implementation standpoint. The memchr algo might be slightly faster due to their use of -1 (which will fold nicely into a 16 bit signed integer in the trace cache). The one above on the the hand will require a second entry most likely in the trace cache to store 0x7F7F7F7F. Still it doesn't have the odd property of occasionally mistakenly identifying a match where there isn't one like memchr. This makes analysis of worst case scenarios hard.

Both require one move, one add, and three bitwise ops, but Microsoft's requires two conditional jumps for a memichr implementation, while the other requires only one.

'This kind of stuff has been done on long word-length machines since at least 1966.'

We won't ask how you know about programming on computers in 1966 ;)

Sorry for the late response, my ISP has been up and down in the past two days, and I had to ask a mod to delete my previous response.

Thanks for your help, this was quite a bit of clever code!

-Sandra

0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.