Solved

Can't understand this - bitwise ops

Posted on 2004-04-29
19
948 Views
Last Modified: 2007-12-19
I'm just not making any sense out of this code and that really bothers me :) See if you can understand why it works. If you can please explain it to me!

This is basically the core loop of microsoft's memchr crt function.

the char we're looking for (rememeber it's ASCII so only 7 bits are used) is stored in every byte of the dword N

IE if your looking for 'C' (43 in hex), then  N = 0x43434343

read 4 bytes as a dword into A  ((7e432e5a, 7e3c2e5a) see below)

I worked this out with hex numbers on a calculator for two values of A, in brackets next to the ops. First value is 4 bytes, one of which is 0x43, the second value is 4 unmatching bytes.

xor A with N. A becomes (3D006D19, 3D7F6D19). The matching byte of the dword is completely cleared now.

Set a second dword, B equal to the constant 7EFEFEFF (why this value?)

Add A into B.

xor B with -1 (0xFFFFFFFF) B becomes (C2FF92E6, C28092E6)

xor B with A, B becomes (C2FF92E6, 7EFEFEFE)

and (&) with 0x81010100. If the result is zero, the character is not in the dword. If it is zero, the character may* be in the dword.

So that's it. Why those values for the constants? I need to modify this code to look for an entire byte (not just the first 7 bits) in 4 bytes of a dword. I suspect it's not just a 'plug and play' thing. It might not even be possible to use this code to find an enitre byte. It might be making use of the fact that the 8th bit will be cleared.

*there is some further processing, which may reject A as containing the char, included in assembler below. ecx is A. bl is 'C', the character we're looking for.

char_is_found:
      xor     cl,bl           ; is it byte 0
      je      short byte_0
      xor     ch,bl           ; is it byte 1
      je      short byte_1
      shr     ecx,10h         ; is it byte 2
      xor     cl,bl
      je      short byte_2
      xor     ch,bl           ; is it byte 3
      je      short byte_3
      jmp     short main_loop ; taken if bits 24-30 are clear and bit
      ; 31 is set





0
Comment
Question by:Sandra-24
19 Comments
 
LVL 3

Assisted Solution

by:CoolBreeze
CoolBreeze earned 150 total points
ID: 10956138
i thought though ASCII is 7-bit, it is common that we uses 8-bits, where the eighth bit is the parity bit?

>> Why those values for the constants?

anyway, notice that 7EFEFEFF AND FFFFFFFF = 81010100

let's number the bytes from right to left. i.e. right-most byte is first byte.
as you noticed, of the four bytes in A, if there is a matching byte, that byte (after A is processed, i.e. xor with N) is always zero.

assuming the matching byte is in the first byte,
notice that the first bit of FE (right to left again) is zero. If the first bit of the second byte of A is 1, the
addition of A into B will give a byte with the first bit 1, which when xor-ed with FF will become 0. Addition with A again
will gives you back the 1, and therefore you can AND it with 01 in the end to get a non-zero result. Similarly if the first
bit of the A is 0, xor-ing with FF will have given it a 1. So this property of FE makes it desirable to be in the constant.

the more important property is the ability to carry forward/overflow. notice the first 3 bytes are FF, FE and FE. considering
the case where the first byte does not match, means there will always be a overflow to FE from the FF byte. In the case
where the matching byte is at the FE position, this would mean that the FE byte will become FF. Notice how this is similar
to the above case, now we only consider the last 3 bytes i.e. 7EFEFF (the 2nd byte had become FF). So following from the above result, this will also give a non-zero result.

this however cannot be continued up to the last byte (because there is no more bytes after the last byte to make use of this property). So you can't have FE at the last byte. My guess on why it chose 7E is one it just happens that 7FFFFFFF is the largest signed long value. Also notice 7E xor FF = 81 = 10000001, lotsa zeros not many ones. Which means when you AND it,
there is a higher chance that it will filter out those cases where it does not match (though not all). Notice this is
probably the best you can get because you need a 1 at the first bit (due to the previous property we made use of). And you
can't have 00000001 because that coincides with FE.

note that the 81010100 is obtained from the 7EFEFEFF.

0
 
LVL 3

Expert Comment

by:CoolBreeze
ID: 10956156
come to think of it, those guys that did this are geniuses!
btw, this is interesting, do you have the entire source? if you don't mind I would be happy if you can share with me
(through email) at coolbreeze at phreaker.net

0
 
LVL 4

Expert Comment

by:skypalae
ID: 10956439
Hi CoolBreeze,

i just wrote little program

int main (int c, char **v) {
    memchr ("abcdefgh", 'c', 8) ;
    return 0 ;
}

compiled, debuged and opened the asm window (Alt-8) this is the source i got:

--- intel\memchr.asm  -----------------------------------------------------------------------------------------------------
memchr:
00401060   mov         eax,dword ptr [esp+0Ch]
00401064   push        ebx
00401065   test        eax,eax
00401067   je          retnull (004010b3)
00401069   mov         edx,dword ptr [esp+8]
0040106D   xor         ebx,ebx
0040106F   mov         bl,byte ptr [esp+0Ch]
00401073   test        edx,3
00401079   je          main_loop_start (0040108d)
str_misaligned:
0040107B   mov         cl,byte ptr [edx]
0040107D   inc         edx
0040107E   xor         cl,bl
00401080   je          found (004010ee)
00401082   dec         eax
00401083   je          retnull (004010b3)
00401085   test        edx,3
0040108B   jne         str_misaligned (0040107b)
main_loop_start:
0040108D   sub         eax,4
00401090   jb          tail_less_then_4 (004010a4)
00401092   push        edi
00401093   mov         edi,ebx
00401095   shl         ebx,8
00401098   add         ebx,edi
0040109A   mov         edi,ebx
0040109C   shl         ebx,10h
0040109F   add         ebx,edi
004010A1   jmp         main_loop_entry (004010ba)
return_from_main:
004010A3   pop         edi
tail_less_then_4:
004010A4   add         eax,4
004010A7   je          retnull (004010b3)
tail_loop:
004010A9   mov         cl,byte ptr [edx]
004010AB   inc         edx
004010AC   xor         cl,bl
004010AE   je          found (004010ee)
004010B0   dec         eax
004010B1   jne         tail_loop (004010a9)
retnull:
004010B3   pop         ebx
004010B4   ret
main_loop:
004010B5   sub         eax,4
004010B8   jb          return_from_main (004010a3)
main_loop_entry:
004010BA   mov         ecx,dword ptr [edx]
004010BC   xor         ecx,ebx
004010BE   mov         edi,7EFEFEFFh
004010C3   add         edi,ecx
004010C5   xor         ecx,0FFh
004010C8   xor         ecx,edi
004010CA   add         edx,4
004010CD   and         ecx,81010100h
004010D3   je          main_loop (004010b5)
char_is_found:
004010D5   mov         ecx,dword ptr [edx-4]
004010D8   xor         cl,bl
004010DA   je          byte_0 (004010ff)
004010DC   xor         ch,bl
004010DE   je          byte_1 (004010f9)
004010E0   shr         ecx,10h
004010E3   xor         cl,bl
004010E5   je          byte_2 (004010f3)
004010E7   xor         ch,bl
004010E9   je          byte_3 (004010ed)
004010EB   jmp         main_loop (004010b5)
byte_3:
004010ED   pop         edi
found:
004010EE   lea         eax,[edx-1]
004010F1   pop         ebx
004010F2   ret
byte_2:
004010F3   lea         eax,[edx-2]
004010F6   pop         edi
004010F7   pop         ebx
004010F8   ret
byte_1:
004010F9   lea         eax,[edx-3]
004010FC   pop         edi
004010FD   pop         ebx
004010FE   ret
byte_0:
004010FF   lea         eax,[edx-4]
00401102   pop         edi
00401103   pop         ebx
00401104   ret
--- No source file  ----------------------------------------------------------------------------------------------------

Cheers! S.
0
 
LVL 3

Expert Comment

by:CoolBreeze
ID: 10956505
oh i see. thanks.
anyway, did my earlier explanation answers your question?
0
 
LVL 4

Assisted Solution

by:skypalae
skypalae earned 100 total points
ID: 10956900
i studied the code and the principles and i must say that microsofts are real bit twiddling gurus :).

lets take some 4 byte sequence

    "xx aa xx xx"   our byte is there, but we don't know it yet

xor "aa aa aa aa"   this is our byte 4 times
    "xx 00 xx xx"   00 appears in the position of the byte .. let's find it

add "fe fe fe ff"   adding ff is like subtracting 1 with overflowing (that why those fe's are there"
  c "-2 ff -1 -1"   here you see, that -2 appeared just above the 00 byte (overflow didnt appear) (00 -1=ff)

now you can test the overflow. if 0, fourth byte is what we're looking for
other way is to add "fe fe fe fe" and then adc 0 (that way the carry would overflow to the least significant bit)
same effect would come if you subtract "01 01 01 01"

    "xx 00 xx xx"   just xor the found sequence with added sequence
xor "-2 ff -1 -1"
    "x0 x1 x1 x1"   least bit in each byte tells us which byte was -2
and "01 01 01 01"   just clear unsignificant bits
    "00 01 01 01"
xor "01 01 01 01"   check if any byte in the sequence is 00 (the least bit of it)
    "01 00 00 00"   if this is not 0 we found our byte, now we can check the original 4byte sequence for which byte is it

or you can calculate the exact position from the bits using some more bit twiddling methods :)

simple, isn't it?

S.
0
 
LVL 3

Expert Comment

by:CoolBreeze
ID: 10956953
who's correct here? from the code given by skypalae:

main_loop_entry:
004010BA   mov         ecx,dword ptr [edx]
004010BC   xor         ecx,ebx
004010BE   mov         edi,7EFEFEFFh
004010C3   add         edi,ecx
004010C5   xor         ecx,0FFh
004010C8   xor         ecx,edi
004010CA   add         edx,4
004010CD   and         ecx,81010100h
004010D3   je          main_loop (004010b5)

especially the line : xor ecx, 0FFh

sandra-24 had described that it is xor-ed with 0xFFFFFFFF, and furthermore it is xor-ed with B not A?

so who's correct here? or are both equivalent?
0
 
LVL 4

Expert Comment

by:skypalae
ID: 10956994
CoolBreeze,

i think both, because of size of ecx, the 0ffh will be extended to 32bit form (with sign bit)

S.
0
 
LVL 3

Expert Comment

by:CoolBreeze
ID: 10957006
skypalae, did you got the explanation you got from the code you posted?
I couldn't find the FEFEFEFF you are talking about in the code.

furthermore,

    "xx 00 xx xx"   just xor the found sequence with added sequence
xor "-2 ff -1 -1"

the added sequence is not "-2 ff -1 -1" according to the code. It is the addition of "xx 00 xx xx" and "fe fe fe ff"
in other words, you might not get this "x0 x1 x1 x1"

so the following is not applicable:
    "x0 x1 x1 x1"   least bit in each byte tells us which byte was -2
0
 
LVL 4

Expert Comment

by:skypalae
ID: 10957233
CoolBreeze, ok, ok, ok,

maybe my explanations were bit unclear. I just wanted to explain how to determine the equal byte from the quadruple. Check out this code, it demonstrates my thoughts (it doesn't do the pre-alignment and post-alignment .. thie microsofts "str_misaligned" and "tail_loop" parts .. that is entered data has to be placed on address aligned to 4 bytes and it's length has to be divisible by 4 too)

void *mymemchr (const void *buf, int c, size_t count) {
      __asm {
            mov            ebx, buf
            jz            l_ret

            xor            eax, eax
            mov            al, byte ptr c
            mov            ecx, eax
            shl            eax, 8
            add            ecx, eax
            mov            eax, ecx
            shl            ecx, 16
            add            eax, ecx

            mov            ecx, count

      l_loop:
            mov            edx, dword ptr [ebx]
            xor            edx, eax
            mov            esi, edx

            sub            edx, 01010101h
            sbb            edx, 0

            xor            edx, esi
            and            edx, 01010101h
            xor            edx, 01010101h
            jne            l_found

            add            ebx, 4
            sub            ecx, 4
            jnz            l_loop

      l_found:
            ror            edx, 9
            jc            l_add_0
            ror            edx, 8
            jc            l_add_1
            ror            edx, 8
            jc            l_add_2

            inc            ebx
      l_add_2:
            inc            ebx
      l_add_1:
            inc            ebx
      l_add_0:

      l_ret:
            mov            eax, ebx
      }
}

of course it can be optimized (at least by putting 01010101h into edi), but it is hust for demonstration of the principles.

S.
0
How your wiki can always stay up-to-date

Quip doubles as a “living” wiki and a project management tool that evolves with your organization. As you finish projects in Quip, the work remains, easily accessible to all team members, new and old.
- Increase transparency
- Onboard new hires faster
- Access from mobile/offline

 
LVL 3

Expert Comment

by:CoolBreeze
ID: 10957436
i tested out your function using the below program, making only the changes:
             jz          l_ret
changes to:
             jz           not_found

and also:
     l_ret:
          mov          eax, ebx
changes to:
     l_ret:
          mov          eax, ebx

     not_found:
          mov          eax, 0


so that if it is not found, 0 (NULL) is returned.

------------------------------------------- START OF CODE ------------------------------------------------------
#include <stdio.h>

void* mymemchr (const void *buf, int c, size_t count) {
     __asm {
          mov          ebx, buf
          jz           not_found

          xor          eax, eax
          mov          al, byte ptr c
          mov          ecx, eax
          shl          eax, 8
          add          ecx, eax
          mov          eax, ecx
          shl          ecx, 16
          add          eax, ecx

          mov          ecx, count

     l_loop:
          mov          edx, dword ptr [ebx]
          xor          edx, eax
          mov          esi, edx

          sub          edx, 01010101h
          sbb          edx, 0

          xor          edx, esi
          and          edx, 01010101h
          xor          edx, 01010101h
          jne          l_found

          add          ebx, 4
          sub          ecx, 4
          jnz          l_loop

     l_found:
          ror          edx, 9
          jc          l_add_0
          ror          edx, 8
          jc          l_add_1
          ror          edx, 8
          jc          l_add_2

          inc          ebx
     l_add_2:
          inc          ebx
     l_add_1:
          inc          ebx
     l_add_0:

     l_ret:
          mov          eax, ebx

     not_found:
          mov          eax, 0
     }
}

int main() {
  char *a = "abcd";
  char *x;
  char *y;

  x = mymemchr(a, 'c', 4);
  y = memchr(a, 'c', 4);
  printf("a = %d\n", a);
  printf("x = %d\n", x);
  printf("y = %d\n", y);
}

------------------------------------------- END OF CODE -----------------------------------------------------------

and the output I obtained is:

a = 4235560
x = 0
y = 4235562

there is something wrong...
0
 
LVL 3

Expert Comment

by:CoolBreeze
ID: 10957444
>> that is entered data has to be placed on address aligned to 4 bytes and it's length has to be divisible by 4 too

just noticed this. how can we ensure this is so? so that I can try the code out properly.
0
 
LVL 3

Expert Comment

by:CoolBreeze
ID: 10957490
what a folly, i forgot that l_ret falls through to not_found
0
 
LVL 4

Expert Comment

by:skypalae
ID: 10957566
Yup .. i'm working on new version

also if ebx (base) == 0, you can just assign that to eax (return), so that piece of code is useless

S.

0
 
LVL 3

Expert Comment

by:CoolBreeze
ID: 10957692
okay, fixed that not_found thing. but here's comes another problem:

#include <stdio.h>

void* mymemchr (const void *buf, int c, size_t count) {
     __asm {
          mov          ebx, buf
          cmp          ebx, 0
          jz           not_found

          xor          eax, eax
          mov          al, byte ptr c
          mov          ecx, eax
          shl          eax, 8
          add          ecx, eax
          mov          eax, ecx
          shl          ecx, 16
          add          eax, ecx

          mov          ecx, count

     l_loop:
          mov          edx, dword ptr [ebx]
          xor          edx, eax
          mov          esi, edx

          sub          edx, 01010101h
          sbb          edx, 0

          xor          edx, esi
          and          edx, 01010101h
          xor          edx, 01010101h
          jne          l_found

          add          ebx, 4
          sub          ecx, 4
          jnz          l_loop
          jmp          not_found

     l_found:
          ror          edx, 9
          jc          l_add_0
          ror          edx, 8
          jc          l_add_1
          ror          edx, 8
          jc          l_add_2

          inc          ebx
     l_add_2:
          inc          ebx
     l_add_1:
          inc          ebx
     l_add_0:

     l_ret:
          mov          eax, ebx
          jmp          the_end

     not_found:
          mov          eax, 0

     the_end:
     }
}

int main() {
  char b[] = { 0x07, 0x04, 0x06, 0x07 };
  char *a = b;
  char *x;
  char *y;

  x = mymemchr(a, 0x04, 4);
  y = memchr(a, 0x04, 4);
  printf("a = %d\n", a);
  printf("x = %d\n", x);
  printf("y = %d\n", y);

  printf("\n");

  x = mymemchr(a, 0x06, 4);
  y = memchr(a, 0x06, 4);
  printf("a = %d\n", a);
  printf("x = %d\n", x);
  printf("y = %d\n", y);
}



the output is:

a = 1245064
x = 1245065
y = 1245065

a = 1245064
x = 1245064
y = 1245066

as you can see, the first search is okay (x == y), the second one isn't. what's wrong?
0
 
LVL 4

Expert Comment

by:skypalae
ID: 10957943
I've (hopefully) fixed the code. Now it does the same as MS memchr (misaligned memory and length not divisible by 4) but it uses my process. Just to be correct, the algorithm I've developped is just exactly the same as original MS, but it uses more understandable technique (as i described above). The MS version doesn't count on the overflow and thus it implements the guessing part. My algo finds the position correctly (that's the difference between 0xfefefeff and 0x8efefeff that you asked first)

Sandra, I hope that you understand the algorithm. You can just check it similar to what CoolBreeze did.

void *mymemchr (const void *buf, int c, size_t count) {
      __asm {
            mov            edi, 01010101h

            mov            ebx, buf
            test      ebx, ebx
            jz            l_ret

            xor            edx, edx
            mov            dl, byte ptr c
            mov            ecx, edx
            shl            ecx, 8
            add            edx, ecx
            mov            ecx, edx
            shl            ecx, 16
            add            edx, ecx

            mov            ecx, count

      l_misaligned:
            test      ebx, 3
            jz            l_aligned
      l_tail:
            cmp            dl, byte ptr [ebx]
            jz            l_ret
            inc            ebx
            loop      l_misaligned
      l_aligned:

            test      ecx, ecx
            jz            l_not_found

      l_loop:
            test      ecx, -4
            jz            l_tail

            mov            eax, dword ptr [ebx]
            xor            eax, edx
            mov            esi, eax

            sub            eax, edi
            sbb            al, 0

            xor            eax, esi
            and            eax, edi
            xor            eax, edi
            jne            l_found

            add            ebx, 4
            sub            ecx, 4
            jnz            l_loop

      l_not_found:
            xor            ebx, ebx
            jmp            l_ret

      l_found:
            ror            eax, 9
            jc            l_add_0
            ror            eax, 8
            jc            l_add_1
            ror            eax, 8
            jc            l_add_2

            inc            ebx
      l_add_2:
            inc            ebx
      l_add_1:
            inc            ebx
      l_add_0:

      l_ret:
            mov            eax, ebx
      }
}


S.
0
 
LVL 22

Expert Comment

by:grg99
ID: 10958872
LIttle Historical Note:

    The guys are Microsoft are NOT to be given credit for this bit of clever code!

This kind of stuff has been done on long word-length machines since at least 1966.

The Control Data 6600 operating systems ( 60-bit words ) used code like this to find and replace characters since around that time frame.  particularly clever were routines like SFN (Space Fill Name) which would space fill a filename out to 10 characters, all with just logical operations, done 60-bits at a time.

0
 
LVL 14

Accepted Solution

by:
wayside earned 250 total points
ID: 10958968
// if looking or 0x43
unsigned int mask = 0x43434343;

unsigned int v = ...; // 32-bit word to check

v ^= mask; // will set any byte containing 0x43 to 0x00

// true if any of the bytes are 0
bool hasZeroByte = ~((((v & 0x7F7F7F7F) + 0x7F7F7F7F) | v) | 0x7F7F7F7F);

This works for 8 bit bytes.

I got this from a web page called "Bit Twiddling Hacks", it's very entertaining: http://graphics.stanford.edu/~seander/bithacks.html




0
 
LVL 4

Expert Comment

by:skypalae
ID: 10958997
most good algorithms were allready invented, we are left here to copy them *sigh*
0
 
LVL 3

Author Comment

by:Sandra-24
ID: 10968925
Actually I think wayside's algorithim is pretty damn close from an implementation standpoint. The memchr algo might be slightly faster due to their use of -1 (which will fold nicely into a 16 bit signed integer in the trace cache). The one above on the the hand will require a second entry most likely in the trace cache to store 0x7F7F7F7F. Still it doesn't have the odd property of occasionally mistakenly identifying a match where there isn't one like memchr. This makes analysis of worst case scenarios hard.

Both require one move, one add, and three bitwise ops, but Microsoft's requires two conditional jumps for a memichr implementation, while the other requires only one.

'This kind of stuff has been done on long word-length machines since at least 1966.'

We won't ask how you know about programming on computers in 1966 ;)

Sorry for the late response, my ISP has been up and down in the past two days, and I had to ask a mod to delete my previous response.

Thanks for your help, this was quite a bit of clever code!

-Sandra

0

Featured Post

Find Ransomware Secrets With All-Source Analysis

Ransomware has become a major concern for organizations; its prevalence has grown due to past successes achieved by threat actors. While each ransomware variant is different, we’ve seen some common tactics and trends used among the authors of the malware.

Join & Write a Comment

Templates For Beginners Or How To Encourage The Compiler To Work For You Introduction This tutorial is targeted at the reader who is, perhaps, familiar with the basics of C++ but would prefer a little slower introduction to the more ad…
Introduction This article is a continuation of the C/C++ Visual Studio Express debugger series. Part 1 provided a quick start guide in using the debugger. Part 2 focused on additional topics in breakpoints. As your assignments become a little more …
The viewer will learn how to user default arguments when defining functions. This method of defining functions will be contrasted with the non-default-argument of defining functions.
The viewer will learn additional member functions of the vector class. Specifically, the capacity and swap member functions will be introduced.

762 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

24 Experts available now in Live!

Get 1:1 Help Now