Solved

Help optimising this routine

Posted on 2004-04-05
30
579 Views
Last Modified: 2008-01-16
memmem function, finds first instance (case-sensitive) of needle in haystack.

Not very good at assembler, so you will probably see lot's of ways. Try to keep it x86 compatible. If anyone can think of a way to get this running at 15% faster (if it's possible, that's the speed of strstr) they will get tripple points!

Hoping to learn from seeing the experts in action, don't hesitate to post snippets that may not be faster, different is good to learn from also.

The code:
__declspec(naked) void * __cdecl memmem(const void * buf, size_type count, const void * needle, size_type needle_len)
{
      __asm
      {
            ;//Prologue, preserve registers
            push      ebp
            push    ebx  
            push    esi  
            push    edi

            mov            edi, [esp+14h]                  ;// Put buffer in destination register
            mov            ecx, [esp+18h]                  ;// Put count in ecx register
            mov            esi, [esp+1Ch]                  ;// Put needle in source register
            mov            ebx, [esp+20h]                  ;// Put needle length in ebx register            

            test      ebx, ebx                        ;// Is needle empty?
            jz            short found                        ;// Return buffer

            xor            edx, edx                        ;//      Clear edx, because we may use the whole thing later (in to_memchr)

            mov            dl, [esi]                        ;// Put first byte of needle into dl (we know it has atleast 1 byte

            sub            ebx, 2                              ;// Subtract two from needle length (since we have two bytes of it in registers already)
            jb            short to_memchr                  ;// Needle length is one byte, call memchr instead

            inc            esi
            mov            dh, [esi]                        ;// Put second byte in dh
            inc            esi

            jmp            short main_loop                  ;// Needle is greater than or equal to 2 bytes

compare_cleanup:
            pop            esi                                    ;// Restore needle
            pop            edi                                    ;// Restore buffer

main_loop:
            cmp            [edi], dl                        ;// Compare first byte of needle with byte of buffer
            je            short get_second_byte      ;// Match, check the rest of needle against the rest of buffer

            dec            ecx                                    ;// Decrement counter (doesn't set carry flag)
            jz            short not_found                  ;// Counter went to zero, needle wasn't found (atleast 2 bytes)

            inc            edi                                    ;// Increment buffer pointer by one byte

            jmp            short main_loop                  ;// Back to main loop

get_second_byte:
            dec            ecx                                    ;// Decrement counter (doesn't set carry flag)
            jz            short not_found                  ;// Counter went to zero, needle wasn't found (needle is atleast length 2)

            inc            edi                                    ;// Increment buffer pointer
            cmp            [edi], dh                        ;// Check if second byte matches
            je            short compare_init            ;// Match, compare rest of needle against buffer

            jmp            short main_loop                  ;// We've fetched the next byte, decremented the counter, just continue on as normal, comparing it with the start of the pattern

compare_init:
            cmp            ecx,ebx                              ;// ebx must be greater than or equal to length of needle - 2
            jb            short not_found                        ;// If count is < needle length, needle cannot be in the buffer

            mov            eax, ebx                        ;// Take a new copy of needle length for use as a counter

            push      edi                                    ;// Preserve buffer pointer
            push      esi                                    ;// Preserve needle

            inc            edi                                    ;// edi will be restored, but this is to ensure we don't recheck this byte

compare_loop:
            test      eax, eax                        ;// Zero, we're compared this byte already, we've found it!
            jz            short found

            cmpsb                                          ;// Compare next byte in each
            jne            short compare_cleanup      ;// No match, keep looking

            dec            eax                                    ;// Decrement counter (checked on next iteration for zero)

            jmp            short compare_loop            ;// So far, so good, keep checking the needle against the buffer

to_memchr:
            push      ecx                                    ;// Count
            push      edx                                    ;// Push needle (must be passed as an int)
            push      edi                                    ;// Buffer
            call      memchr                              ;// Call memchr(buffer,value,count)
            add            esp, 0Ch                        ;// Cleanup stack (__cdecl)
            jmp            short finished

not_found:
            xor            eax, eax                        ;// Return zero
            jmp            short finished

found:
            pop            esi                                    ;// Restore needle
            pop            edi                                    ;// Restore buffer

            dec            edi                                    ;// We're scrolled one ahead (by compare second byte)

            mov            eax, edi                        ;// Return current location in buffer

finished:
            ;// Epilogue
            pop     edi  
            pop     esi  
            pop     ebx  
            pop     ebp  
            ret
      }
}

(might want to copy and paste into dev env, tabs got messed up a little)

Cheers,
-Sandra
0
Comment
Question by:Sandra-24
  • 13
  • 8
  • 3
  • +4
30 Comments
 
LVL 6

Assisted Solution

by:joghurt
joghurt earned 330 total points
ID: 10755541
Instead of
          xor          edx, edx                    ;//     Clear edx, because we may use the whole thing later (in to_memchr)

          mov          dl, [esi]                    ;// Put first byte of needle into dl (we know it has atleast 1 byte

          sub          ebx, 2                         ;// Subtract two from needle length (since we have two bytes of it in registers already)
          jb          short to_memchr               ;// Needle length is one byte, call memchr instead

          inc          esi
          mov          dh, [esi]                    ;// Put second byte in dh
          inc          esi

try
; // No xor-ing is required.
    movzx edx, word ptr [esi]  
    add esi, 2
    sub ebx, 2
    jb to_memchr                  ; // The compiler
instead. It has two advantages: 1. You'll have less instructions, and they will take less clock cycles (because they remain simple instructions). 2. You won't get a penalty for reading from non-4-aligned addresses and for dealing with 8-byte data.

Using the same idea, you can do the same for the data at [edi].

Rethink register usage also, the push-pop pairs can be removed. (For example you can use ebp since it's saved across your function.)

Btw, if this algorithm should do that I think it should, it has a few bugs. Anyway, it's clearly written more complicated than it should be.
0
 
LVL 8

Expert Comment

by:manish_regmi
ID: 10755874
hi,
 You can also use the optimising feature of  compiler.
-O1 -O2 -O3 in gcc

your compiler might also have similar switch. I think VC++ has it in options in its ide.
If i remember correctly it is /O1 /O2 switch in vc++

regards manish
0
 
LVL 8

Expert Comment

by:manish_regmi
ID: 10755917
hi,
 See the source code in code project. They are really good and optimised.
http://www.codeproject.com/string/
http://www.codeproject.com/string/xmemstring.asp

regards manish
0
 
LVL 22

Expert Comment

by:grg99
ID: 10756645

replace the first loop that looks for the first character match with a rep scansb


On all x86's a rep scansb beats an explicit loop (although not by much on the later models)

0
 
LVL 6

Expert Comment

by:joghurt
ID: 10758580
grg99, if you meant "scasb" (without 'n'), it is beaten by an explicit loop on newer models because the atomic operations (movzx, inc, dec) can be paired very nicely.
0
 
LVL 16

Expert Comment

by:PaulCaswell
ID: 10758808
I suggest an algorithmic change if you are really looking for >10% improvement. I cant remember what the method is called but there's a technique that builds a table of skip lengths for each possible last-character.

E.G.

search for "needle"

table is full of 6 except for:

'n' = 5
'e' = 0
'd' = 2
'l' = 1

The algorithm is to compare the LAST character of the needle against the x + n character of the buffer (n = length of search string). Actually, just lookup the x+n character in table.

E.G.:

Look for "needle" as above in "haystack haystack need needle heystack".

Start at x = 0.
"haystack haystack need needle heystack"
"needle"
Lookup 'a'. Contains 6 so n += 6.

"haystack haystack need needle heystack"
"      needle"
Lookup 'y'. Contains 6 so n += 6.

"haystack haystack need needle heystack"
"            needle"
Lookup ' '. Contains 6 so n += 6.

"haystack haystack need needle heystack"
"                       needle"
Lookup 'e'. Contains 0 so check for match.

This is REALLY fast. We've done 4 table lookups where the normal grunt approach would have hit 33 comparisons by now!

There are well documented variations for case insensitive versions.




0
 
LVL 3

Author Comment

by:Sandra-24
ID: 10759525
jogurt:

 movzx edx, word ptr [esi]  
 add esi, 2
 sub ebx, 2
 jb to_memchr

won't this crash if esi is only one byte long? That's reading past the end of a buffer by one byte.

perhaps:

sub  ebx, 2
jb to_memchr    ; why do we leave off short here? They add it in microsofts asm files (but then they also use add 1 instead of inc)

;this is gonna be atleast 2 bytes now, it's safe
movzx edx, word ptr [esi]
add esi, 2

This is only ever executed once, so I don't think we'll even notice the effect of optimizing it. It is the loops that will really show.

However, you gave me a very good idea. If needle length is < 4 && > 1 I can do the comparisons with words. If it is > 4 I can use dwords! I think that will run much faster, I'm going to try that!

Also I will make use of EBP but it saves me only a stack push/pop. (in exchange for two mov commands) I really need 2 additional registers to eliminate the stack usage in that compare loop. Or I need to change the algorithim.

manish_regmi:
I checked out those functions at code project before I started doing this. They are written in C and run over 200% slower. Also VC++ says that it doesn't optimize assembly code, although it may possibly have an option that will do that, It would have to be smarter than I give it credit for.

PaulCaswell:
Yes, an algorithim improvement will speed things up, but there is a catch. The algorithim you are thinking of is boyer-moore. It creates two tables used for skips and is the algorithim all string searching worth anything is based upon. However, here's the catch. It is much faster, but only for large haystacks and large needles. Wtih small needles, the skips are very few. With small haystacks, the overhead of creating the skip tables is very large. With A C implementation I find it is faster than strstr only for haystacks over 4000 chars, and needles above about 8 chars. Not something that could be used by itself. Still for case insensitive searching, a modification of the boyer-moore gets much closer (since the character comparisons are doubled, and it skips much of these). The break even point is around 150 for haystack length, and around 4 for needle length. A version of the algorithim coded into asm would possibly be much faster.

Thanks,

I'm going to make some of the improvements suggested, and bring back the next version for your review.

-Sandra





0
 
LVL 6

Expert Comment

by:joghurt
ID: 10760202
What I've written about "Using the same idea, you can do the same for the data at [edi]." is exactly applies to the main loop that is executed most of the time.

And yes, if you check 2 or 4 bytes at the same time, you have to take care about the remaining 1 or 1 to 3 bytes so not to overrun the buffer.

Regarding PaulCaswell's idea, yes again, it's the algoritm that can be optimized most. Your algorithm is an O(N*M) algorithm, meaning that you'll need to do a multiply of N*M [N and M are the length of the two buffers] comparisons in the worst case. The best pattern-search algorithm I know is the Knurr-Morris-Pratt algorithm, that solves this problem in O(N+M) steps that is much less. You can also check the Rabin-Karp algorithm in the literature.
0
 
LVL 3

Author Comment

by:Sandra-24
ID: 10760294
The knurr-morris-pratt algorithim is faster, but still not as fast as boyer-moore in most cases (was one of the first improvements in string searching algorithims though). Rabin-Karp is a hashing algorithim that is generally regarded as easier to implement than boyer-moore, but still slower. 64bit systems may change that however with the ability to compare twice as much data in a similar amount of time. I think knurr-morris-pratt will have less inital overhead (only one skip table), and maybe be faster than boyer-moore for smaller strings/needles but I think it's too much effort to implement 3 different algorithims :) I may do that eventually, but for now I'll be happy with the brute force algo and the boyer-moore for select circumstances.

I do have a question about alignment though. I know that memchr uses dword comparisons, but it also has a str_misaligned loop. I'm not sure what that's for, or what alignment really is. Could someone explain this to me?

Thanks,
-Sandra
0
 
LVL 6

Expert Comment

by:joghurt
ID: 10760645
memchr, memcpy, strchr, etc. al., all have three parts:

1. Copy/search 1..3 bytes if the pointer isn't aligned to 4.
2. Copy/search dwords
3. Copy/search 1..3 bytes if the buffer's end isn't divisible by 4.

But: As there are two buffers for memcpy and strcpy, both pointers may be misaligned so you can align only one of them.

OFF
Gee, you seem to know/read a lot about algorithms. ;-) It's always a pleasure to find people here who don't want to have us solve their homeworks so that they wouldn't need to learn it.
/OFF
0
 
LVL 3

Author Comment

by:Sandra-24
ID: 10761020
How do I know if a pointer is aligned to 4? Address evenly divisible by 4? I don't fully understand this alignment stuff. Here's my current conception of it, and then maybe you will see hwo to set me straight.

Thinking of memory as more of a collection of dwords than of bytes, if you have a byte array it is possibke to have it aligned like so:

| = dword boundary
x = byte

|x x x x|x x x x|x x x x|

|x s t r|i n g x |x x x x|

In which case the starting address is at 1 instead of 0, and fetching a dword will actually step over the boundary and perhaps require the fetching of 2 dwords instead of 1 from memory?

Very grey area for me anyway, it's a totally new concept. I've done some searchign with google but haven't turned up much yet, I'll continue to look some more. Do you happen to know of any websites I might visit to read about this?

Yes, actually I did read a fair amount about string searching algorithms. It's a much deeper topic that I first thought. Very interesting, but one can spend a lot of time on it. To me, learning about computers is something I never tire of. Heck, last week I didn't know what a register was. The week before that I'd never known that there was such a thing as string searching algorithms.
0
 
LVL 6

Assisted Solution

by:joghurt
joghurt earned 330 total points
ID: 10761114
Yes, alignment means that the address is divisible by a number (4 here). And yes, your example is OK.

And the literature:
First of all, the Pentium family programmer's manual (from http://developer.intel.com/). However, it's too huge for right now.
Check out Agner Fog's "How to optimize for the Pentium family of microprocessors". Its chapter 3.6 deals with alignment problems.
You can get this (and other useful stuff) from http://www.agner.org/assem/
0
 
LVL 3

Assisted Solution

by:Dancie
Dancie earned 100 total points
ID: 10761228
Try this :
I can give some comments later

             push     ebp
          push    ebx  
          push    esi  
          push    edi

          mov          edi, [esp+14h]               ;// Put buffer in destination register
          mov          ecx, [esp+18h]               ;// Put count in ecx register
          mov          esi, [esp+1Ch]               ;// Put needle in source register
          mov          ebx, [esp+20h]               ;// Put needle length in ebx register          

          test     ebx, ebx                    ;// Is needle empty?
          jz          short found1                    ;// Return buffer  !!!!! found would cause a stack fault
          mov      edx, 0                    ;//     Clear edx, because we may use the whole thing later (in to_memchr)
          mov      dl, [esi]                    ;// Put first byte of needle into dl (we know it has atleast 1 byte
          sub      ebx, 2                         ;// Subtract two from needle length (since we have two bytes of it in registers already)
          jb       short to_memchr               ;// Needle length is one byte, call memchr instead
          mov      dh, [esi+1]                    ;// Put second byte in dh
          add      esi,2
main_loop:
          cmp     [edi], dl                    ;// Compare first byte of needle with byte of buffer
          je      short get_second_byte     ;// Match, check the rest of needle against the rest of buffer
          inc     edi                              ;// Increment buffer pointer by one byte
          dec     ecx                              ;// Decrement counter (doesn't set carry flag)
          jnz     main_loop               ;// Back to main loop
not_found:
          xor     eax, eax                    ;// Return zero
          pop     edi                           ;just finish
          pop     esi  
          pop     ebx  
          pop     ebp  
          ret

      ALIGN      4

get_second_byte:
          inc     edi                              ;// Increment buffer pointer
          dec     ecx                              ;// Decrement counter (doesn't set carry flag)
          jz      short not_found               ;// Counter went to zero, needle wasn't found (needle is atleast length 2)
          cmp     [edi], dh                    ;// Check if second byte matches
          jne     short main_loop               ;// We've fetched the next byte, decremented the counter, just continue on as normal, comparing it with the start of the pattern
          cmp     ecx,ebx                         ;// ebx must be greater than or equal to length of needle - 2
          jb      short not_found                    ;// If count is < needle length, needle cannot be in the buffer
          mov     eax, ebx                    ;// Take a new copy of needle length for use as a counter
          push    edi                              ;// Preserve buffer pointer
          push    esi                              ;// Preserve needle
compare_loop:
          inc     edi                              ;// edi will be restored, but this is to ensure we don't recheck this byte
          test    eax, eax                    ;// Zero, we're compared this byte already, we've found it!
          jz      short found
         mov        dh,[edi]                ;much faster than a cmpsb
        inc     esi
          dec     eax                              ;// Decrement counter (checked on next iteration for zero)
          cmp     dh,[esi-1]                           ;// Compare next byte in each
          je      short compare_loop          ;// So far, so good, keep checking the needle against the b
compare_cleanup:
          pop     esi                              ;// Restore needle
          pop     edi                              ;// Restore buffer
        jmp        short main_loop

      ALIGN      4

to_memchr:
          push     ecx                              ;// Count
          push     edx                              ;// Push needle (must be passed as an int)
          push     edi                              ;// Buffer
          call     memchr                         ;// Call memchr(buffer,value,count)
          add     esp, 0Ch                    ;// Cleanup stack (__cdecl)
          pop     edi                          ;just finish
          pop     esi  
          pop     ebx  
          pop     ebp  
          ret

      ALIGN      4

found:
          pop          esi                              ;// Restore needle
          pop          edi                              ;// Restore buffer
          dec          edi                              ;// We're scrolled one ahead (by compare second byte)
found1:
          mov     eax, edi                    ;// Return current location in buffer
          ;// Epilogue
          pop     edi  
          pop     esi  
          pop     ebx  
          pop     ebp  
          ret

This is mostly aligning code and using more effecent code . It probably can be a bit better. But give it it a try.

Alignment for data is to access data on boundaries which corrispond to the data size
Byte data is always aligned
word(or 2byte data) should be only in addresses which are a multible of 2.
dword(or 4byte data) should be only in addresses which are a multible of 4;
qword(or 8byte data) should be only in addresses which are a multible of 8;
oword(or 16byte data) should be only in addresses which are a multible of 16;
It is because of the way the processor access memory.
If these alignment rules are not met then the processor needs two memory fetches
to retrieve one unit of data.
There are many other rules of data accessing but this is a very importante one.
0
 
LVL 3

Author Comment

by:Sandra-24
ID: 10761573
jogurt, so couldn't I do this to use aligned reads?

| x s t r | n full, aligned dwords | i n g x |

check if starting address is evenly divisible by 4, in this case false

compare the first three bytes of string byte by byte against needle

proceed to fetch dwords until we reach | i n g x | at which point we do things byte by byte unless the string we're looking for is > 3 bytes, in which case we ignore the last part.

I have to leave now, but Dancie:

compare_loop:
          inc     edi                              ;// edi will be restored, but this is to ensure we don't recheck this byte
          test    eax, eax                    ;// Zero, we're compared this byte already, we've found it!
          jz      short found
        mov       dh,[edi]                ;much faster than a cmpsb
       inc     esi
          dec     eax                              ;// Decrement counter (checked on next iteration for zero)
          cmp     dh,[esi-1]                           ;// Compare next byte in each
          je      short compare_loop          ;// So far, so good, keep checking the needle against the b

this modifies dh, which doesn't seem to be preserved. dl is first byte of needle, dh is second byte. Change that and the main_loop won't work anymore.

But what I get from you is that fetching, incrementing, comparing etc is faster than cmpsb. Also you see to have made a few more changes. Don't really understand the align stuff though. Since we fetch everything one byte at a time, isn't alignment irrelevent?
0
 
LVL 3

Author Comment

by:Sandra-24
ID: 10763803
Thanks Dancie, there was a bug with jumping to found right in the beginning.

I think I've got a really optimized function now, it's almost trippled in length though, and the complexity has increased exponentially. Trade off is instead of fetching 4 bytes from memory one at a time, I fetch a whole dword (aligned) and instead of 1 to 4 comparisons with some possible stack usage and jumping, I do only 1 comparison. There is also less additional bytes fetched in the compare loop, which will make the run time faster for strings where the needle is 'almost' found frequently. I think that will more than make up for the code bloat. It's sure been a learning experience though. I'll get it all working nicely tomorrow and post it here for you guys to review again. Thanks for the help so far!

-Sandra
0
IT, Stop Being Called Into Every Meeting

Highfive is so simple that setting up every meeting room takes just minutes and every employee will be able to start or join a call from any room with ease. Never be called into a meeting just to get it started again. This is how video conferencing should work!

 
LVL 16

Assisted Solution

by:PaulCaswell
PaulCaswell earned 50 total points
ID: 10763897
Sandra,

You seem to be learning really fast so perhaps this is a bit too late but:

>> ... |x s t r|i n g x |x x x x|

you are exactly right but remember that either part can be misaligned:

Haystack:

 |a s t r|i n g b |c d e f |

Needle:

 |a b s t | r i n g | ...

I bet you've thought of this already though.

Then you've got to remember that you are running on a little-endian machine. I.E. "string" is actually stored | i r t s | . . n g | if you pick it up as a dword.
0
 
LVL 22

Assisted Solution

by:grg99
grg99 earned 20 total points
ID: 10764862
It might be worthwhile also to check first off that the pattern is <= as long as the string.

0
 
LVL 2

Expert Comment

by:rogerperkins
ID: 10767006
Try changing this:

          jmp          short main_loop               ;// Needle is greater than or equal to 2 bytes

compare_cleanup:
          pop          esi                              ;// Restore needle
          pop          edi                              ;// Restore buffer


main_loop:
          cmp          [edi], dl                    ;// Compare first byte of needle with byte of buffer
          je          short get_second_byte     ;// Match, check the rest of needle against the rest of buffer

          dec          ecx                              ;// Decrement counter (doesn't set carry flag)
          jz          short not_found               ;// Counter went to zero, needle wasn't found (atleast 2 bytes)

          inc          edi                              ;// Increment buffer pointer by one byte

          jmp          short main_loop               ;// Back to main loop


To this:

          add         ecx,edi

          jmp          short main_loop               ;// Needle is greater than or equal to 2 bytes

compare_cleanup:
          pop          esi                              ;// Restore needle
          pop          edi                              ;// Restore buffer


main_loop:
          cmp          [edi], dl                    ;// Compare first byte of needle with byte of buffer
          je          short get_second_byte     ;// Match, check the rest of needle against the rest of buffer

          inc          edi                              ;// Increment buffer pointer by one byte

          cmp        edi,ecx
          jnz        short main_loop              ;// Back to main loop

          jmp        short not_found               ;// Counter went to zero, needle wasn't found (atleast 2 bytes)
0
 
LVL 3

Author Comment

by:Sandra-24
ID: 10767299
"Then you've got to remember that you are running on a little-endian machine. I.E. "string" is actually stored | i r t s | . . n g | if you pick it up as a dword."

I knew it seemed too simple, thanks, that will save me some debugging time. *rewrites algo*.

-Sandra
0
 
LVL 3

Author Comment

by:Sandra-24
ID: 10780755
Well it was a learning experience, but all I managed to do was make it slower. I fetched everything with dwords, then used a shr 8 to move the next byte into dl. I also did the comparisons one dword at a time where possible. I was working with aligned memory too.

There is some optimizations that could be made still, perhaps, but I don't think anything that would make it worthwhile.

If anything this made me very, very thankfull for C++ ! I never thought of C++ as a 'high-level' language, but I sure do now :) I'm going to go with the old code I posted above. 15% slower is pretty good considering it has to use two counters that strstr doesn't. There's a lot to say for simplicity.

Thanks for your help everyone, it's been a pleasure learning with you
-Sandra
0
 
LVL 16

Expert Comment

by:PaulCaswell
ID: 10781560
Sandra,

It was refreshing to help someone keen to understand rather than the more common 'I dunno what I want but I want it NOW!".

Paul
0
 
LVL 3

Author Comment

by:Sandra-24
ID: 10788814
Well I'm not giving up yet. I woke up this morning and decided to give it another go. No way microsoft is going to have more efficient code than me :) I think I've got a better design this time, and I'll really be surprised if it runs slower again!

But I do have a quick question. What is faster? cmp ax, dx or cmp eax, edx. I'm just using half of each register but if I zero them first the comparisons will be the same no matter which I use. So I'm wondering if there's a performance trade off in choosing one over the other.

Just one of many little things you'd only know if you've been doing this for a while.

Thanks,
-Sandra
0
 
LVL 3

Author Comment

by:Sandra-24
ID: 10789421
Also I think that I read somewhere that sub eax, 2 and add eax, 2 is slower than two times dec eax or inc eax. I couldn't find that website again. Have you heard anything like that?
0
 
LVL 6

Assisted Solution

by:joghurt
joghurt earned 330 total points
ID: 10789606
Don't be much surprised that the Microsoft compiler can generate a faster code than you. It has all the knowledge wired in from Intel docs that you don't have in your mind. I've put once a question here about my code that *should* (theoretically) run faster than the Microsoft-generated one but runs 20% slower than that. Noone was able to point out exactly why my code is slower, where I got some penalty that makes Microsoft's solution faster. Yes, we like to understand the things behind all this (and joining to PaulCaswell I'm glad to meet someone wanting to understand things) but present CPUs are starting to be much too complex to be understood in full detail.

On 32-bit processors operations with 32-bit registers are faster than with 16-bit registers. And if you have to clear the upper 16 bits only once, before the loop, or if you use a zeroing instruction within the loop such as movzx, you can use 32-bit comparisons.

I'm afraid you should refresh the infos which instruction is slower. Yes, there were times when your statement was true but from appr. P6 they are not necessarily. You can read more on optimization at http://www.agner.org/assem/

0
 
LVL 3

Author Comment

by:Sandra-24
ID: 10789649
Bah, another day's effort and it performs about the same lol!

I suppose that's why you should use a profiler! Still atleast it works very well. I couldn't find anymore bugs in it. See if you can spot any way to make it run faster. I'm through for the night with it.

(warning 400 lines of x86 assembler follow)

__declspec(naked) void * __cdecl memmem(const void * buf, unsigned long count, const void * needle, unsigned long needle_len)
{
      __asm
      {
            ;//Prologue, preserve registers
            push            ebp                                          ;
            push            ebx                                          ;
            push            esi                                          ;
            push            edi                                          ;

            mov         edi, [esp+14h]              ;// Put buffer in destination register
            mov         ecx, [esp+18h]              ;// Put count in ecx register
            mov         esi, [esp+1Ch]              ;// Put needle in source register
            mov         ebx, [esp+20h]              ;// Put needle length in ebx register

            test            ebx, ebx                              ;// Is needle empty?
            jz          short return_buffer         ;// Return buffer

            sub                  ecx, ebx
            jb                  short not_found                        ;// needle can't be in buffer if it is larger than buffer

            sub         ebx, 4                      ;// Subtract four from needle length, is it atleast 4 bytes?
            jb          short byte_init             ;// Needle length is 1-3 bytes, handle it with a byte loop instead

            mov                  ebp, dword ptr [esi]            ;
            add                  esi, 4                                    ;

            bswap            ebp                                          ;// Make big-endian

            mov                  eax, dword ptr [edi]            ;// Safe cause we know edi is >= esi (may require two fetches)
            add                  edi, 4                                    ;

            bswap            eax                                          ;

            sub                  ecx, 4                                    ;
            jb                  near_end                              ;// Go to a byte loop, skip alignment phase

            mov                  edx, dword ptr[edi]                  ;// Put a dword into edx (may require two fetches)
            add                  edi, 4                                    ;

            test            edi, 3                                    ;// Test if string is aligned on 32 bits
            jz                  short dword_loop_has_edx      ;// Needle is greater than or equal to 4 bytes and perfectly aligned

            ;// we've established that ebx > 4 and that ecx >= ebx, we also know that edi is misaligned

            mov                  dh, dl                                    ;// mov dl (first byte) into dh
            inc                  edi                                          ;
            dec                  ecx                                          ;

            test            edi, 3                                    ;// Test if string is aligned on 32 bits
            jz                  short pipe_quarter                  ;

            ;// do nothing, first two bytes are where we want them, dl and dh respectively
            inc                  edi                                          ;
            dec                  ecx                                          ;

            test            edi, 3                                    ;// Test if string is aligned on 32 bits
            jz                  short pipe_half_no_shift      ;

            ;// string is aligned now, could not have been more than 24 bits off a 32 bit boundary
            shl                  edx, 8                                    ;// need first byte in dh, not dl
            inc                  edi                                          ;
            dec                  ecx                                          ;            

            jmp                  short pipe_three_quarter      ;

byte_init:
            add                  ebx, 2                                    ;// we're -1, 0, or 1 now (unsigned), it is also representative of the number of characters to compare after matching a word
            jnc                  short to_memchr                        ;// if there was no carry, needle is a single byte, call memchr

            mov                  ax, word ptr [esi]                  ;// Put first byte into al, second into ah
            xchg            al, ah                                    ;// Make big-endian
            add                  esi, 2

            ;// we've ensured there's room in haystack for needle (haystack >= needle && needle >= 2
end_to_word:
            mov                  dl,      byte ptr [edi]                  ;
            inc                  edi                                          ;
            dec                  ecx                                          ;

word_loop:
            mov                  dh, dl                                    ;// dh is first byte, dl is second
            mov                  dl, byte ptr [edi]                  ;// Simulate placing another word into dx

            cmp                  ax, dx                                    ;
            je                  compare_byte                        ;

            inc                  edi                                          ;
            dec                  ecx                                          ;
            jz                  short not_found                        ;// There must be zero bytes left in edi

            jmp                  short word_loop                        ;

near_end:
            add                  ecx, 8                                    ;
            add                  ecx, ebx                              ;// now ecx is back to length of string

            sub                  edi, 4                                    ;// go back to before last dword fetch
            sub                  esi, 2                                    ;// we added 4 to esi, but now we must decrease that to 2
            add                  ebx, 2                                    ;// we subtracted 4 from ebx, now we must decrease that to 2 also

            mov                  eax, ebp                              ;// place first two bytes of needle into ax
            shr                  eax, 16                                    ;

            ;// go into a word loop (optimized byte loop)
            jmp                  short end_to_word                  ;

dword_loop:
            ;// This has to work with aligned memory
            sub                  ecx, 4                                    ;
            jb                  near_end                              ;// Go to a byte loop

            mov                  edx, dword ptr[edi]                  ;// Put a dword into edx (may require two fetches)
            add                  edi, 4                                    ;

dword_loop_has_edx:
            cmp                  eax, ebp                              ;
            je                  compare4                              ;

pipe_full:
            shl                  eax, 8                                    ;
            mov                  al, dl                                    ;// Place first byte from pipe into eax

            cmp                  eax, ebp                              ;
            je                  compare3                              ;// found at edi-3

pipe_three_quarter:
            shl                  eax, 8                                    ;
            mov                  al, dh                                    ;// Place second byte from pipe into eax

            cmp                  eax, ebp                              ;
            je                  compare2                              ;// found at edi-2

pipe_half:
            ;// Move next word up
            shr                  edx, 16                                    ;

pipe_half_no_shift:
            shl                  eax, 8                                    ;
            mov                  al, dl                                    ;// Place third byte from pipe into eax

            cmp                  eax, ebp                              ;
            je                  compare1                              ;// found at edi-1

pipe_quarter:
            shl                  eax, 8                                    ;
            mov                  al, dh                                    ;// Place fourth byte from pipe into eax

            jmp                  short dword_loop                  ;

;// Dword Loop Comparer #1
compare_done1:
            pop                  edi                                          ;
            pop                  esi                                          ;
            pop                  eax                                          ;
            pop                  ebx                                          ;

            jmp                  short pipe_quarter                  ;// go back to our search

compare_found1:
            pop                  edi                                          ;// pop buffer
            add                  esp, 0Ch                              ;// restore stack
found_minus5:
            lea                  eax, [edi-5]                        ;// found at -(1+4)
            jmp                  short finished                        ;

compare1:            
            test            ebx, ebx                              ;
            jz                  found_minus5                        ;// found it already                  

            push            ebx                                          ;// backup counter
            push            eax                                          ;
            push            esi                                          ;
            push            edi                                          ;

            sub                  edi, 1                                    ;// we're 1 ahead

compare_loop1:
            ;// edi points at the next character to compare, so does esi
            mov                  al, [esi]                              ;

            cmp                  al, [edi]                              ;
            jne                  short compare_done1                  ;// not found, go back to word_loop

            inc                  esi                                          ;
            inc                  edi                                          ;
            dec                  ebx                                          ;
            jz                  short compare_found1            ;// found it

            jmp                  short compare_loop1                  ;

;// Dword Loop Comparer #2
compare_done2:
            pop                  edi                                          ;
            pop                  esi                                          ;
            pop                  eax                                          ;
            pop                  ebx                                          ;

            jmp                  short pipe_half                        ;// go back to our search

compare_found2:
            pop                  edi                                          ;// pop buffer
            add                  esp, 0Ch                              ;// restore stack
found_minus6:
            lea                  eax, [edi-6]                        ;// found at -(2+4)
            jmp                  short finished                        ;

compare2:            
            test            ebx, ebx                              ;
            jz                  found_minus6                        ;// found it already                  

            push            ebx                                          ;// backup counter
            push            eax                                          ;
            push            esi                                          ;
            push            edi                                          ;

            sub                  edi, 2                                    ;// we're 2 ahead

compare_loop2:
            ;// edi points at the next character to compare, so does esi
            mov                  al, [esi]                              ;

            cmp                  al, [edi]                              ;
            jne                  short compare_done2                  ;// not found, go back to word_loop

            inc                  esi                                          ;
            inc                  edi                                          ;
            dec                  ebx                                          ;
            jz                  short compare_found2            ;// found it

            jmp                  short compare_loop2                  ;

            ;// Dword Loop Comparer #3
compare_done3:
            pop                  edi                                          ;
            pop                  esi                                          ;
            pop                  eax                                          ;
            pop                  ebx                                          ;

            jmp                  short pipe_three_quarter      ;// go back to our search

compare_found3:
            pop                  edi                                          ;// pop buffer
            add                  esp, 0Ch                              ;// restore stack
found_minus7:
            lea                  eax, [edi-7]                        ;// found at -(3+4)
            jmp                  short finished                        ;

compare3:            
            test            ebx, ebx                              ;
            jz                  found_minus7                        ;// found it already                  

            push            ebx                                          ;// backup counter
            push            eax                                          ;
            push            esi                                          ;
            push            edi                                          ;

            sub                  edi, 3                                    ;// we're 3 ahead

compare_loop3:
            ;// edi points at the next character to compare, so does esi
            mov                  al, [esi]                              ;

            cmp                  al, [edi]                              ;
            jne                  short compare_done3                  ;// not found, go back to word_loop

            inc                  esi                                          ;
            inc                  edi                                          ;
            dec                  ebx                                          ;
            jz                  short compare_found3            ;// found it

            jmp                  short compare_loop3                  ;

            ;// Dword Loop Comparer #4
compare_done4:
            pop                  edi                                          ;
            pop                  esi                                          ;
            pop                  eax                                          ;
            pop                  ebx                                          ;

            jmp                  short pipe_full                        ;// go back to our search

compare_found4:
            pop                  edi                                          ;// pop buffer
            add                  esp, 0Ch                              ;// restore stack
found_minus8:
            lea                  eax, [edi-8]                        ;// found at -(4+4)
            jmp                  short finished                        ;

compare4:            
            test            ebx, ebx                              ;
            jz                  found_minus8                        ;// found it already                  

            push            ebx                                          ;// backup counter
            push            eax                                          ;
            push            esi                                          ;
            push            edi                                          ;

            sub                  edi, 4                                    ;// we're 4 ahead

compare_loop4:
            ;// edi points at the next character to compare, so does esi
            mov                  al, [esi]                              ;

            cmp                  al, [edi]                              ;
            jne                  short compare_done4                  ;// not found, go back to word_loop

            inc                  esi                                          ;
            inc                  edi                                          ;
            dec                  ebx                                          ;
            jz                  short compare_found4            ;// found it

            jmp                  short compare_loop4                  ;


;// Word Loop Comparer
compare_done:
            pop                  edi                                          ;
            pop                  esi                                          ;
            pop                  eax                                          ;

            jmp                  short word_loop                        ;// go back to our search

compare_found:
            pop                  edi                                          ;// pop buffer (we found it at -2)
            add                  esp, 8                                    ;// restore stack
found_word:
            lea                  eax, [edi-1]                        ;// found at -1 from current position of edi
            jmp                  short finished                        ;

compare_byte:            
            test            ebx, ebx                              ;
            jz                  found_word                              ;// found it already                  

            ;// we don't use ebp from word_loop
            mov                  ebp, ebx                              ;// our counter
            push            eax                                          ;
            push            esi                                          ;
            push            edi                                          ;

            inc                  edi                                          ;

compare_loop:
            ;// edi points at the next character to compare, so does esi
            mov                  al, [esi]                              ;

            cmp                  al, [edi]                              ;
            jne                  short compare_done                  ;// not found, go back to word_loop

            inc                  esi                                          ;
            inc                  edi                                          ;
            dec                  ebp                                          ;
            jz                  short compare_found                  ;// found it

            jmp                  short compare_loop                  ;

to_memchr:
            movzx            eax, byte ptr [esi]                  ;// Zero extend needle (one byte) into eax
            push            ecx                                          ;// Count
            push            eax                                          ;// Push needle (must be passed as an int)
            push            edi                                          ;// Buffer
            call            memchr                                    ;// Call memchr(buffer,value,count)
            add                  esp, 0Ch                              ;// Cleanup stack (__cdecl)
            jmp                  short finished                        ;

not_found:
            xor                  eax, eax                              ;// Return zero
            jmp                  short finished                        ;

return_buffer:
            mov                  eax, edi                              ;// Return current location of needle in buffer

finished:
            ;// Epilogue
            pop                  edi                                          ;
            pop                  esi                                          ;
            pop                  ebx                                          ;
            pop                  ebp                                          ;
            ret                                                            ;
      }
}
0
 
LVL 3

Author Comment

by:Sandra-24
ID: 10789719
Wow, whoever made that pdf went through an awful lot of work! Thanks, there's a lot of good material in there.
0
 
LVL 3

Author Comment

by:Sandra-24
ID: 10792882
I moved to the 32 bit register comparisons in word loop, and now 2 and 3 length needles (what word loop is mostly used for) are much faster. Where it used to be about twice as slow as strstr for those, it's now 20% slower, like it is for needle lengths of >= 4 (dword loop). I think there's a lot of stuff that can be optimised still, I'm sure this code can run faster than Microsoft's. And that's not the compiler btw. For low-level memory routines like strlen, strstr, memchr, or memmem like this one the compiler generates very slow code (more than twice as slow for an optmized version of strstr) That's why microsoft hand coded all of those string routines into assembler. If you have visual studio go into the visual c++ directory / crt / src / intel (off the top of my head) and you'll see all of the string/memory routines there.

For higher level things I think that site you linked me to has good advice, code it in C++ and let the compiler deal with it, except for the innermost loop or most intensive part of the code. That's where any speed increases will be realized anyway.

-Sandra
0
 
LVL 6

Expert Comment

by:joghurt
ID: 10793939
A final note. If you want it to be faster than Microsoft's, you may need to use VTune. A 7-day trial can be downloaded from http://developer.intel.com/
It tells you which instructions take the most time and VTune can even tell you what's the problem.
0
 
LVL 3

Author Comment

by:Sandra-24
ID: 10795353
well, I'll settle for 20% slower for now, that's pretty good. Considering it is more complex than microsofts. And it gains a big boost in case insensitive comparisons, microsoft doesn't have a stristr and mine is much faster than creating a totally new lower case copy of the buffer to search. And of course my function is not limited to strings, it will work for binary data or strings without a null-terminator.

Now I just have to split up the points. Thanks a lot for helping me learn folks.
0
 
LVL 6

Accepted Solution

by:
joghurt earned 330 total points
ID: 10796186
As a sidenote: If you want to beat Microsoft's version (and not only by 10-20% but by a magnitude), search for the following document [I don't remember the URL]: gdc_2002_amd.pdf "Optimizing memory bandwidth - Don't settle for just a byte or two. Grab a whole fistful of cache". It's a superb article about prefetching data to the cache. Their version is three times faster than the "naive" version of memcpy.
0

Featured Post

How your wiki can always stay up-to-date

Quip doubles as a “living” wiki and a project management tool that evolves with your organization. As you finish projects in Quip, the work remains, easily accessible to all team members, new and old.
- Increase transparency
- Onboard new hires faster
- Access from mobile/offline

Join & Write a Comment

Suggested Solutions

A safe way to clean winsxs folder from your windows server 2008 R2 editions
Is your Office 365 signature not working the way you want it to? Are signature updates taking up too much of your time? Let's run through the most common problems that an IT administrator can encounter when dealing with Office 365 email signatures.
Here's a very brief overview of the methods PRTG Network Monitor (https://www.paessler.com/prtg) offers for monitoring bandwidth, to help you decide which methods you´d like to investigate in more detail.  The methods are covered in more detail in o…
Polish reports in Access so they look terrific. Take yourself to another level. Equations, Back Color, Alternate Back Color. Write easy VBA Code. Tighten space to use less pages. Launch report from a menu, considering criteria only when it is filled…

760 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

18 Experts available now in Live!

Get 1:1 Help Now