Link to home
Start Free TrialLog in
Avatar of DanRollins
DanRollinsFlag for United States of America

asked on

How fast is your ASM Base64 Encoder?

I was just reading over an old Q here from several years ago...
      How fast is your ASM hex converter?
      http:/Assembly/Q_20272901.html
which was more of a fun challenge than a real question.  I'm not sure how many ASM Experts are participating in this TA these days, but I thought I'd throw out the gauntlet again.  

The challenge is to write -- in 80x86 ASM -- an ultra-fast Base64 Encoder.

A Base64 Encoder converts binary data into ASCII codes that can be transmitted/processed by systems that do not normally process raw binary... for instance, in email attachments and in HTTP headers and as values in HTTP URL.  The encoder outputs only the following characters:
       ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/
plus, the output may end in zero, one, or two equal signs (=)

To encode the raw binary, you access it three bytes at a time, taking the data in groups of six  bits.  For instance:
Hex:      0x07 0xFF 0x02
Binary:  0000 0111 1111 1111 0000 0010 (groups of 4 bits)
Binary:  000001 111111 111100 000010   (groups of 6 bits)
            \____/  \____/  \____/  \____/
decimal      1        63        60        2    
lookup       B         /          8         C

The length of the output must always be an even multiple of four, because when you DEcode the data, you will be taking groups of four base64 "digits" and turning them back into three binary bytes of output.  Therefore, when ENcoding, if your input data is not an even multiple of three, you will need to pad the end of the output with one or two equal signs (=)

If you need more info, then (as always) RTFR! (read the friggin RFC :-) http://www.ietf.org/rfc/rfc3548.txt 

Also, to verify your output, you can use an Encoder at this URL:
   http://www.opinionatedgeek.com/dotnet/tools/Base64Encode/Default.aspx

=-=-=-=-=-=-=-=-=-=-=
I'll be speed-testing the submissions using code like this:

//---------------------------------------- (assume dest buf is large enough)
void ToBase64( BYTE* pSrc, int nSrcLen, char* pszOutBuf )  // {{modified 9/13/2006 --DR, PE}}
{
       _asm{
        // your code here!
        }      
}
So if you write your code in a format that can be pasted into an _asm block and compiled by Visual Studio 6, it will make it a bit easier.

If there is any interest in this, I'll post a C-language version for use as comparison (there is one here: http://research.microsoft.com/invisible/src/crypto/cyphers/base64/base64.c.htm -- though it does not look particularly efficient) and I'll also time the ATL functions Base64Encode for reference (my guess is that it is pretty fast... unbeatable?).  The CryptoAPI also provides a function, CryptBinaryToString, that will do the work and it might be interesting to see how fast it runs.  If this works out, I'll start a second challenge for Base64 ASM Decoder.

Are you game?  500 points and your reputation as an ASM programmer are on the line :-)

-- Dan
Avatar of Craig Wardman
Craig Wardman
Flag of United Kingdom of Great Britain and Northern Ireland image

how do we know the # of bytes pointed to by pSrc?
ok, well based on having the length passed in as a parameter I have done the following code:

-------------------------------------------------
void ToBase64( BYTE* pSrc, char* pszOutBuf, int len )
{
      char* chr_table="ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/";

       __asm{
            //      pushad                                          (removed for speed - uncomment if necessary)
                  xor ecx, ecx                              //byte idx

                  
src_byteLoop:
                  xor edx, edx

                  mov esi, pSrc                              //bytes from source
                  add esi, ecx
                  xor eax, eax                              

                  //read in 3 bytes seperately (to allow for little endian layout)
                  mov ah, byte ptr[esi]
                  mov al, byte ptr[esi+1]
                  shl eax, 16
                  mov ah, byte ptr[esi+2]
                  
padded_byteLoop:

                  mov ebx, 4                                    //4 sets (of 6 bits) 24bits (3bytes)
                  
next_bitset:
                  xor edx, edx
                  mov edi, 6                                    //6bit groups
bitLoop:
                  
                  shl eax, 1                                    //get MSB
                  adc edx,0                                    //add 1 if carry
                  shl edx, 1                                    //keep significance

                  dec edi                                          //bit--
                  jnz bitLoop                        

                  shr edx, 1                                    //div 2 (final bit should have sig 1)
                  
                  mov esi, chr_table
                  add esi, edx                              //set the pointer to correct char
                  mov al, byte ptr [esi]                  //al is spare, use it!
                  mov edi, pszOutBuf
                  mov byte ptr [edi], al                  //put char in buffer
                  add pszOutBuf,1                              //next buf
                  
                  dec ebx                                          //set--
                  jnz next_bitset

                  add ecx,3
                  mov esi, len
                  sub esi, ecx                              //len - done = remaining
                  jle finished
                  
                  sub esi,3
                  jge src_byteLoop                        //still got some groups of 3 bytes
                  
                  //need to pad out some extra bytes
                  xor edx, edx

                  mov edi, pSrc                              //bytes from source
                  add edi, ecx
                  xor eax, eax                              

                  //read in 3 bytes regardless of junk data following pSrc
                  mov ah, byte ptr[edi]
                  mov al, byte ptr[edi+1]
                  shl eax, 16
                  mov ah, byte ptr[edi+2]

                  //as per the RFC, any padded bytes should be 0s
                  mov ebx, 0xFFFFFF
                  lea esi, dword ptr[esi*8+8]
                  xchg esi, ecx
                  shl ebx, cl
                  xchg esi, ecx
                  and eax, ebx
                  jmp padded_ByteLoop

finished:
                  neg esi
                  jz end
                  //some bytes were padding, put them as =
padChars:
                  mov edi, pszOutBuf
                  sub edi, esi
                  mov byte ptr[edi], 0x3d
                  dec esi
                  jnz padChars


end:
                  
            //      popad                              (removed for speed - uncomment if necessary)
        }      
}


----------------

some notes:

1) I could unroll the predetermined loops for faster performance if necessary - I have left them in for now for readability and to see how quickly it runs before unrolling

2) If this was in pure asm and not in an __asm block then it would be quicker because there would be no need to double dereference the variables in order to get at their data..

3) I am a novice asm programmer so the above was written to just 'do' what it should, I have given some thought to speed (such as not using the stack, using registers instead of memory locations and avoiding length mnemonics such as div), but there may be better ways of achieving what my code does.. please let me know I am always eager to learn!!
for the records, is the version with loops unrolled, which after a small amount of testing seems considerably quicker..

-----------------------------

void ToBase64( BYTE* pSrc, char* pszOutBuf, int len )
{
      char* chr_table="ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/";

       __asm{

                  xor ecx, ecx                              //byte idx

                  
src_byteLoop:
                  xor edx, edx

                  mov esi, pSrc                              //bytes from source
                  add esi, ecx
                  xor eax, eax                              

                  //read in 3 bytes seperately (to allow for little endian layout)
                  mov ah, byte ptr[esi]
                  mov al, byte ptr[esi+1]
                  shl eax, 16
                  mov ah, byte ptr[esi+2]
                  
padded_byteLoop:

                  //4 sets (of 6 bits) 24bits (3bytes)
      //set1
                  
                  xor edx, edx
                  //6bit groups

                  //1
                  shl eax, 1                                    //get MSB
                  adc edx,0                                    //add 1 if carry
                  shl edx, 1                                    //keep significance

                  //2
                  shl eax, 1                                    //get MSB
                  adc edx,0                                    //add 1 if carry
                  shl edx, 1                                    //keep significance

                  //3
                  shl eax, 1                                    //get MSB
                  adc edx,0                                    //add 1 if carry
                  shl edx, 1                                    //keep significance

                  //4
                  shl eax, 1                                    //get MSB
                  adc edx,0                                    //add 1 if carry
                  shl edx, 1                                    //keep significance

                  //5
                  shl eax, 1                                    //get MSB
                  adc edx,0                                    //add 1 if carry
                  shl edx, 1                                    //keep significance

                  //6
                  shl eax, 1                                    //get MSB
                  adc edx,0                                    //add 1 if carry
                  shl edx, 1                                    //keep significance

                  shr edx, 1                                    //div 2 (final bit should have sig 1)
                  
                  mov esi, chr_table
                  add esi, edx                              //set the pointer to correct char
                  mov al, byte ptr [esi]                  //al is spare, use it!
                  mov edi, pszOutBuf
                  mov byte ptr [edi], al                  //put char in buffer
                  add pszOutBuf,1                              //next buf
      
      //set2

                  xor edx, edx

                  //1
                  shl eax, 1                                    //get MSB
                  adc edx,0                                    //add 1 if carry
                  shl edx, 1                                    //keep significance

                  //2
                  shl eax, 1                                    //get MSB
                  adc edx,0                                    //add 1 if carry
                  shl edx, 1                                    //keep significance

                  //3
                  shl eax, 1                                    //get MSB
                  adc edx,0                                    //add 1 if carry
                  shl edx, 1                                    //keep significance

                  //4
                  shl eax, 1                                    //get MSB
                  adc edx,0                                    //add 1 if carry
                  shl edx, 1                                    //keep significance

                  //5
                  shl eax, 1                                    //get MSB
                  adc edx,0                                    //add 1 if carry
                  shl edx, 1                                    //keep significance

                  //6
                  shl eax, 1                                    //get MSB
                  adc edx,0                                    //add 1 if carry
                  shl edx, 1                                    //keep significance

                  shr edx, 1                                    //div 2 (final bit should have sig 1)
                  
                  mov esi, chr_table
                  add esi, edx                              //set the pointer to correct char
                  mov al, byte ptr [esi]                  //al is spare, use it!
                  mov edi, pszOutBuf
                  mov byte ptr [edi], al                  //put char in buffer
                  add pszOutBuf,1                              //next buf
      //set3                  

                  xor edx, edx

                  //1
                  shl eax, 1                                    //get MSB
                  adc edx,0                                    //add 1 if carry
                  shl edx, 1                                    //keep significance

                  //2
                  shl eax, 1                                    //get MSB
                  adc edx,0                                    //add 1 if carry
                  shl edx, 1                                    //keep significance
                  
                  //3
                  shl eax, 1                                    //get MSB
                  adc edx,0                                    //add 1 if carry
                  shl edx, 1                                    //keep significance

                  //4
                  shl eax, 1                                    //get MSB
                  adc edx,0                                    //add 1 if carry
                  shl edx, 1                                    //keep significance

                  //5
                  shl eax, 1                                    //get MSB
                  adc edx,0                                    //add 1 if carry
                  shl edx, 1                                    //keep significance

                  //6
                  shl eax, 1                                    //get MSB
                  adc edx,0                                    //add 1 if carry
                  shl edx, 1                                    //keep significance

                  shr edx, 1                                    //div 2 (final bit should have sig 1)
                  
                  mov esi, chr_table
                  add esi, edx                              //set the pointer to correct char
                  mov al, byte ptr [esi]                  //al is spare, use it!
                  mov edi, pszOutBuf
                  mov byte ptr [edi], al                  //put char in buffer
                  add pszOutBuf,1                              //next buf
      
      //set4

                  xor edx, edx

                  //1
                  shl eax, 1                                    //get MSB
                  adc edx,0                                    //add 1 if carry
                  shl edx, 1                                    //keep significance

                  //2
                  shl eax, 1                                    //get MSB
                  adc edx,0                                    //add 1 if carry
                  shl edx, 1                                    //keep significance

                  //3
                  shl eax, 1                                    //get MSB
                  adc edx,0                                    //add 1 if carry
                  shl edx, 1                                    //keep significance

                  //4
                  shl eax, 1                                    //get MSB
                  adc edx,0                                    //add 1 if carry
                  shl edx, 1                                    //keep significance

                  //5
                  shl eax, 1                                    //get MSB
                  adc edx,0                                    //add 1 if carry
                  shl edx, 1                                    //keep significance

                  //6
                  shl eax, 1                                    //get MSB
                  adc edx,0                                    //add 1 if carry
                  shl edx, 1                                    //keep significance

                  shr edx, 1                                    //div 2 (final bit should have sig 1)
                  
                  mov esi, chr_table
                  add esi, edx                              //set the pointer to correct char
                  mov al, byte ptr [esi]                  //al is spare, use it!
                  mov edi, pszOutBuf
                  mov byte ptr [edi], al                  //put char in buffer
                  add pszOutBuf,1                              //next buf

                  add ecx,3
                  mov esi, len
                  sub esi, ecx                              //len - done = remaining
                  jle finished
                  
                  sub esi,3
                  jge src_byteLoop                        //still got some groups of 3 bytes
                  
                  //need to pad out some extra bytes
                  xor edx, edx

                  mov edi, pSrc                              //bytes from source
                  add edi, ecx
                  xor eax, eax                              

                  //read in 3 bytes regardless of junk data following pSrc
                  mov ah, byte ptr[edi]
                  mov al, byte ptr[edi+1]
                  shl eax, 16
                  mov ah, byte ptr[edi+2]

                  //as per the RFC, any padded bytes should be 0s
                  mov ebx, 0xFFFFFF
                  lea esi, dword ptr[esi*8+8]
                  xchg esi, ecx
                  shl ebx, cl
                  xchg esi, ecx
                  and eax, ebx
                  jmp padded_ByteLoop

finished:
                  neg esi
                  jz end
                  //some bytes were padding, put them as =
padChars:
                  mov edi, pszOutBuf
                  sub edi, esi
                  mov byte ptr[edi], 0x3d
                  dec esi
                  jnz padChars


end:
                  
        }      
}

------------------------

now ill stop flooding the thread :)
Avatar of DanRollins

ASKER

Hi craigwardman!
Thanks for popping in.   You are right!  The source length must be an input parameter.  Using my PageEditor tools, I've modified the question text to reflect that (and fixed my link to the RFC).

>>...please let me know I am always eager to learn!!

I hope that this can be a learning experience for everybody involved :-)

I'll be looking at your code soon.  First I want to show my first shots at a C-language version... I've found that coding it up in C is often a good first step... to make sure I understand the algorithm.

My first shot at an inner loop looked like this:

for ( int j=0; j< nSrcLen; j+=3 ) {
      d = *pSrc++; d <<= 8;         // load three bytes ... in the right order for processing
      d|= *pSrc++; d <<= 8;
      d|= *pSrc++;

      BYTE b1= BYTE((d & 0x00FC0000) >> 18);
      BYTE b2= BYTE((d & 0x0003F000) >> 12);
      BYTE b3= BYTE((d & 0x00000FC0) >> 6 );
      BYTE b4= BYTE((d & 0x0000003F) >> 0 );
      
      *p++ = abDigitLookup[ b1 ];
      *p++ = abDigitLookup[ b2 ];
      *p++ = abDigitLookup[ b3 ];
      *p++ = abDigitLookup[ b4 ];
}

I'm omitting the end-of-source buffer padding logic, but that shows the general operation.  The endian thing threw me at first... I thought I could just do a DWORD load to get the source bytes, but they end up in the wrong order for subsequent processing via bit shifts.  A DWORD load should be much faster -- I think that it should be possible to avoid the three separate byte loads in the ASM version by using XCHG and/or BSWAP.

Before optimization, that sequence of C code generates 337 bytes of machine language opcodes.
After optimization for speed or size it is 130 or 125 bytes, respectively.

In my next version, I tried a slight variation... calculating the outputs taking the data one byte at a time.  

for (int j=0; j< nSrcLen; j+=3 ) {
      d = *pSrc++;                                 // 0000 0000 aaaa aaaa
      b1= BYTE((d & 0x00000FC) >> 2);   // 0000 0000 AAAA AA00
      d <<= 8;                                        // aaaa aaaa 0000 0000
      d |= *pSrc++;                                // aaaa aaaa bbbb bbbb
      b2= BYTE((d & 0x00003F0) >> 4);    // 0000 00AA BBBB 0000
      d <<= 8;                                        // bbbb bbbb 0000 0000
      d |= *pSrc++;                                 // bbbb bbbb cccc cccc
      b3= BYTE((d & 0x0000FC0) >> 6 );   // bbbb BBBB CCcc cccc
      b4= BYTE((d & 0x000003F) >> 0 );   // bbbb bbbb ccCC CCCC

      *p++ = abDigitLookup[b1];
      *p++ = abDigitLookup[b2];
      *p++ = abDigitLookup[b3];
      *p++ = abDigitLookup[b4];
}
The "apparent" advantage would be using shorter shifts (eg, SHR 2 instead of SHR 18) but I think I'm wasting my time... the CPU can shifts 18 positiions as fast as it does one.   This version of the code actually generates more opcodes (by about 10).  I could save a few (about 5) by avoiding the temp values (b1,b2,b3,b4), using code like:

      *p++ = abDigitLookup[BYTE((d & 0x00000FC) >> 2)];

...but i HATE it when I see code like that -- the compiler basically optimizes away the temp variables, so the advantage of readability usually wins.  Of course, if a few nanoseconds (or bytes of opcode) are important, then can might be worth doing.

BTW, by comparison, craigwardman's first submission is 120 bytes -- smaller and probably faster (measurements to be taken soon) than even the most optimized of my compiled C versions.

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
My end-of-src processing takes a different tack from that of craigwardman.  After I'm all done, I do:

      *p= 0;     // null terminate the output
      p--;         // back up to the output char before the end
      while ( j-- > nSrcLen ) {
            *(p--) = '=';
      }

In other words, I go ahead and process the one or two bytes of junk at the end of the source in the main loop, but at the end, go back and overwrite my extra output with the padding character.  I observed that if the nSrcLen MOD 3 is 1, then there will be two output characters that need to be overwritten and if nSrcLen MOD 3 is 2 then there will be one.  I arranged a while loop to use the main loop index (j) and nSrcLen to handle that.

Note that in both my code and craigwardman's submission, we both ASSUME that we have a legal access to up to two bytes of source beyond nSrcLen.   Assuming anything like that can be dangerous... Imagine if the caller's buffer has been allocated in such a way that reading even one byte beyond the end of the input buffer would casue an illegal access.

Query:  
Should we require that there be no access beyond the specified length of the input buffer?  I think that doing so will slow down the final code...

An alterntive would be that we pretend that we have documented a requirement that the source buffer be a multiple of 3 (though nSrcLen and be anything -- an exact count of the length of the data to encode).  The problem with that option is that it's ugly -- there's no way it would be a reasonable requirement for production code.  

I personally don't mind it if we decide to allow access beyond the end of the buffer for this speed contest, but I think the issue should be mentioned, in any case.

Thoughts?

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
I'm going to post a link to this thread the the C++ TA to see if we can get some more interest.  
I'll be back with some timing/speed data as soon as I can.

-- Dan
Hi Dan,

I think for the purpose of this thread, allowing the reading beyond the pSrc is ok - as you say in a proper implementation it may need some extra code to prevent it attempting this operation, but I imagine this would only be a problem if pSrc pointed to an area in memory that aligned itself right next to a protected page..

----

Using XCHG is a good idea, i didnt realise this would automatically account for little endianness.

---


regarding the padding at the end of the stream..

>>if the nSrcLen MOD 3 is 1, then there will be two output characters that need to be overwritten

this is true, but as groups of 3 are converted into 4 chars you will also have 2 chars that are not '=' signs..

The only way to ensure these two are correct is to follow the RFC of zeroing out any incomplete 24bit triplet.. as you could end up encoding junk data that is after the src.. (unless of course you are only reading upto the end of the src).. It must have been this text that made me decide to read in junk bytes:

>>Special processing is performed if fewer than 40 bits are available
   at the end of the data being encoded.  A full encoding quantum is
   always completed at the end of a body.  When fewer than 40 input bits
   are available in an input group, zero bits are added (on the right)
   to form an integral number of 5-bit groups.  Padding at the end of
   the data is performed using the "=" character.

I suppose, on 2nd reading they expect you to 'add' the bits, not overwrite the junk..

im not sure without seeing the whole of your source, but what would happen in the following situation:

BYTE lpSrc[]={0x01, 0x02,0x03,0x04};                imagine the junk that follows this is 0xFF 0xFF

the last byte's first 6 bits will encode ok, the next 2 bits will be combined with 4 bits of the junk so if you dont zero these out then the last char will be wrong..

Anyway, better go now - need to set off out!!

I'll look into it and comment on that shortly.  

In the meantime, the initial timings are in, and your "Unrolled loops" version is fastest! (so far :-).
I've posted the timing methodolgies and the current results at:
      http:/Cplusplus/Q_21988706.html#17524209
wow, those figures should remove any doubt that unrolling loops improves performance!

10MB/sec quicker than the original :)

hmm, your 'other' C function sounds interesting.. I cant say that im a very strong reader of C unless ive written the code myself, too many long and complex statements - will you be releasing your compiler generated asm for this function?

also.. when you compile your code, are you enabling compiler optimisations? - this point may be worth mentioning in the C thread, as different compilers might optimise the code in different ways, so the generated asm may be much different to the C code and it becomes a contest of which compiler has the best optimisations..
Re...
>> regarding the padding at the end of the stream..

Your quote was from the part of the RFC that describes base32 encoding (a minor oversight, since the principle is the same).

You are correct... i need to change my end-of-source handling.   The 'junk' at the end of the source buffer can and does affect the final base64 'digit.'

My encoding of 0x01 0x02 0x03 0x04 0xFF 0xFF is
   AQIDBP==

   0x01        0x02          0x03         0x04         0xFF          0xFF
  0000 0001 0000 0010 0000 0011 0000 0100 1111 1111 1111 1111    // 4-bit groups
  000000  010000  001000  000011  000001  001111  111111  111111  // 6-bit groups
  0(A)      32(Q)    16(I)      3(D)      1(B)      15(P)      63(/)     63(/)     // decimal(base64)

...whereas, if the 'junk' values at the end are 0...

   0x01        0x02          0x03         0x04         0x00          0x00
  0000 0001 0000 0010 0000 0011 0000 0100 0000 0000 0000 0000    // 4-bit groups
  000000  010000  001000  000011  000001  000000  000000  000000  // 6-bit groups
  0(A)      32(Q)    16(I)      3(D)      1(B)      0(A)       0(A)      0(A)      // decimal(base64)

meaning the final base64 digit should be 'A'
   AQIDBP==     (mine)
   AQIDBA==     (correct per the RFC)

It's interesting that my (incorrect) encoding actually DEcodes correctly:

   AQID BP==     (mine, incorrect)
             A          Q          I            D
   AQID:  000000  010000  001000  000011   // 6-bit groups
          :  0000 0001  0000 0010  0000 0011   // 4-bit groups
          :  0x01         0x02          0x03            // fine so far...
             B           P           =         =    
   BP==: 000001  001111  ??????  ??????    // 6-bit groups
           : 0000 0100  0011 ????  ???? ????   // 4-bit groups
           : 0x04           3?             ??             // also fine, since only the 0x04 is output

That's because the DEcoder, seeing two padding chars at the end, will decode only the first six bits of the final 24 (and discard the final 18 bits).  If it sees just the one padding char, it will decode the first 12 bits and discard the final six.  The values of the discarded bits do not affect the output.

Empirically testing at that base64 decoding website bears it out... when there are two padding chars, the final base64 digit (before the padding) can have a variety of values without affecting the binary output.

=-=-=-=-=-=-=-=-=
All that said... the RFC (page 4) is clear:
>>  When fewer than 24 input bits are available in an input group, zero bits
      are added (on the right) to form an integral number of 6-bit groups.

Thus, to be within spec, a valid base64 ENcoder function must "encode" up to two extra bytes of 0 if the buffer is not an even multiple of 3.

I'll work on my function to see if I can find an optimal way to do it right.

Nice catch, by the way :-)
Here is the compiler-generated ASM of "Dan's standard' function when optimized for speed:

>> void zToBase64( BYTE* pSrc, int nSrcLen, char* pszOutBuf ) {
      push        ebp
      mov         ebp,esp
>> char* p= pszOutBuf; DWORD d; BYTE b1, b2, b3, b4;)

>> for ( int j=0; j< nSrcLen; j+=3 ) {
      mov        eax,dword ptr [ebp+0Ch]
      mov        ecx,dword ptr [ebp+10h]
      push       edi
      xor         edi,edi
      test        eax,eax
      jle          zToBase64+81h ;; to loop exit
      push       ebx ;;
      push       esi
      push       3              
      add         eax,2
      xor         edx,edx
      pop         esi
      div          eax,esi
      mov        esi,dword ptr [pSrc]
      push       3Fh
      mov        dword ptr [pszOutBuf],eax
      lea         edi,[eax+eax*2]
      pop        eax
>> d= *pSrc++; d <<= 8;
      movzx    edx,byte ptr [esi]
>> d|= *pSrc++; d <<= 8;
      movzx    ebx,byte ptr [esi+1]
      inc         esi
      shl         edx,8
      or          edx,ebx
      inc         esi
      shl         edx,8
>> d|= *pSrc++;
      movzx       ebx,byte ptr [esi]
      or          edx,ebx
      inc         esi
>> b1= BYTE((d & 0x00FC0000) >> 18);
      mov         ebx,edx
      shr         ebx,12h
      and         ebx,eax
>> b2= BYTE((d & 0x0003F000) >> 12);
>> b3= BYTE((d & 0x00000FC0) >> 6 );
>> b4= BYTE((d & 0x0000003F) >> 0 );
>> *p = abDigitLookup[ b1 ]; p++;
      mov         bl,byte ptr abDigitLookup[ebx]
      mov         byte ptr [ecx],bl
      mov        ebx,edx
      shr         ebx,0Ch
      and        ebx,eax
      inc         ecx
>> *p = abDigitLookup[ b2 ]; p++;
      mov        bl,byte ptr abDigitLookup[ebx]
      mov        byte ptr [ecx],bl
      mov        ebx,edx
      shr         ebx,6
      and        ebx,eax
      inc         ecx
      and        edx,eax
>> *p = abDigitLookup[ b3 ]; p++;
      mov        bl,byte ptr abDigitLookup[ebx]
      mov        byte ptr [ecx],bl
>> *p = abDigitLookup[ b4 ]; p++;
      mov        dl,byte ptr abDigitLookup[edx]
      inc          ecx
      mov        byte ptr [ecx],dl
      inc          ecx
      dec        dword ptr [pszOutBuf]

      jne         zToBase64+28h ;; top of loop
      pop        esi
      pop        ebx
>> }
>> // note: TBD:  I have not fixed this end-of-source logic
>>*p= 0;
      and         byte ptr [ecx],0
>> p--;
      dec         ecx
>> while ( j-- > nSrcLen ) {
      mov         eax,edi
      dec         edi
      cmp         eax,dword ptr [nSrcLen]
      jle         zToBase64+98h (004016f0)
      sub         edi,dword ptr [nSrcLen]
      inc         edi
>> *(p--) = '=';
      mov         byte ptr [ecx],3Dh
      dec         ecx
      dec         edi
      jne         zToBase64+91h (004016e9)
      pop         edi
>> }
>> }
      pop         ebp
      ret

=-=-=-=-=-=-=-=--==--=
I have to confess that some of what it's doing seems pretty obscure.  It seems to prepare the AND masks in advance -- way in advance -- and doing various things in a rather non-sequential manner.

Looking at this, it did pop out at me that this sequence:
   b1= BYTE((d & 0x00FC0000) >> 18);
   b2= BYTE((d & 0x0003F000) >> 12);
   b3= BYTE((d & 0x00000FC0) >> 6 );
   b4= BYTE((d & 0x0000003F) >> 0 );

...is inefficient.  The correct way would be to populate B4, then shift d, then populate b3, then shift d, etc.  That would ue the same mask in the same bit positions.  I think that may be what the compiler is doing... reorganizing my clumsy code into a lean, mean base64-encoding machine!
>>Your quote was from the part of the RFC that describes base32 encoding

I thought the text seemed different :)

Ill have to take some time to fully analyse the code the compiler has produced.. Compiler optimizations can pretty well obscure the code well away from the original source, especially if it has interleaved instructions (this is to allow the processor to process two instructions simulatanously without affecting the logic of the code) this would be one optimization that would be very difficult to do by hand! - this would account for the non-sequential matter, although I have only scanned the code..

A couple of cringy things though, use the 'div' command - this is a slow command and also use of the stack (and not just setting up the stack frame)

some of it looks like its optimized for size..

push       3Fh
mov        dword ptr [pszOutBuf],eax
lea         edi,[eax+eax*2]
pop        eax

why not just

mov        dword ptr [pszOutBuf],eax
lea         edi,[eax+eax*2]
movzx     eax, 0x3F


I would imagine that 'movzx r32, imm32' is quicker than pushing and popping the stack!?


Anway, its that time again!
You're right.  That was set for minimize size.  Optimized for speed, the main code looks like

              >> for ( int j=0; j< nSrcLen; j+=3 ) {
     mov   ebx,dword ptr [esp+0Ch]
     push   ebp
     xor     ebp,ebp
     test    ebx,ebx
     jle      zToBase64+88h (exit loop)
     lea     edx,[ebx+2]
     mov   eax,0AAAAAAABh
     mul    eax,edx
     push   esi
     mov    esi,dword ptr [esp+10h]
     push   edi
     mov    edi,edx
     shr     edi,1
     lea     ebp,[edi+edi*2]
              >> d = *pSrc++; d <<= 8;
again:    
     xor    eax,eax
              >> d|= *pSrc++; d <<= 8;
     xor    edx,edx
     mov   al,byte ptr [esi]
     mov   dl,byte ptr [esi+1]
     inc     esi
     shl     eax,8
     or      eax,edx
     inc     esi
              >> d|= *pSrc++;
     xor    edx,edx
     mov   dl,byte ptr [esi]
     shl     eax,8
     or      eax,edx
     inc     esi
             >> b1= BYTE((d & 0x00FC0000) >> 18);
     mov    edx,eax
     shr     edx,12h
     and    edx,3Fh
             >> b2= BYTE((d & 0x0003F000) >> 12);
             >> b3= BYTE((d & 0x00000FC0) >> 6 );
             >> b4= BYTE((d & 0x0000003F) >> 0 );
             >> *p = abDigitLookup[ b1 ]; p++;
     inc    ecx
     mov   dl,byte ptr abDigitLookup[edx]
     mov   byte ptr [ecx-1],dl
     mov   edx,eax
     shr    edx,0Ch
     and   edx,3Fh
             >> *p = abDigitLookup[ b2 ]; p++;
     inc    ecx
     mov   dl,byte ptr abDigitLookup[edx]
     mov    byte ptr [ecx-1],dl
     mov    edx,eax
     shr     edx,6
     and    edx,3Fh
     and    eax,3Fh
              >> *p = abDigitLookup[ b3 ]; p++;
     inc     ecx
     mov   dl,byte ptr abDigitLookup[edx]
     mov   byte ptr [ecx-1],dl
              >> *p = abDigitLookup[ b4 ]; p++;
     mov   al,byte ptr abDigitLookup[eax]
     mov    byte ptr [ecx],al
     inc     ecx
     dec    edi
     jne     zToBase64+27h (label "again")
     pop    edi
     pop    esi

=-=-=--==-=-=-=-=-
There are many difference, including the ones you mentioned.  For instance, the digit lookup code

optimized for size:
     mov    bl,byte ptr abDigitLookup[ebx]
     mov     byte ptr [ecx],bl
     mov     ebx,edx
     shr      ebx,0Ch
     and     ebx,eax        ; must be smaller (not faster)
     inc       ecx

optimized for speed:
     inc    ecx
     mov   dl,byte ptr abDigitLookup[edx]
     mov   byte ptr [ecx-1],dl          ; interesting! offset-1 after INC?
     mov   edx,eax
     shr    edx,0Ch
     and   edx,3Fh      ; must be faster (not smaller)

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
There is still a MUL (now instead of a DIV) in the lead-in code.  One would think that the optimizer would avoid either one.  I think I recall from a previous discussion that MUL and DIV are not as inefficient as they used to be back in the day (of programming the 8086 with your teeth :-)

Anyway, after changing the optimization, the timing for the C code is much faster (faster even than the  "unrolled loops" ASM version!).  See http:/Cplusplus/Q_21988706.html#17532671

-- Dan
I had a look at using the XCHG command to load the DWORD into a register in correct byte order.. According to my tests this is slower than have the three seperate byte loading commands!

I have re-written my code,since it was so easily beaten by your mystery code and the Crypt32.dll - and now seemingly the standard C function! I dont think my manual attempts at optimizing my code will ever match that of the compiler!

Anywhere, here is my new submission ready for speed testing.. This one comes ready with the loops unrolled...

----------------------------

void ToBase64( BYTE* pSrc, char* pszOutBuf, int len )
{
      char* chr_table="ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/";

       __asm{
                  mov ecx, len
                  mov esi, pSrc                              //bytes from source
                  
src_byteLoop:
            
                  xor eax, eax                              

                  //read 3 bytes
                  mov ah, byte ptr[esi]
                  mov al, byte ptr[esi+1]
                  shl eax, 16
                  mov ah, byte ptr[esi+2]


                  //manipulate in edx bitset1
                  mov edx, eax
                  shr edx, 26

                  //done first 6 bits
                  shl eax, 6
                  
                  mov edi, chr_table
                  add edi, edx                              //set the pointer to correct char
                  mov dl, byte ptr [edi]            
                  mov edi, pszOutBuf
                  mov byte ptr [edi], dl                  //put char in buffer
                  add pszOutBuf,1                              //next buf
      
                  //manipulate in edx bitset2
                  mov edx, eax
                  shr edx, 26

                  //done first 6 bits
                  shl eax, 6
                  
                  mov edi, chr_table
                  add edi, edx                              //set the pointer to correct char
                  mov dl, byte ptr [edi]                  
                  mov edi, pszOutBuf
                  mov byte ptr [edi], dl                  //put char in buffer
                  add pszOutBuf,1                              //next buf

                  //manipulate in edx bitset3
                  mov edx, eax
                  shr edx, 26

                  //done first 6 bits
                  shl eax, 6
                  
                  mov edi, chr_table
                  add edi, edx                              //set the pointer to correct char
                  mov dl, byte ptr [edi]                  
                  mov edi, pszOutBuf
                  mov byte ptr [edi], dl                  //put char in buffer
                  add pszOutBuf,1                              //next buf
      
                  //manipulate in edx bitset4
                  mov edx, eax
                  shr edx, 26

                  //done first 6 bits
                  shl eax, 6
                  
                  mov edi, chr_table
                  add edi, edx                              //set the pointer to correct char
                  mov dl, byte ptr [edi]                  
                  mov edi, pszOutBuf
                  mov byte ptr [edi], dl                  //put char in buffer
                  add pszOutBuf,1                              //next buf
      
                  //done these bytes
                  add esi, 3
                  sub ecx, 3

                  cmp ecx, 3
                  jge src_byteLoop                        //still got src bytes
                  
                  xor ebx, ebx                              //set to zero
                  cmp ecx, 0
                  jz finished
                  
                  //need to pad out some extra bytes

                  xor eax, eax                              //zero

                  //read in 3 bytes regardless of junk data following pSrc
                  mov ah, byte ptr[esi]
                  mov al, byte ptr[esi+1]
                  shl eax, 16
                  mov ah, byte ptr[esi+2]
      
                  sub ecx, 3                                    //bytes just read
                  neg ecx                                          //+ve inverse
                  mov esi, ecx                              //save how many bytes need padding

                  //as per the RFC, any padded bytes should be 0s
                  mov ebx, 0xFFFFFF
                  lea ecx, dword ptr[ecx*8+8]            //calculate bitmask to shift
                  shl ebx, cl
                  and eax, ebx                              //mask out the junk bytes
                  
                  //manipulate in edx byte 1
                  mov edx, eax
                  shr edx, 26

                  //done first 6 bits
                  shl eax, 6
                  
                  mov edi, chr_table
                  add edi, edx                              //set the pointer to correct char
                  mov dl, byte ptr [edi]                  
                  mov edi, pszOutBuf
                  mov byte ptr [edi], dl                  //put char in buffer
                  add pszOutBuf,1                              //next buf
      
                  //manipulate in edx byte 2
                  mov edx, eax
                  shr edx, 26

                  //done first 6 bits
                  shl eax, 6
                  
                  mov edi, chr_table
                  add edi, edx                              //set the pointer to correct char
                  mov dl, byte ptr [edi]            
                  mov edi, pszOutBuf
                  mov byte ptr [edi], dl                  //put char in buffer
                  add pszOutBuf,1                              //next buf
      
                  //manipulate in edx byte 3
                  mov edx, eax
                  shr edx, 26

                  //done first 6 bits
                  shl eax, 6
                  
                  mov edi, chr_table
                  add edi, edx                              //set the pointer to correct char
                  mov dl, byte ptr [edi]                  
                  mov edi, pszOutBuf
                  mov byte ptr [edi], dl                  //put char in buffer
                  add pszOutBuf,1                              //next buf

                  //manipulate in edx byte 3
                  mov edx, eax
                  shr edx, 26

                  //done first 6 bits
                  shl eax, 6
                  
                  mov edi, chr_table
                  add edi, edx                              //set the pointer to correct char
                  mov dl, byte ptr [edi]                  
                  mov edi, pszOutBuf
                  mov byte ptr [edi], dl                  //put char in buffer
                  add pszOutBuf,1                              //next buf

finished:
                  test esi, esi
                  jz end
                  //some bytes were padding, put them as =
padChars:
                  mov edi, pszOutBuf
                  sub edi, esi                        //move ptr back for num bytes to pad
                  mov byte ptr[edi], 0x3d            //=
                  dec esi

                  jnz padChars


end:
                  
        }      
}

----------------------------------

I have avoided my original working with 1 bit at a time, which avoids a lot of instructions!

My local testing shows this code is approximately 2x faster than my original unrolled loop code :)
Good work!  That code is, indeed over twice as fast as your previous code:

MB/sec  Internal name // Description
--------  -----------------------------------------------------------------------------------------------------------
  61.81  yToBase64 / craigwardman (ASM, unrolled loops)
131.68   y2ToBase64 / craigwardman (ASM, unrolled loops, later version)

but Alas! still not as fast as the optimized C.  Just checking the raw size of your code, it is 286 bytes vs 154 bytes for the optimized C.

One thing I noticed is that you are continually updating a memory variable, e.g.,
       mov edi, pszOutBuf
       mov byte ptr [edi], dl               //put char in buffer
       add pszOutBuf,1                      //next buf

and that's something the compiler avoids... it just uses a register pointer:

     mov    byte ptr [ecx],dl
     inc     ecx
 
I think you can shave some time by avoiding all of those load/stores.  Also, as I recall
   ADD mem, 1
looks tight, but it is rather slow... My understanding is that (internally) the CPU must read the value from memory into a temporary register, add 1, then write the data back out -- that can be a killer since you need to do that sequence four times for every input group.

-- Dan
Ah right, thanks for that tip!

I will jig the registers around see if I can free one up for this purpose, if I see a substancial improvement in my test results here (since they seem to agree with yours) I will re-post..

Craig,.
in fact, there is a bug in the above code (near the end i set ebx to zero as a flag but then test esi - too much jigging!).. but I will fix it in the new version..

ok here it is!! damn CPUs should have more registers!

This version seems to be another 30-40% faster than my current fastest version:

------------------------------

void ToBase64( BYTE* pSrc, char* pszOutBuf, int len )
{
      char* chr_table="ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/";

       __asm{
                  mov ecx, len
                  mov esi, pSrc                              //bytes from source
                  mov ebx, pszOutBuf

src_byteLoop:
            
                  xor eax, eax                              

                  //read 3 bytes
                  mov ah, byte ptr[esi]
                  mov al, byte ptr[esi+1]
                  shl eax, 16
                  mov ah, byte ptr[esi+2]


                  //manipulate in edx bitset1
                  mov edx, eax
                  shr edx, 26

                  //done first 6 bits
                  shl eax, 6
                  
                  mov edi, chr_table
                  add edi, edx                              //set the pointer to correct char
                  mov dl, byte ptr [edi]            
                  mov byte ptr [ebx], dl                  //put char in buffer
                  inc ebx                                          //next buf
      
                  //manipulate in edx bitset2
                  mov edx, eax
                  shr edx, 26

                  //done first 6 bits
                  shl eax, 6
                  
                  mov edi, chr_table
                  add edi, edx                              //set the pointer to correct char
                  mov dl, byte ptr [edi]                  
                  mov byte ptr [ebx], dl                  //put char in buffer
                  inc ebx                                          //next buf

                  //manipulate in edx bitset3
                  mov edx, eax
                  shr edx, 26

                  //done first 6 bits
                  shl eax, 6
                  
                  mov edi, chr_table
                  add edi, edx                              //set the pointer to correct char
                  mov dl, byte ptr [edi]                  
                  mov byte ptr [ebx], dl                  //put char in buffer
                  inc ebx                                          //next buf
      
                  //manipulate in edx bitset4
                  mov edx, eax
                  shr edx, 26

                  //done first 6 bits
                  shl eax, 6
                  
                  mov edi, chr_table
                  add edi, edx                              //set the pointer to correct char
                  mov dl, byte ptr [edi]                  
                  mov byte ptr [ebx], dl                  //put char in buffer
                  inc ebx                                          //next buf
      
                  //done these bytes
                  add esi, 3
                  sub ecx, 3

                  cmp ecx, 3
                  jge src_byteLoop                        //still got src bytes
                  
                  xor eax, eax                              //set to zero (pad count)
                  cmp ecx, 0
                  jz finished
                  
                        //need to pad out some extra bytes

                        //read in 3 bytes regardless of junk data following pSrc - already zero from above)
                        mov ah, byte ptr[esi]
                        mov al, byte ptr[esi+1]
                        shl eax, 16
                        mov ah, byte ptr[esi+2]
            
                        sub ecx, 3                                    //bytes just read
                        neg ecx                                          //+ve inverse
                        mov edx, ecx                              //save how many bytes need padding

                        //as per the RFC, any padded bytes should be 0s
                        mov esi, 0xFFFFFF
                        lea ecx, dword ptr[ecx*8+8]            //calculate bitmask to shift
                        shl esi, cl
                        and eax, esi                              //mask out the junk bytes

                        mov ecx, edx                              //restore pad count
                        
                        //manipulate in edx byte 1
                        mov edx, eax
                        shr edx, 26

                        //done first 6 bits
                        shl eax, 6
                        
                        mov edi, chr_table
                        add edi, edx                              //set the pointer to correct char
                        mov dl, byte ptr [edi]                  
                        mov byte ptr [ebx], dl                  //put char in buffer
                        inc ebx                                          //next buf
            
                        //manipulate in edx byte 2
                        mov edx, eax
                        shr edx, 26

                        //done first 6 bits
                        shl eax, 6
                        
                        mov edi, chr_table
                        add edi, edx                              //set the pointer to correct char
                        mov dl, byte ptr [edi]            
                        mov byte ptr [ebx], dl                  //put char in buffer
                        inc ebx                                          //next buf
            
                        //manipulate in edx byte 3
                        mov edx, eax
                        shr edx, 26

                        //done first 6 bits
                        shl eax, 6
                        
                        mov edi, chr_table
                        add edi, edx                              //set the pointer to correct char
                        mov dl, byte ptr [edi]                  
                        mov byte ptr [ebx], dl                  //put char in buffer
                        inc ebx                                          //next buf

                        //manipulate in edx byte 3
                        mov edx, eax
                        shr edx, 26

                        //done first 6 bits
                        shl eax, 6
                        
                        mov edi, chr_table
                        add edi, edx                              //set the pointer to correct char
                        mov dl, byte ptr [edi]                  
                        mov byte ptr [ebx], dl                  //put char in buffer
                        inc ebx                                          //next buf

                        mov eax, ecx                              //'return' pad count

finished:
                  test eax, eax
                  jz end
                  //some bytes were padding, put them as =

                        sub ebx, eax                        //move ptr back for num bytes to pad
padChars:
                        mov byte ptr[ebx], 0x3d            //=
                        inc ebx
                        dec eax

                        jnz padChars


end:
                  
        }      
}


----------------------------


let me know how the MB/s goes!

Thanks..
That does look more efficient.  
It seems to me that the end-of-source logic could be done more elegantly (than basically repeating the logic of the main loop) but I haven't worked it out myself, so that is just a hunch -- anyway, that portion of the code is executed just once, so it probably does not have much effect on the overall performance.

I'll run the speed test tommorrow on my computer at the office.

-- Dan
I suspect that the final part could be only be improved for size/elegance, but I think the fastest way is to repeat the code, this avoids the need for one if not two conditional jumps in the code..

although its not very good coding practice as it would be harder to maintain, I could imagine the compiler doing this as an optimisation if it knew it could.. (i.e. if the repeated code where put into a function my guess is that the compiler would compile it inline - as I have)
Good work!  That code is 25% faster.

MB/sec  Internal name // Description
--------  -----------------------------------------------------------------------------------------------------------
  61.81  yToBase64 / craigwardman (ASM, unrolled loops)
131.68   y2ToBase64 / craigwardman (ASM, unrolled loops, later version)
175.34   y3ToBase64 / craigwardman (ASM, unrolled loops, later version -- fewer load/stores)

You're right -- the compiler will unroll loops (reusing previously-generated code) if it recognizes that doing so will avoid jumps and "special casing" within the all-important inner loop.  I see nothing at all wrong with repeating the previous code to handle the special cases at the end.

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Have you been following the C++ version of this?
    https://www.experts-exchange.com/questions/21988706/How-fast-is-your-C-C-Base64-Encoder.html

C++ Expert infinity08 employed a novel multi-table lookup algorithm to squeeze out some more throughput.  As of now, his function and yours are ouputting at very nearly the same rate.
Hi Dan,

Yes I am keeping an eye on the C thread too, although i dont think I will be able to contribute much to it!

Well having the fastest ASM version is ok for now :) I will be having a think about some other ways that I could speed up the algo..

There may be a couple more enhancements I can make to the code, but they are really superficial and im not sure how much difference they will make, I think a re-think is required for a major performance gain..

----

looking at some of your previous posts..

Im not sure how fast the MUL is these days I just know it and DIV are renowned for being slow, but that MUL is faster than DIV..

I suppose if the compiler cant find any other way to do the divide then it will have to settle for MUL (using the multiplicative inverse) as this
is still faster than DIV.. (unless the multiplicative inverse has to be worked out on the fly, but in the code its hard coded as 0xAAAAAAAB)

also,

     inc    ecx
     mov   dl,byte ptr abDigitLookup[edx]
     mov   byte ptr [ecx-1],dl          ; interesting! offset-1 after INC?

Im not sure why it has done this, perhaps this was the best option for code interleaving..


I have noticed that when optimized for size the function does not use a stack frame, this frees up the EBP register.. which could be useful in my code - as I have a few possible ideas that may or may not improve performance.. (could you confirm that no stack frame is set up when you compile using my ASM version?)

Thanks.
ASKER CERTIFIED SOLUTION
Avatar of Craig Wardman
Craig Wardman
Flag of United Kingdom of Great Britain and Northern Ireland image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
correction to my original post of todays date...

>>I have noticed that when optimized for size the function does not use a stack frame,

should be

I have noticed that when optimized for SPEED the function does not use a stack frame,
Just a note:
I tried using the XLATB opcode for the table lookup...
It seems ideal:  Just point EBX to the start of the table (one time only), set a byte to AL, execute XLAT and AL contains the looked up value.  Although the code ends up tight and clean, I saw  no significant performance increase.  My guess is that the microcode for that instruction is not particularly fast (I seem to recall verifying that from the Intel docs a while back).

Speed-testing your new code now...
hmm, an interesting mnemonic, I havent seen this one used before..

When you say you saw no increase, was there any decrease? or do you think the CPU just translates the opcode into its sub parts..

i.e.

mov al, byte ptr[ebx+al]

..

I have noticed on several occasions now, how XCHG is slower at copying a DWORD to/from memory than manually copying it a byte at a time (using multiple mov instructions)..

It appears this may only be quicker when you just XCHG r32, r32  (as it is fewer instructions - might even resolve to the same in the CPU as doing it manually!) - so I dont see why Intel have implemented the memory option, as there is a significant loss of performance, which might be overlooked by some developers..

Excellent!  Another 10%+ speed increase!

MB/sec  Internal name // Description
--------  ---------------------------------------------------------------------------------------
  61.81  yToBase64 / craigwardman (ASM, unrolled loops)
131.68   y2ToBase64 / craigwardman (ASM, later version)
175.34   y3ToBase64 / craigwardman (ASM, later version -- fewer load/stores)

196.92   y4ToBase64 / craigwardman (ASM, later version -- uses bp and [edi+edx])

I was about to suggest a PUSH/POP of EBP to give you another register, but it looks like you figured that out :-)

It's clear that using...

         mov bl, byte ptr [edi+edx]          // indirect addressing base+offset
         mov byte ptr[ebp], bl

...is a big win over a sequence like..

          mov edi, chr_table
          add edi, edx                                  
          mov dl, byte ptr [edi]              
          mov byte ptr [ebx], dl  

=-=-=-=-=-=-=-=-=-=
I didn't notice your sweeeeeet trick with the...
           mov edx, eax
           shl eax, 6  
           shr edx, 26          
...sequence in your previous code... it took me a while to see what's going on there.  That lets you avoid an AND operation (you will always have exactly 6 bits in your lookup table index).  Shifts are *very* fast, so that should end up being at least as fast as the technique of using a 256-byte lookup table with duplicated values (as in http:/Cplusplus/Q_21988706.html#17554449).

Nyeece :-)
thanks for your compliments!

Just to clarify to anyone else reading the code, if you look at my first generation code, I was shifting out one bit at a time from EAX, then slowly shifting them left to keep the significance in EDX, I then realised I may as well copy the whole DWORD at once EAX->EDX, lose the 6 MSBs in the EAX, and keep only the 6 MSBs in EDX (but re-aligned to the correct significance), which of course is acheived by shifting right 26 times...

that 196.92 is slower than expected.. My tests here were into the 230MB/s range on my 1.7GHz AMD Sempron - was that an average, or a first run?

using the pasted copy of your code for testing:


It took 4265 ms to process 1000 MB of data
That is 234.47 MB per second
Press any key to continue


But I suppose there could be other factors involved and all code submitted will be affected..
That timing was the best of three runs.  
In all the tests, I've found that the second and third are usually a few MB/s faster than the first, but usually only 1 or 2 MB/s faster.  I sometimes click the button a few more times as a sanity check (in case some high-priority background task suddenly kicked in) but I've never seen it vary more and a few MB/s.

True, your times will be different from what I post, but if you use your computer to test one or two of the other submissions, you should see proportional values.   Your computer is faster than mine, so higher values are expected.  But if you measure your previous tries, those, too, will be faster.   For instance, I calculated about a 10% speed increase between your last two submissions.  I'd expect you to see (about) the same 10% difference -- even though both actual results would he higher than what I measured.

=-=-=-=-=-=-=-=-=
Regarding XCHG:
>> I have noticed on several occasions now, how XCHG is slower at copying a DWORD to/from memory than manually copying it a byte at a time

An XCHG between register and memory is bound to be slow... it means a read from, and a write to memory.   One possible use for that opcode  would be in a multi-tasking semaphore system -- it "atomically" obtains a value form memory and stores a new value there without any chance that some other process could interrupt the sequence.

What I was recommending with a register-to-register XCHG was something like
   mov BX, WORD PTR [ESI]   ; bytes in wrong order
   xchg  BL,BH  ; now in correct order..

One thing I've experimented with is the BSWAP opcode.  I tried replacing a sequence like:

      mov    bh, BYTE PTR [esi]
      mov    bl, BYTE PTR [esi+1]
      shl      ebx,8                        
      mov    bl, BYTE PTR [esi+2]

...with something like..
      mov     ebx, DWORD PTR [esi]   ;; gets four bytes at once
      bswap  ebx       // endian adjust
      shr       ebx, 8   // discard the extra and (inefficiently?) re-read it next pass

...with the idea of avoiding three separate one-byte reads by using a single four-byte read plus an adjustment to get the bytes into the desired order for subsequent processing.

My timing tests indicate that the two options work at nearly identical speeds.  Most likely, after the first one-byte access, all three bytes are in high-speed cache, so the second and third bytes are instantly available -- nullifying any advantage.  I know for certain that on a 4.77 MHz 8086 IBM PC, two separate 8-bit accesses took nearly twice as long as a single 16-bit access... but a LOT of what I used to know back in 1985 is only marginally useful information these days :-)

-- Dan
hehe, well i was 1 year old in 1985 so im sure you have more useful info than me :)

i have run the speed test with all of my versions, here are the results:

ver1     - rolled loops      - 46.18 MB per second
ver1.2  - unrolled loops   - 52.03 MB per second
ver2     - ready unrolled  - 112.47 MB per second
ver2.2  - no mem access - 163.69 MB per second
ver2.3  - EBP version       - 231.91 MB per second


As you can see, there seems to be a bigger jump at my end, i wonder if there is something specific to my CPU/mobo that it likes my register only (ebp version) code so much ...
Hey!  Making me feel OLD will NOT improve your timings <G>

>> i wonder if there is something specific to my CPU/mobo...
That's a reasonable assumption.  Things like DDR RAM and front-side bus speed would probably have a proportionally greater impact as you eliminate more memory accesses, while raw opcode count would be less important.

I'm also wary of the timings for another reason...  I might have more or fewer simultaneous tasks time-sharing the CPU.  If the other tasks do a lot of work, they would tend to overwrite the L2/L1 cache, forcing a reload when my timeslice comes back around.   Also, the faster the code runs, the fewer the number of chances it has to be interrupted by other tasks.  So there could be an accellerating upward curve, where even a slight efficiency gain yields significant throughput increase...
Yes that is true..

I suppose it really depends on the architecture and the maximum performance of the PC you run the code on now, rather than the code itself..

For example, the Infinity08_lookup_b that he measured > 300MB/s I have tested on my machine and got average of 187MB/s.. Again my processor is AMD so this appears to be entirely dependant on architecure..

On reading the C thread, it seems as though you have also looked into this and got some confirmative (is that a word?) results...
Hey Dan, are you still taking submissions?
Absolutely!
I think that there are a lot of avenues that can still be explored.  I hope that this is as much (or more of) a learning experience as a contest.  Even little changes can have a measurable effect.   I don't think we've come close to wringing this thing out :-)  

I have some new ideas myself, but I've been waiting for some more Experts to join in!
I'm just going to start where craigwardman has left off and try to optimize his code further. I'll hopefully have something by monday.


Brian
Hi Dan,

I've finally got around to having a tinker with this.

I've got a working algorithm that should be quite quick when converted to assembler. Unfortunately my assembler is rusty enough to make this a very slow process. Meanwhile here's the algorithm.

// Includes.
#include <stdio.h>

// The digits of the number.
char * digits = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/";

void PrintTable(int * c, int * carry, bool cAsChars)
{
    int i;
    // The characters.
    if ( cAsChars ) for ( i = 0; i < 256; i++ ) printf("%c", digits[c[i]]);
    else for ( i = 0; i < 256; i++ ) printf("%d,", c[i]);
    printf ("\n");
    // The carries.
    for ( i = 0; i < 256; i++ ) printf("%d,", carry[i]);
    printf ("\n");
}

void GenerateTables ( void )
{
    int i;
    int c [256];
    int carry [256];
    // I want tables of character and carry for each byte of the three.
    /*
    The carry from the previous calculation will be added to whatever is in
    the c array to lookup in the digits array.
    */
    // Table for first byte.
    // Top 6 bits is character.
    // Bottom 2 is carry to top 2 of next 6.
    for ( i = 0; i < 256; i++ )
    {
        c[i] = i>>2;
        carry[i] = (i & 0x03) << 4;
    }
    PrintTable(c,carry,true);    
    // Table for second byte.
    // Top 4 bits is character.
    // Bottom 4 is carry to top 4 of next 6.
    for ( i = 0; i < 256; i++ )
    {
        c[i] = i>>4;
        carry[i] = (i & 0x0f) << 2;
    }
    PrintTable(c,carry,false);
    // Table for third byte.
    // Top 2 bits is character.
    // Bottom 6 is fourth character.
    for ( i = 0; i < 256; i++ )
    {
        c[i] = i>>6;
        carry[i] = (i & 0x3f);
    }
    PrintTable(c,carry,false);
}

// The tables generated by the above code.
//int c0 [] = {0,0,0,0,1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4,5,5,5,5,6,6,6,6,7,7,7,7,8,8,8,8,9,9,9,9,10,10,10,10,11,11,11,11,12,12,12,12,13,13,13,13,14,14,14,14,15,15,15,15,16,16,16,16,17,17,17,17,18,18,18,18,19,19,19,19,20,20,20,20,21,21,21,21,22,22,22,22,23,23,23,23,24,24,24,24,25,25,25,25,26,26,26,26,27,27,27,27,28,28,28,28,29,29,29,29,30,30,30,30,31,31,31,31,32,32,32,32,33,33,33,33,34,34,34,34,35,35,35,35,36,36,36,36,37,37,37,37,38,38,38,38,39,39,39,39,40,40,40,40,41,41,41,41,42,42,42,42,43,43,43,43,44,44,44,44,45,45,45,45,46,46,46,46,47,47,47,47,48,48,48,48,49,49,49,49,50,50,50,50,51,51,51,51,52,52,52,52,53,53,53,53,54,54,54,54,55,55,55,55,56,56,56,56,57,57,57,57,58,58,58,58,59,59,59,59,60,60,60,60,61,61,61,61,62,62,62,62,63,63,63,63};
char c0 [] = "AAAABBBBCCCCDDDDEEEEFFFFGGGGHHHHIIIIJJJJKKKKLLLLMMMMNNNNOOOOPPPPQQQQRRRRSSSSTTTTUUUUVVVVWWWWXXXXYYYYZZZZaaaabbbbccccddddeeeeffffgggghhhhiiiijjjjkkkkllllmmmmnnnnooooppppqqqqrrrrssssttttuuuuvvvvwwwwxxxxyyyyzzzz0000111122223333444455556666777788889999++++////";
int carry0 [] = {0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48};
int c1 [] = {0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,11,11,11,11,11,11,11,11,11,11,11,11,11,11,11,11,12,12,12,12,12,12,12,12,12,12,12,12,12,12,12,12,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,14,14,14,14,14,14,14,14,14,14,14,14,14,14,14,14,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15};
int carry1 [] = {0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60};
int c2 [] = {0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3};
int carry2 [] = {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63};

// Here for debugging.
char output [1024];

void ToBase64 (unsigned char * data, int bytes, char * s)
{
    int remainder [] = {0,2,3};
    // How many chars to generate. 4 for every 3 bytes plus a remainder.
    int chars = ((bytes / 3) * 4) + remainder[bytes%3];

    while ( chars > 0 )
    {
        int carry = 0;
        // First character.
        *s++ = c0[*data];
        carry = carry0[*data];
        data += 1; bytes -= 1; chars -= 1;
        // Second character.
        if ( chars > 0 )
        {
            if ( bytes > 0 )
            {
                *s = digits[carry+c1[*data]];
                carry = carry1[*data];
                data += 1; bytes -= 1;
            } else {
                // We must consume the bottom two bits of the first byte.
                *s = digits[carry];
            }
            chars -= 1;
        } else {
            *s = '=';
        }
        s += 1;
        // Third character.
        if ( chars > 0 )
        {
            if ( bytes > 0 )
            {
                *s = digits[carry+c2[*data]];
                carry = carry2[*data];
                data += 1; bytes -= 1;
            } else {
                // Consume the carry from the last byte.
                *s = digits[carry];
            }
            chars -= 1;
        } else {
            *s = '=';
        }
        s += 1;
        // Fourth character.
        if ( chars > 0 )
        {
            *s = digits[carry];
            chars -= 1;
        } else {
            *s = '=';
        }
        s += 1;
    }
}

void atest ( void )
{
    unsigned char test [] = "12345678";
    //unsigned char test [] = {0x00};
    // Generate my tables.
    //GenerateTables();
    memset(output,0,sizeof(output));
    // Convert it.
    //ToBase64asm(test,sizeof(test)-1,output);
    ToBase64(test,sizeof(test)-1,output);
    // Print it.
    printf("%s\n",output);


}

Given enough registers this would be pretty optimal I believe. However, register juggling seems to be impeding the implementation. Join in if you wish.

Paul

PS: I am already considering holding *data in a register but I am short as it is.
Hi DanRollins,

How about this? Just slap it in the above code and call it.

void ToBase64asm(unsigned char * data, int bytes, char * s)
{
    // Insignificant bit, easier done in C.
    char remainder [] = {0,2,3};
    int chars = ((bytes / 3) * 4) + remainder[bytes%3];
    // Safety harness.
    if ( bytes == 0 ) return;
    if ( bytes > 0xff ) return; // Cover this with an outer loop later.

    _asm {
    // NOTE: I am assuming ds == es!
    // This can be catered for easily if not the case.
        // si = data would be nice but I need si for table lookups.
        mov esi,data
        // di = s
        mov edi,s
        // bh = bytes
        mov ebx,bytes
        mov bh,bl
        // cx = chars
        mov ecx,chars
charsLoop:
        // bl = carry
        // edx spare

        // First character.
        //*s++ = c0[*data];
        xor eax,eax
        mov al,[esi]
        mov edx,esi
            mov esi,eax // si = *data
            mov al,c0[esi]
            stosb
            //carry = carry0[*data];
            mov bl,carry0[esi]
        mov esi,edx // si = data
        //data += 1; bytes -= 1; chars -= 1;
        inc esi
        dec bh
        dec cx

        // Second character.
        //if ( chars > 0 )
        //{
        jcxz secondIsEquals
        //    chars -= 1;
        dec cx;
        //    if ( bytes > 0 )
        //    {
        test bh,bh
        jz secondBytesIsZero
        //        *s = digits[carry+c1[*data]];
        xor eax,eax
        mov al,[esi]
        mov edx,esi
            mov esi,eax // si = *data
            mov al,c1[esi]
            add al,bl
            //        carry = carry1[*data];
            mov bl,carry1[esi]
            //continued        *s = digits[carry+c1[*data]];
            mov esi,eax // si = carry+c1[*data]
            mov al,digits[esi];
            stosb
        mov esi,edx // si = data
        //        data += 1; bytes -= 1;
        inc esi
        dec bh
        jmp doneSecond

secondBytesIsZero:
        //        *s = digits[carry];
    mov edx,esi
        xor ax,ax
        mov al,bl
        mov esi,eax
        mov al,digits[esi]
        stosb
    mov esi,edx
        jmp doneSecond

secondIsEquals:
        mov al,'='
        stosb

doneSecond:
        // Third character.
        //if ( chars > 0 )
        //{
        jcxz thirdIsEquals
        //    chars -= 1;
        dec cx;
        //    if ( bytes > 0 )
        //    {
        test bh,bh
        jz thirdBytesIsZero

        //        *s = digits[carry+c2[*data]];
        xor eax,eax
        mov al,[esi]
        mov edx,esi
            mov esi,eax // si = *data
            mov al,c2[esi]
            add al,bl
            //        carry = carry2[*data];
            mov bl,carry2[esi]
            //continued        *s = digits[carry+c2[*data]];
            mov esi,eax // si = carry+c2[*data]
            mov al,digits[esi];
            stosb
        mov esi,edx // si = data
        //        data += 1; bytes -= 1;
        inc esi
        dec bh
        jmp doneThird

thirdBytesIsZero:
        //        *s = digits[carry];
    mov edx,esi
        xor ax,ax
        mov al,bl
        mov esi,eax
        mov al,digits[esi]
        stosb
    mov esi,edx
        jmp doneThird

thirdIsEquals:
        mov al,'='
        stosb

doneThird:
        // Fourth character.
        //if ( chars > 0 )
        //{
        jcxz fourthIsEquals
        //    chars -= 1;
        dec cx
        //        *s = digits[carry];
    mov edx,esi
        xor eax,eax
        mov al,bl
        mov esi,eax
        mov al,digits[esi]
        stosb
    mov esi,edx
        jmp doneFourth

fourthIsEquals:
        //    *s = '=';
        mov al,'='
        stosb

doneFourth:
        // Next char.
        //loop charsLoop
        jcxz done
        jmp charsloop

done:

    }
}

How does that measure up? Some of the lookups could be calculated to improve things but I think the algorithm is sound.

Paul
Hi DanRollins,

Here's a polished up version.

// Includes.
#include <stdio.h>
#include <conio.h>

// The digits of the number.
char digits [] = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/";

void PrintTable(int * c, int * carry, bool cAsChars)
{
    int i;
    // The characters.
    if ( cAsChars ) for ( i = 0; i < 256; i++ ) printf("%c", digits[c[i]]);
    else for ( i = 0; i < 256; i++ ) printf("%d,", c[i]);
    printf ("\n");
    // The carries.
    for ( i = 0; i < 256; i++ ) printf("%d,", carry[i]);
    printf ("\n");
}

void GenerateTables ( void )
{
    int i;
    int c [256];
    int carry [256];
    // I want tables of character and carry for each byte of the three.
    /*
    The carry from the previous calculation will be added to whatever is in
    the c array to lookup in the digits array.
    */
    // Table for first byte.
    // Top 6 bits is character.
    // Bottom 2 is carry to top 2 of next 6.
    for ( i = 0; i < 256; i++ )
    {
        c[i] = i>>2;
        carry[i] = (i & 0x03) << 4;
    }
    PrintTable(c,carry,true);    
    // Table for second byte.
    // Top 4 bits is character.
    // Bottom 4 is carry to top 4 of next 6.
    for ( i = 0; i < 256; i++ )
    {
        c[i] = i>>4;
        carry[i] = (i & 0x0f) << 2;
    }
    PrintTable(c,carry,false);
    // Table for third byte.
    // Top 2 bits is character.
    // Bottom 6 is fourth character.
    for ( i = 0; i < 256; i++ )
    {
        c[i] = i>>6;
        carry[i] = (i & 0x3f);
    }
    PrintTable(c,carry,false);
}

// The tables generated by the above code.
//int c0 [] = {0,0,0,0,1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4,5,5,5,5,6,6,6,6,7,7,7,7,8,8,8,8,9,9,9,9,10,10,10,10,11,11,11,11,12,12,12,12,13,13,13,13,14,14,14,14,15,15,15,15,16,16,16,16,17,17,17,17,18,18,18,18,19,19,19,19,20,20,20,20,21,21,21,21,22,22,22,22,23,23,23,23,24,24,24,24,25,25,25,25,26,26,26,26,27,27,27,27,28,28,28,28,29,29,29,29,30,30,30,30,31,31,31,31,32,32,32,32,33,33,33,33,34,34,34,34,35,35,35,35,36,36,36,36,37,37,37,37,38,38,38,38,39,39,39,39,40,40,40,40,41,41,41,41,42,42,42,42,43,43,43,43,44,44,44,44,45,45,45,45,46,46,46,46,47,47,47,47,48,48,48,48,49,49,49,49,50,50,50,50,51,51,51,51,52,52,52,52,53,53,53,53,54,54,54,54,55,55,55,55,56,56,56,56,57,57,57,57,58,58,58,58,59,59,59,59,60,60,60,60,61,61,61,61,62,62,62,62,63,63,63,63};
char c0 [] = "AAAABBBBCCCCDDDDEEEEFFFFGGGGHHHHIIIIJJJJKKKKLLLLMMMMNNNNOOOOPPPPQQQQRRRRSSSSTTTTUUUUVVVVWWWWXXXXYYYYZZZZaaaabbbbccccddddeeeeffffgggghhhhiiiijjjjkkkkllllmmmmnnnnooooppppqqqqrrrrssssttttuuuuvvvvwwwwxxxxyyyyzzzz0000111122223333444455556666777788889999++++////";
char carry0 [] = {0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48};
char c1 [] = {0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,11,11,11,11,11,11,11,11,11,11,11,11,11,11,11,11,12,12,12,12,12,12,12,12,12,12,12,12,12,12,12,12,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,14,14,14,14,14,14,14,14,14,14,14,14,14,14,14,14,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15};
char carry1 [] = {0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60};
char c2 [] = {0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3};
char carry2 [] = {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63};

// Here for debugging.
char output [1024];

void ToBase64 (unsigned char * data, int bytes, char * s)
{
    int remainder [] = {0,2,3};
    // How many chars to generate. 4 for every 3 bytes plus a remainder.
    int chars = ((bytes / 3) * 4) + remainder[bytes%3];

    while ( chars > 0 )
    {
        int carry = 0;
        // First character.
        *s++ = c0[*data];
        carry = carry0[*data];
        data += 1; bytes -= 1; chars -= 1;
        // Second character.
        if ( chars > 0 )
        {
            if ( bytes > 0 )
            {
                *s = digits[carry+c1[*data]];
                carry = carry1[*data];
                data += 1; bytes -= 1;
            } else {
                // We must consume the bottom two bits of the first byte.
                *s = digits[carry];
            }
            chars -= 1;
        } else {
            *s = '=';
        }
        s += 1;
        // Third character.
        if ( chars > 0 )
        {
            if ( bytes > 0 )
            {
                *s = digits[carry+c2[*data]];
                carry = carry2[*data];
                data += 1; bytes -= 1;
            } else {
                // Consume the carry from the last byte.
                *s = digits[carry];
            }
            chars -= 1;
        } else {
            *s = '=';
        }
        s += 1;
        // Fourth character.
        if ( chars > 0 )
        {
            *s = digits[carry];
            chars -= 1;
        } else {
            *s = '=';
        }
        s += 1;
    }
}

void ToBase64asm(unsigned char * data, int bytes, char * s)
{
    // Insignificant bit, easier done in C.
    char remainder [] = {0,2,3};
    int chars = ((bytes / 3) * 4) + remainder[bytes%3];
    // Safety harness.
    if ( bytes == 0 ) return;
    if ( bytes > 0xff ) return; // Cover this with an outer loop later.

    _asm {
    // NOTE: I am assuming ds == es!
    // This can be catered for easily if not the case.
        // si = data would be nice but I need si for table lookups.
        mov esi,data
        // di = s
        mov edi,s
        // bh = bytes
        mov ebx,bytes
        mov bh,bl
        // cx = chars
        mov ecx,chars
charsLoop:
        // bl = carry
        // edx spare

        // First character.
        //*s++ = c0[*data];
        xor eax,eax
        mov al,[esi]
        mov edx,esi
            mov esi,eax // si = *data
            mov al,c0[esi]
            stosb
            //carry = carry0[*data];
            mov bl,carry0[esi]
        mov esi,edx // si = data
        //data += 1; bytes -= 1; chars -= 1;
        inc esi
        dec bh
        dec ecx

        // Second character.
        //if ( chars > 0 )
        //{
        jcxz secondIsEquals
        //    chars -= 1;
        dec ecx;
        //    if ( bytes > 0 )
        //    {
        test bh,bh
        jz secondBytesIsZero
        //        *s = digits[carry+c1[*data]];
        xor eax,eax
        mov al,[esi]
        mov edx,esi
            mov esi,eax // si = *data
            mov al,c1[esi]
            add al,bl
            //        carry = carry1[*data];
            mov bl,carry1[esi]
            //continued        *s = digits[carry+c1[*data]];
            mov esi,eax // si = carry+c1[*data]
            mov al,digits[esi];
            stosb
        mov esi,edx // si = data
        //        data += 1; bytes -= 1;
        inc esi
        dec bh
        jmp doneSecond

secondBytesIsZero:
        //        *s = digits[carry];
    mov edx,esi
        xor eax,eax
        mov al,bl
        mov esi,eax
        mov al,digits[esi]
        stosb
    mov esi,edx
        jmp doneSecond

secondIsEquals:
        mov al,'='
        stosb

doneSecond:
        // Third character.
        //if ( chars > 0 )
        //{
        jcxz thirdIsEquals
        //    chars -= 1;
        dec ecx;
        //    if ( bytes > 0 )
        //    {
        test bh,bh
        jz thirdBytesIsZero

        //        *s = digits[carry+c2[*data]];
        xor eax,eax
        mov al,[esi]
        mov edx,esi
            mov esi,eax // si = *data
            mov al,c2[esi]
            add al,bl
            //        carry = carry2[*data];
            mov bl,carry2[esi]
            //continued        *s = digits[carry+c2[*data]];
            mov esi,eax // si = carry+c2[*data]
            mov al,digits[esi];
            stosb
        mov esi,edx // si = data
        //        data += 1; bytes -= 1;
        inc esi
        dec bh
        jmp doneThird

thirdBytesIsZero:
        //        *s = digits[carry];
    mov edx,esi
        xor ax,ax
        mov al,bl
        mov esi,eax
        mov al,digits[esi]
        stosb
    mov esi,edx
        jmp doneThird

thirdIsEquals:
        mov al,'='
        stosb

doneThird:
        // Fourth character.
        //if ( chars > 0 )
        //{
        jcxz fourthIsEquals
        //    chars -= 1;
        dec ecx
        //        *s = digits[carry];
    mov edx,esi
        xor eax,eax
        mov al,bl
        mov esi,eax
        mov al,digits[esi]
        stosb
    mov esi,edx
        jmp doneFourth

fourthIsEquals:
        //    *s = '=';
        mov al,'='
        stosb

doneFourth:
        // Next char.
        //loop charsLoop
        jcxz done
        jmp charsloop

done:

    }
}


void atest ( void )
{
    unsigned char test [] = "1234567";
    // Generate my tables.
    //GenerateTables();
    // Convert it.
    memset(output,0,sizeof(output)); ToBase64(test,sizeof(test)-1,output); printf("%s\n",output);
    memset(output,0,sizeof(output)); ToBase64asm(test,sizeof(test)-1,output); printf("%s\n",output);

    _getch();
}

I've often found that it is worth the attempt to pre-calculate as much as possible. However, in this case it may end up being significantly less efficient. I would be interested in the outcome.

Note the limitation of a maximum of 255 bytes to encode. A simple outer loop may be necessary if you need it to process larger chunks. Obviously each chunk must be a multiple of 3 bytes so 255 bytes per chunk would probably be the most efficient..

Paul
I'll test it on Tuesday (when I have access to the PC I've been using) so the timings will be comparable.

>>Note the limitation of a maximum of 255 bytes to encode....

The test harness creates a 1MB buffer and it calls the encoding function it 1000 times... I haven't had time to look at your submission yet, so forgive me if I'm misunderstanding, but are you saying that your function will not encode the entire 1MB buffer?

-- Dan
Hi Dan,

Here's a testable version with an outer loop that handles the 255 byte limitation:

// Includes.
#include <stdio.h>
#include <conio.h>

// The digits of the number.
char digits [] = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/";

void PrintTable(int * c, int * carry, bool cAsChars)
{
      int i;
      // The characters.
      if ( cAsChars ) for ( i = 0; i < 256; i++ ) printf("%c", digits[c[i]]);
      else for ( i = 0; i < 256; i++ ) printf("%d,", c[i]);
      printf ("\n");
      // The carries.
      for ( i = 0; i < 256; i++ ) printf("%d,", carry[i]);
      printf ("\n");
}

void GenerateTables ( void )
{
      int i;
      int c [256];
      int carry [256];
      // I want tables of character and carry for each byte of the three.
      /*
      The carry from the previous calculation will be added to whatever is in
      the c array to lookup in the digits array.
      */
      // Table for first byte.
      // Top 6 bits is character.
      // Bottom 2 is carry to top 2 of next 6.
      for ( i = 0; i < 256; i++ )
      {
            c[i] = i>>2;
            carry[i] = (i & 0x03) << 4;
      }
      PrintTable(c,carry,true);      
      // Table for second byte.
      // Top 4 bits is character.
      // Bottom 4 is carry to top 4 of next 6.
      for ( i = 0; i < 256; i++ )
      {
            c[i] = i>>4;
            carry[i] = (i & 0x0f) << 2;
      }
      PrintTable(c,carry,false);
      // Table for third byte.
      // Top 2 bits is character.
      // Bottom 6 is fourth character.
      for ( i = 0; i < 256; i++ )
      {
            c[i] = i>>6;
            carry[i] = (i & 0x3f);
      }
      PrintTable(c,carry,false);
}

// The tables generated by the above code.
//int c0 [] = {0,0,0,0,1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4,5,5,5,5,6,6,6,6,7,7,7,7,8,8,8,8,9,9,9,9,10,10,10,10,11,11,11,11,12,12,12,12,13,13,13,13,14,14,14,14,15,15,15,15,16,16,16,16,17,17,17,17,18,18,18,18,19,19,19,19,20,20,20,20,21,21,21,21,22,22,22,22,23,23,23,23,24,24,24,24,25,25,25,25,26,26,26,26,27,27,27,27,28,28,28,28,29,29,29,29,30,30,30,30,31,31,31,31,32,32,32,32,33,33,33,33,34,34,34,34,35,35,35,35,36,36,36,36,37,37,37,37,38,38,38,38,39,39,39,39,40,40,40,40,41,41,41,41,42,42,42,42,43,43,43,43,44,44,44,44,45,45,45,45,46,46,46,46,47,47,47,47,48,48,48,48,49,49,49,49,50,50,50,50,51,51,51,51,52,52,52,52,53,53,53,53,54,54,54,54,55,55,55,55,56,56,56,56,57,57,57,57,58,58,58,58,59,59,59,59,60,60,60,60,61,61,61,61,62,62,62,62,63,63,63,63};
char c0 [] = "AAAABBBBCCCCDDDDEEEEFFFFGGGGHHHHIIIIJJJJKKKKLLLLMMMMNNNNOOOOPPPPQQQQRRRRSSSSTTTTUUUUVVVVWWWWXXXXYYYYZZZZaaaabbbbccccddddeeeeffffgggghhhhiiiijjjjkkkkllllmmmmnnnnooooppppqqqqrrrrssssttttuuuuvvvvwwwwxxxxyyyyzzzz0000111122223333444455556666777788889999++++////";
char carry0 [] = {0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48,0,16,32,48};
char c1 [] = {0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,11,11,11,11,11,11,11,11,11,11,11,11,11,11,11,11,12,12,12,12,12,12,12,12,12,12,12,12,12,12,12,12,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,14,14,14,14,14,14,14,14,14,14,14,14,14,14,14,14,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15};
char carry1 [] = {0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60};
char c2 [] = {0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3};
char carry2 [] = {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63};

// Here for debugging.
char output [(1024*1024)/3*4+1];

void ToBase64 (unsigned char * data, int bytes, char * s)
{
      int remainder [] = {0,2,3};
      // How many chars to generate. 4 for every 3 bytes plus a remainder.
      int chars = ((bytes / 3) * 4) + remainder[bytes%3];

      while ( chars > 0 )
      {
            int carry = 0;
            // First character.
            *s++ = c0[*data];
            carry = carry0[*data];
            data += 1; bytes -= 1; chars -= 1;
            // Second character.
            if ( chars > 0 )
            {
                  if ( bytes > 0 )
                  {
                        *s = digits[carry+c1[*data]];
                        carry = carry1[*data];
                        data += 1; bytes -= 1;
                  } else {
                        // We must consume the bottom two bits of the first byte.
                        *s = digits[carry];
                  }
                  chars -= 1;
            } else {
                  *s = '=';
            }
            s += 1;
            // Third character.
            if ( chars > 0 )
            {
                  if ( bytes > 0 )
                  {
                        *s = digits[carry+c2[*data]];
                        carry = carry2[*data];
                        data += 1; bytes -= 1;
                  } else {
                        // Consume the carry from the last byte.
                        *s = digits[carry];
                  }
                  chars -= 1;
            } else {
                  *s = '=';
            }
            s += 1;
            // Fourth character.
            if ( chars > 0 )
            {
                  *s = digits[carry];
                  chars -= 1;
            } else {
                  *s = '=';
            }
            s += 1;
      }
}

void _ToBase64asm(unsigned char * data, int bytes, char * s, int chars)
{
      _asm {
      // NOTE: I am assuming ds == es!
      // This can be catered for easily if not the case.
            // si = data would be nice but I need si for table lookups.
            mov esi,data
            // di = s
            mov edi,s
            // bh = bytes
            mov ebx,bytes
            mov bh,bl
            // cx = chars
            mov ecx,chars
charsLoop:
            // bl = carry
            // edx spare

            // First character.
            //*s++ = c0[*data];
            xor eax,eax
            mov al,[esi]
            mov edx,esi
                  mov esi,eax // si = *data
                  mov al,c0[esi]
                  stosb
                  //carry = carry0[*data];
                  mov bl,carry0[esi]
            mov esi,edx // si = data
            //data += 1; bytes -= 1; chars -= 1;
            inc esi
            dec bh
            dec ecx

            // Second character.
            //if ( chars > 0 )
            //{
            jcxz secondIsEquals
            //      chars -= 1;
            dec ecx;
            //      if ( bytes > 0 )
            //      {
            test bh,bh
            jz secondBytesIsZero
            //            *s = digits[carry+c1[*data]];
            xor eax,eax
            mov al,[esi]
            mov edx,esi
                  mov esi,eax // si = *data
                  mov al,c1[esi]
                  add al,bl
                  //            carry = carry1[*data];
                  mov bl,carry1[esi]
                  //continued            *s = digits[carry+c1[*data]];
                  mov esi,eax // si = carry+c1[*data]
                  mov al,digits[esi];
                  stosb
            mov esi,edx // si = data
            //            data += 1; bytes -= 1;
            inc esi
            dec bh
            jmp doneSecond

secondBytesIsZero:
            //            *s = digits[carry];
      mov edx,esi
            xor eax,eax
            mov al,bl
            mov esi,eax
            mov al,digits[esi]
            stosb
      mov esi,edx
            jmp doneSecond

secondIsEquals:
            mov al,'='
            stosb

doneSecond:
            // Third character.
            //if ( chars > 0 )
            //{
            jcxz thirdIsEquals
            //      chars -= 1;
            dec ecx;
            //      if ( bytes > 0 )
            //      {
            test bh,bh
            jz thirdBytesIsZero

            //            *s = digits[carry+c2[*data]];
            xor eax,eax
            mov al,[esi]
            mov edx,esi
                  mov esi,eax // si = *data
                  mov al,c2[esi]
                  add al,bl
                  //            carry = carry2[*data];
                  mov bl,carry2[esi]
                  //continued            *s = digits[carry+c2[*data]];
                  mov esi,eax // si = carry+c2[*data]
                  mov al,digits[esi];
                  stosb
            mov esi,edx // si = data
            //            data += 1; bytes -= 1;
            inc esi
            dec bh
            jmp doneThird

thirdBytesIsZero:
            //            *s = digits[carry];
      mov edx,esi
            xor ax,ax
            mov al,bl
            mov esi,eax
            mov al,digits[esi]
            stosb
      mov esi,edx
            jmp doneThird

thirdIsEquals:
            mov al,'='
            stosb

doneThird:
            // Fourth character.
            //if ( chars > 0 )
            //{
            jcxz fourthIsEquals
            //      chars -= 1;
            dec ecx
            //            *s = digits[carry];
      mov edx,esi
            xor eax,eax
            mov al,bl
            mov esi,eax
            mov al,digits[esi]
            stosb
      mov esi,edx
            jmp doneFourth

fourthIsEquals:
            //      *s = '=';
            mov al,'='
            stosb

doneFourth:
            // Next char.
            //loop charsLoop
            jcxz done
            jmp charsloop

done:

      }
}

void ToBase64asm(unsigned char * data, int bytes, char * s)
{
      // Insignificant bit, easier done in C.
      char remainder [] = {0,2,3};
      int chars = ((bytes / 3) * 4) + remainder[bytes%3];
      // Safety harness.
      if ( bytes == 0 ) return;
      // Do all the 255byte blocks.
      while ( bytes > 0xff )
      {
            _ToBase64asm(data, 0xff, s, (0xff / 3) * 4);
            data += 0xff;
            bytes -= 0xff;
            s += (0xff / 3) * 4;
            chars -= (0xff / 3) * 4;
      }
      // And the last block.
      if ( bytes > 0 ) _ToBase64asm(data, bytes, s, chars);
}


void atest ( void )
{
      unsigned char test [] = "1234567";
      unsigned char * test2 = (unsigned char *)malloc(1024 * 1024);
      // Generate my tables.
      //GenerateTables();
      // Convert it.
      //memset(output,0,sizeof(output)); ToBase64(test,1024 * 1024,output); printf("%s\n",output);
      memset(output,0,sizeof(output)); ToBase64asm(test2,1024 * 1024,output); printf("%s\n",output);

      memset(output,0,sizeof(output)); ToBase64(test,sizeof(test)-1,output); printf("%s\n",output);
      memset(output,0,sizeof(output)); ToBase64asm(test,sizeof(test)-1,output); printf("%s\n",output);

      _getch();
}

PS: Nice to see your pic on this page again. :-)

Paul
Hi Paul!
I finally grabbed a moment to look at your code.  That's a clever use of lookup tables and (interestingly) avoids the use of binary shift operations during the main loop -- prefering to precalculate the values and store them into lookup tables.  Let me see if I understand the algorithm:

Examine the data three bytes at-a-time and output four bytes per triplet.  For each triplet:

1) The first input byte can be looked up in a table that just repeats each output character four times (co[]).  That effectively ignores the lowest two bits (same output regardless of bits 0 and 1).  That yields the first output.

2) Remember the part of the data that was ignored (you refer to it as "the carry") This will be one of the values 0,16,32,48 (0x00,0x10,0x20,0x30) looked up from the carry0[] table.

3) Calculate the sum of the carry from step #2 plus a lookup from a another table( c1[]).  The c1[] table is 16 repeats of each possible value; that is, it outputs the same value regardless of the low four bits of the input.   By adding that looked-up value to the saved carry, we have an index into the standard Base64 lookup table (digits[]).  That yields the second output.

4) Remember the part of the second-byte data that was ignored in step #3 (the low four bits).  Use that as an index into the carry1[] table (which contains 0,4,8,...,52,56,60, repeated 16 times).  That is, this table is those low four bits shifted left by 4.

5) Again calculate an index into the standard Base64 lookup table, from the combined value of result of the carry1[] lookup and a lookup that effectively uses the top two bits from the next input byte.

6 (... repeat a similar procedure for the last byte...)

That is:  For each byte except the first, you lookup a value from a table that ignores certain bits and you combine that with a value from another table that ignores other bits, resulting in an index into the standard lookup table.

I'm impressed :-)  This is one of the reasons I so enjoy these challenge questions... that is an an approach that would not have occurred to me.  But in retrospect, makes perfect sense.

Now as to the timings:

MB/sec  Internal name // Description
--------  ---------------------------------------------------------------------------------------
131.24   pcToBase64 / PaulCaswell's C (7 lookup tables)
140.66   pcToBase64as / PaulCaswell's ASM conversion (7 lookup tables)

196.92   y4ToBase64 / craigwardman (ASM, later version -- uses bp and [edi+edx])

Alas, clever as it is, it is not as fast as some of the other techniques.  Also, the C-to-ASM conversion was only slightly faster.  I think part of the reason for that is the ASM version uses that 256-byte chunking technique and thus requires the need for an outer loop.  Each of the 1000 times through the test loop, the outer loop of your function needs to make 4,000 function calls; that is a killer  -- there is a lot of 'hidden' overhead in each CALL to _ToBase64asm().  

Another thing that holds it back (I believe) is that at each input byte, you are checking for end-of-input and jumping around the end-of-input logic.  The faster algorithms avoid millions of those compare-and-jump sequences by just bulling through and doing all three bytes in the all-important inner-loop logic and then doing a one-time adjustment at the end.

Don't be too dissapointed! craigwardman's first submissions needed some tweaks here and there before he reached high-speed nirvana :-)  Also, your code is already *way* faster than either of the Microsoft "submissions" CryptBinaryToStringA in Crypt32.dll or the CBase64Coder object :-)

-- Dan
Hi Dan,

Thanks for the feedback. As you can see this is just the first hit.

I take the point of the outer loop. That will go soon enough. That was a hack to get something that functions at the last minute when I noticed your test mechanism.

My next stage is to find a balance between shift/mask and table lookup. I am unsure how many cycles I can cut that way but it's certainly central to the algorithm so anything I can squeeze out of that will be a bonus.

That is an excellent idea to stick the '=' padding in at the end. I think that will be my first change.

The table technique goes back to my old 6502 days when there were no maths functions at all except add/sub. I have to admit that taking a wole K for tables would not have been acceptable, especially when you only have 64 of them. Precalculating is an excellent optimisation as I am sure you are aware. I am sure many of the 'old' algorithms would benefit a little.

I'll be back with the next iteration soon I hope.

Paul
That seems to have improved things a bit.

There may be still some small improvements but the extra register seems to help.

void ToBase64asm(unsigned char * data, int bytes, char * s)
{
      _asm {
      // NOTE: I am assuming ds == es!
      // This can be catered for easily if not the case.
            // si = data would be nice but I need si for table lookups.
            mov esi,data
            // di = s
            mov edi,s
            // cx = bytes
            mov ecx,bytes
            // I will only use al. Empty all of ax.
            xor eax,eax
bytesLoop:
            // bl = carry
            // edx spare, used for si push.

            // First character.
            //*s++ = c0[*data];
            mov al,[esi]
            mov edx,esi
                  mov esi,eax // si = *data
                  mov al,c0[esi]
                  stosb
                  //carry = carry0[*data];
                  mov bl,carry0[esi]
            mov esi,edx // si = data
            //data += 1; bytes -= 1;
            inc esi
            dec ecx

            //if ( bytes > 0 )
            jcxz noMoreBytes
            //{
            //      // Second character.
            //      *s++ = digits[carry+c1[*data]];
            mov al,[esi]
            mov edx,esi
                  mov esi,eax // si = *data
                  mov al,c1[esi]
                  add al,bl
                  //            carry = carry1[*data];
                  mov bl,carry1[esi]
                  //continued            *s = digits[carry+c1[*data]];
                  mov esi,eax // si = carry+c1[*data]
                  mov al,digits[esi];
                  stosb
            mov esi,edx // si = data
            //            data += 1; bytes -= 1;
            inc esi
            dec ecx
            //      // Third character.
            //      if ( bytes > 0 )
            jcxz noMoreBytes
            //      {
            //            *s++ = digits[carry+c2[*data]];
            mov al,[esi]
            mov edx,esi
                  mov esi,eax // si = *data
                  mov al,c2[esi]
                  add al,bl
                  //            carry = carry2[*data];
                  mov bl,carry2[esi]
                  //continued            *s = digits[carry+c2[*data]];
                  mov esi,eax // si = carry+c2[*data]
                  mov al,digits[esi];
                  stosb
            mov esi,edx // si = data
            //            data += 1; bytes -= 1;
            inc esi
            //dec ecx // Done by loop

noMoreBytes:
            //// Consume the last carry.
            //*s++ = digits[carry];
            mov edx,esi
                  mov al,bl
                  mov esi,eax
                  mov al,digits[esi]
                  stosb
            mov esi,edx
            // Next char.
            jcxz done
            loop bytesLoop
done:
      }
}

I hope you've kept the previous run so I dont have to post all the tables again. Note that there is no outer loop now.

Paul
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Not sure how the final figures will come out by I've just tested craigwardmans's final code against this one and I get:

Paul 298ms
Craig 3,603ms

for 1000 * 1mb.

I look forward to the final figures.

I wonder if it would be worth running your test on different cpu/mobo setups to see if any of them are particularly good in a certain environment.

 I suspect mine makes more use of data cache than instruction cache.

Paul
My first timing test on your last submission really threw me... it was so much faster than any other, that I had to check it carefully.  Alas, I did find a problem with it...

When you use the opcode,
    JCXZ
you are only testing the 16-bit CX register.  Since the nSrcLen starts out at
   1024*1024= 1048576 = 0x00100000
then the first
    DEC ECX
sets the 16-bit CX register to 0xFFFF and the subsequent
    JCXZ
opcodes will end up exiting the loop after processing only about 65,000 characters (of the original 1MB).  When I replaced the several JCXZ opcodes with
    JECXZ
I got valid timings that are more in line with what I expected.  Please verify my result, but I'm pretty sure that this is what happened.  

Please note that I would have made the exact same mistake (coming from an 8088 background and having last done "actual production ASM" coding in about 1989 :-) and I had to look up the opcode in the Intel literature to be sure... so don't feel too bad :-)  

BTW, the LOOP opcode works on ECX (unless the CPU is set to globally use 16-bit mode) -- I had to look that up because I was not at all certain.

With the changes you made, I did see a slight improvement over your previous submission.

MB/sec  Internal name // Description
--------  ---------------------------------------------------------------------------------------
140.66   pcToBase64asm / PaulCaswell's ASM conversion (7 lookup tables)
142.17   pc2ToBase64asm / PaulCaswell's (7 lookup tables) later revision after JECXZ

196.92   y4ToBase64 / craigwardman (ASM, later version -- uses bp and [edi+edx])

=-=-=-=-=-=-=-=-=-=-=
>> I wonder if it would be worth running your test on different cpu/mobo setups...
>> I suspect mine makes more use of data cache than instruction cache.

In fact, it turns out that Genuine Intel processors do end up processing signifiantly higher volumes for (what is purported yo be "an equivallent" GHz rating).  I show some results for running on a Celeron in the C++ version of this Q:
       https://www.experts-exchange.com/questions/21988706/How-fast-is-your-C-C-Base64-Encoder.html#17581593

In that thread, we also talked a bit about the L1/L2 caching situation.  The timing test, as designed, is certain to "thrash the cache" since it processes 1MB chunks while most caches are only 512K or less.  In my next two posts, I'll illustrate that rather dramatically.

-- Dan
This is my "mystery algorithm" -- it employs a 8096-byte lookup table... I look up a 16-bit output value for each 12 bits of input.  One of the tricks of this is that the table has the byte-pairs reversed to avoid fix-up before storing the output

Another thing about this is that it reads four bytes at-a-time, but discards one of them for each 3-byte triplet.  So it is reading a lot of bytes twice.  I thought there would be a big improvlent when I used that technique, but for various reasons, it's not nuch faster (if any) than just reading the three bytes one-at-a-time.

I also used the "scaled indexed indirect adressing mode" which is a first for me... an opcode like
      mov    bx, WORD PTR Lookup12bits[edx*2]
...is faster than the two-opcode sequence...
      shl    edx,2          
      mov   bx, WORD PTR Lookup12bits[edx]
and it also preserves the value of edx (though I don't use that feature).

Note that i'm "cheating" because I've omitted the end-of-input fixup that really needs to be done, but since that is an insignificant fraction of the overall time (conpared to processing the entire MB of data) I don't feel bad about omitting it.

//----------------------------------
// 8096-byte lookup table; not using MMX
//
//char Lookup12bits[ ((64*64)*2)+1 ]=  // 8096+1
//"AAABACADAEAFAGAHAIAJAKALAMANAOAPAQARASATA ... A6A7A8A9A+A/"
//"BABBBCBDBEBFBGBHBIBJBKBLBMBNBOBPBQBRBSBTB ... B6B7B8B9B+B/"
//...
//
//"9A9B9C9D9E9F9G9H9I9J9K9L9M9N9O9P9Q9R9S9T9...  5969798999+9/"
//"+A+B+C+D+E+F+G+H+I+J+K+L+M+N+O+P+Q+R+S+T+...  5+6+7+8+9+++/"
//"/A/B/C/D/E/F/G/H/I/J/K/L/M/N/O/P/Q/R/S/T/...  5/6/7/8/9/+//"
//;

// 8096-byte lookup table; not using MMX
// timed at 219.92 MB/s
//
void q5ToBase64( BYTE* pSrc, int nSrcLen, char* pszOutBuf )
{
      _asm {
            mov   edi, pszOutBuf;
            mov   esi, pSrc;
            mov   ecx, esi;
            add   ecx, nSrcLen     ;; all done when esi > ecx

            again:
            mov    eax, DWORD PTR [esi]
            bswap  eax         // big-endian adjust
            shr    eax,8        // discard byte n+3 (using only first 3)

            mov    edx,eax
            and    edx, 0x00000fff
            mov    bx, WORD PTR Lookup12bits[edx*2]

            shl    ebx,16  // save the first two bytes in bits 31-16

            shr    eax,12   // no need to mask here
            mov    bx, WORD PTR Lookup12bits[eax*2] // next two bytes in bits 15-0

            mov    DWORD PTR [edi], ebx

            add    edi, 4
            add    esi, 3

            cmp  esi, ecx
            jle  again
      }
      //      end-of-input adjustment code omitted
}
This is a variation of the above... It uses a 12-bit index into the 8096-byte table.  The difference is that in this code, I accumulate EIGHT bytes of output into an MMX register and then when I store it to memory, I'm using a special version of the MOVQ opcode that prevents the CPU from caching the output.  

The MOVNTQ opcode is designed for "streaming output" ... Normally, if you store a variable to memory, it is very likely that you will need to access it again soon -- so it's smart that the CPU puts it in the fastest cache.  But in situations like this one, I'd rather not clutter the cache with data that I'll not need to access any time soon.  As best as I can tell, this lets the lookup table and (perhaps part of) the source data stay in L1 cache.  The timing results show that changing just that one SINGLE opcode make a significant difference:

MB/sec  Internal name // Description
--------  ---------------------------------------------------------------------------------------
140.66   pcToBase64asm / PaulCaswell's ASM conversion (7 lookup tables)
142.17   pc2ToBase64asm / PaulCaswell's (7 lookup tables) later revision after JECXZ

196.92   y4ToBase64 / craigwardman (ASM, later version -- uses bp and [edi+edx])

219.92   q5ToBase64 / DanRollins's "mystery algo" -- 8096-byte lookup table (no MMX)
219.92   q6ToBase64 / Dan's "mystery algo" -- accumulating 8 bytes to output in an MMX register

307.69   q6xToBase64 / same as q6ToBase64 but using MOVNTQ rather than MOVQ

=--=-=-=-=-=-=-=-=-=-=-=-=-=

//--------------------------
// 8096-byte lookup table; using MMX for output (and using cache-managlemnet)
// timed at 307.96 MB/s
//
// PLUS, it collects 8 bytes into an MMX register for output
// Note that using MOVQ, it runs at "only*" 216,
// but witn MOVNTQ, it zooms in at 307
//
void q6ToBase64( BYTE* pSrc, int nSrcLen, char* pszOutBuf )
{
      _asm {
            EMMS
            mov   edi, pszOutBuf;
            mov   esi, pSrc;
            mov   ecx, esi;
            add   ecx, nSrcLen     ;; all done when esi > ecx

       again:
            mov    eax, DWORD PTR [esi]
            bswap  eax         // endian adjust

            shr    eax,8       // discard byte n+3 (using only first 3)

            mov    edx,eax
            and    edx, 0x00000fff
            mov    bx, WORD PTR Lookup12bits[edx*2]

            shl    ebx,16

            shr    eax,12
            mov    bx,  WORD PTR Lookup12bits[eax*2]

            add    esi, 3     // point to next three bytes
            movd   mm0, ebx
                  
            //------------------------------ repeat above for next three bytes
            mov    eax, DWORD PTR [esi]
            bswap  eax
            shr    eax,8      

            mov    edx,eax
            and    edx, 0x00000fff
            mov    bx, WORD PTR Lookup12bits[edx*2]

            shl    ebx,16

            shr    eax,12
            mov    bx,  WORD PTR Lookup12bits[eax*2]

            movd   mm1, ebx
            psll   mm1, 32
            por    mm0, mm1

            //movq   [edi],mm0      // runs at 219 MB/s

            //movntq  [edi],mm0  // runs at 307 MB/s
            _emit 0x0F
            _emit 0xE7
            _emit 0x07

            add    edi, 8
            add    esi, 3

            cmp  esi, ecx
            jle  again
      }
      //      end-of-input adjustment code omitted
}
Dan,

Wow! That's some amazing stuff. Pretty impressive performance too.

I have read in many places that there is a performance hit when reading on odd byte boundaries. I wonder if there is a way of keeping the fourth byte you read for the next time around. Perhaps read it 4 and 2 instead, preserving the the fourth for next time around. It certainly would make it fidddly but your discovery that movntq makes such a significant difference suggests that the main performance bottleneck is memory I/O.

Thanks for the jecxz thing, I really couldnt understand why it was so much faster. :-(

Good Challenge! :-)

Paul
Coming from an 8088 background and having written code for the Mac II (Motorolla 68020), I too, was worried about those odd-address accesses.  But in some time trials, I saw no particular performance loss.  I formed a theory as to why:

My theory is that accessing data that's in L1 cache is nearly as fast as accessing a register.  If so, then once a cache line is filled (typically 32 bytes, I think), you can get to any byte -- regardless of its physical address -- without an odd-byte penalty.

In this app, we "stream" through the input... touching each byte sequentially.  As soon as we access a byte, the group of bytes next to it get cached.  Since every byte in the 1MB sample will be accessed, there is no time lost to cache misses -- there would be exactly the same number of cach misses in either case.

It might be fun to try to look for evidence that the theory is correct... write a test that randomly accesses all over the place in say, a 10 MB buffer.  On the first test, take care that all WORD accesses start on even boundaries and on the next test, make sure that they all start on odd boundaries.  In that test, nearly every access would be a cache miss.... I'd wager that the timing for even-byte reads would be similar to those for odd-byte reads.  

A special case would be a test contrived to access WORDs in which each and every access was odd and also straddled a cache-line boundary.  That would casue TWO cache misses on each DWORD access and the performance would be terrible.

=-=-=-=-=-=-=-=
I also thought of optimizing the reads by grabbing 8-byte chunks into an MMX register.  But I spent so many opcodes shuffling the bytes into the right order for processing that the results were no good.  That technique might show promise using the sort of multi-table pre-calculated lookup techniques you used...

I read that the latest processors now support a PSHUFL opcode that could handle the endian-rearranging all at once... Using SSE, I could read 128 bits (16 bytes, 5 source triplets) into a single SSE register and grind out 20 bytes of output at a time. :-)
Avatar of aib_42
aib_42

My first idea was to use XLAT as well, though that may just be because I never had a chance to do so :).

I'm not feeling crazy enough to implement it right now (3 AM), but here's an idea:

How about a 16M look-up table?
Yea, it's not wise to talk assembly at this hour...

What I meant was a LUT with 2^24 entries, one for each byte triplet... With 4 bytes per entry, which would be... 2^24*4 = 16M*4 = 64M

I guess it comes down to how fast RAMs are these days...
hi aib_42,
Thanks for joining in!

In the C++ version of this thread, I tried the 64MB lookup table (http:/Cplusplus/Q_21988706.html#17579860).  It seemed like a wothwhile idea, but the performance was terrible.  As best as I can tell, it is because the table can't remain in cache, so the time lost to cache misses will far outweigh the advantage of having a tiny "code footprint."

Bits     Bits       Table
Input   Output   Size (Bytes)
 6        8                 64     (64)*1
12      16             8096     (64*64)*2
18      24         786,432     (64*64*64)*3
24      32     66,060,288    (64*64*64*64)*4

So far, the 8096-byte lookup table mechanism (two output bytes for each possible pair of 6-bit values) seems to be the best.  The next larger lookup table would be 786K which might work on a CPU that had 1MB of L1 cache.

There might be a clever variation that uses two or more different smaller lookup tables (Infinity08 came up with one such system in the C++ thread) but IMO any system using a table larger than the cache will perform pooorly.

-- Dan
Funny enough, I had to write a BASE64 encoding routine in C today for an SMTP client.

I didn't like any of the snippets I found on the web. Perhaps I should start a "how _readable_ is your base64 encoder" challenge? :)

This is how I did it:
loop on every byte of the source, keep track of the previous byte, too
keep a counter modulo 4, this will be called "cycle"
switch on 'cycle':
0 -> read the first 6 bits of the byte
1 -> read the last 2 bits of the previous byte | first 4 of the current
2 -> read the last 4 bits of the previous byte | first 2 of the current
3 -> read the last 6 bits of the previous byte, roll back the input one place

padding is easy, too, it's another switch on 'cycle'.

I think you can call this "de-optimization by loop rolling".
Well, I guss it *is* time to cliose this one up.

There were only two "official" submissions in the ASM category.  By my measure, craigwardman's was somewhat faster than that of PaulCaswell, so I sent more points his way.  FWIW, all of the subissions were far faster than the generic utilities available from standard Microsoft libraries :-)

My own submission http:#17711344 ended up fastest -- by using MMX opcodes and using some underhanded tweaks of the caching control...

The C++ version of this is here:   http:/Cplusplus/Q_21988706.html and it included very interesting discussion about CPU (AMD vs Intel) and compiler differences etc.

In the end, this thread was more about playing around with opcodes and learning a few things than it was a contest.

Thanks to all who participated!

-- Dan