Link to home
Start Free TrialLog in
Avatar of DanRollins
DanRollinsFlag for United States of America

asked on

How fast is your C/C++ Base64 Encoder?

Are you up for a challenge?
Several of us C and ASM programmers had a fun learning experience a few years ago when we tried to outdo each other by writing "The World's Fastest" binary-to-hex converter and output-formatting function.  Now I'm throwing out the gauntlet again:

    See if you can write a C/C++ function that encodes binary data into a
    base64 output string -- and see if yours is faster than everybody else's.

For additional reference, the ASM version of this question is here: http:Q_21983447.html

A Base64 Encoder converts binary data into ASCII codes that can be transmitted/processed by systems that do not normally process raw binary... for instance, in email attachments and in HTTP headers and as values in HTTP URL.  The encoder outputs only the following characters:
       ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/
plus, the output may end in zero, one, or two equal signs (=)

To encode the raw binary, you access it three bytes at a time, taking the data in groups of six bits.  For instance:
Hex:      0x07 0xFF 0x02
Binary:  0000 0111 1111 1111 0000 0010 (groups of 4 bits)
Binary:  000001 111111 111100 000010   (groups of 6 bits)
            \____/  \____/  \____/  \____/
decimal      1        63        60        2    
lookup       B         /          8         C

The length of the output must always be an even multiple of four, because when you DEcode the data, you will be taking groups of four base64 "digits" and turning them back into three binary bytes of output.  Therefore, when ENcoding, if your input data is not an even multiple of three, you will need to pad the end of the output with one or two equal signs (=)

If you need more info, then (as always) RTFR! (read the friggin RFC :-) http://www.ietf.org/rfc/rfc3548.txt 

Also, to verify your output, you can use an Encoder at this URL:
   http://www.opinionatedgeek.com/dotnet/tools/Base64Encode/Default.aspx

=-=-=-=-=-=-=-=-=-=-=
I'll be speed-testing the submissions using code like this:

//---------------------------------------- (assume dest buf is large enough)
void ToBase64( BYTE* pSrc, int nSrcLen, char* pszOutBuf )  // {
        // your code here!
}

I've already posted one C function here: http:/Assembly/Q_21983447.html#17516191

=-=-=-=-=-=-=-=-=-=-=
I'm betting that the ASM code will out-perform even the most optimized C code.  But as we saw in the Hex-converter challenge, a clever algorithm can often win over even the best hand-optimized ASM code (of course the ASM programmers can then implement the same algorithm and shave a few milliseconds and will surely win :-)

So submit your function and I'll post timing/speed comparisons.   The timings will measure thousands of passes at a large buffer of random input data, so if that changes the way you write your code, then keep it in mind.  

Submit your C/C++ function and I'll measure it against everybody else's as well as the ASM versions and any standard library calls and API functions that I can find.  If this works out, I'll start a second challenge for Base64 DEcoder.

Are you game?  500 points and your reputation as a hotshot C programmer are on the line :-)

-- Dan
Avatar of Infinity08
Infinity08
Flag of Belgium image

Just a question : is this algorithm only, or can we use a lookup table ?
Avatar of grg99
grg99

Hate to be a killjoy, but this is an academic exercise only--  in the real world, CPU's are tens to hundreds of times faster than any I/O device, even gigabit ethernet, so doing base64 conversion is not a significant bottleneck.  

Still it's a fun challenge-- go to it!



Avatar of DanRollins

ASKER

grg99,
It's not entirely pointless.  Imagine that your server must send thousands of 4MB video clips as email attachments.  The encoding *might* become a throughput bottleneck if the CPU gets maxed out by multiple threads doing various related activities.   Decoding might be more of an issue -- the time spent in transit is a constant, but once it is on your computer, an efficient decoder could prevent delay between clicking the attachment and watching the video.

But, as you say, the practicality of the challenge is not the issue :-)

Infinity08,
There is likely to be at least one lookup table employed by most algorithms.  In fact, it would be interesting to see if there is a way to convert *without* a lookup table.  For instance, binary-to-hex conversion *could* use a simple lookup in a table like

    char table "0123456789ABCDEF";
    c= table[b];  // where b starts is a 4-bit nybble and c ends up as an ASCII char

but most binary-to-hex algorithms do it without a lookup; e.g.,
    if (b>9) b += 7;   // where b is a 4-bit nybble
    c= b + '0';            // c is a hex digit

-- Dan
>> There is likely to be at least one lookup table employed by most algorithms.  In fact, it would be interesting to see if there is a way to convert *without* a lookup table.
I should have clarified: I wasn't referring to without versus with a lookup table, but whether you can use a lookup table to replace your algorithm. Just a practical question :)
For instance, one calculation scheme:

      switch (b) {                           // b is a 6-bit value (0...63)
         case 0x3e: b='+'; break;     // special cases (not part of any sequence)
         case 0x3f:  b='/'; break;
         default:
             if (b>25) b += 6;       // Not one of "ABC...XYZ"
             if (b>57) b -= 75;      // Not one of "abc...xyz" must be '012...789"
             b += 'A';
      }
>>... but whether you can use a lookup table to replace your algorithm.

My reply is that a lookup operation (whatever part of the algorithm it "replaces") is not only valid, but likely to be used in the fastest algorithm.  There are no particular constraints other than that the decoder needs to be fast if you want a chance to win the challenge :-)
Perfect :)
Oh, and how much time do we have ?
Ahh, you might not want to use a lookup table.   If you're encoding data, that's going to be flowing thru the cache, very likely flushing out the lookup table all the time.  That means it's often going to take a whole memory access time, approx 3nsec (like, 18 instructions) to get ONE table value!

Only way to tell for sure is to time the code.

Until we think we've rung the last ounce of performance out of the code :-)

Here is my timing methodology:
I will generate a 1MB buffer of random data (the same data for everybody)
I will then call the ToBase64 function 1000 times (process 1000 MB of data)
I will calculate the MB-per-second.
I'll run it three times and list the highest score.

Here is my function that does all of that:

void CBase64Dlg::OnButton1()
{
      CWaitCursor c;  // show that the code is running

      int nSrcLen= 1024*1024;      // one true MB
      int nPasses= 1000;

      BYTE* pData=      new BYTE[ nSrcLen ];
      char* pszEncoded= new char[ nSrcLen*2 ]; // really only need * 1.5, but close enough

      //----------- generate the random binary data
      strcpy( (char*)pData, "Hi There Everybody" );      // put recognizable text at the start
      int nFirst= strlen( (char*)pData );

      srand(123);  // seed the random number generator the same for everybody
      for (int j=nFirst; j< nSrcLen; j++ ) {
            pData[j]= rand() & 0xFF;
      }

      Sleep(0);             // start on a new timeslice
      int nStartTick= GetTickCount();
      for (int k=0; k<nPasses; k++) {
            ToBase64( pData, nSrcLen, pszEncoded ); // encode 1MB of data
      }
      int nTicksTotal= GetTickCount() - nStartTick;  // duration of the test in ms (note: +- about 9ms)

      delete pData;
      delete pszEncoded;

      //--------- output the results
      CString sMsg; sMsg.Format(
            "\r\nIt took %d ms to process %d MB of data\r\n"
            "That is %.2f MB per second\r\n",
            (int) nTicksTotal,
            (int) nPasses,
            (double) nPasses/((double)nTicksTotal/1000)
      );
      m_ctlEdOutput.SetSel(32767, 32767);
      m_ctlEdOutput.ReplaceSel( sMsg );
}

=-=-=-=-==-=-=-=-=-=-=
Notes:
* The test loops enough times to make the inaccuracy of the GetTickCount() timer unimportant.
* The test is likley to be interrupted many times by other processes, but that should be
   relatively consistant.  By running it three times (andy verifying *similar* results) that
   should not skew the results.
* I'm currently running on a 2.6Ghz Athlon.  All times will be taken from here.
   If you do your own timings, you should be able to obtain "relative" values (a faster algorithm
   should be faster, regardless of CPU speed)
* I'll eyeball the result output to verify that the first part is encoded correctly:
   Hi There Everybody                 (source)
   SGkgVGhlcmUgRXZlcnlib2R5    (encoded)
Timing results to date:
MB/sec  Description
--------  -----------------------------------------------------------------------------------------------------------
   58.59  Dan's C "standard" encoder       (inner loop at http:/Assembly/Q_21983447.html#17516191
   53.42  craigwardman's ASM (rolled loops)
   63.36  craigwardman 's ASM (unrolled loops)

I can also tell you that I have written an alternative C version that is running at
   97.86 MB/sec
which I'll reveal if nobody else comes up with the same technique.  I suspect that conversion to ASM will put it over 100.

Please!  Don't let these timings discourage you.  Until I've timed your function on my specific computer, there is no way to know how fast it is relative to the other submissions.  Also, there can be great benefit in comparing merits and problems with slow functions.  It's all about the challenge and the learning experience.
Note: On the C functions, I set the compiler optimization on "maximum speed."  From past experience, that can have a significant advantage over "Debug (no optimization)" but ends up nearly identical to "minimum size."
Some additional results:
MB/sec  Description
--------  -----------------------------------------------------------------------------------------------------------
   17.11  The Base64Coder object from http://support.microsoft.com/default.aspx?scid=kb;[LN];Q191239
 104.05  CryptBinaryToStringA (in Crypt32.dll)

The bar is raised!  A standard API function spins through at perhaps unbeatable rates.  
But I think we can beat it.
What's the deadline for submission, and how do we submit ?
A submission should be source code of the one function, posted right here in this thread.  Submit when you have a working solution.  Or don't submit at all, just join the discussion and give us your ideas and thoughts.  I'm not stating a specific time limit -- I'll keep this and the ASM thread open as long as interest continues.
Note: I messed up the compiler optimization settings in previous time trials.  Current (corrected) results are:

MB/sec  Internal name // Description
--------  -----------------------------------------------------------------------------------------------------------
  54.42  xToBase64 / craigwardman (ASM, std loops)
  61.81  yToBase64 / craigwardman (ASM, unrolled loops)
  36.80  aToBase64 / Base64Coder http://support.microsoft.com/default.aspx?scid=kb;[LN];Q191239
104.92  cToBase64 /  CryptBinaryToString API call
156.86  z2ToBase64 /  Dan's 'standard' C code (modified, see below)
188.78  zToBase64 /  Dan's 'standard' C code
198.13  mToBase64 / Dan's "mystery" algorithm (C code)

Note that the ASM and API functions timings are consistent with previous measurements, but the C code got much faster (as expected if the complier optimization were wrong).

It's interesting that both of the C functions are considerably faster than the ASM functions and that both of them are almost twice as fast as Microsoft's CryptoAPI code :-)  The ASM code generated by the compiler is shown in the ASM thread.

=-=-=-=-=-=-=-=-=-=
In my 'standard C' algorithm, I changed just these lines:
      b1= BYTE((d & 0x00FC0000) >> 18);
      b2= BYTE((d & 0x0003F000) >> 12);
      b3= BYTE((d & 0x00000FC0) >> 6 );
      b4= BYTE((d & 0x0000003F) >> 0 );
...to what I thought would be faster... this sequence:
      b4= BYTE(d & 0x0000003F);
      d >>= 6;
      b3= BYTE(d & 0x0000003F);
      d >>= 6;
      b2= BYTE(d & 0x0000003F);
      d >>= 6;
      b1= BYTE(d & 0x0000003F);

It turned out that the modified code was actually slower!  And not by a trival amount... fully 17% slower!  It just goes to show that sometimes the intuitive is not the best/fastest.

Attempting to avoid the use of the intermediate variables b1,b2,b3,b4 I tried:
      p[3]= abDigitLookup[ (d & 0x0000003F) ]; d >>= 6;
      p[2]= abDigitLookup[ (d & 0x0000003F) ]; d >>= 6;
      p[1]= abDigitLookup[ (d & 0x0000003F) ]; d >>= 6;
      p[0]= abDigitLookup[ (d & 0x0000003F) ];
      p += 4;
...which gave a very slight boost.  
But there was no additional gain for combining into longer statements like so:
      *p++ = abDigitLookup[ BYTE((d & 0x00FC0000) >> 18) ];
      *p++ = abDigitLookup[ BYTE((d & 0x0003F000) >> 12) ];
      *p++ = abDigitLookup[ BYTE((d & 0x00000FC0) >> 6 ) ];
      *p++ = abDigitLookup[ BYTE((d & 0x0000003F)      ) ];

-- Dan
grg99,
I missed your last post... sorry.

That is a significant point... as you cycle through the large source/input buffer, the CPU will continuously update the L1 cache, possibly to the detriment of ultra-fast access to the lookup table.  

I wonder if there is a way to tell the CPU that a particular part of the cache (the lookup table) is more important than the data that you have already processed in the source buffer.  On the other hand, if it is watching the action (e.g., counting accesses), it could notice that the lookup table is used often and it might decide to keep it in cache (these modern CPUs are sometimes a very smart about stuff like that :-)

-- Dan
ASKER CERTIFIED SOLUTION
Avatar of Infinity08
Infinity08
Flag of Belgium image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Thanks for the submission!
That's a clever and interesting use of lookup tables, fer sure :-)

>> What are the L1 and L2 cache sizes of your CPU btw ?

There is a sticker on my PC indicating an L2 cache of 512K.  I don't know for sure how large the L1 cache is, but a standard size for an AMD Athlon 2800 L1 cache is 64KB, so that's a good guess.  If so, then even these larger lookup tables should fit cleanly in the fastest cache.

The timing results:
MB/sec  Internal name // Description
--------  -----------------------------------------------------------------------------------------------------------
168.43  ToBase64_infinity08_standard / Infinity08 (std)
181.21  ToBase64_infinity08_lookup_b / Infinity08 (multi-table lookup, 6K of tables)
  79.59  ToBase64_infinity08_standard  (no optimization)
  90.01  ToBase64_infinity08_lookup_b (no optimization)
 
188.78  zToBase64 /  Dan's 'standard' C code
198.13  mToBase64 / Dan's "mystery" algorithm (C code)

That is, your second version is about 7% faster than your "standard" one, but still slightly slower than "Dan's std" and  "Dan's mystery algorithm."

Before you ask... Yes, I was very careful to populate the lookup tables outside of the timer -- I even tried doing it in InitDialog() just to be certain -- and the throughput was no different than if done just before starting the clock.

I also verified the optimization was set at "max speed" and I did a "sanity check" by re-running the other tests -- which came out similar to the previously-posted values

=--=-=-=-=-=-=-=-=-=
It seems to me that using the more extensive lookup tables *ought* to increase the speed by a larger factor.  Your main loop is very tight, requiring only about 79 bytes of opcodes (compared to 93 for my "std" function).  In trying to analyse the code, I think the bottleneck is that there is still a lot of index calculating going on -- primarily when determining the index values into the 2-d tables (lookup_table_b2 and lookup_table_b3).  Here is the ASM emitted by the compiler for the main loop of the ToBase64_infinity08_lookup_b function:

         >>  *pszOutBuf++ = lookup_table_b1[*pSrc];
   xor     edx,edx
   lea     edi,[esi+1]
   mov   dl,byte ptr [esi]
         >> *pszOutBuf++ = lookup_table_b2[*pSrc & 0x03][*(pSrc + 1)];
   xor     ebx,ebx
   inc     eax
   mov   dl,byte ptr lookup_table_b1 (0040766c)[edx]
   mov   byte ptr [eax-1],dl
   mov   dl,byte ptr [esi]
   mov   bl,byte ptr [edi]
   and    edx,3
   shl     edx,8
   lea     esi,[edi+1]
   inc     eax
   mov   dl,byte ptr [edx+ebx+40626Ch]
         >> *pSrc++;
         >> *pszOutBuf++ = lookup_table_b3[*pSrc & 0x0F][*(pSrc + 1)];
   xor    ebx,ebx
   mov   byte ptr [eax-1],dl
   mov   dl,byte ptr [edi]
   mov   bl,byte ptr [esi]
   and    edx,0Fh
   shl     edx,8
   inc     eax
   mov   dl,byte ptr [edx+ebx+40666Ch]
   mov   byte ptr [eax-1],dl
         >> *pSrc++;
         >> *pszOutBuf++ = lookup_table_b4[*pSrc];
   xor     edx,edx
   mov   dl,byte ptr [esi]
   inc     eax
         >> *pSrc++;
   inc     esi
   dec    ebp
   mov   dl,byte ptr lookup_table_b4 (0040776c)[edx]
   mov   byte ptr [eax-1],dl
   jne    ToBase64_infinity08_lookup_b+2Dh ; (first opcode above)
   pop    edi
   pop    ebp
   pop    ebx

=-=-=-=-=-=-=--=
Although there is out-of-sequence "obfuscation" by the compiler, a statement like:
     lookup_table_b2[*pSrc & 0x03][*(pSrc + 1)];
is a lookup into a 2-d array, which means that two index values need to be calculated...

               >>  *pszOutBuf++ = lookup_table_b1[*pSrc];       // simple 1-d lookup
   xor         edx,edx
   mov       dl,byte ptr [esi]                         // get source byte
   mov       dl,byte ptr lookup_table_b1[edx]  // look it up
   mov       byte ptr [eax-1],dl                      // store the output

                >>  *pszOutBuf++ = lookup_table_b2[*pSrc & 0x03][*(pSrc + 1)]; ;; complex 2-d lookup
   xor         edx,edx
   mov       dl,byte ptr [esi]  // get src byte
   and        edx,3               // calc first part of the index
   shl         edx,8
   xor         ebx,ebx
   mov       bl,byte ptr [edi]  // calc second part of the index
   mov       dl,byte ptr [edx+ebx+40626Ch] // look up the value
   mov       byte ptr [eax-1],dl                    // store the output

Hoping to cut down some of the work, I tried this:  We all know that...
      char table[4][256];
      z= table[x][y]
...equates to...        
      char table[1024];
       z= table[(x*256) + y]

So I used this code:
    char* t2= &lookup_table_b2[0][0];
    char* t3= &lookup_table_b3[0][0];
      
    while (nSrcLen >= 3) {
          BYTE b1= *pSrc++;
          BYTE b2= *pSrc++;
          BYTE b3= *pSrc++;
          *pszOutBuf++ = lookup_table_b1[b1];
          *pszOutBuf++ = t2[((b1 & 0x03)*256)+b2];
          *pszOutBuf++ = t3[((b2 & 0x0F)*256)+b3];
          *pszOutBuf++ = lookup_table_b4[b3];
          nSrcLen -= 3;
    }

But the score came out the same.   The local variables b1,b2,b3 wind up as loads of registers (never getting saved to memory) so that shouldn't cause a slowdown.  

After trying several alternatives, I have to surmize that though this particular lookup scheme improves upon a standard 64-byte table lookup, we are not seeing maximal performance.   Yet :-)
>> Before you ask... Yes, I was very careful to populate the lookup tables outside of the timer -- I even tried doing it in InitDialog() just to be certain -- and the throughput was no different than if done just before starting the clock.

Even if you put it inside the timer, it shouldn't change a lot, because we're processing about 1GB of data.

>> That is, your second version is about 7% faster than your "standard" one

The problem is that using lookup tables that way makes the algorithm very compiler dependent (memory architecture dependent). When I tested it on my laptop (Intel Pentium M 1,5GHz - 64kB L1, 1MB L2), I got these results:

    My standard code (no optimisations): 80.92 MB/s
    My standard code (with optimisations): 95.74 MB/s
    My lookup code b (no optimisations): 84.84 MB/s
    My lookup code b (with optimisations): 313.09 MB/s

You notice the huuuge increase in performance ? My bet is that the size of my lookup tables is exactly right for my system, but not for yours. Another possibility is different compiler flags. For the record, I used these for g++ :

    -Wall -ansi -fexpensive-optimizations -O3

>> It seems to me that using the more extensive lookup tables *ought* to increase the speed by a larger factor.

Yep, but a lot depends on the memory caching that happens. If after 1 pass, all 4 lookup tables are in the L1 cache completely, then the other 999 passes will be lightning fast. If however they get flushed out regularly, the advantage of the lookup tables disappears fast.
My code will be fast when encoding large amounts of data (like in this setup), but for smaller amounts of data, it doesn't win as much and will probably even be slower than regular algorithms.

>> In trying to analyse the code, I think the bottleneck is that there is still a lot of index calculating going on -- primarily when determining the index values into the 2-d tables

I would love to eliminate those 2D tables indeed, and I've tried a bit, but everything I tried untill now resulted in worse performance.

>> I have to surmize that though this particular lookup scheme improves upon a standard 64-byte table lookup, we are not seeing maximal performance.   Yet :-)

I'm not giving up :)
Wow, that is a rather dramatic difference for your _d version.  
I'll see if I can find a "Genuine Intel" processor to see if that is the difference (I'm testing with an AMD).  
Also, there may be a way to force the lookup tables into cache and keep them there (at least in the ASM versions).  I'll see if I can find out anything about that.

=-=-=-=-=-=-=
I love synergy... your code illustrated a possibility that I had not considered... a 256-byte table lets us avoid a mask operation on each lookup.

char chrTbl256[]=
"ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/"
"ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/"
"ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/"
"ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/";

void z7ToBase64( BYTE* pSrc, int nSrcLen, char* pszOutBuf )
{
      char* p= pszOutBuf;
      DWORD d;
      for ( int j=0; j< nSrcLen; j+=3 ) {
            d = *pSrc++; d <<= 8;
            d|= *pSrc++; d <<= 8;
            d|= *pSrc++;

            p[3]= chrTbl256[ BYTE(d) ]; d >>= 6;   // note: avoids the & 0x0000003F
            p[2]= chrTbl256[ BYTE(d) ]; d >>= 6;
            p[1]= chrTbl256[ BYTE(d) ]; d >>= 6;
            p[0]= chrTbl256[ BYTE(d) ];

            p+= 4;
      }
      // fixup for final bytes (padding) not shown
}

That avoids 4 billion AND operations across the test and certainly has a measureable effect.  

MB/sec  Internal name // Description
--------  -----------------------------------------------------------------------------------------------------------
188.78  zToBase64 /  Dan's 'standard' C code
197.13 zToBase64 /  Dan's 'standard' C code with 256-byte lookup table (thanks to Infinity08's idea)

I predict that in the ASM version, that very simple and obvious concept (simple and obvious, once somebody shows it to you, lol :-) will have an even greater impact.  That's because in ASM, we can swap bytes around very easily (eg, just clear EDX, then move, say BL (source data) to DL and use EDX as the lookup index).

-- Dan
Ok, a small update : made the lookup tables smaller, and saved some operations in the loop :

    void ToBase64_infinity08_lookup_e(BYTE *pSrc, int nSrcLen, char *pszOutBuf) {
     BYTE *end = pSrc + nSrcLen - (nSrcLen % 3);
     while (pSrc < end) {
       *pszOutBuf++ = lookup_table_e1[*pSrc];
       *pszOutBuf++ = lookup_table_e2[(*pSrc & 0x03) | (*(pSrc + 1) & 0xF0)];
       *pSrc++;
       *pszOutBuf++ = lookup_table_e3[(*pSrc & 0x0F) | (*(pSrc + 1) & 0xC0)];
       *pSrc++;
       *pszOutBuf++ = lookup_table_e4[*pSrc];
       *pSrc++;
     }
     if (nSrcLen %= 3) {
       *pszOutBuf++ = lookup_table_e1[*pSrc];
       if (--nSrcLen) {
         *pszOutBuf++ = lookup_table_e2[(*pSrc & 0x03) + (*(pSrc + 1) & 0xF0)];
         *pSrc++;
         *pszOutBuf++ = lookup_base64[(*pSrc << 2) & 0x3C];
         *pszOutBuf++ = '=';
       }
       else {
         *pszOutBuf++ = lookup_base64[(*pSrc << 4) & 0x30];
         *pszOutBuf++ = '=';
         *pszOutBuf++ = '=';
       }
     }
     return;
    }

lookup tables are generated like this :

    char lookup_table_e1[256];
    char lookup_table_e2[256];
    char lookup_table_e3[256];
    char lookup_table_e4[256];
   
    void generate_lookup_table_e() {
     int i = 0;
   
     for (i = 0; i < 256; i++) {
       lookup_table_e1[i] = lookup_base64[(i >> 2) & 0x3F];
     }
     for (i = 0; i < 256; i++) {
       lookup_table_e2[i] = lookup_base64[((i << 4) & 0x30) | ((i >> 4) & 0x0F)];
     }
     for (i = 0; i < 256; i++) {
       lookup_table_e3[i] = lookup_base64[((i << 2) & 0x3C) | ((i >> 6) & 0x03)];
     }
     for (i = 0; i < 256; i++) {
       lookup_table_e4[i] = lookup_base64[i & 0x3F];
     }
     return;
    }

Just out of curiosity, what optimisation flags do you use ?
>> Wow, that is a rather dramatic difference for your _d version.
I tested the same code on an AMD Athlon64 3700+, and didn't notice the same huge performance increase I had on the Pentium M. I think I'll do the rest of the tests on this one, as it should be closer to yours :)

>> your code illustrated a possibility that I had not considered... a 256-byte table lets us avoid a mask operation on each lookup.
Yep, a lookup table can be a very handy tool when optimizing for speed. But it only really works when there's a lot of data to process, and when you can ensure that the lookup table stays in fast memory (preferrably the L1 cache) the whole time. That's the difficulty, especially since there are so many different architectures, and they all have a different behaviour when it comes to caching.
I've set my VC++ 6.0 compiler to use /O2  (oh 2, "maximize speed") which the book says is equivalent to:
   /Og /Oi /Ot /Oy /Ob1 /Gs /Gf /Gy
   /Og  global optimizations, automatic-register allocation, and loop optimization
   /Oi   (irrelevant, has to do with marshalling)
   /Ot  instructs the compiler to favor speed over size.
   /Oy  suppresses creation of frame pointers on the call stack
   /Ob1  inline expansion of functions (irrelevant)
   /Gs   (irrelevant, has to do with stack probes)
   /Gf   puts single copy of identical strings into the .EXE file
   /Gy   (irrelevant) package individual functions (COMDATs)
MB/sec  Internal name // Description
--------  --------------------------------------------------------------------------------------
168.43  ToBase64_infinity08_standard / Infinity08 (std)
181.21  ToBase64_infinity08_lookup_b / Infinity08 (multi-table lookup, 6K of tables)

179.79  ToBase64_infinity08_lookup_e / Infinity08 (multi-table lookup, 1K of tables)

By my measurements, it's a bit slower than your previous try (within the margin of error)

It would be tremendously interesting to me to see the opcodes generated for the version that you measured as having such a high rate (for comparison with the opcodes I posted in http:#17547635).   Your other measurements were relatively proportional to mine, but that one was off the scale by comparison.  Can you post that ASM listing?  Whatever tricks your compiler is using would come in handy in the ASM version at least.

I'd hate to jump to the conclusion that it all boils down down to L1 cache-related issues if it is really something else.
The compiler I'm using : g++ (GCC) version 3.3.1 (MinGW32 special 20030804-1)

flags without optimization : -Wall
flags with optimization : -Wall -ansi -fexpensive-optimizations -O3

I've just listed the main loops.


NO OPTIMIZATION:
============

a) standard:
    ----------

L99:
      cmpl      $2, 12(%ebp)
      jg      L101
      jmp      L100
L101:
      movl      16(%ebp), %eax
      movl      %eax, %edx
      movl      8(%ebp), %eax
      movzbl      (%eax), %eax
      shrb      $2, %al
      movzbl      %al, %eax
      andl      $63, %eax
      movzbl      _lookup_base64(%eax), %eax
      movb      %al, (%edx)
      leal      16(%ebp), %eax
      incl      (%eax)
      movl      16(%ebp), %eax
      movl      %eax, %ecx
      movl      8(%ebp), %eax
      movzbl      (%eax), %eax
      sall      $4, %eax
      movl      %eax, %edx
      andl      $48, %edx
      movl      8(%ebp), %eax
      incl      %eax
      movzbl      (%eax), %eax
      shrb      $4, %al
      movzbl      %al, %eax
      andl      $15, %eax
      orl      %edx, %eax
      movzbl      _lookup_base64(%eax), %eax
      movb      %al, (%ecx)
      leal      16(%ebp), %eax
      incl      (%eax)
      incl      8(%ebp)
      movl      16(%ebp), %eax
      movl      %eax, %ecx
      movl      8(%ebp), %eax
      movzbl      (%eax), %eax
      sall      $2, %eax
      movl      %eax, %edx
      andl      $60, %edx
      movl      8(%ebp), %eax
      incl      %eax
      movzbl      (%eax), %eax
      shrb      $6, %al
      movzbl      %al, %eax
      andl      $3, %eax
      orl      %edx, %eax
      movzbl      _lookup_base64(%eax), %eax
      movb      %al, (%ecx)
      leal      16(%ebp), %eax
      incl      (%eax)
      incl      8(%ebp)
      movl      16(%ebp), %eax
      movl      %eax, %edx
      movl      8(%ebp), %eax
      movzbl      (%eax), %eax
      andl      $63, %eax
      movzbl      _lookup_base64(%eax), %eax
      movb      %al, (%edx)
      leal      16(%ebp), %eax
      incl      (%eax)
      incl      8(%ebp)
      leal      12(%ebp), %eax
      subl      $3, (%eax)
      jmp      L99


b) multi-table lookup, 6K of tables (b):
    ---------------------------------------

L115:
      cmpl      $2, 12(%ebp)
      jg      L117
      jmp      L116
L117:
      movl      16(%ebp), %eax
      movl      %eax, %edx
      movl      8(%ebp), %eax
      movzbl      (%eax), %eax
      movzbl      _lookup_table_b1(%eax), %eax
      movb      %al, (%edx)
      leal      16(%ebp), %eax
      incl      (%eax)
      movl      16(%ebp), %eax
      movl      %eax, %ecx
      movl      8(%ebp), %eax
      incl      %eax
      movzbl      (%eax), %edx
      movl      8(%ebp), %eax
      movzbl      (%eax), %eax
      andl      $3, %eax
      sall      $8, %eax
      addl      %edx, %eax
      addl      $_lookup_table_b2, %eax
      movzbl      (%eax), %eax
      movb      %al, (%ecx)
      leal      16(%ebp), %eax
      incl      (%eax)
      incl      8(%ebp)
      movl      16(%ebp), %eax
      movl      %eax, %ecx
      movl      8(%ebp), %eax
      incl      %eax
      movzbl      (%eax), %edx
      movl      8(%ebp), %eax
      movzbl      (%eax), %eax
      andl      $15, %eax
      sall      $8, %eax
      addl      %edx, %eax
      addl      $_lookup_table_b3, %eax
      movzbl      (%eax), %eax
      movb      %al, (%ecx)
      leal      16(%ebp), %eax
      incl      (%eax)
      incl      8(%ebp)
      movl      16(%ebp), %eax
      movl      %eax, %edx
      movl      8(%ebp), %eax
      movzbl      (%eax), %eax
      movzbl      _lookup_table_b4(%eax), %eax
      movb      %al, (%edx)
      leal      16(%ebp), %eax
      incl      (%eax)
      incl      8(%ebp)
      leal      12(%ebp), %eax
      subl      $3, (%eax)
      jmp      L115


c) multi-table lookup, 1K of tables (e):
    ---------------------------------------

L136:
      movl      8(%ebp), %eax
      cmpl      -8(%ebp), %eax
      jb      L138
      jmp      L137
L138:
      movl      16(%ebp), %eax
      movl      %eax, %edx
      movl      8(%ebp), %eax
      movzbl      (%eax), %eax
      movzbl      _lookup_table_e1(%eax), %eax
      movb      %al, (%edx)
      leal      16(%ebp), %eax
      incl      (%eax)
      movl      16(%ebp), %eax
      movl      %eax, %ecx
      movl      8(%ebp), %eax
      movzbl      (%eax), %eax
      movl      %eax, %edx
      andl      $3, %edx
      movl      8(%ebp), %eax
      incl      %eax
      movzbl      (%eax), %eax
      andl      $240, %eax
      orl      %edx, %eax
      movzbl      _lookup_table_e2(%eax), %eax
      movb      %al, (%ecx)
      leal      16(%ebp), %eax
      incl      (%eax)
      incl      8(%ebp)
      movl      16(%ebp), %eax
      movl      %eax, %ecx
      movl      8(%ebp), %eax
      movzbl      (%eax), %eax
      movl      %eax, %edx
      andl      $15, %edx
      movl      8(%ebp), %eax
      incl      %eax
      movzbl      (%eax), %eax
      andl      $192, %eax
      orl      %edx, %eax
      movzbl      _lookup_table_e3(%eax), %eax
      movb      %al, (%ecx)
      leal      16(%ebp), %eax
      incl      (%eax)
      incl      8(%ebp)
      movl      16(%ebp), %eax
      movl      %eax, %edx
      movl      8(%ebp), %eax
      movzbl      (%eax), %eax
      movzbl      _lookup_table_e4(%eax), %eax
      movb      %al, (%edx)
      leal      16(%ebp), %eax
      incl      (%eax)
      incl      8(%ebp)
      jmp      L136


WITH OPTIMIZATION:
==============

a) standard:
    ----------

L171:
      movzbl      (%ecx), %eax
      subl      $3, %esi
      shrb      $2, %al
      andl      $63, %eax
      movzbl      _lookup_base64(%eax), %edx
      movb      %dl, (%ebx)
      incl      %ebx
      movzbl      (%ecx), %edx
      movzbl      1(%ecx), %eax
      incl      %ecx
      sall      $4, %edx
      shrb      $4, %al
      andl      $48, %edx
      andl      $15, %eax
      orl      %eax, %edx
      movzbl      _lookup_base64(%edx), %eax
      movb      %al, (%ebx)
      incl      %ebx
      movzbl      (%ecx), %edx
      movzbl      1(%ecx), %eax
      incl      %ecx
      sall      $2, %edx
      shrb      $6, %al
      andl      $60, %edx
      andl      $3, %eax
      orl      %eax, %edx
      movzbl      _lookup_base64(%edx), %eax
      movb      %al, (%ebx)
      incl      %ebx
      movzbl      (%ecx), %eax
      incl      %ecx
      andl      $63, %eax
      movzbl      _lookup_base64(%eax), %edx
      movb      %dl, (%ebx)
      incl      %ebx
      cmpl      $2, %esi
      jg      L171


b) multi-table lookup, 6K of tables (b):
    ---------------------------------------

L195:
      movzbl      (%ecx), %eax
      subl      $3, %esi
      movzbl      _lookup_table_b1(%eax), %edx
      movb      %dl, (%ebx)
      incl      %ebx
      movzbl      (%ecx), %eax
      movzbl      1(%ecx), %edx
      incl      %ecx
      andl      $3, %eax
      sall      $8, %eax
      movzbl      _lookup_table_b2(%edx,%eax), %eax
      movb      %al, (%ebx)
      incl      %ebx
      movzbl      (%ecx), %eax
      movzbl      1(%ecx), %edx
      incl      %ecx
      andl      $15, %eax
      sall      $8, %eax
      movzbl      _lookup_table_b3(%edx,%eax), %eax
      movb      %al, (%ebx)
      incl      %ebx
      movzbl      (%ecx), %eax
      incl      %ecx
      movzbl      _lookup_table_b4(%eax), %edx
      movb      %dl, (%ebx)
      incl      %ebx
      cmpl      $2, %esi
      jg      L195


c) multi-table lookup, 1K of tables (e):
    ---------------------------------------

L228:
      movzbl      (%ecx), %eax
      movzbl      _lookup_table_e1(%eax), %edx
      movb      %dl, (%ebx)
      incl      %ebx
      movzbl      (%ecx), %edx
      movzbl      1(%ecx), %eax
      incl      %ecx
      andl      $3, %edx
      andl      $240, %eax
      orl      %eax, %edx
      movzbl      _lookup_table_e2(%edx), %eax
      movb      %al, (%ebx)
      incl      %ebx
      movzbl      (%ecx), %edx
      movzbl      1(%ecx), %eax
      incl      %ecx
      andl      $15, %edx
      andl      $192, %eax
      orl      %eax, %edx
      movzbl      _lookup_table_e3(%edx), %eax
      movb      %al, (%ebx)
      incl      %ebx
      movzbl      (%ecx), %eax
      incl      %ecx
      movzbl      _lookup_table_e4(%eax), %edx
      movb      %dl, (%ebx)
      incl      %ebx
      cmpl      %edi, %ecx
      jb      L228
>> I'd hate to jump to the conclusion that it all boils down down to L1 cache-related issues if it is really something else.
Well, I've tested the exact same executable on an AMD Athlon64 and an Intel Pentium M, with the results I described earlier.
I know that the comparison is unfair, as they're completely different architectures - but on the other hand, that's exactly what's causing the different behavior imo.

I would be interested to see you try the g++ compiler with the same flags I used to see what results you get. Because there could also be a difference in optimization between Visual C++ and GCC.

These are the optimizations enabled by the -O3 flag :

  -falign-functions : Align the start of functions to the machine dependent byte boundary
  -falign-jumps : Align branch targets to the machine dependent byte boundary
  -falign-labels : Align all branch targets to the machine dependent byte boundary
  -falign-loops : Align loops to the machine dependent byte boundary
  -fcaller-saves : Enable values to be allocated in registers that will be clobbered by function calls, by emitting extra instructions to save and restore the registers around such calls
  -fcprop-registers : Disable the copy-propagation pass to try to reduce scheduling dependencies
  -fcrossjumping : Perform cross-jumping transformation
  -fcse-follow-jumps : In common subexpression elimination, scan through jump instructions when the target of the jump is not reached by any other path
  -fcse-skip-blocks : This is similar to -fcse-follow-jumps, but causes CSE to follow jumps which conditionally skip over blocks
  -fdefer-pop : For machines which must pop arguments after a function call, the compiler normally lets arguments accumulate on the stack for several function calls and pops them all at once
  -fdelayed-branch : If supported for the target machine, attempt to reorder instructions to exploit instruction slots available after delayed branch instructions
  -fdelete-null-pointer-checks : Use global dataflow analysis to identify and eliminate useless checks for null pointers
  -fexpensive-optimizations : Perform a number of minor optimizations that are relatively expensive
  -fforce-mem : Force memory operands to be copied into registers before doing arithmetic on them
  -fgcse : Perform a global common subexpression elimination pass. This pass also performs global constant and copy propagation
  -fgcse-lm : When -fgcse-lm is enabled, global common subexpression elimination will attempt to move loads which are only killed by stores into themselves
  -fgcse-sm : When -fgcse-sm is enabled, A store motion pass is run after global common subexpression elimination
  -fguess-branch-probability : Guess branch probabilities using a randomized model
  -fif-conversion : Attempt to transform conditional jumps into branch-less equivalents
  -fif-conversion2 : Use conditional execution (where available) to transform conditional jumps into branch-less equivalents
  -finline-functions : Integrate all simple functions into their callers
  -floop-optimize : Perform loop optimizations: move constant expressions out of loops, simplify exit test conditions and optionally do strength-reduction and loop unrolling as well
  -fmerge-constants : Attempt to merge identical constants (string constants and floating point constants) across compilation units
  -foptimize-sibling-calls : Optimize sibling and tail recursive calls
  -fpeephole2 : Enable machine-specific peephole optimizations
  -fregmove : Attempt to reassign register numbers in move instructions and as operands of other simple instructions in order to maximize the amount of register tying
  -frename-registers : Attempt to avoid false dependencies in scheduled code by making use of registers left over after register allocation
  -freorder-blocks : Reorder basic blocks in the compiled function in order to reduce number of taken branches and improve code locality
  -freorder-functions : Reorder basic blocks in the compiled function in order to reduce number of taken branches and improve code locality
  -frerun-cse-after-loop : Re-run common subexpression elimination after loop optimizations has been performed
  -frerun-loop-opt : Run the loop optimizer twice
  -fsched-interblock : Schedule instructions across basic blocks
  -fsched-spec : Allow speculative motion of non-load instructions
  -fschedule-insns : If supported for the target machine, attempt to reorder instructions to eliminate execution stalls due to required data being unavailable
  -fschedule-insns2 : Similar to -fschedule-insns, but requests an additional pass of instruction scheduling after register allocation has been done
  -fstrength-reduce : Perform the optimizations of loop strength reduction and elimination of iteration variables
  -fstrict-aliasing : Allows the compiler to assume the strictest aliasing rules applicable to the language being compiled
  -fthread-jumps : Perform optimizations where we check to see if a jump branches to a location where another comparison subsumed by the first is found
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
>> (1) Converting two bytes at once by using a 64x64 lookup table.
I've actually tried that (and some variations on it), but it creates other problems that bring down performance, like:
  a) the total size of the lookup tables would be 8kB, which could be too big for general usage (or it might not)
  b) you're dealing with 2D tables, which require a lot of operations to just calculate the address
  c) each entry in the lookup table will be two bytes instead of 1, so you're not gaining a lot in that respect.
The code I tried using this "trick" was slower than my standard code.

>> (2)  Keeping the conversion table in the cache by periodic uses of cache PREFETCH instructions.   Or on the x64, read-specific prefetch.
Can you give a bit more info on those ? They sound interesting. Although, I think, if the data to be encoded is sufficiently "random", we won't need to keep them in the cache this way, as they should be kept in there by the encoding process itself.

>> (3)  Unrolling the main loop a few times.
Certain compiler optimizations already do that. It's worth a shot, but in this particular case, I'm not sure it'll win us much.

>> (4) spinning off separate threads to take advantage of multiple cores or CPUs.
We're kind of limited by Dan's test system here :) (2.6Ghz Athlon).

>> (5)  Carefully overlapping instructions to take into account the two to four instruction pipes available on Pentium and neewer CPUs.
That's usually handled by the compiler (if it's doing a good job), and is very difficult to control using C (you really need assembler to do this good).
And it's system dependent, which we should avoid, imo.

>> (6)  Using MMX, SSE, and other instructions that process two to eight operands at one gulp.
Worth a try I guess. I don't know the MMX, SSE instruction sets too well though ... And again, it's hard to control using C instead of ASM.

Want to join us and try to be faster ? :)
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
It might be fast, but it's not encoding the string correctly (SA==BIAABIAABIAABIAABIAA...) :) A few things :

1) you're depending on the endianness of the machine (running it on a machine with different endianness will mess up the algorithm).
2) the masking operations in the while loop don't do what you expect them to do.
3) you're depending on the sizes of WORD's and LONG's.
4) the *pChar += 3 part will only work to a certain extent, untill the byte overflows.
5) the if at the end uses pszOutBuf while that has never been incremented, so it overwrites the start of the string.

points 1) and 3) are the exact reasons why I didn't use the tricks you did ...
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
BTW, all those  "*pSrc++;"  lines, they almost certainly should be  "pSrc++;"   You want to bump the pointer, not the thing it points to.

Infinity08,
Thanks for posting that ASM.  I'll be looking at it closely soon...

grg99,
The "mystery algorithm" that I mentiond earlier uses the 8096-byte table with the two-byte lookup scheme and a 12-bit index...
 
I'm glad you hit on it ! (I felt sure it would come up sooner or later :-)

=-=-=-=-=-=-=-=-=-=-=-=
char Lookup12bits[ ((64*64)*2)+1 ]=  // 8096+1
"AAABACADAEAFAGAHAIAJAKALAMANAOAPAQARASATA ... A6A7A8A9A+A/"
"BABBBCBDBEBFBGBHBIBJBKBLBMBNBOBPBQBRBSBTB ... B6B7B8B9B+B/"
...
"9A9B9C9D9E9F9G9H9I9J9K9L9M9N9O9P9Q9R9S9T9...  5969798999+9/"
"+A+B+C+D+E+F+G+H+I+J+K+L+M+N+O+P+Q+R+S+T+...  5+6+7+8+9+++/"
"/A/B/C/D/E/F/G/H/I/J/K/L/M/N/O/P/Q/R/S/T/...  5/6/7/8/9/+//"
;

//------------------------- Dan's "mystery algorithm" (similar to grg99's code)
void mToBase64( BYTE* pSrc, int nSrcLen, char* pszOutBuf )
{
      char* p= pszOutBuf;
      DWORD d;
      WORD* pw=       (WORD*)pszOutBuf;
      WORD* pawTbl= (WORD*)Lookup12bits;

      while ( nSrcLen >= 3 ) {
            d = *pSrc++; d <<= 8;
            d|= *pSrc++; d <<= 8;
            d|= *pSrc++;
            nSrcLen -= 3;

            *pw++ = pawTbl[ d  >> 12 ];  // no need for a mask
            *pw++ = pawTbl[ d & 0x00000FFF ];
      }
      // ... code to fixup for end-of-src logic
}
=-=-=-=-=-=-=-=-=-=-=-=

Infinity08 is right... it's all about endedness... For this alorithm to work, the table has to be arranged with the pairs in the opposite of expected order -- as build by generate_lookup_table_grg12() or as in my static table.

My version is running a little faster than the others (on my box, by my measure, I'm getting 208.50 MB/s), so it's fastest for all except for grg99's...

However, I can't post an "official" timing for grg99's code because the output is not yet correct.  It's sure to be fast code, but some of what's missing will require a bit of additional code to fix (for instance, the endedness of the INPUT bytes needs to be rectified and the source pointer needs to be updated -- you are currently reading and re-reading the same bytes over and over).  The required changes will affect the timing of course.

-- Dan
Re: grg99's other suggestions:

(1) 8096-byte (64*64*2) lookup table
I think this will be the fastest technique short of a 1,048,576-byte (64*64*64*4) lookup table -- which seems like overkill, and due to cache thrashing, may not be faster, anyway.

(2) PREFETCH instructions.
I could try this out on the ASM side, but I'm not sure how much difference it will make.  Most L1 caches are typically far larger than the 8096-byte lookup table, so once it's been fetched, it gets accessed constantly -- so it seems likely to stay there.

(3)  Unrolling the main loop a few times.
This *could* have a significant effect.  I know that processing 6 input bytes into 8 output bytes (and running the loop nSrcLen/6 times) gave me a boost in some ASM code I was trying...

(4) take advantage of multiple cores or CPUs.
I'd certainly explore this if there was a $1,000,000 prize... It would probably nearly double the throughput to have two cores working simultaneously... say one on the first half and the other on the last half.

(5)  overlapping/pairing instructions (optimize for dual pipeline architecture)...
As Inifinity08 mentioned, the compiler will probably do a pretty good job of this.  I did see some modest gains in some ASM code I was playing around with.  But the inner loop ends up being very tight so there aren't a lot of opportunities... you pretty much need the results of each register as soon as you can set it's value.

(6)  Using MMX ... two to eight operands at one gulp.
My most recent ASM try (unpublished) builds an 8-byte output into an MMX register (six bytes of input) and dumps a quadword at each pass.  There is even a "non-temporal" version of the MOVQ opcode (MOVNTQ) that does not fill the cache with the output data (that never gets used again).  This technique is, indeed the fastest I've timed yet... I'll post it in the ASM thread in due time.
>> The "mystery algorithm" that I mentiond earlier uses the 8096-byte table with the two-byte lookup scheme and a 12-bit index...
The only problem with this kind of algorithm, is that you need to take into account machines with different endianness. That's still doable, just use different code on different machines. But it gets harder on machines that have 1byte words for example.
It was my first thought too, but I abandoned the idea for the reasons I mentioned.

>> BTW, all those  "*pSrc++;"  lines, they almost certainly should be  "pSrc++;"   You want to bump the pointer, not the thing it points to.
heh, that's a code editing problem (copy-paste). But the operator precedence still makes sure that the ++ is done before the *. So, the dereference operation is useless, but doesn't create a problem either, as it's optimized out by the compiler.
>But the operator precedence still makes sure that the ++ is done before the *.

Huh?      To my knowledge,  *p++ means, and has always meant,  dereference p, then increment p.  

But you're right, most compilers will optimize away the useless load, unless p is declared volatile.

And I was wrong, it doesnt increment what p points to.  Dang language.

I will try modyfiying my code to handle both endians, hope that doesnt slow it down too much.
Also it would be fun to try a 3-hextet lookup table, although that's going to be wildly cache-unfriendly, all for a 50% gain at most.  Oh, what fun!

Also a intersting point:  if I comment out the line that stores the output data in the output array (two bytes at a time) the speed DOUBLES, which implies the code is going just about as fast as memory can accept (16-bit words) of data.  One possibility is to buffer up 32-bits of output at a time and see if that helps a bit.
>> Huh?      To my knowledge,  *p++ means, and has always meant,  dereference p, then increment p.  
I meant that the ++ has higher precedence, so *p++ is the same as *(p++). I might have formulated a bit incorrect by using the word "done" though :) (the "done" i used referred to the compilation, not the execution)
I saw the *p++ and (because it worked :-) figured the same thing:  It increments the pointer, which now points to a new location, but since the code does not use the data at the pointer, the effect is the same as p++  I guess that if you want to increment the value that is AT the pointer, you need to use (*p)++ to override operator precidence.

Other notes:
The Microsoft ATL Server contains a header file (atlenc.h) that contains a Base64 Encoder.  In case you don't have VS Studio 2003, I found the source code here:

    http://groups.google.com/group/microsoft.public.vc.language/browse_thread/thread/2fbd5a4418ca6e2c

The code does a number of things, such as inserting line breaks every 76 bytes (which is mentioned in the RFC, and might have made this a slightly different challenge, but it's trivial enough that I guess we can ignore it) But when you pare out such superfluous stuff, the Microsoft code has a main loop that looks like this:

int nLoopCnt(nSrcLen/3);

for ( int j=0; j< nLoopCnt; j++ ) {
      DWORD dwCurr(0);
      for (int n=0; n<3; n++) {
            dwCurr |= *pSrc++;
            dwCurr <<= 8;
      }
      for (int k=0; k<4; k++) {
            BYTE b = (BYTE)(dwCurr>>26);
            *pszOutBuf++ = abDigitLookup[b];
            dwCurr <<= 6;
      }
}
//--- send of source logic omitted

This comes in at 166 MB/s  When you unroll the loops, it goes to 188 MB/s  (about the rate of "Dan's std" code).

Tell me if I'm wrong, but it looks like this will work regardless of endedness...

They used a clever technique to break out the 6-bit indexes, that interestingly enough, matches with the trick used in craigwardman's technique on the ASM side:

   1) Accumulate three source bytes, shifting to the left, 8 bits at-a-time
       (first byte ends up in the top of the DWORD accumulator)
   2) The lookup index is the accumulator value, shifted right by 26  (the upper 6 bits)
       (this eliminates one masking operation)
   3) look up the output byte and save it
   4) move the remaining bits to the left by 6 bits
       (discard the top bits, already used)
   5) ... rinse and repeat ( steps 2-4,  four times)

I think it's an interesting way to do it, mainly because it never would have occurred to me!   In my "hay day" of ASM programming of the 8086 (circa 1985), a SHIFT operation took a specific number of clock cycles *plus* some cycles per bit position shifted.  So a >> 26 still *looks like* an expensive operation to me :-)
It does seem that using a single DWORD read to grab the three bytes could be more efficient than three separate 1-BYTE reads...  But the endian thing makes it a hassle.  There are several easy solutions in ASM, but I can't work out an efficent way to code it in C...

// ---------- the 'standard' way to collect the three bytes
      DWORD d;    
      d = *pSrc++; d <<= 8;
      d|= *pSrc++; d <<= 8;
      d|= *pSrc++;

// ---------- a try that reads a full DWORD and rearranges...
      DWORD d,  e= *(DWORD*)pSrc;
      d = ((e & 0x0000FF00))
         | ((e & 0x000000FF ) << 16)
         | ((e & 0x00FF0000 ) >> 16);
      pSrc += 3;

The second version ran considerably slower than the first in my time trials.

Does anybody have any ideas on how to get a potential speed up by reading DWORD reads without losing time rearranging the bytes for subsequent processing?
>> Tell me if I'm wrong, but it looks like this will work regardless of endedness...
It will, but really it's not much different than this (my standard code in this case - but any similar approach works too) :

      while (nSrcLen >= 3) {
        *pszOutBuf++ = lookup_base64[ (*pSrc >> 2) & 0x3F];
        *pszOutBuf++ = lookup_base64[((*pSrc << 4) & 0x30) | (((*(pSrc + 1)) >> 4) & 0x0F)];
        pSrc++;
        *pszOutBuf++ = lookup_base64[((*pSrc << 2) & 0x3C) | (((*(pSrc + 1)) >> 6) & 0x03)];
        pSrc++;
        *pszOutBuf++ = lookup_base64[ (*pSrc     ) & 0x3F];
        pSrc++;
        nSrcLen -= 3;
      }

When we purely count operations (without compiler optimizations), we get :

    lookup : 6
    assign : 4
    in/decrement : 10
    shift : 5
    and : 6
    or : 2
    array lookup : 4
    comparison : 1

And for this code :

    int nLoopCnt(nSrcLen/3);
    for ( int j=0; j< nLoopCnt; j++ ) {
      DWORD dwCurr(0);
      for (int n=0; n<3; n++) {
        dwCurr |= *pSrc++;
        dwCurr <<= 8;
      }
      for (int k=0; k<4; k++) {
        BYTE b = (BYTE)(dwCurr>>26);
        *pszOutBuf++ = abDigitLookup[b];
        dwCurr <<= 6;
      }
    }

we get :

    lookup : 3
    assign : 8
    in/decrement : 15
    shift : 11
    and : 0
    or : 3
    array lookup : 4
    comparison : 8

Comparing these two tables (I know they're not an exact measurement of performance, but they give a good idea), we can make these observations :

1) the lookups and assigns can be ignored, as the compiler will optimize them so it comes down to 3 lookups and 4 assigns anyway. (without optimization, we're looking at 10 vs. 11 lookups and assigns)

2) the increments and decrements can be easily optimized by the compiler, and are of similar magnitued anyway : 10 vs. 15

3) the number of "real" operations (shift, and, or) is about the same : 13 vs. 14

4) the array lookups are the same (4), which is to be expected, since the same lookup table is used

5) the number of comparisons are easily optimized by the compiler too, so they can be ignored

Which makes 3) the most important observation : both codes will take about the same time.


The thing is, if you want to keep your code endianness and word-size friendly, then you can't get much better code than the ones we already came up with, because you'll always end up with similar operation counts as I gave above.
The only real improvement I can see so far is to increase the size of the lookup table, and take advantage of that. But the problem is that the performance of code making use of that will change a lot between different architectures.


>> Does anybody have any ideas on how to get a potential speed up by reading DWORD reads without losing time rearranging the bytes for subsequent processing?
I've been trying to look for a way since you posted this challenge, but I always end up with the same amount of ands, ors and shifts.

The only approach that would work in theory is to have a 4 * 2^24 byte lookup table, which maps 3 bytes to 4 bytes. But in practice that doesn't work, since the size of the L1 cache is too limited (I still have to find a 67 MB L1 cache heh). But just  for argument's sake, this would allow code like this :

      unsigned int src = 0;
      while (nSrcLen >= 3) {
        src = *pSrc++ << 16;
        src |= *pSrc++ << 8;
        src |= *pSrc++;
        *((unsigned int*) pszOutBuf) = lookup_base64[src];
        pszOutBuf += 4;
        nSrcLen -= 3;
      }

Granted, it also depends on 4-byte unsigned ints, but it doesn't depend on endianness (as long as the lookup table is generated on the same machine), and it's as fast as you can get (minus some minor optimizations probably) IF you have a sufficiently large L1 cache ...
It's instructional to test the idea of using a Monster Table:

DWORD adwTbl[256*256*256];   /// a 64MB lookup table!

void GenMonsterTbl() {
      BYTE* pb= (BYTE*) &adwTbl[0];
      for ( int j=0; j<256*256*256; j++ ) {
            DWORD d= j;
            *pb++ = abDigitLookup[ BYTE((d & 0x00FC0000) >> 18) ];
            *pb++ = abDigitLookup[ BYTE((d & 0x0003F000) >> 12) ];
            *pb++ = abDigitLookup[ BYTE((d & 0x00000FC0) >> 6 ) ];
            *pb++ = abDigitLookup[ BYTE((d & 0x0000003F)      ) ];
      }
}

void ToBase64_MonsterTbl( BYTE* pSrc, int nSrcLen, char* pszOutBuf )
{
      DWORD* pdw= (DWORD*)pszOutBuf;
      for ( int j=0; j< nSrcLen; j+=3 ) {
            DWORD d;
            d = *pSrc++; d <<= 8;
            d|= *pSrc++; d <<= 8;
            d|= *pSrc++;
            *pdw++ = adwTbl[d];  // output four bytes
      }
}

As predicted, the speed is abysmal... around 13MB/s.

Since the code itself is so concise (a mere 42 bytes of opcodes in the loop) the difference can ONLY be memory accesses and though it makes FAR FEWER such accesses (a third as many table lookups and a forth as many output writes), virtually every lookup is a cache miss.  Since the accesses jump around so much, it's effectviely like having no cache at all!

And please don't tell me "I told you so" ... I already knew it.  But I'm a pragmatic type of guy and it's reassuring to see predicted results :-)

=-=-=-=-=-=-=-=-=
I'm still mystified by the ultra-fast times measured on Infinitiy08's "6K of tables" code... the ASM output will take some deciphering before I can understand it.  I'll take another look if I can find time later today.

-- Dan
OK!
I analyzed Infinity08's Xxx_b function's pseudo-ASM listing and converted it to an _asm block.  
The GCC compiler used some opcode techniques that saved time... for instance, it uses
   movzx eax, (...any byte)
to avoid needing
   xor     eax,eax
   mov    al, (...any byte)
when it will need to use all of eax as an index.  Other than that, the low-level instruction streams are basically similar.  

My timing results were similar to earlier:

MB/sec  Internal name // Description (on AMD Athlon 2800)
--------  --------------------------------------------------------------------------------------
180.45  ...infinity08_lookup_b / Infinity08 (6K of tables)  Compiled C
181.81  ...infinity08_lookup_b / Infinity08 (6K of tables)  My ASM conversion

208.43  mToBase64 / Dan's "mystery algorithm" (8096-byte lookup table)

HOWEVER!!!!!

I then ran the exact same executables on 2.5 GHz Celeron (genuine Intel) with these results:

MB/sec  Internal name // Description (on 2.51 Celeron)
--------  --------------------------------------------------------------------------------------
306.53  ...infinity08_lookup_b / Infinity08 (6K of tables)  Compiled C
432.21  ...infinity08_lookup_b / Infinity08 (6K of tables)  My ASM conversion

453.01  mToBase64 / Dan's "mystery algorithm" (8096-byte lookup table)

The times are all significantly better on the Intel architecture, even though it's running a Celeron (which is known to have smallish caches).

The GCC-generated output (my ASM conversion) was significantly faster than the same algorithm when compiled/optimized by the VC++ compiler.   I think that tells us that the VC++ optimizer is smarter (at least when the output is ececuted on Intel).  The difference between the two (26%) is dramatic... just as Infinity08's said, and this confirms his measurements.  

I really can't see enough difference in the opcodes to justify such a dramatic difference, but there it is.  The Intel architecture must be better at cache prediction and pipelining instructions -- a LOT better.

I also ran the 2-byte lookup version for comparison... it's still the fastest (either architecture), but only by a negligible 0.5% on the Intel Celeron (1.5% on my Athlon).

-- Dan
>> The times are all significantly better on the Intel architecture, even though it's running a Celeron (which is known to have smallish caches).
Well, we can't forget which CPU's we tried :

Intel : Pentium M and Pentium Celeron
AMD : Athlon XP and Athlon64

There's a serious difference in architecture between the two. But more importantly : both Intel processors are in a lower performance class than the AMD processors. It would be interesting to see how the code performs on a Pentium 4 or a Duron or something similar.

It could be that encoding algorithms that use small lookup tables have an advantage on systems in the low performance class (Pentium M, Celeron, Duron), however contradictory that may sound.
You have to watch out, some of the newer instructions like movez cause pipeline stalls on the early Pentiums, so doing the two simpler instructions might be as fast or faster!

It's interesting that the 18-bit lookup table is so dang slow.  Shows how slow inexpensive memory really is compared to the CPU these days.

I tried some experiments with using MMX instructions to process EIGHT sixbit quantities at a time.  Only took about eight MMX instructions to process all eight sixbits, BUT that time is swamped by the SIXTEEN shifts it takes to position each sixbit quantity in a separate byte.  Sigh.
If there was a MMX instruction to pack/unpack 6 bit fields into 8 bit bytes this would be a huge winner, but alas no, not even in the latests SSE4 or AMD extensions..

One last possible trick-- investigate non-sequential processing, such as processing all the first bytes, then the seconds, then the thirds.  That way you or the compiler could keep the masks and shifts values in registers, AND position all the bits in 4 or 8 bytes with just ONE shift. Wow, gotta try this!




Innovative idea there grg99... it would be "paradoxical" if taking two passes through the data actually saved time, but there might be some other advantage that overrides everything else.

In my MMX tests, I was able to avoid a lot of shift ops by moving back and forth from MMX registers to general registers (where BSWAP and XCHG are available).  The big win seems to be in using a non-temporal 8-byte move to output the results.  If you care to share, I'd love to see any work MMX stuff you've tried in the ASM version of this thread (http:Q_21983447.html :-)
In the ASM thread,
   https://www.experts-exchange.com/questions/21983447/How-fast-is-your-ASM-Base64-Encoder.html#17711344
I wrote a version of the 8096-byte lookup algorithm code that confirms the assertion that "caching is king" --i.e., that all of the algorithms are losing time to the cache pollution that occurs when we plow through 1MB of of output.

In that example, I changed just one opcode from MOVQ to MOVNTQ and saw the fastest results to date, nearly 30% speed increase by changing just that *one opcode.*  The MOVNTQ opcode tells the CPU to store the 8-byte MMX register directly to main memory -- bypassing the caches (and thus allowing other data to stay cached).

I think that it's safe to say that when processing large chunks of memory, such as streaming video or this 1MB block of input, this is an overriding concern.

It might be a more valid test of the encoding alorithms to try a variation of my timing benchmark.... a variation in which a much smaller buffer (say 10K) is encoded 100,000,000 times.  In such a case, all of the input and all of the output and all of the lookup tables would be in cache for the entire run.

But... I think this thread is running down.  I'll leave this open for a while longer to see if we can get some more submissions or even some general interest.  If nothing pops, I'll close up and award the points.

-- Dan
This has been an interesting and thoroughly enjoyable intellectual excercise.  I -- and I hope the other participants -- have learned a lot.  For one thing the poor performance of the monster lookup table (cache thrashing) was enlightening.  For another, the significant speed advantage of a genuine Intel CPU (in this particular scenario) surprised me.  Infinity08's multiple-table scheme was innovative -- I'd have never thought of it :-)

We all put more into this that we would for a standard EE question, so the point awards are not true "compensation" -- the real reward here has been the participation itself.  Infinity08 posted the code that ran fastest with correct output.  grg99's contribution *showed the way* to the fastest algorithm measured here (the 8096-byte lookup table) but he did not submit a debugged version for actual timing comparison.  grg99's other contributions, including the no-table-at-all code, were interesting and his comments insightful, and I am chosing to award a share of the points to him.

For what it's worth: Every submission beat all of the base64 encoders in a Microsoft API or other published utility.  So that's a bonus :-)

Thanks to everyone who participated and even if you just "lurked" or if you read this at some time in the future,  I hope that you enjoyed this as much as I did :-)

-- Dan
>> the real reward here has been the participation itself.
Wise words :)

>> For what it's worth: Every submission beat all of the base64 encoders in a Microsoft API or other published utility.  So that's a bonus :-)
Heh.

I enjoyd this discussion/challenge, Dan ! Thanks for that !