<

Still celebrating National IT Professionals Day with 3 months of free Premium Membership. Use Code ITDAY17

x

Fast Base64 Encode and Decode

Published on
37,446 Points
27,446 Views
5 Endorsements
Last Modified:
Awarded
This article surveys and compares options for encoding and decoding base64 data.  It includes source code in C++ as well as examples of how to use standard Windows API functions for these tasks. We'll look at the algorithms — how encoding and decoding is usually done — and finally, I'll provide several examples of C++ source code, including a function that decodes base64 nearly ten times faster than commonly-used functions.


What is Base64?

Base64 is a way to represent binary data in a way that avoids some of the problems associated with binary data.  When you handle binary data, you need to take special care about embedded nulls, commas, quotes, apostrophes, the less-than, ampersand, and space characters, and so forth.  Base64 data, on the other hand, is simple text; you can assign it to a string data type, copy it, scan, and transfer it without worries.

You may see base64-encoded data in emails and in XML CDATA sections. If you have ever worked with HTTP headers, you have probably seen:
Content-Type: application/x-www-form-urlencoded
Authorization: Basic TXlVc2VySUQ6TXlQYXNzd29yZA==

Open in new window

The text after the word, Basic is a base64-encoding of a username and password (specifically, that string encodes MyUserName:MyPassword).

You may see base64 used to transport encrypted data, which often contains such otherwise "awkward" characters.  In an earlier article, Easy String Encryption Using CryptoAPI in C++, I recommended converting encrypted data to hexadecimal to avoid special-character handling problems. Base64 can be used similarly, and it has one advantage over hexadecimal encoding: It is more concise.  While hex-encoded data is twice the size of the original binary data, base64-encoded data is only 1.33 times the size. For instance, if a hexadecimal encoding is 5,000 bytes long, a base64 encoding of the same source data will be 3,336 bytes long (saving 1664 bytes of transmission bandwidth).

Binary uses two digits (01), base 10 — decimal — uses 10 digits (0123456789), base 16 — hexadecimal — uses 16 (0123456789ABCDEF). So, as you might expect, there are 64 "digits" used in base64 encoding:

    ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/

Each digit encodes six bits of source data.  So it takes four digits to encode three bytes of source data (4*6 = 3*8 = 24 bits).  This makes encoding and decoding much trickier than hex encoding.  Also, there is the issue that any base64-encoded data must end up as an exact multiple of four digits.  Since the source data is often not an exact multiple of three bytes in length, you need special end-of-string logic to pad the base64 output when encoding and to ignore that padding when decoding.


Encoding and Decoding Options

We will look at four options available to C++ programmers who need to encode and decode base64:

Windows CryptoAPI functions
The ATL (Active Template Library) functions
A pair of straightforward functions that I wrote
My ultimate high-speed functions
For each option, I wrote a small wrapper so that I could do testing with uniform parameters.  At the end of this article, I'll show a timing/performance comparison chart, and I'll provide links to documentation of the standard library functions.

For years, I've done my base64 work using a C++ class object written by PJ Naughter. I wanted to include his code in this article, but when I surfed around to find it, I learned that his programs no longer contain the base64-handling functions.  He switched to using the ATL functions a while back.  So I'll be showing that option.

The last one I'll show was a result of a "recreational programming" challenge to myself: Can I write code that will out-perform Microsoft? You might consider using this high-speed code if, for instance, your website must encode many large PDF or image files for transmission.


Windows CryptoAPI: CryptBinaryToString and CryptStringToBinary

The Windows CryptoAPI provides a set of general-purpose functions (CryptBinaryToString and CryptStringToBinary) that support base64 encoding and decoding. The following is a pair of functions that wrap that API:
#include <Wincrypt.h>
#pragma comment (lib, "Crypt32.lib")

int ToBase64Crypto( const BYTE* pSrc, int nLenSrc, char* pDst, int nLenDst )
{
   DWORD nLenOut= nLenDst;
   BOOL fRet= CryptBinaryToString(
      (const BYTE*)pSrc, nLenSrc,
      CRYPT_STRING_BASE64 | CRYPT_STRING_NOCRLF,
      pDst, &nLenOut
   );
   if (!fRet) nLenOut=0;  // failed
   return( nLenOut );
}
int FromBase64Crypto( const BYTE* pSrc, int nLenSrc, char* pDst, int nLenDst )
{
   DWORD nLenOut= nLenDst;
   BOOL fRet= CryptStringToBinary(
      (LPCSTR)pSrc, nLenSrc,
      CRYPT_STRING_BASE64,
      (BYTE*)pDst, &nLenOut,
      NULL,        // pdwSkip (not needed)
      NULL         // pdwFlags (not needed)
   );
   if (!fRet) nLenOut=0;  // failed
   return( nLenOut );
}

Open in new window

I used these CryptoAPI functions to validate my own functions and for performance testing. These were, as I expected, poor performers.  But I really can't fault you if you use these functions.  They are reliable and bullet-proof.


ATL functions: Base64Encode and Base64Decode

The ATL header files that have come with VC++ since VS2005 provide a nice pair of base64 tools (though the decoder is rather slow). The following is my wrapper for those functions.
#include <atlenc.h>
int ToBase64ATL( const BYTE* pSrc, int nLenSrc, char* pDst, int nLenDst )
{
   int nDestLen= nLenDst;
   BOOL fRet= Base64Encode( pSrc, nLenSrc, pDst, &nDestLen,  ATL_BASE64_FLAG_NOCRLF );
   if (!fRet) nDestLen=0;
   return( nDestLen );
}
int FromBase64ATL( const BYTE* pSrc, int nLenSrc, char* pDst, int nLenDst )
{
   int nDestLen= nLenDst;
   BOOL fRet= Base64Decode( (LPCSTR)pSrc, nLenSrc, (BYTE*)pDst, &nDestLen );
   if (!fRet) nDestLen=0;
   return( nDestLen );
}

Open in new window

The encoder turned out to be quite speedy, but I was not surprised at how poorly the decoder performed — ATL (and STL) code is generally written with flexibility as a priority (various CPUs, different character widths, special options such as handling embedded CRLFs, and so forth.) Even so, these are a good choice for your production code — you can be certain that these functions are in use at many thousands of sites that depend upon it for mission-critical operations.

An advantage of these functions is that the source code is on your disk. If you want to see how it's done, this is some high-quality code for you to review.


My own straightforward functions

These functions illustrate the bit manipulation needed for encoding and decoding.


Simple Version: Encoding

The base64 encoding algorithm can be described simply:
Read three bytes of input (24 bits).  For each 6 bits, lookup which base64 "digit" should be output, and  write its 1-byte value; write four output bytes for every three bytes of input.  Repeat until finished.
//  simple version illustrates the basic encode algorithm
//
static char Base64Digits[] =
 "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/";
//
int ToBase64Simple( const BYTE* pSrc, int nLenSrc, char* pDst, int nLenDst )
{
   int nLenOut= 0;
   while ( nLenSrc > 0 ) {
      if (nLenOut+4 > nLenDst) return(0); // error

      // read three source bytes (24 bits) 
      BYTE s1= pSrc[0];   // (but avoid reading past the end)
      BYTE s2= 0; if (nLenSrc>1) s2=pSrc[1]; //------ corrected, thanks to  jprichey
      BYTE s3= 0; if (nLenSrc>2) s3=pSrc[2];

      DWORD n;
      n =  s1;    // xxx1
      n <<= 8;    // xx1x
      n |= s2;    // xx12  
      n <<= 8;    // x12x
      n |= s3;    // x123  

      //-------------- get four 6-bit values for lookups
      BYTE m4= n & 0x3f;  n >>= 6;
      BYTE m3= n & 0x3f;  n >>= 6;
      BYTE m2= n & 0x3f;  n >>= 6;
      BYTE m1= n & 0x3f;  
	
      //------------------ lookup the right digits for output
      BYTE b1 = Base64Digits[m1];
      BYTE b2 = Base64Digits[m2];
      BYTE b3 = Base64Digits[m3];
      BYTE b4 = Base64Digits[m4];

      //--------- end of input handling
      *pDst++ = b1;
      *pDst++ = b2;
      if ( nLenSrc >= 3 ) {  // 24 src bits left to encode, output xxxx
         *pDst++ = b3;
         *pDst++ = b4;
      }
      if ( nLenSrc == 2 ) {  // 16 src bits left to encode, output xxx=
         *pDst++ = b3;
         *pDst++ = '=';
         }
      if ( nLenSrc == 1 ) {  // 8 src bits left to encode, output xx==
         *pDst++ = '=';
         *pDst++ = '=';
      }
      pSrc    += 3;
      nLenSrc -= 3;
      nLenOut += 4;
   }
   // Could optionally append a NULL byte like so:
   // *pDst++= 0; nLenOut++;
   return( nLenOut );  
}

Open in new window

Once you have worked out the bit-handling (I tried a number of methods; I think that the one above is the easiest to follow), there is still the issue of what to do at the end of the input. There might NOT be a multiple of three bytes in the final pass through the loop.

This creates two problems: firstly, it would be improper (even fatal) to read even one byte beyond the end of the input buffer, so lines 14-15 simulate an input value of 0 when that would happen; secondly, when only one or two source bytes remain, the coded string must end with one or two padding characters.  Both exceptions require special handling.

If you have an eye for performance optimization, you will see that the above implementation is very inefficient.  Lines 39-50 check for the end-of-buffer cases in every pass through loop. Imagine encoding a 3 Mb buffer of data — the many "if" statements must be executed one million times. The above code also checks for overrun of the caller's output buffer each time through the loop. In the ultra-fast version, we'll move all of the special handling to the outside of the loop (and we'll use a few other optimization tricks, to boot).


Simple Version: Decoding

The decoding algorithm can also be stated simply:
Read four bytes (32 bits) of input (they are base64 digits); look up the binary value of each digit (six bits, values 0 to 63); rearrange those bits into 24 bits (three bytes) of the original source data, and write them out in the correct order.
And once again, we have end-of-buffer issues: If we get an equal sign, we know not to write a byte of output.  More specifically, instead of three bytes, we must output just one or two.

The table below is used to convert base64 digits and padding (A-Z, a-z, 0-9, +, /, or =) into a binary value.  The table is 256 bytes long so that the 8-bit character code from the input can be used directly as an index into the table.  Note that UNICODE or UTF or character translation issues don't apply here... the table just holds an ordered sequence of 8-bit binary values.
static BYTE LookupDigits[] = {
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, //gap: ctrl chars
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, //gap: ctrl chars
0,0,0,0,0,0,0,0,0,0,0,           //gap: spc,!"#$%'()*
62,                   // +
 0, 0, 0,             // gap ,-.
63,                   // /
52, 53, 54, 55, 56, 57, 58, 59, 60, 61, // 0-9
 0, 0, 0,             // gap: :;<
99,                   //  = (end padding)
 0, 0, 0,             // gap: >?@
 0, 1, 2, 3, 4, 5, 6, 7, 8, 9,10,11,12,13,14,15,16,
17,18,19,20,21,22,23,24,25, // A-Z
 0, 0, 0, 0, 0, 0,    // gap: [\]^_`
26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,
43,44,45,46,47,48,49,50,51, // a-z    
 0, 0, 0, 0,          // gap: {|}~ (and the rest...)
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
};

Open in new window

For instance, if we see a z in the input, we can use ...
    n= LookupDigits[ 'z' ];
... to obtain the 6-bit value, 51 (0x33, 00110011b), that z encodes.  If we see an equal sign (=), the looked-up value will be 99 which we can use as a flag during end-of-buffer handling.

So here's the relatively simple, though pretty fast, code:
//  simple version illustrates the basic decode algorithm
//
int FromBase64Simple( const BYTE* pSrc, int nLenSrc, char* pDst, int nLenDst )
{
   int nLenOut= 0;
   for( int j=0; j<nLenSrc; j+=4 ) {
      if ( nLenOut > nLenDst ) {
         return( 0 ); // error, buffer too small
      }
      BYTE s1= LookupDigits[ *pSrc++ ];
      BYTE s2= LookupDigits[ *pSrc++ ];
      BYTE s3= LookupDigits[ *pSrc++ ];
      BYTE s4= LookupDigits[ *pSrc++ ];

      BYTE d1= ((s1 & 0x3f) << 2) | ((s2 & 0x30) >> 4);
      BYTE d2= ((s2 & 0x0f) << 4) | ((s3 & 0x3c) >> 2);
      BYTE d3= ((s3 & 0x03) << 6) | ((s4 & 0x3f) >> 0);

      *pDst++ = d1;  nLenOut++;
      if (s3==99) break;      // end padding found
      *pDst++ = d2;  nLenOut++;
      if (s4==99) break;      // end padding found
      *pDst++ = d3;  nLenOut++;
   }
   return( nLenOut );
}

Open in new window

Note: For simplicity, the above code assumes well-formed input data (an exact multiple of four in length, with trailing '=' bytes as padding).
As with the simple encoder, this simple decoder could can also be improved by moving the end-of-buffer handling outside of the main loop.   I tried other tricks...

In an earlier article on hexadecimal conversion, I was able to squeeze out a bit more speed by reading and writing in 32-bit DWORD blocks.   I tried DWORD reads here, but perhaps because of CPU and L1 caching differences or perhaps because of the overhead of rearranging the "little-endian" bytes, I could not get a performance increase.  I went with the above since it is quite straightforward.

The incoming digits must be handled in a very specific way.  Lines 10-13 convert the input into a set of four 6-bit values.  Lines 15-17 pack the total of 24 bits into three bytes.  The lines mean: (15) the bits of the first digit are moved all the way to the left of the first byte, to which two bits of the second digit are added; (16) the remaining four bits are assembled with four bits of the third digit; (17) the two remaining bits and the entire last digit make up the third byte.  This is illustrated below.
// d1        d2      d3         s1       s2       s3       s4  
// 00000000 00000000 00000000   00aaaaaa 00bbbbbb 00cccccc 00dddddd 
// 00aaaaaa 00000000 00000000   s1 << 2          
// aaaaaa00 00000000 00000000            000000bb s2 >> 4
// aaaaaabb 00000000 00000000            
// aaaaaabb 00000000 00000000            bbbb???? s2 << 4
// aaaaaabb bbbb0000 00000000                     0000cccc s3 >> 2
// aaaaaabb bbbbcccc 00000000                     cc?????? s3 << 6
// aaaaaabb bbbbcccc cc000000                             00dddddd
// aaaaaabb bbbbcccc ccdddddd                              
 

Open in new window

Now we're getting into my favorite part of this type of "recreational programming":
— Is there a better way to do that last part?
— There is!
   DWORD n32;
   n32  = s1; n32 <<= 6;
   n32 |= s2; n32 <<= 6;
   n32 |= s3; n32 <<= 6;
   n32 |= s4;

   BYTE d3= ( n32 & 0x00ff ); n32 >>= 8;
   BYTE d2= ( n32 & 0x00ff ); n32 >>= 8;
   BYTE d1= ( n32 & 0x00ff ); 

Open in new window

This variation rolls the bits into a temporary DWORD variable and back out — all in a shorter series of steps, and that turns out to be very efficient.
// DWORD temp variable                  four 6-bit values;
// 00000000 00000000 00000000 00000000  00aaaaaa 00bbbbbb 00cccccc 00dddddd 
// 00000000 00000000 00000000 00aaaaaa  -------- 00bbbbbb 00cccccc 00dddddd 
// 00000000 00000000 0000aaaa aa??????          
// 00000000 00000000 0000aaaa aabbbbbb           -------- 00cccccc 00dddddd 
// 00000000 000000aa aaaabbbb bb??????                    
// 00000000 000000aa aaaabbbb bbcccccc                    -------- 00dddddd 
// 00000000 aaaaaabb bbbbcccc cc??????                    
// 00000000 aaaaaabb bbbbcccc ccdddddd                             --------

Open in new window

This has fewer shifting and masking operations, but perhaps more importantly, the C++ compiler optimizes some of the steps into simple 8-bit register loads and other fast operations.


My High-Performance Functions

The following functions use optimization tricks and techniques to improve performance.


High-Speed Version: Encoding

This is similar to the "simple" version earlier, but I've taken the end-of-buffer handling out of the inner loop. I also improved upon the "DWORD roll-in" technique, by writing the DWORD out as a single operation.

// fastest encoding variation I've come up with (Not really, see comments after the article!)
//
int ToBase64Fast( const BYTE* pSrc, int nLenSrc, char* pDst, int nLenDst )
{
   int nLenOut= ((nLenSrc+2)/3)*4; // 4 out for every 3 in, rounded up
   if ( nLenOut+1 > nLenDst ) {
         return( 0 ); // fail!
   }
   DWORD* pdwDst= (DWORD*)pDst;
   while( nLenSrc > 2 ) {
      DWORD n= pSrc[0];  // xxx1 // roll the data in...
      n <<= 8;           // xx1x
      n |= pSrc[1];      // xx12  
      n <<= 8;           // x12x
      n |= pSrc[2];      // x123  

      BYTE m4= n & 0x3f;  n >>= 6;  // roll it out, 6 bits at a time
      BYTE m3= n & 0x3f;  n >>= 6;
      BYTE m2= n & 0x3f;  n >>= 6;
      BYTE m1= n & 0x3f;  

      n  = Base64Digits[m4];  n <<= 8; 
      n |= Base64Digits[m3];  n <<= 8;
      n |= Base64Digits[m2];  n <<= 8;
      n |= Base64Digits[m1]; 

      *pdwDst++ = n;   // write four bytes in one operation
      nLenSrc -= 3;
      pSrc += 3;
   }
   // -------------- end of buffer special handling (see text)
   pDst= (char*)pdwDst;
   if ( nLenSrc > 0 ) {  // some left after last triple
      int n1= (*pSrc & 0xfc) >> 2;
      int n2= (*pSrc & 0x03) << 4;
      if (nLenSrc > 1 ) {  // corrected.  Thanks to jprichey
            pSrc++;
            n2 |= (*pSrc & 0xf0) >> 4;
      }
      *pDst++ = Base64Digits[n1]; // encode at least 2 outputs
      *pDst++ = Base64Digits[n2];
      if (nLenSrc == 2) {  // 2 src bytes left to encode, output xxx=
         int n3= (*pSrc & 0x0f) << 2;
         pSrc++;
         n3 |= (*pSrc & 0xc0) >> 6;
         *pDst++ = Base64Digits[n3];
      }
      if (nLenSrc == 1) {  // 1 src byte left to encode, output xx==
         *pDst++ = '=';
      }
      *pDst++ = '=';
   }
   // *pDst= 0; nLenOut++ // could terminate with NULL, here
   return( nLenOut );
}

Open in new window

The trick in lines 22-25 is to get the bytes in the right order in the DWORD so that the "little endian" output is in the right order. It's easy once you've seen it done :-)

Also, note that the function is overall much longer. But all of the additional code is repeated logic from the inner loop code... removed from the inner loop to avoid if/then conditional jumps. The CPU can spin through the inner loop without pre-fetch flushing or other unwanted interruptions.

I wrote some assembly-language code to replace the inner loop, but I could not get a performance boost. The compiler is much smarter than me... it shuffles the code sequence in strange ways (to, I presume, avoid pipeline stalls) and delving into that would be a topic for a different day.


High-Speed Version: Decoding

Some will say that with this function, I've edged over from the sublime and stepped into a steaming stinky pile of the ridiculous.  That's because the following base64 decoding function uses a 128Kb lookup table.

Yes, you heard me right.  Before running the first time, this function allocates and builds a 131,072-byte lookup table.  The result is that the inner loop can lookup values of two base-64 digits at the same time.  You can make the table smaller (as little as about 30K) and still do a 16-bit lookup, but that would involve doing some calculations on each byte of the input data — likely costing a significant fraction of the performance gain.

Here's the code:
// Utra-Fast base64-decoding, using 128KB lookup table
int FromBase64Fast( const BYTE* pSrc, int nLenSrc, char* pDst, int nLenDst )
{
   if ( NULL==gpLookup16 ) SetupLookup16();  // see below
   int nLenOut= 0;
   if ( nLenDst < ((nLenSrc / 4)-1) * 3 ) {
      return (0); // (buffer too small)
   }
   int nLoopMax= (nLenSrc/4)-1;
   WORD* pwSrc= (WORD*)pSrc;
   for( int j=0; j<nLoopMax; j++ ) {
      WORD s1= gpLookup16[ pwSrc[0] ];  // translate two "digits" at once
      WORD s2= gpLookup16[ pwSrc[1] ];  // ... and two more

      DWORD n32;
      n32  = s1;         // xxxxxxxx xxxxxxxx xx111111 222222xx
      n32 <<= 10;        // xxxxxxxx 11111122 2222xxxx xxxxxxxx
      n32 |= s2 >> 2;    // xxxxxxxx 11111122 22223333 33444444

      BYTE b3= ( n32 & 0x00ff ); n32 >>= 8;  // in reverse (WORD order)
      BYTE b2= ( n32 & 0x00ff ); n32 >>= 8;
      BYTE b1= ( n32 & 0x00ff ); 

      // *pDst++ = b1;  *pDst++ = b2;  *pDst++ = b3;  //slighly slower

      pDst[0]= b1;  // slightly faster
      pDst[1]= b2;
      pDst[2]= b3; 

      pwSrc += 2;
      pDst += 3;
   }
   nLenOut= ((nLenSrc/4)-1) * 3;

   //-------------------- special handling outside of loop for end
   WORD s1= gpLookup16[ pwSrc[0] ];
   WORD s2= gpLookup16[ pwSrc[1] ]; 

   DWORD n32;
   n32  = s1;      
   n32 <<= 10;
   n32 |= s2 >> 2;

   BYTE b3= ( n32 & 0x00ff ); n32 >>= 8;
   BYTE b2= ( n32 & 0x00ff ); n32 >>= 8;
   BYTE b1= ( n32 & 0x00ff ); 

   if (nLenOut >= nLenDst)       return(0); // error
   *pDst++ = b1;  nLenOut++;
   if (b2 != 99)  {
      if (nLenOut >= nLenDst)   return(0); // error
      *pDst++ = b2;  nLenOut++;
   }
   if (b3 != 99) {
      if (nLenOut >= nLenDst)   return(0); // error
      *pDst++ = b3;  nLenOut++;
   }
   return( nLenOut );
} 

//----------------------------------------------------
// Using two-byte lookup table
// must call here before calling the above
//----------------------------------------------------
WORD* gpLookup16= 0;

void SetupLookup16()
{
   int nLenTbl= 256*256;            // yes, the table is 128Kb!
   if (NULL==gpLookup16) {
      gpLookup16= new WORD[nLenTbl];
   }
   WORD* p= gpLookup16;
   for (int j= 0; j<256; j++ ) {
      for (int k= 0; k<256; k++ ) {
         WORD w;
         w  = LookupDigits[k] << 8;
         w |= LookupDigits[j] << 2 ; // pre-shifted! See notes
         *p++ = w;
      }
   }
}

Open in new window

Note the simplicity of the inner loop (lines 12-31).  This is partially thanks to an optimization I discovered in creating the lookup table.  I pre-shifted the data values in the table to avoid a series of mask-and-shift operations in the loop.  Also, the one-byte variables (b1,b2,b3) end up being optimized into 8-bit register values by the compiler.  The entire inner loop is just 56 opcodes long!

The function calls SetupLookup16() the first time it is used.  If you were to wrap this into a C++ object, you would persist that table as an object member and initialize it in the constructor.  The time spent building the table is spread out across all invocations of the decoding function.

Is this at all practical? Probably not.

While I can imagine a web server needing an ultra-fast encoder to process Gigabytes of outgoing PDF files, it is difficult to come up with a scenario where high-speed decoding was worth the price in memory usage.  On the other hand, with 6GB of RAM coming standard on many home computers, 128K is only a tiny fraction of 1% of the system memory.  So, as Tom Cruise says in "Risky Business" — What the heck!


Summary

This article reviewed various options for encoding and decoding base64 data. We looked at the algorithms and at C++ code to implement them.

I did time trials on the various options and the results are shown in the following table:
                   Rate    Approximate 
Version:          (Kb/ms)  Ratio
-----------------  -----   -----------
ToBase64Crypto     133     3.4 : 1     
ToBase64ATL        420     1.1 : 1     
ToBase64Simple     224     1.9 : 1     
ToBase64Fast       446       1 : 1     

FromBase64Crypto    92     8.7 : 1     
FromBase64ATL       84     9.5 : 1     
FromBase64Simple   260     3.1 : 1     
FromBase64Fast     805       1 : 1     

Open in new window

Notes:

Added post-publication:
A significantly faster encoding algorithm has been posted in the comments, below.  See
      http://www.experts-exchange.com/A_3216.html#c15564
By using an 8Kb lookup table, the encoder can process 688 Kb/ms (50% faster!)
The CryptoAPI encoder is much slower even than an unoptimized "roll your own" encoder.
The ATL encoder is very fast, within the margin of error of the speed of my best effort in this article (Added: But NOT as good as the one in the article comments, here).
For decoding, even the "simple" version that I wrote is considerably faster than either the ATL version or the CryptoAPI version.
My ultra-fast decoder is nearly ten times as fast as other available options, but it has the drawback of using a 128Kb lookup table -- which is probably impractical.  Note, too that timing results do not include time taken to create that lookup table.
Testing methodology:

I read-in a 1MB JPG file and ran 1000 iterations on it. I timed only the outer loop, and I used the relatively low-resolution GetTickCount() function. However, the duration of each test was at least several seconds (minimizing resolution issues) and results were repeatable, with only small variations.

All code versions were 32-bit executables built in Release mode, with full optimization enabled.

I ran the tests on a 2.60 GHz AMD Athlon II QuadCore 620 CPU. While running, my CPU meter went to 28%, and indicated that one core was spiking at 100%.

Writing and testing this code was fun and challenging.  If you are interested in trying your hand at writing faster versions, then I'd be delighted to look at your code.  I'm quite certain that hand-optimized ASM code, and especially code that uses 64-bit registers and advanced MMX/SSE/AVX operations, can out-perform anything I've tried here.  But I'd especially like to see standard 32-bit code in which you have implemented an improved algorithm.  Perhaps a multi-threaded version that could take advantage of those idle cycles on a secondary CPU core...


Resources:

Formal description of base64 (RFC 2045)
   http://www.ietf.org/rfc/rfc2045.txt

Wikipedia article, Base64 (excellent coverage and quirk description)
  http://en.wikipedia.org/wiki/Base64

MSDN Documentation:
   CryptBinaryToString   http://msdn.microsoft.com/en-us/library/aa379887(v=VS.85).aspx
   CryptStringToBinary  http://msdn.microsoft.com/en-us/library/aa380285(VS.85).aspx

   ATL Base64Encode http://msdn.microsoft.com/en-us/library/ezec6x4e(v=VS.80).aspx
   ATL Base64Decode http://msdn.microsoft.com/en-us/library/2fzdww6e(VS.80).aspx

A website where you can encode or decode base64:
   http://www.opinionatedgeek.com/dotnet/tools/base64encode/

Script-accessible way to encode and decode base64 (uses MSXML)
   http://www.nonhostile.com/howto-encode-decode-base64-vb6.asp
   http://www.nonhostile.com/page-vb-net-base64-encoding-decoding.asp

Related Articles Written by Dan Rollins:
    Easy String Encryption Using CryptoAPI in C++  
    Fast Hexadecimal Encode and Decode  
    Convert String to int / Convert int to String in C++
    Binary Bit Flags: Tutorial and Usage Tips

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
If you liked this article and want to see more from this author, please click the Yes button near the:
      Was this article helpful?
label that is just below and to the right of this text.   Thanks!
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
5
Comment
Author:DanRollins
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
12 Comments
 
LVL 53

Expert Comment

by:Infinity08
Reminds me of http://www.experts-exchange.com/Programming/Languages/CPP/Q_21988706.html ... I'm sure you remember too ;)
0
 
LVL 49

Author Comment

by:DanRollins
Ha!  As I was working out those coding sequences, I knew that some of it seemed very familiar.  I guess that my memory horizon is four years ...  I did mention the hex-converter challenge in that article, but I failed to mention the later base64 challenge thread and the great contributions from the several coding geniuses who participated.  Thanks for reminding me.  Just to repeat...

Please see:
    How fast is your C/C++ Base64 Encoder?
    http://www.experts-exchange.com/Programming/Languages/CPP/Q_21988706.html 
for some additiional discussion and some very clever, innovative, and interesting techniques.

For instance, a small change to ToBase64Simple() increases performance by about 5% (to 234 Kb/ms) and can boost the ToBase64Fast() algorithm by 2% (to 458 Kb/ms)
static char Base64DigitsX4[] = 
 "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/"
 "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/"
 "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/"
 "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/"
;
   .
   :
      BYTE m4= (BYTE)n;  n >>= 6;  // no need for & 0x3f 
      BYTE m3= (BYTE)n;  n >>= 6;
      BYTE m2= (BYTE)n;  n >>= 6;
      BYTE m1= (BYTE)n;
   .
   :
      BYTE b1 = Base64DigitsX4[m1]; // use longer table
... etc...

Open in new window

0
 
LVL 49

Author Comment

by:DanRollins
Update:  
On review of that earlier "challenge thread," I realized I missed a potent version of the ToBase64 encoder function.  As this is the one that is probably most important (say, for high-volume web servers), I decided to post the code and describe the algorithm:

ToBase64Fast8192

The idea is to use a 8192-byte lookup table that converts 12 bits of source data directly into two base64 digits.  The thought of using an 8K table is not ludicrous, and the performance gain is substantial.  This is over 50% faster than the ToBase64Fast() function in the article!  It runs at 688 Kb/ms -- which makes it considerably faster than the ATL version.

//-------------- modify the inner loop of the ToBase64 fn to..   
   WORD* pwDst= (WORD*)pDst;
   while( nLenSrc > 2 ) {
      DWORD n= pSrc[0];  // xxx1
      n <<= 8;           // xx1x
      n |= pSrc[1];      // xx12  
      n <<= 8;           // x12x
      n |= pSrc[2];      // x123  

      pwDst[0]= Base64Digits8192[ n >> 12 ]; 
      pwDst[1]= Base64Digits8192[ n & 0x00000fff ]; 

      nLenSrc -= 3;
      pwDst += 2;
      pSrc += 3;
   }

Open in new window

Here's the function that builds the lookup table.  A feasable alternative implementaton would be to just let the compiler create a static lookup table, as shown below in the code:
WORD Base64Digits8192[ 4096 ];
void SetupTable8192()
{
   for ( int j=0; j<64; j++ ) {
      for ( int k=0; k<64; k++ ) {
         WORD w;
         w  = Base64Digits[k] << 8;
         w |= Base64Digits[j]; 
         Base64Digits8192[(j*64)+k]= w;
      }
   }
}
//--------------------------
// or have a pre-built table, something like:
BYTE Base64Digits8192[ ((64*64)*2)+1 ]=  // 8192+1 
"AAABACADAEAFAGAHAIAJAKALAMANAOAPAQARASATA ... A6A7A8A9A+A/"
"BABBBCBDBEBFBGBHBIBJBKBLBMBNBOBPBQBRBSBTB ... B6B7B8B9B+B/"
...
"9A9B9C9D9E9F9G9H9I9J9K9L9M9N9O9P9Q9R9S9T9...  5969798999+9/"
"+A+B+C+D+E+F+G+H+I+J+K+L+M+N+O+P+Q+R+S+T+...  5+6+7+8+9+++/"
"/A/B/C/D/E/F/G/H/I/J/K/L/M/N/O/P/Q/R/S/T/...  5/6/7/8/9/+//"
;

Open in new window

0
NFR key for Veeam Backup for Microsoft Office 365

Veeam is happy to provide a free NFR license (for 1 year, up to 10 users). This license allows for the non‑production use of Veeam Backup for Microsoft Office 365 in your home lab without any feature limitations.

 

Expert Comment

by:jprichey
There appears to be a bug in the ToBase64Simple code that results in a buffer overrun and an incorrect encoding.  When I compared the output of both implementations they were consistent with each other, but did not match the output of the .NET Convert.ToBase64String().  Given .NET history, I first assumed it was .NET, but after further review I convinced my self that the problem was here.

Lines 14,15
      BYTE s2= 0; if (nLenSrc>0) s2=pSrc[1];  
      BYTE s3= 0; if (nLenSrc>1) s3=pSrc[2];

should be
      BYTE s2= 0; if (nLenSrc>1) s2=pSrc[1];  
      BYTE s3= 0; if (nLenSrc>2) s3=pSrc[2];

This is then also reproduced (although different) in the fast version
Lines 34,37
      int n1= (*pSrc & 0xfc) >> 2;
      int n2= (*pSrc & 0x03) << 4;
      pSrc++;
      n2 |= (*pSrc & 0xf0) >> 4;

should be
      int n1= (*pSrc & 0xfc) >> 2;
      int n2= (*pSrc & 0x03) << 4;
      if (nLenSrc > 1) {
            pSrc++;
            n2 |= (*pSrc & 0xf0) >> 4;
      }
0
 
LVL 49

Author Comment

by:DanRollins
Thanks for the heads up.  
Both the ToBase64Simple() and ToBase64Fast() functions as originally published could, indeed, read beyond the end of the input buffer by one or two bytes.   I have corrected the source code, in the article.

None of my tests revealed incorrect output.   However, I did not validate against every possible input.  What input data did you use to detect the invalid output?

0
 

Expert Comment

by:jprichey
I should have said this in my first post - THANKS FOR THE GREAT ARTICLE!!!  I was so excited to help that I forgot my manners.

I think my test turned out to be dumb luck.  I simply needed an encryption routine that produced the same output in C++ and C# and wanted to use only Win32 and .NET APIs respectively.  However, since Win32 does not have a Base64 API, I needed a reasonable implementation, and yours fit the bill, because it was explained so simply and thoroughly.  My test was simply encrypting "HelloWorld" and comparing the output between the two implementations.  The output from the corrected code matched the C# and .NET exactly as follows:

Cryptobuf is:
    60 81 18 54 236 24 147 67 141 18 76 4 163 100 230 65
Cryptobuf in base64 is:
    PFESNuwYk0ONEkwEo2TmQQ==

Prior to the fix the base64 result was:
    PFESNuwYk0ONEkwEo2TmQf==

Because the cryptobuf is 16 bytes long (natural result of TripleDES encryption) it didn't nicely fit the 3-4 rule of base64 and thus fell into the special case code.  Because of the read beyond the end of buffer, the last byte was picked up from the debug mode "guard bytes" of the allocated cryptobuf and thus both Simple and Fast had consistent data to convert and produced consistent results.  In release mode they may have produced different results because of a lack of "guard bytes" in the memory allocator (depends on the library).

BTW, I found the buffer overrun by converting this code to C#.  When I got the inconsistent output from the C++ and C# routines, I originally assumed it was .NET, so I ported this code to C#.  When it ran I got an exception when the code attempted to read beyond the end of the buffer.

Thanks again!
0
 
LVL 49

Author Comment

by:DanRollins
Thanks!  Great response, and I really do appreciate your help.

I am very glad that I revisited this.  I just found another error in the code... In the ToBase64ATL() function, the ALT_BASE64_FLAG_NOPAD should not be used... it causes the fn to omit the trailing equal signs (it seems strange that they even have that option).  I've updated the code in the article.

There's something interesting about your test string:
    BYTE pOrig[]="\x3c\x51\x12\x36\xec\x18\x93\x43\x8d\x12\x4c\x04\xa3\x64\xe6\x41";

All of the encoding functions (now) produce:
   PFESNuwYk0ONEkwEo2TmQQ==
but the other base64 string:
   PFESNuwYk0ONEkwEo2TmQf==
when decoded (by any of the decoding options, including the website in the References section) also produces the original data.

That is because the final quad of QQ== is equivalent to Qf==  ... as follows:
In the == scenario, there is only going to be one byte of output, but there are 12 input bits.  
Q is 0x10 and f is 0x1f  
   QQ is 0x00410000  while
   Qf is 0x0041f000
The low two bytes are discarded, so the difference between Q and f is ignored -- only the 0x41 part is used.

Thus, there are at least two "valid" ways to represent that final quad (actually, many more, for instance, Qe==, etc).

Any correctly-written decoder would produce the right output, given either base64 input.  

HOWEVER!!!

If you go to the original RFC, you'll see that the encoder is required to use zeros to flesh out an incomplete triplet (I wondered about that when I read it... "Self," I said to myself, "Isn't that irrelevant?")   So, even though Qf== decodes correctly, only QQ== is encoded correctly.

Lord help me, I really love this kind of puzzle!
0
 
LVL 46

Expert Comment

by:aikimark
@Dan

Once you've calculated the table (8k or 128k), could you turn it into a static resource?
0
 
LVL 49

Author Comment

by:DanRollins
Sure!  A minor tweak to the code in http:#c15564 can output the table to a text file, which could be added as a resource.  But it probably makes more sense to just put it right into the source code (a "prebuilt lookup table") as shown in that snippet.

--- Dan
0
 

Expert Comment

by:ignacvucko
Hi Dan,

Awesome article.
I'd like to use your 'fast-b64' routine in my app --- could you clarify the license?
Is it public domain / MIT / LGPL / other?

Thanks!
Ignac
0
 
LVL 49

Author Comment

by:DanRollins
Thanks!
You have my permission to use the code in any way you want.
-- Dan
0
 
LVL 3

Expert Comment

by:mrfreshly
Hi Dan,

Love the article.  

I'm doing some MSVC++ coding and thought I'd try your Base64Fast functions out.  

The first string I tried to encode was "encode me!" (10 bytes) which yielded "ZW5jb2RlIG1lIQ==".  But then when I passed that back into the decode function it returns 12 bytes "encode me!¿p".

I tried "hello world!" which encoded to "aGVsbG8gd29ybGQh" and decoded back to "hello world!" perfectly.

I tried a four letter word and it came back with 6 after the decode that looked similar to other extra non-alpha characters.  So, 4 character and 10 character sources seem to be causing a problem.

I tested the "ZW5jb2RlIG1lIQ==" string via the web based decoder and it decoded correctly.  There must be a gotcha in the FromBase64Fast code.  I'll update if/when I figure it out.

Update:


The condition of b2 or b3 being 99 never triggered when the "=" pad came up.  My solution is a total hack but it seems to work for now, I'll update again with something sexier later.

Until then, consider these changes to the last section of code for "FromBase64Fast":
		if ((pwSrc[1] & 0x00FF)!=0x003D)  {
		//if (b2 != 99)  {
			if (nLenOut >= nLenDst)   return(0); // error
			*pDst++ = b2;  nLenOut++;
			if ((pwSrc[1] & 0xFF00)!=0x3D00)  {
		//		if (b3 != 99) {
				if (nLenOut >= nLenDst)   return(0); // error
				*pDst++ = b3;  nLenOut++;
			}
		}
//	*pDst = 0; // could terminate with NULL, here
	return(nLenOut);
}

Open in new window

I ran some generated data through it and it matched the results correctly.  More testing is needed, but so far so good.

Cheers,
Chris
0

Featured Post

Veeam Task Manager for Hyper-V

Task Manager for Hyper-V provides critical information that allows you to monitor Hyper-V performance by displaying real-time views of CPU and memory at the individual VM-level, so you can quickly identify which VMs are using host resources.

Join & Write a Comment

This is Part 3 in a 3-part series on Experts Exchange to discuss error handling in VBA code written for Excel. Part 1 of this series discussed basic error handling code using VBA. http://www.experts-exchange.com/videos/1478/Excel-Error-Handlin…
Want to learn how to record your desktop screen without having to use an outside camera. Click on this video and learn how to use the cool google extension called "Screencastify"! Step 1: Open a new google tab Step 2: Go to the left hand upper corn…

Keep in touch with Experts Exchange

Tech news and trends delivered to your inbox every month