Community Pick: Many members of our community have endorsed this article.
Editor's Choice: This article has been selected by our editors as an exceptional contribution.

Fast Base64 Encode and Decode

DanRollins
CERTIFIED EXPERT
Published:
Updated:
This article surveys and compares options for encoding and decoding base64 data.  It includes source code in C++ as well as examples of how to use standard Windows API functions for these tasks. We'll look at the algorithms — how encoding and decoding is usually done — and finally, I'll provide several examples of C++ source code, including a function that decodes base64 nearly ten times faster than commonly-used functions.


What is Base64?

Base64 is a way to represent binary data in a way that avoids some of the problems associated with binary data.  When you handle binary data, you need to take special care about embedded nulls, commas, quotes, apostrophes, the less-than, ampersand, and space characters, and so forth.  Base64 data, on the other hand, is simple text; you can assign it to a string data type, copy it, scan, and transfer it without worries.

You may see base64-encoded data in emails and in XML CDATA sections. If you have ever worked with HTTP headers, you have probably seen:
Content-Type: application/x-www-form-urlencoded
                      Authorization: Basic TXlVc2VySUQ6TXlQYXNzd29yZA==

Open in new window

The text after the word, Basic is a base64-encoding of a username and password (specifically, that string encodes MyUserName:MyPassword).

You may see base64 used to transport encrypted data, which often contains such otherwise "awkward" characters.  In an earlier article, Easy String Encryption Using CryptoAPI in C++, I recommended converting encrypted data to hexadecimal to avoid special-character handling problems. Base64 can be used similarly, and it has one advantage over hexadecimal encoding: It is more concise.  While hex-encoded data is twice the size of the original binary data, base64-encoded data is only 1.33 times the size. For instance, if a hexadecimal encoding is 5,000 bytes long, a base64 encoding of the same source data will be 3,336 bytes long (saving 1664 bytes of transmission bandwidth).

Binary uses two digits (01), base 10 — decimal — uses 10 digits (0123456789), base 16 — hexadecimal — uses 16 (0123456789ABCDEF). So, as you might expect, there are 64 "digits" used in base64 encoding:

    ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/

Each digit encodes six bits of source data.  So it takes four digits to encode three bytes of source data (4*6 = 3*8 = 24 bits).  This makes encoding and decoding much trickier than hex encoding.  Also, there is the issue that any base64-encoded data must end up as an exact multiple of four digits.  Since the source data is often not an exact multiple of three bytes in length, you need special end-of-string logic to pad the base64 output when encoding and to ignore that padding when decoding.


Encoding and Decoding Options

We will look at four options available to C++ programmers who need to encode and decode base64:

Windows CryptoAPI functions
The ATL (Active Template Library) functions
A pair of straightforward functions that I wrote
My ultimate high-speed functions
For each option, I wrote a small wrapper so that I could do testing with uniform parameters.  At the end of this article, I'll show a timing/performance comparison chart, and I'll provide links to documentation of the standard library functions.

For years, I've done my base64 work using a C++ class object written by PJ Naughter. I wanted to include his code in this article, but when I surfed around to find it, I learned that his programs no longer contain the base64-handling functions.  He switched to using the ATL functions a while back.  So I'll be showing that option.

The last one I'll show was a result of a "recreational programming" challenge to myself: Can I write code that will out-perform Microsoft? You might consider using this high-speed code if, for instance, your website must encode many large PDF or image files for transmission.


Windows CryptoAPI: CryptBinaryToString and CryptStringToBinary

The Windows CryptoAPI provides a set of general-purpose functions (CryptBinaryToString and CryptStringToBinary) that support base64 encoding and decoding. The following is a pair of functions that wrap that API:
#include <Wincrypt.h>
                      #pragma comment (lib, "Crypt32.lib")
                      
                      int ToBase64Crypto( const BYTE* pSrc, int nLenSrc, char* pDst, int nLenDst )
                      {
                         DWORD nLenOut= nLenDst;
                         BOOL fRet= CryptBinaryToString(
                            (const BYTE*)pSrc, nLenSrc,
                            CRYPT_STRING_BASE64 | CRYPT_STRING_NOCRLF,
                            pDst, &nLenOut
                         );
                         if (!fRet) nLenOut=0;  // failed
                         return( nLenOut );
                      }
                      int FromBase64Crypto( const BYTE* pSrc, int nLenSrc, char* pDst, int nLenDst )
                      {
                         DWORD nLenOut= nLenDst;
                         BOOL fRet= CryptStringToBinary(
                            (LPCSTR)pSrc, nLenSrc,
                            CRYPT_STRING_BASE64,
                            (BYTE*)pDst, &nLenOut,
                            NULL,        // pdwSkip (not needed)
                            NULL         // pdwFlags (not needed)
                         );
                         if (!fRet) nLenOut=0;  // failed
                         return( nLenOut );
                      }

Open in new window

I used these CryptoAPI functions to validate my own functions and for performance testing. These were, as I expected, poor performers.  But I really can't fault you if you use these functions.  They are reliable and bullet-proof.


ATL functions: Base64Encode and Base64Decode

The ATL header files that have come with VC++ since VS2005 provide a nice pair of base64 tools (though the decoder is rather slow). The following is my wrapper for those functions.
#include <atlenc.h>
                      int ToBase64ATL( const BYTE* pSrc, int nLenSrc, char* pDst, int nLenDst )
                      {
                         int nDestLen= nLenDst;
                         BOOL fRet= Base64Encode( pSrc, nLenSrc, pDst, &nDestLen,  ATL_BASE64_FLAG_NOCRLF );
                         if (!fRet) nDestLen=0;
                         return( nDestLen );
                      }
                      int FromBase64ATL( const BYTE* pSrc, int nLenSrc, char* pDst, int nLenDst )
                      {
                         int nDestLen= nLenDst;
                         BOOL fRet= Base64Decode( (LPCSTR)pSrc, nLenSrc, (BYTE*)pDst, &nDestLen );
                         if (!fRet) nDestLen=0;
                         return( nDestLen );
                      }

Open in new window

The encoder turned out to be quite speedy, but I was not surprised at how poorly the decoder performed — ATL (and STL) code is generally written with flexibility as a priority (various CPUs, different character widths, special options such as handling embedded CRLFs, and so forth.) Even so, these are a good choice for your production code — you can be certain that these functions are in use at many thousands of sites that depend upon it for mission-critical operations.

An advantage of these functions is that the source code is on your disk. If you want to see how it's done, this is some high-quality code for you to review.


My own straightforward functions

These functions illustrate the bit manipulation needed for encoding and decoding.


Simple Version: Encoding

The base64 encoding algorithm can be described simply:
Read three bytes of input (24 bits).  For each 6 bits, lookup which base64 "digit" should be output, and  write its 1-byte value; write four output bytes for every three bytes of input.  Repeat until finished.
//  simple version illustrates the basic encode algorithm
                      //
                      static char Base64Digits[] =
                       "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/";
                      //
                      int ToBase64Simple( const BYTE* pSrc, int nLenSrc, char* pDst, int nLenDst )
                      {
                         int nLenOut= 0;
                         while ( nLenSrc > 0 ) {
                            if (nLenOut+4 > nLenDst) return(0); // error
                      
                            // read three source bytes (24 bits) 
                            BYTE s1= pSrc[0];   // (but avoid reading past the end)
                            BYTE s2= 0; if (nLenSrc>1) s2=pSrc[1]; //------ corrected, thanks to  jprichey
                            BYTE s3= 0; if (nLenSrc>2) s3=pSrc[2];
                      
                            DWORD n;
                            n =  s1;    // xxx1
                            n <<= 8;    // xx1x
                            n |= s2;    // xx12  
                            n <<= 8;    // x12x
                            n |= s3;    // x123  
                      
                            //-------------- get four 6-bit values for lookups
                            BYTE m4= n & 0x3f;  n >>= 6;
                            BYTE m3= n & 0x3f;  n >>= 6;
                            BYTE m2= n & 0x3f;  n >>= 6;
                            BYTE m1= n & 0x3f;  
                      	
                            //------------------ lookup the right digits for output
                            BYTE b1 = Base64Digits[m1];
                            BYTE b2 = Base64Digits[m2];
                            BYTE b3 = Base64Digits[m3];
                            BYTE b4 = Base64Digits[m4];
                      
                            //--------- end of input handling
                            *pDst++ = b1;
                            *pDst++ = b2;
                            if ( nLenSrc >= 3 ) {  // 24 src bits left to encode, output xxxx
                               *pDst++ = b3;
                               *pDst++ = b4;
                            }
                            if ( nLenSrc == 2 ) {  // 16 src bits left to encode, output xxx=
                               *pDst++ = b3;
                               *pDst++ = '=';
                               }
                            if ( nLenSrc == 1 ) {  // 8 src bits left to encode, output xx==
                               *pDst++ = '=';
                               *pDst++ = '=';
                            }
                            pSrc    += 3;
                            nLenSrc -= 3;
                            nLenOut += 4;
                         }
                         // Could optionally append a NULL byte like so:
                         // *pDst++= 0; nLenOut++;
                         return( nLenOut );  
                      }

Open in new window

Once you have worked out the bit-handling (I tried a number of methods; I think that the one above is the easiest to follow), there is still the issue of what to do at the end of the input. There might NOT be a multiple of three bytes in the final pass through the loop.

This creates two problems: firstly, it would be improper (even fatal) to read even one byte beyond the end of the input buffer, so lines 14-15 simulate an input value of 0 when that would happen; secondly, when only one or two source bytes remain, the coded string must end with one or two padding characters.  Both exceptions require special handling.

If you have an eye for performance optimization, you will see that the above implementation is very inefficient.  Lines 39-50 check for the end-of-buffer cases in every pass through loop. Imagine encoding a 3 Mb buffer of data — the many "if" statements must be executed one million times. The above code also checks for overrun of the caller's output buffer each time through the loop. In the ultra-fast version, we'll move all of the special handling to the outside of the loop (and we'll use a few other optimization tricks, to boot).


Simple Version: Decoding

The decoding algorithm can also be stated simply:
Read four bytes (32 bits) of input (they are base64 digits); look up the binary value of each digit (six bits, values 0 to 63); rearrange those bits into 24 bits (three bytes) of the original source data, and write them out in the correct order.
And once again, we have end-of-buffer issues: If we get an equal sign, we know not to write a byte of output.  More specifically, instead of three bytes, we must output just one or two.

The table below is used to convert base64 digits and padding (A-Z, a-z, 0-9, +, /, or =) into a binary value.  The table is 256 bytes long so that the 8-bit character code from the input can be used directly as an index into the table.  Note that UNICODE or UTF or character translation issues don't apply here... the table just holds an ordered sequence of 8-bit binary values.
static BYTE LookupDigits[] = {
                      0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, //gap: ctrl chars
                      0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, //gap: ctrl chars
                      0,0,0,0,0,0,0,0,0,0,0,           //gap: spc,!"#$%'()*
                      62,                   // +
                       0, 0, 0,             // gap ,-.
                      63,                   // /
                      52, 53, 54, 55, 56, 57, 58, 59, 60, 61, // 0-9
                       0, 0, 0,             // gap: :;<
                      99,                   //  = (end padding)
                       0, 0, 0,             // gap: >?@
                       0, 1, 2, 3, 4, 5, 6, 7, 8, 9,10,11,12,13,14,15,16,
                      17,18,19,20,21,22,23,24,25, // A-Z
                       0, 0, 0, 0, 0, 0,    // gap: [\]^_`
                      26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,
                      43,44,45,46,47,48,49,50,51, // a-z    
                       0, 0, 0, 0,          // gap: {|}~ (and the rest...)
                      0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
                      0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
                      0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
                      0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
                      };

Open in new window

For instance, if we see a z in the input, we can use ...
    n= LookupDigits[ 'z' ];
... to obtain the 6-bit value, 51 (0x33, 00110011b), that z encodes.  If we see an equal sign (=), the looked-up value will be 99 which we can use as a flag during end-of-buffer handling.

So here's the relatively simple, though pretty fast, code:
//  simple version illustrates the basic decode algorithm
                      //
                      int FromBase64Simple( const BYTE* pSrc, int nLenSrc, char* pDst, int nLenDst )
                      {
                         int nLenOut= 0;
                         for( int j=0; j<nLenSrc; j+=4 ) {
                            if ( nLenOut > nLenDst ) {
                               return( 0 ); // error, buffer too small
                            }
                            BYTE s1= LookupDigits[ *pSrc++ ];
                            BYTE s2= LookupDigits[ *pSrc++ ];
                            BYTE s3= LookupDigits[ *pSrc++ ];
                            BYTE s4= LookupDigits[ *pSrc++ ];
                      
                            BYTE d1= ((s1 & 0x3f) << 2) | ((s2 & 0x30) >> 4);
                            BYTE d2= ((s2 & 0x0f) << 4) | ((s3 & 0x3c) >> 2);
                            BYTE d3= ((s3 & 0x03) << 6) | ((s4 & 0x3f) >> 0);
                      
                            *pDst++ = d1;  nLenOut++;
                            if (s3==99) break;      // end padding found
                            *pDst++ = d2;  nLenOut++;
                            if (s4==99) break;      // end padding found
                            *pDst++ = d3;  nLenOut++;
                         }
                         return( nLenOut );
                      }

Open in new window

Note: For simplicity, the above code assumes well-formed input data (an exact multiple of four in length, with trailing '=' bytes as padding).
As with the simple encoder, this simple decoder could can also be improved by moving the end-of-buffer handling outside of the main loop.   I tried other tricks...

In an earlier article on hexadecimal conversion, I was able to squeeze out a bit more speed by reading and writing in 32-bit DWORD blocks.   I tried DWORD reads here, but perhaps because of CPU and L1 caching differences or perhaps because of the overhead of rearranging the "little-endian" bytes, I could not get a performance increase.  I went with the above since it is quite straightforward.

The incoming digits must be handled in a very specific way.  Lines 10-13 convert the input into a set of four 6-bit values.  Lines 15-17 pack the total of 24 bits into three bytes.  The lines mean: (15) the bits of the first digit are moved all the way to the left of the first byte, to which two bits of the second digit are added; (16) the remaining four bits are assembled with four bits of the third digit; (17) the two remaining bits and the entire last digit make up the third byte.  This is illustrated below.
// d1        d2      d3         s1       s2       s3       s4  
                      // 00000000 00000000 00000000   00aaaaaa 00bbbbbb 00cccccc 00dddddd 
                      // 00aaaaaa 00000000 00000000   s1 << 2          
                      // aaaaaa00 00000000 00000000            000000bb s2 >> 4
                      // aaaaaabb 00000000 00000000            
                      // aaaaaabb 00000000 00000000            bbbb???? s2 << 4
                      // aaaaaabb bbbb0000 00000000                     0000cccc s3 >> 2
                      // aaaaaabb bbbbcccc 00000000                     cc?????? s3 << 6
                      // aaaaaabb bbbbcccc cc000000                             00dddddd
                      // aaaaaabb bbbbcccc ccdddddd                              
                       

Open in new window

Now we're getting into my favorite part of this type of "recreational programming":
— Is there a better way to do that last part?
— There is!
   DWORD n32;
                         n32  = s1; n32 <<= 6;
                         n32 |= s2; n32 <<= 6;
                         n32 |= s3; n32 <<= 6;
                         n32 |= s4;
                      
                         BYTE d3= ( n32 & 0x00ff ); n32 >>= 8;
                         BYTE d2= ( n32 & 0x00ff ); n32 >>= 8;
                         BYTE d1= ( n32 & 0x00ff ); 

Open in new window

This variation rolls the bits into a temporary DWORD variable and back out — all in a shorter series of steps, and that turns out to be very efficient.
// DWORD temp variable                  four 6-bit values;
                      // 00000000 00000000 00000000 00000000  00aaaaaa 00bbbbbb 00cccccc 00dddddd 
                      // 00000000 00000000 00000000 00aaaaaa  -------- 00bbbbbb 00cccccc 00dddddd 
                      // 00000000 00000000 0000aaaa aa??????          
                      // 00000000 00000000 0000aaaa aabbbbbb           -------- 00cccccc 00dddddd 
                      // 00000000 000000aa aaaabbbb bb??????                    
                      // 00000000 000000aa aaaabbbb bbcccccc                    -------- 00dddddd 
                      // 00000000 aaaaaabb bbbbcccc cc??????                    
                      // 00000000 aaaaaabb bbbbcccc ccdddddd                             --------

Open in new window

This has fewer shifting and masking operations, but perhaps more importantly, the C++ compiler optimizes some of the steps into simple 8-bit register loads and other fast operations.


My High-Performance Functions

The following functions use optimization tricks and techniques to improve performance.


High-Speed Version: Encoding

This is similar to the "simple" version earlier, but I've taken the end-of-buffer handling out of the inner loop. I also improved upon the "DWORD roll-in" technique, by writing the DWORD out as a single operation.

// fastest encoding variation I've come up with (Not really, see comments after the article!)
                      //
                      int ToBase64Fast( const BYTE* pSrc, int nLenSrc, char* pDst, int nLenDst )
                      {
                         int nLenOut= ((nLenSrc+2)/3)*4; // 4 out for every 3 in, rounded up
                         if ( nLenOut+1 > nLenDst ) {
                               return( 0 ); // fail!
                         }
                         DWORD* pdwDst= (DWORD*)pDst;
                         while( nLenSrc > 2 ) {
                            DWORD n= pSrc[0];  // xxx1 // roll the data in...
                            n <<= 8;           // xx1x
                            n |= pSrc[1];      // xx12  
                            n <<= 8;           // x12x
                            n |= pSrc[2];      // x123  
                      
                            BYTE m4= n & 0x3f;  n >>= 6;  // roll it out, 6 bits at a time
                            BYTE m3= n & 0x3f;  n >>= 6;
                            BYTE m2= n & 0x3f;  n >>= 6;
                            BYTE m1= n & 0x3f;  
                      
                            n  = Base64Digits[m4];  n <<= 8; 
                            n |= Base64Digits[m3];  n <<= 8;
                            n |= Base64Digits[m2];  n <<= 8;
                            n |= Base64Digits[m1]; 
                      
                            *pdwDst++ = n;   // write four bytes in one operation
                            nLenSrc -= 3;
                            pSrc += 3;
                         }
                         // -------------- end of buffer special handling (see text)
                         pDst= (char*)pdwDst;
                         if ( nLenSrc > 0 ) {  // some left after last triple
                            int n1= (*pSrc & 0xfc) >> 2;
                            int n2= (*pSrc & 0x03) << 4;
                            if (nLenSrc > 1 ) {  // corrected.  Thanks to jprichey
                                  pSrc++;
                                  n2 |= (*pSrc & 0xf0) >> 4;
                            }
                            *pDst++ = Base64Digits[n1]; // encode at least 2 outputs
                            *pDst++ = Base64Digits[n2];
                            if (nLenSrc == 2) {  // 2 src bytes left to encode, output xxx=
                               int n3= (*pSrc & 0x0f) << 2;
                               pSrc++;
                               n3 |= (*pSrc & 0xc0) >> 6;
                               *pDst++ = Base64Digits[n3];
                            }
                            if (nLenSrc == 1) {  // 1 src byte left to encode, output xx==
                               *pDst++ = '=';
                            }
                            *pDst++ = '=';
                         }
                         // *pDst= 0; nLenOut++ // could terminate with NULL, here
                         return( nLenOut );
                      }

Open in new window

The trick in lines 22-25 is to get the bytes in the right order in the DWORD so that the "little endian" output is in the right order. It's easy once you've seen it done :-)

Also, note that the function is overall much longer. But all of the additional code is repeated logic from the inner loop code... removed from the inner loop to avoid if/then conditional jumps. The CPU can spin through the inner loop without pre-fetch flushing or other unwanted interruptions.

I wrote some assembly-language code to replace the inner loop, but I could not get a performance boost. The compiler is much smarter than me... it shuffles the code sequence in strange ways (to, I presume, avoid pipeline stalls) and delving into that would be a topic for a different day.


High-Speed Version: Decoding

Some will say that with this function, I've edged over from the sublime and stepped into a steaming stinky pile of the ridiculous.  That's because the following base64 decoding function uses a 128Kb lookup table.

Yes, you heard me right.  Before running the first time, this function allocates and builds a 131,072-byte lookup table.  The result is that the inner loop can lookup values of two base-64 digits at the same time.  You can make the table smaller (as little as about 30K) and still do a 16-bit lookup, but that would involve doing some calculations on each byte of the input data — likely costing a significant fraction of the performance gain.

Here's the code:
// Utra-Fast base64-decoding, using 128KB lookup table
                      int FromBase64Fast( const BYTE* pSrc, int nLenSrc, char* pDst, int nLenDst )
                      {
                         if ( NULL==gpLookup16 ) SetupLookup16();  // see below
                         int nLenOut= 0;
                         if ( nLenDst < ((nLenSrc / 4)-1) * 3 ) {
                            return (0); // (buffer too small)
                         }
                         int nLoopMax= (nLenSrc/4)-1;
                         WORD* pwSrc= (WORD*)pSrc;
                         for( int j=0; j<nLoopMax; j++ ) {
                            WORD s1= gpLookup16[ pwSrc[0] ];  // translate two "digits" at once
                            WORD s2= gpLookup16[ pwSrc[1] ];  // ... and two more
                      
                            DWORD n32;
                            n32  = s1;         // xxxxxxxx xxxxxxxx xx111111 222222xx
                            n32 <<= 10;        // xxxxxxxx 11111122 2222xxxx xxxxxxxx
                            n32 |= s2 >> 2;    // xxxxxxxx 11111122 22223333 33444444
                      
                            BYTE b3= ( n32 & 0x00ff ); n32 >>= 8;  // in reverse (WORD order)
                            BYTE b2= ( n32 & 0x00ff ); n32 >>= 8;
                            BYTE b1= ( n32 & 0x00ff ); 
                      
                            // *pDst++ = b1;  *pDst++ = b2;  *pDst++ = b3;  //slighly slower
                      
                            pDst[0]= b1;  // slightly faster
                            pDst[1]= b2;
                            pDst[2]= b3; 
                      
                            pwSrc += 2;
                            pDst += 3;
                         }
                         nLenOut= ((nLenSrc/4)-1) * 3;
                      
                         //-------------------- special handling outside of loop for end
                         WORD s1= gpLookup16[ pwSrc[0] ];
                         WORD s2= gpLookup16[ pwSrc[1] ]; 
                      
                         DWORD n32;
                         n32  = s1;      
                         n32 <<= 10;
                         n32 |= s2 >> 2;
                      
                         BYTE b3= ( n32 & 0x00ff ); n32 >>= 8;
                         BYTE b2= ( n32 & 0x00ff ); n32 >>= 8;
                         BYTE b1= ( n32 & 0x00ff ); 
                      
                         if (nLenOut >= nLenDst)       return(0); // error
                         *pDst++ = b1;  nLenOut++;
                         if (b2 != 99)  {
                            if (nLenOut >= nLenDst)   return(0); // error
                            *pDst++ = b2;  nLenOut++;
                         }
                         if (b3 != 99) {
                            if (nLenOut >= nLenDst)   return(0); // error
                            *pDst++ = b3;  nLenOut++;
                         }
                         return( nLenOut );
                      } 
                      
                      //----------------------------------------------------
                      // Using two-byte lookup table
                      // must call here before calling the above
                      //----------------------------------------------------
                      WORD* gpLookup16= 0;
                      
                      void SetupLookup16()
                      {
                         int nLenTbl= 256*256;            // yes, the table is 128Kb!
                         if (NULL==gpLookup16) {
                            gpLookup16= new WORD[nLenTbl];
                         }
                         WORD* p= gpLookup16;
                         for (int j= 0; j<256; j++ ) {
                            for (int k= 0; k<256; k++ ) {
                               WORD w;
                               w  = LookupDigits[k] << 8;
                               w |= LookupDigits[j] << 2 ; // pre-shifted! See notes
                               *p++ = w;
                            }
                         }
                      }

Open in new window

Note the simplicity of the inner loop (lines 12-31).  This is partially thanks to an optimization I discovered in creating the lookup table.  I pre-shifted the data values in the table to avoid a series of mask-and-shift operations in the loop.  Also, the one-byte variables (b1,b2,b3) end up being optimized into 8-bit register values by the compiler.  The entire inner loop is just 56 opcodes long!

The function calls SetupLookup16() the first time it is used.  If you were to wrap this into a C++ object, you would persist that table as an object member and initialize it in the constructor.  The time spent building the table is spread out across all invocations of the decoding function.

Is this at all practical? Probably not.

While I can imagine a web server needing an ultra-fast encoder to process Gigabytes of outgoing PDF files, it is difficult to come up with a scenario where high-speed decoding was worth the price in memory usage.  On the other hand, with 6GB of RAM coming standard on many home computers, 128K is only a tiny fraction of 1% of the system memory.  So, as Tom Cruise says in "Risky Business" — What the heck!


Summary

This article reviewed various options for encoding and decoding base64 data. We looked at the algorithms and at C++ code to implement them.

I did time trials on the various options and the results are shown in the following table:
                   Rate    Approximate 
                      Version:          (Kb/ms)  Ratio
                      -----------------  -----   -----------
                      ToBase64Crypto     133     3.4 : 1     
                      ToBase64ATL        420     1.1 : 1     
                      ToBase64Simple     224     1.9 : 1     
                      ToBase64Fast       446       1 : 1     
                      
                      FromBase64Crypto    92     8.7 : 1     
                      FromBase64ATL       84     9.5 : 1     
                      FromBase64Simple   260     3.1 : 1     
                      FromBase64Fast     805       1 : 1     

Open in new window

Notes:

Added post-publication:
A significantly faster encoding algorithm has been posted in the comments, below.  See
      https://www.experts-exchange.com/A_3216.html#c15564
By using an 8Kb lookup table, the encoder can process 688 Kb/ms (50% faster!)
The CryptoAPI encoder is much slower even than an unoptimized "roll your own" encoder.
The ATL encoder is very fast, within the margin of error of the speed of my best effort in this article (Added: But NOT as good as the one in the article comments, here).
For decoding, even the "simple" version that I wrote is considerably faster than either the ATL version or the CryptoAPI version.
My ultra-fast decoder is nearly ten times as fast as other available options, but it has the drawback of using a 128Kb lookup table -- which is probably impractical.  Note, too that timing results do not include time taken to create that lookup table.
Testing methodology:

I read-in a 1MB JPG file and ran 1000 iterations on it. I timed only the outer loop, and I used the relatively low-resolution GetTickCount() function. However, the duration of each test was at least several seconds (minimizing resolution issues) and results were repeatable, with only small variations.

All code versions were 32-bit executables built in Release mode, with full optimization enabled.

I ran the tests on a 2.60 GHz AMD Athlon II QuadCore 620 CPU. While running, my CPU meter went to 28%, and indicated that one core was spiking at 100%.

Writing and testing this code was fun and challenging.  If you are interested in trying your hand at writing faster versions, then I'd be delighted to look at your code.  I'm quite certain that hand-optimized ASM code, and especially code that uses 64-bit registers and advanced MMX/SSE/AVX operations, can out-perform anything I've tried here.  But I'd especially like to see standard 32-bit code in which you have implemented an improved algorithm.  Perhaps a multi-threaded version that could take advantage of those idle cycles on a secondary CPU core...


Resources:

Formal description of base64 (RFC 2045)
   http://www.ietf.org/rfc/rfc2045.txt

Wikipedia article, Base64 (excellent coverage and quirk description)
  http://en.wikipedia.org/wiki/Base64

MSDN Documentation:
   CryptBinaryToString   http://msdn.microsoft.com/en-us/library/aa379887(v=VS.85).aspx
   CryptStringToBinary  http://msdn.microsoft.com/en-us/library/aa380285(VS.85).aspx

   ATL Base64Encode http://msdn.microsoft.com/en-us/library/ezec6x4e(v=VS.80).aspx
   ATL Base64Decode http://msdn.microsoft.com/en-us/library/2fzdww6e(VS.80).aspx

A website where you can encode or decode base64:
   http://www.opinionatedgeek.com/dotnet/tools/base64encode/

Script-accessible way to encode and decode base64 (uses MSXML)
   http://www.nonhostile.com/howto-encode-decode-base64-vb6.asp
   http://www.nonhostile.com/page-vb-net-base64-encoding-decoding.asp

Related Articles Written by Dan Rollins:
    Easy String Encryption Using CryptoAPI in C++  
    Fast Hexadecimal Encode and Decode  
    Convert String to int / Convert int to String in C++
    Binary Bit Flags: Tutorial and Usage Tips

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
If you liked this article and want to see more from this author, please click the Yes button near the:
      Was this article helpful?
label that is just below and to the right of this text.   Thanks!
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
5
43,630 Views
DanRollins
CERTIFIED EXPERT

Comments (12)

aikimarkGet vaccinated; Social distance; Wear a mask
CERTIFIED EXPERT
Top Expert 2014

Commented:
@Dan

Once you've calculated the table (8k or 128k), could you turn it into a static resource?
CERTIFIED EXPERT
Author of the Year 2009

Author

Commented:
Sure!  A minor tweak to the code in http:#c15564 can output the table to a text file, which could be added as a resource.  But it probably makes more sense to just put it right into the source code (a "prebuilt lookup table") as shown in that snippet.

--- Dan
Hi Dan,

Awesome article.
I'd like to use your 'fast-b64' routine in my app --- could you clarify the license?
Is it public domain / MIT / LGPL / other?

Thanks!
Ignac
CERTIFIED EXPERT
Author of the Year 2009

Author

Commented:
Thanks!
You have my permission to use the code in any way you want.
-- Dan
Hi Dan,

Love the article.  

I'm doing some MSVC++ coding and thought I'd try your Base64Fast functions out.  

The first string I tried to encode was "encode me!" (10 bytes) which yielded "ZW5jb2RlIG1lIQ==".  But then when I passed that back into the decode function it returns 12 bytes "encode me!¿p".

I tried "hello world!" which encoded to "aGVsbG8gd29ybGQh" and decoded back to "hello world!" perfectly.

I tried a four letter word and it came back with 6 after the decode that looked similar to other extra non-alpha characters.  So, 4 character and 10 character sources seem to be causing a problem.

I tested the "ZW5jb2RlIG1lIQ==" string via the web based decoder and it decoded correctly.  There must be a gotcha in the FromBase64Fast code.  I'll update if/when I figure it out.

Update:


The condition of b2 or b3 being 99 never triggered when the "=" pad came up.  My solution is a total hack but it seems to work for now, I'll update again with something sexier later.

Until then, consider these changes to the last section of code for "FromBase64Fast":
		if ((pwSrc[1] & 0x00FF)!=0x003D)  {
		//if (b2 != 99)  {
			if (nLenOut >= nLenDst)   return(0); // error
			*pDst++ = b2;  nLenOut++;
			if ((pwSrc[1] & 0xFF00)!=0x3D00)  {
		//		if (b3 != 99) {
				if (nLenOut >= nLenDst)   return(0); // error
				*pDst++ = b3;  nLenOut++;
			}
		}
//	*pDst = 0; // could terminate with NULL, here
	return(nLenOut);
}

Open in new window

I ran some generated data through it and it matched the results correctly.  More testing is needed, but so far so good.

Cheers,
Chris

View More

Have a question about something in this article? You can receive help directly from the article author. Sign up for a free trial to get started.