# How fast is your C/C++ Base64 Encoder?

Are you up for a challenge?
Several of us C and ASM programmers had a fun learning experience a few years ago when we tried to outdo each other by writing "The World's Fastest" binary-to-hex converter and output-formatting function. Â Now I'm throwing out the gauntlet again:

Â  Â  See if you can write a C/C++ function that encodes binary data into a
Â  Â  base64 output string -- and see if yours is faster than everybody else's.

For additional reference, the ASM version of this question is here: http:Q_21983447.html

A Base64 Encoder converts binary data into ASCII codes that can be transmitted/processed by systems that do not normally process raw binary... for instance, in email attachments and in HTTP headers and as values in HTTP URL. Â The encoder outputs only the following characters:
Â  Â  Â  Â ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/
plus, the output may end in zero, one, or two equal signs (=)

To encode the raw binary, you access it three bytes at a time, taking the data in groups of six bits. Â For instance:
Hex: Â  Â  Â 0x07 0xFF 0x02
Binary: Â 0000 0111 1111 1111 0000 0010 (groups of 4 bits)
Binary: Â 000001 111111 111100 000010 Â  (groups of 6 bits)
Â  Â  Â  Â  Â  Â  \____/ Â \____/ Â \____/ Â \____/
decimal Â  Â  Â 1 Â  Â  Â  Â 63 Â  Â  Â  Â 60 Â  Â  Â  Â 2 Â  Â
lookup Â  Â  Â  B Â  Â  Â  Â  / Â  Â  Â  Â  Â 8 Â  Â  Â  Â  C

The length of the output must always be an even multiple of four, because when you DEcode the data, you will be taking groups of four base64 "digits" and turning them back into three binary bytes of output. Â Therefore, when ENcoding, if your input data is not an even multiple of three, you will need to pad the end of the output with one or two equal signs (=)

Also, to verify your output, you can use an Encoder at this URL:
Â  Â http://www.opinionatedgeek.com/dotnet/tools/Base64Encode/Default.aspx

=-=-=-=-=-=-=-=-=-=-=
I'll be speed-testing the submissions using code like this:

//---------------------------------------- (assume dest buf is large enough)
void ToBase64( BYTE* pSrc, int nSrcLen, char* pszOutBuf ) Â // {
Â  Â  Â  Â  // your code here!
}

I've already posted one C function here: http:/Assembly/Q_21983447.html#17516191

=-=-=-=-=-=-=-=-=-=-=
I'm betting that the ASM code will out-perform even the most optimized C code. Â But as we saw in the Hex-converter challenge, a clever algorithm can often win over even the best hand-optimized ASM code (of course the ASM programmers can then implement the same algorithm and shave a few milliseconds and will surely win :-)

So submit your function and I'll post timing/speed comparisons. Â  The timings will measure thousands of passes at a large buffer of random input data, so if that changes the way you write your code, then keep it in mind. Â

Submit your C/C++ function and I'll measure it against everybody else's as well as the ASM versions and any standard library calls and API functions that I can find. Â If this works out, I'll start a second challenge for Base64 DEcoder.

Are you game? Â 500 points and your reputation as a hotshot C programmer are on the line :-)

-- Dan
C++

Last Comment
Infinity08

8/22/2022 - Mon
Infinity08

Just a question : is this algorithm only, or can we use a lookup table ?
grg99

Hate to be a killjoy, but this is an academic exercise only-- Â in the real world, CPU's are tens to hundreds of times faster than any I/O device, even gigabit ethernet, so doing base64 conversion is not a significant bottleneck. Â

Still it's a fun challenge-- go to it!

DanRollins

grg99,
It's not entirely pointless. Â Imagine that your server must send thousands of 4MB video clips as email attachments. Â The encoding *might* become a throughput bottleneck if the CPU gets maxed out by multiple threads doing various related activities. Â  Decoding might be more of an issue -- the time spent in transit is a constant, but once it is on your computer, an efficient decoder could prevent delay between clicking the attachment and watching the video.

But, as you say, the practicality of the challenge is not the issue :-)

Infinity08,
There is likely to be at least one lookup table employed by most algorithms. Â In fact, it would be interesting to see if there is a way to convert *without* a lookup table. Â For instance, binary-to-hex conversion *could* use a simple lookup in a table like

Â  Â  char table "0123456789ABCDEF";
Â  Â  c= table[b]; Â // where b starts is a 4-bit nybble and c ends up as an ASCII char

but most binary-to-hex algorithms do it without a lookup; e.g.,
Â  Â  if (b>9) b += 7; Â  // where b is a 4-bit nybble
Â  Â  c= b + '0'; Â  Â  Â  Â  Â  Â // c is a hex digit

-- Dan
Your help has saved me hundreds of hours of internet surfing.
fblack61
Infinity08

>> There is likely to be at least one lookup table employed by most algorithms. Â In fact, it would be interesting to see if there is a way to convert *without* a lookup table.
I should have clarified: I wasn't referring to without versus with a lookup table, but whether you can use a lookup table to replace your algorithm. Just a practical question :)
DanRollins

For instance, one calculation scheme:

Â Â Â Â Â Â switch (b) { Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  // b is a 6-bit value (0...63)
Â Â Â Â Â Â  Â  case 0x3e: b='+'; break; Â  Â  // special cases (not part of any sequence)
Â Â Â Â Â Â  Â  case 0x3f: Â b='/'; break;
Â Â Â Â Â Â  Â  default:
Â Â Â Â Â Â  Â  Â  Â  if (b>25) b += 6; Â  Â  Â  // Not one of "ABC...XYZ"
Â Â Â Â Â Â  Â  Â  Â  if (b>57) b -= 75; Â  Â  Â // Not one of "abc...xyz" must be '012...789"
Â Â Â Â Â Â  Â  Â  Â  b += 'A';
Â Â Â Â Â Â }
DanRollins

>>... but whether you can use a lookup table to replace your algorithm.

My reply is that a lookup operation (whatever part of the algorithm it "replaces") is not only valid, but likely to be used in the fastest algorithm. Â There are no particular constraints other than that the decoder needs to be fast if you want a chance to win the challenge :-)
Get an unlimited membership to EE for less than \$4 a week.
Unlimited question asking, solutions, articles and more.
Infinity08

Perfect :)
Infinity08

Oh, and how much time do we have ?
grg99

Ahh, you might not want to use a lookup table. Â  If you're encoding data, that's going to be flowing thru the cache, very likely flushing out the lookup table all the time. Â That means it's often going to take a whole memory access time, approx 3nsec (like, 18 instructions) to get ONE table value!

Only way to tell for sure is to time the code.

I started with Experts Exchange in 2004 and it's been a mainstay of my professional computing life since. It helped me launch a career as a programmer / Oracle data analyst
William Peck
DanRollins

Until we think we've rung the last ounce of performance out of the code :-)

Here is my timing methodology:
I will generate a 1MB buffer of random data (the same data for everybody)
I will then call the ToBase64 function 1000 times (process 1000 MB of data)
I will calculate the MB-per-second.
I'll run it three times and list the highest score.

Here is my function that does all of that:

void CBase64Dlg::OnButton1()
{
Â Â Â Â Â Â CWaitCursor c; Â // show that the code is running

Â Â Â Â Â Â int nSrcLen= 1024*1024; Â  Â  Â // one true MB
Â Â Â Â Â Â int nPasses= 1000;

Â Â Â Â Â Â BYTE* pData= Â  Â  Â new BYTE[ nSrcLen ];
Â Â Â Â Â Â char* pszEncoded= new char[ nSrcLen*2 ]; // really only need * 1.5, but close enough

Â Â Â Â Â Â //----------- generate the random binary data
Â Â Â Â Â Â strcpy( (char*)pData, "Hi There Everybody" ); Â Â Â Â Â // put recognizable text at the start
Â Â Â Â Â Â int nFirst= strlen( (char*)pData );

Â Â Â Â Â Â srand(123); Â // seed the random number generator the same for everybody
Â Â Â Â Â Â for (int j=nFirst; j< nSrcLen; j++ ) {
Â Â Â Â Â Â  Â Â Â Â Â pData[j]= rand() &Â 0xFF;
Â Â Â Â Â Â }

Â Â Â Â Â Â Sleep(0); Â  Â  Â  Â  Â  Â  // start on a new timeslice
Â Â Â Â Â Â int nStartTick= GetTickCount();
Â Â Â Â Â Â for (int k=0; k<nPasses; k++) {
Â Â Â Â Â Â  Â Â Â Â Â ToBase64( pData, nSrcLen, pszEncoded ); // encode 1MB of data
Â Â Â Â Â Â }
Â Â Â Â Â Â int nTicksTotal= GetTickCount() - nStartTick; Â // duration of the test in ms (note: +- about 9ms)

Â Â Â Â Â Â delete pData;
Â Â Â Â Â Â delete pszEncoded;

Â Â Â Â Â Â //--------- output the results
Â Â Â Â Â Â CString sMsg; sMsg.Format(
Â Â Â Â Â Â  Â Â Â Â Â "\r\nIt took %d ms to process %d MB of data\r\n"
Â Â Â Â Â Â  Â Â Â Â Â "That is %.2f MB per second\r\n",
Â Â Â Â Â Â  Â Â Â Â Â (int) nTicksTotal,
Â Â Â Â Â Â  Â Â Â Â Â (int) nPasses,
Â Â Â Â Â Â  Â Â Â Â Â (double) nPasses/((double)nTicksTotal/1000)
Â Â Â Â Â Â );
Â Â Â Â Â Â m_ctlEdOutput.SetSel(32767, 32767);
Â Â Â Â Â Â m_ctlEdOutput.ReplaceSel( sMsg );
}

=-=-=-=-==-=-=-=-=-=-=
Notes:
* The test loops enough times to make the inaccuracy of the GetTickCount() timer unimportant.
* The test is likley to be interrupted many times by other processes, but that should be
Â  Â relatively consistant. Â By running it three times (andy verifying *similar* results) that
Â  Â should not skew the results.
* I'm currently running on a 2.6Ghz Athlon. Â All times will be taken from here.
Â  Â If you do your own timings, you should be able to obtain "relative" values (a faster algorithm
Â  Â should be faster, regardless of CPU speed)
* I'll eyeball the result output to verify that the first part is encoded correctly:
Â  Â Hi There Everybody Â  Â  Â  Â  Â  Â  Â  Â  (source)
Â  Â SGkgVGhlcmUgRXZlcnlib2R5 Â  Â (encoded)
DanRollins

Timing results to date:
MB/sec Â Description
-------- Â -----------------------------------------------------------------------------------------------------------
Â  Â 58.59 Â Dan's C "standard" encoder Â  Â  Â  (inner loop at http:/Assembly/Q_21983447.html#17516191
Â  Â 53.42 Â craigwardman's ASM (rolled loops)
Â  Â 63.36 Â craigwardman 's ASM (unrolled loops)

I can also tell you that I have written an alternative C version that is running at
Â  Â 97.86 MB/sec
which I'll reveal if nobody else comes up with the same technique. Â I suspect that conversion to ASM will put it over 100.

Please! Â Don't let these timings discourage you. Â Until I've timed your function on my specific computer, there is no way to know how fast it is relative to the other submissions. Â Also, there can be great benefit in comparing merits and problems with slow functions. Â It's all about the challenge and the learning experience.
DanRollins

Note: On the C functions, I set the compiler optimization on "maximum speed." Â From past experience, that can have a significant advantage over "Debug (no optimization)" but ends up nearly identical to "minimum size."
Get an unlimited membership to EE for less than \$4 a week.
Unlimited question asking, solutions, articles and more.
DanRollins

MB/sec Â Description
-------- Â -----------------------------------------------------------------------------------------------------------
Â  Â 17.11 Â The Base64Coder object from http://support.microsoft.com/default.aspx?scid=kb;[LN];Q191239
Â 104.05 Â CryptBinaryToStringA (in Crypt32.dll)

The bar is raised! Â A standard API function spins through at perhaps unbeatable rates. Â
But I think we can beat it.
Infinity08

What's the deadline for submission, and how do we submit ?
DanRollins

A submission should be source code of the one function, posted right here in this thread. Â Submit when you have a working solution. Â Or don't submit at all, just join the discussion and give us your ideas and thoughts. Â I'm not stating a specific time limit -- I'll keep this and the ASM thread open as long as interest continues.
This is the best money I have ever spent. I cannot not tell you how many times these folks have saved my bacon. I learn so much from the contributors.
rwheeler23
DanRollins

Note: I messed up the compiler optimization settings in previous time trials. Â Current (corrected) results are:

MB/sec Â Internal name // Description
-------- Â -----------------------------------------------------------------------------------------------------------
Â  54.42 Â xToBase64 / craigwardman (ASM, std loops)
Â  61.81 Â yToBase64 / craigwardman (ASM, unrolled loops)
Â  36.80 Â aToBase64 / Base64Coder http://support.microsoft.com/default.aspx?scid=kb;[LN];Q191239
104.92 Â cToBase64 / Â CryptBinaryToString API call
156.86 Â z2ToBase64 / Â Dan's 'standard' C code (modified, see below)
188.78 Â zToBase64 / Â Dan's 'standard' C code
198.13 Â mToBase64 / Dan's "mystery" algorithm (C code)

Note that the ASM and API functions timings are consistent with previous measurements, but the C code got much faster (as expected if the complier optimization were wrong).

It's interesting that both of the C functions are considerably faster than the ASM functions and that both of them are almost twice as fast as Microsoft's CryptoAPI code :-) Â The ASM code generated by the compiler is shown in the ASM thread.

=-=-=-=-=-=-=-=-=-=
In my 'standard C' algorithm, I changed just these lines:
Â Â Â Â Â Â b1= BYTE((d &Â 0x00FC0000) >>Â 18);
Â Â Â Â Â Â b2= BYTE((d &Â 0x0003F000) >>Â 12);
Â Â Â Â Â Â b3= BYTE((d &Â 0x00000FC0) >>Â 6 );
Â Â Â Â Â Â b4= BYTE((d &Â 0x0000003F) >>Â 0 );
...to what I thought would be faster... this sequence:
Â Â Â Â Â Â b4= BYTE(d &Â 0x0000003F);
Â Â Â Â Â Â d >>= 6;
Â Â Â Â Â Â b3= BYTE(d &Â 0x0000003F);
Â Â Â Â Â Â d >>= 6;
Â Â Â Â Â Â b2= BYTE(d &Â 0x0000003F);
Â Â Â Â Â Â d >>= 6;
Â Â Â Â Â Â b1= BYTE(d &Â 0x0000003F);

It turned out that the modified code was actually slower! Â And not by a trival amount... fully 17% slower! Â It just goes to show that sometimes the intuitive is not the best/fastest.

Attempting to avoid the use of the intermediate variables b1,b2,b3,b4 I tried:
Â Â Â Â Â Â p[3]= abDigitLookup[ (d &Â 0x0000003F) ]; d >>= 6;
Â Â Â Â Â Â p[2]= abDigitLookup[ (d &Â 0x0000003F) ]; d >>= 6;
Â Â Â Â Â Â p[1]= abDigitLookup[ (d &Â 0x0000003F) ]; d >>= 6;
Â Â Â Â Â Â p[0]= abDigitLookup[ (d &Â 0x0000003F) ];
Â Â Â Â Â Â p += 4;
...which gave a very slight boost. Â
But there was no additional gain for combining into longer statements like so:
Â Â Â Â Â Â *p++ = abDigitLookup[ BYTE((d &Â 0x00FC0000) >>Â 18) ];
Â Â Â Â Â Â *p++ = abDigitLookup[ BYTE((d &Â 0x0003F000) >>Â 12) ];
Â Â Â Â Â Â *p++ = abDigitLookup[ BYTE((d &Â 0x00000FC0) >>Â 6 ) ];
Â Â Â Â Â Â *p++ = abDigitLookup[ BYTE((d &Â 0x0000003F) Â  Â  Â ) ];

-- Dan
DanRollins

grg99,
I missed your last post... sorry.

That is a significant point... as you cycle through the large source/input buffer, the CPU will continuously update the L1 cache, possibly to the detriment of ultra-fast access to the lookup table. Â

I wonder if there is a way to tell the CPU that a particular part of the cache (the lookup table) is more important than the data that you have already processed in the source buffer. Â On the other hand, if it is watching the action (e.g., counting accesses), it could notice that the lookup table is used often and it might decide to keep it in cache (these modern CPUs are sometimes a very smart about stuff like that :-)

-- Dan
Infinity08

Become an EE member today7-DAY FREE TRIAL
Members can start a 7-Day Free trial then enjoy unlimited access to the platform
or
Learn why we charge membership fees
We get it - no one likes a content blocker. Take one extra minute and find out why we block content.
Not exactly the question you had in mind?
Sign up for an EE membership and get your own personalized solution. With an EE membership, you can ask unlimited troubleshooting, research, or opinion questions.
DanRollins

Thanks for the submission!
That's a clever and interesting use of lookup tables, fer sure :-)

>>Â What are the L1 and L2 cache sizes of your CPU btw ?

There is a sticker on my PC indicating an L2 cache of 512K. Â I don't know for sure how large the L1 cache is, but a standard size for an AMD Athlon 2800 L1 cache is 64KB, so that's a good guess. Â If so, then even these larger lookup tables should fit cleanly in the fastest cache.

The timing results:
MB/sec Â Internal name // Description
-------- Â -----------------------------------------------------------------------------------------------------------
168.43 Â ToBase64_infinity08_standard / Infinity08 (std)
181.21 Â ToBase64_infinity08_lookup_b / Infinity08 (multi-table lookup, 6K of tables)
Â  79.59 Â ToBase64_infinity08_standard Â (no optimization)
Â  90.01 Â ToBase64_infinity08_lookup_b (no optimization)
Â
188.78 Â zToBase64 / Â Dan's 'standard' C code
198.13 Â mToBase64 / Dan's "mystery" algorithm (C code)

That is, your second version is about 7% faster than your "standard" one, but still slightly slower than "Dan's std" and Â "Dan's mystery algorithm."

Before you ask... Yes, I was very careful to populate the lookup tables outside of the timer -- I even tried doing it in InitDialog() just to be certain -- and the throughput was no different than if done just before starting the clock.

I also verified the optimization was set at "max speed" and I did a "sanity check" by re-running the other tests -- which came out similar to the previously-posted values

=--=-=-=-=-=-=-=-=-=
It seems to me that using the more extensive lookup tables *ought* to increase the speed by a larger factor. Â Your main loop is very tight, requiring only about 79 bytes of opcodes (compared to 93 for my "std" function). Â In trying to analyse the code, I think the bottleneck is that there is still a lot of index calculating going on -- primarily when determining the index values into the 2-d tables (lookup_table_b2 and lookup_table_b3). Â Here is the ASM emitted by the compiler for the main loop of the ToBase64_infinity08_lookup_b function:

Â  Â  Â  Â  Â >> Â *pszOutBuf++ = lookup_table_b1[*pSrc];
Â  Â xor Â  Â  edx,edx
Â  Â lea Â  Â  edi,[esi+1]
Â  Â mov Â  dl,byte ptr [esi]
Â  Â  Â  Â  Â >> *pszOutBuf++ = lookup_table_b2[*pSrc &Â 0x03][*(pSrc + 1)];
Â  Â xor Â  Â  ebx,ebx
Â  Â inc Â  Â  eax
Â  Â mov Â  dl,byte ptr lookup_table_b1 (0040766c)[edx]
Â  Â mov Â  byte ptr [eax-1],dl
Â  Â mov Â  dl,byte ptr [esi]
Â  Â mov Â  bl,byte ptr [edi]
Â  Â and Â  Â edx,3
Â  Â shl Â  Â  edx,8
Â  Â lea Â  Â  esi,[edi+1]
Â  Â inc Â  Â  eax
Â  Â mov Â  dl,byte ptr [edx+ebx+40626Ch]
Â  Â  Â  Â  Â >> *pSrc++;
Â  Â  Â  Â  Â >> *pszOutBuf++ = lookup_table_b3[*pSrc &Â 0x0F][*(pSrc + 1)];
Â  Â xor Â  Â ebx,ebx
Â  Â mov Â  byte ptr [eax-1],dl
Â  Â mov Â  dl,byte ptr [edi]
Â  Â mov Â  bl,byte ptr [esi]
Â  Â and Â  Â edx,0Fh
Â  Â shl Â  Â  edx,8
Â  Â inc Â  Â  eax
Â  Â mov Â  dl,byte ptr [edx+ebx+40666Ch]
Â  Â mov Â  byte ptr [eax-1],dl
Â  Â  Â  Â  Â >> *pSrc++;
Â  Â  Â  Â  Â >> *pszOutBuf++ = lookup_table_b4[*pSrc];
Â  Â xor Â  Â  edx,edx
Â  Â mov Â  dl,byte ptr [esi]
Â  Â inc Â  Â  eax
Â  Â  Â  Â  Â >> *pSrc++;
Â  Â inc Â  Â  esi
Â  Â dec Â  Â ebp
Â  Â mov Â  dl,byte ptr lookup_table_b4 (0040776c)[edx]
Â  Â mov Â  byte ptr [eax-1],dl
Â  Â jne Â  Â ToBase64_infinity08_lookup_b+2Dh ; (first opcode above)
Â  Â pop Â  Â edi
Â  Â pop Â  Â ebp
Â  Â pop Â  Â ebx

=-=-=-=-=-=-=--=
Although there is out-of-sequence "obfuscation" by the compiler, a statement like:
Â  Â  Â lookup_table_b2[*pSrc &Â 0x03][*(pSrc + 1)];
is a lookup into a 2-d array, which means that two index values need to be calculated...

Â  Â  Â  Â  Â  Â  Â  Â >> Â *pszOutBuf++ = lookup_table_b1[*pSrc]; Â  Â  Â  // simple 1-d lookup
Â  Â xor Â  Â  Â  Â  edx,edx
Â  Â mov Â  Â  Â  dl,byte ptr [esi] Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  // get source byte
Â  Â mov Â  Â  Â  dl,byte ptr lookup_table_b1[edx] Â // look it up
Â  Â mov Â  Â  Â  byte ptr [eax-1],dl Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â // store the output

Â  Â  Â  Â  Â  Â  Â  Â  >>Â  *pszOutBuf++ = lookup_table_b2[*pSrc &Â 0x03][*(pSrc + 1)]; ;; complex 2-d lookup
Â  Â xor Â  Â  Â  Â  edx,edx
Â  Â mov Â  Â  Â  dl,byte ptr [esi] Â // get src byte
Â  Â and Â  Â  Â  Â edx,3 Â  Â  Â  Â  Â  Â  Â  // calc first part of the index
Â  Â shl Â  Â  Â  Â  edx,8
Â  Â xor Â  Â  Â  Â  ebx,ebx
Â  Â mov Â  Â  Â  bl,byte ptr [edi] Â // calc second part of the index
Â  Â mov Â  Â  Â  dl,byte ptr [edx+ebx+40626Ch] // look up the value
Â  Â mov Â  Â  Â  byte ptr [eax-1],dl Â  Â  Â  Â  Â  Â  Â  Â  Â  Â // store the output

Hoping to cut down some of the work, I tried this: Â We all know that...
Â  Â  Â  char table[4][256];
Â  Â  Â  z= table[x][y]
...equates to... Â  Â  Â  Â
Â  Â  Â  char table[1024];
Â  Â  Â  Â z= table[(x*256) + y]

So I used this code:
Â  Â  char* t2= &lookup_table_b2[0][0];
Â  Â  char* t3= &lookup_table_b3[0][0];
Â Â Â Â Â Â
Â  Â  while (nSrcLen >= 3) {
Â  Â  Â Â Â Â Â Â BYTE b1= *pSrc++;
Â  Â  Â Â Â Â Â Â BYTE b2= *pSrc++;
Â  Â  Â Â Â Â Â Â BYTE b3= *pSrc++;
Â  Â  Â Â Â Â Â Â *pszOutBuf++ = lookup_table_b1[b1];
Â  Â  Â Â Â Â Â Â *pszOutBuf++ = t2[((b1 &Â 0x03)*256)+b2];
Â  Â  Â Â Â Â Â Â *pszOutBuf++ = t3[((b2 &Â 0x0F)*256)+b3];
Â  Â  Â Â Â Â Â Â *pszOutBuf++ = lookup_table_b4[b3];
Â  Â  Â Â Â Â Â Â nSrcLen -= 3;
Â  Â  }

But the score came out the same. Â  The local variables b1,b2,b3 wind up as loads of registers (never getting saved to memory) so that shouldn't cause a slowdown. Â

After trying several alternatives, I have to surmize that though this particular lookup scheme improves upon a standard 64-byte table lookup, we are not seeing maximal performance. Â  Yet :-)
Get an unlimited membership to EE for less than \$4 a week.
Unlimited question asking, solutions, articles and more.
Infinity08

>> Before you ask... Yes, I was very careful to populate the lookup tables outside of the timer -- I even tried doing it in InitDialog() just to be certain -- and the throughput was no different than if done just before starting the clock.

Even if you put it inside the timer, it shouldn't change a lot, because we're processing about 1GB of data.

The problem is that using lookup tables that way makes the algorithm very compiler dependent (memory architecture dependent). When I tested it on my laptop (Intel Pentium M 1,5GHz - 64kB L1, 1MB L2), I got these results:

Â  Â  My standard code (no optimisations): 80.92 MB/s
Â  Â  My standard code (with optimisations): 95.74 MB/s
Â  Â  My lookup code b (no optimisations): 84.84 MB/s
Â  Â  My lookup code b (with optimisations): 313.09 MB/s

You notice the huuuge increase in performance ? My bet is that the size of my lookup tables is exactly right for my system, but not for yours. Another possibility is different compiler flags. For the record, I used these for g++ :

Â  Â  -Wall -ansi -fexpensive-optimizations -O3

>>Â It seems to me that using the more extensive lookup tables *ought* to increase the speed by a larger factor.

Yep, but a lot depends on the memory caching that happens. If after 1 pass, all 4 lookup tables are in the L1 cache completely, then the other 999 passes will be lightning fast. If however they get flushed out regularly, the advantage of the lookup tables disappears fast.
My code will be fast when encoding large amounts of data (like in this setup), but for smaller amounts of data, it doesn't win as much and will probably even be slower than regular algorithms.

>>Â In trying to analyse the code, I think the bottleneck is that there is still a lot of index calculating going on -- primarily when determining the index values into the 2-d tables

I would love to eliminate those 2D tables indeed, and I've tried a bit, but everything I tried untill now resulted in worse performance.

>>Â I have to surmize that though this particular lookup scheme improves upon a standard 64-byte table lookup, we are not seeing maximal performance. Â  Yet :-)

I'm not giving up :)
DanRollins

Wow, that is a rather dramatic difference for your _d version. Â
I'll see if I can find a "Genuine Intel" processor to see if that is the difference (I'm testing with an AMD). Â
Also, there may be a way to force the lookup tables into cache and keep them there (at least in the ASM versions). Â I'll see if I can find out anything about that.

=-=-=-=-=-=-=
I love synergy... your code illustrated a possibility that I had not considered... a 256-byte table lets us avoid a mask operation on each lookup.

char chrTbl256[]=
"ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/"
"ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/"
"ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/"
"ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/";

void z7ToBase64( BYTE* pSrc, int nSrcLen, char* pszOutBuf )
{
Â Â Â Â Â Â char* p= pszOutBuf;
Â Â Â Â Â Â DWORD d;
Â Â Â Â Â Â for ( int j=0; j< nSrcLen; j+=3 ) {
Â Â Â Â Â Â  Â Â Â Â Â d = *pSrc++; d <<= 8;
Â Â Â Â Â Â  Â Â Â Â Â d|= *pSrc++; d <<= 8;
Â Â Â Â Â Â  Â Â Â Â Â d|= *pSrc++;

Â Â Â Â Â Â  Â Â Â Â Â p[3]= chrTbl256[ BYTE(d) ]; d >>= 6; Â  // note: avoids the &Â 0x0000003F
Â Â Â Â Â Â  Â Â Â Â Â p[2]= chrTbl256[ BYTE(d) ]; d >>= 6;
Â Â Â Â Â Â  Â Â Â Â Â p[1]= chrTbl256[ BYTE(d) ]; d >>= 6;
Â Â Â Â Â Â  Â Â Â Â Â p[0]= chrTbl256[ BYTE(d) ];

Â Â Â Â Â Â  Â Â Â Â Â p+= 4;
Â Â Â Â Â Â }
Â Â Â Â Â Â // fixup for final bytes (padding) not shown
}

That avoids 4 billion AND operations across the test and certainly has a measureable effect. Â

MB/sec Â Internal name // Description
-------- Â -----------------------------------------------------------------------------------------------------------
188.78 Â zToBase64 / Â Dan's 'standard' C code
197.13 zToBase64 / Â Dan's 'standard' C code with 256-byte lookup table (thanks to Infinity08's idea)

I predict that in the ASM version, that very simple and obvious concept (simple and obvious, once somebody shows it to you, lol :-) will have an even greater impact. Â That's because in ASM, we can swap bytes around very easily (eg, just clear EDX, then move, say BL (source data) to DL and use EDX as the lookup index).

-- Dan
Infinity08

Ok, a small update : made the lookup tables smaller, and saved some operations in the loop :

Â  Â  void ToBase64_infinity08_lookup_e(BYTE *pSrc, int nSrcLen, char *pszOutBuf) {
Â  Â  Â BYTE *end = pSrc + nSrcLen - (nSrcLen % 3);
Â  Â  Â while (pSrc <Â end) {
Â  Â  Â  Â *pszOutBuf++ = lookup_table_e1[*pSrc];
Â  Â  Â  Â *pszOutBuf++ = lookup_table_e2[(*pSrc &Â 0x03) | (*(pSrc + 1) &Â 0xF0)];
Â  Â  Â  Â *pSrc++;
Â  Â  Â  Â *pszOutBuf++ = lookup_table_e3[(*pSrc &Â 0x0F) | (*(pSrc + 1) &Â 0xC0)];
Â  Â  Â  Â *pSrc++;
Â  Â  Â  Â *pszOutBuf++ = lookup_table_e4[*pSrc];
Â  Â  Â  Â *pSrc++;
Â  Â  Â }
Â  Â  Â if (nSrcLen %= 3) {
Â  Â  Â  Â *pszOutBuf++ = lookup_table_e1[*pSrc];
Â  Â  Â  Â if (--nSrcLen) {
Â  Â  Â  Â  Â *pszOutBuf++ = lookup_table_e2[(*pSrc &Â 0x03) + (*(pSrc + 1) &Â 0xF0)];
Â  Â  Â  Â  Â *pSrc++;
Â  Â  Â  Â  Â *pszOutBuf++ = lookup_base64[(*pSrc <<Â 2) &Â 0x3C];
Â  Â  Â  Â  Â *pszOutBuf++ = '=';
Â  Â  Â  Â }
Â  Â  Â  Â else {
Â  Â  Â  Â  Â *pszOutBuf++ = lookup_base64[(*pSrc <<Â 4) &Â 0x30];
Â  Â  Â  Â  Â *pszOutBuf++ = '=';
Â  Â  Â  Â  Â *pszOutBuf++ = '=';
Â  Â  Â  Â }
Â  Â  Â }
Â  Â  Â return;
Â  Â  }

lookup tables are generated like this :

Â  Â  char lookup_table_e1[256];
Â  Â  char lookup_table_e2[256];
Â  Â  char lookup_table_e3[256];
Â  Â  char lookup_table_e4[256];
Â  Â
Â  Â  void generate_lookup_table_e() {
Â  Â  Â int i = 0;
Â  Â
Â  Â  Â for (i = 0; i <Â 256; i++) {
Â  Â  Â  Â lookup_table_e1[i] = lookup_base64[(i >>Â 2) &Â 0x3F];
Â  Â  Â }
Â  Â  Â for (i = 0; i <Â 256; i++) {
Â  Â  Â  Â lookup_table_e2[i] = lookup_base64[((i <<Â 4) &Â 0x30) | ((i >>Â 4) &Â 0x0F)];
Â  Â  Â }
Â  Â  Â for (i = 0; i <Â 256; i++) {
Â  Â  Â  Â lookup_table_e3[i] = lookup_base64[((i <<Â 2) &Â 0x3C) | ((i >>Â 6) &Â 0x03)];
Â  Â  Â }
Â  Â  Â for (i = 0; i <Â 256; i++) {
Â  Â  Â  Â lookup_table_e4[i] = lookup_base64[i &Â 0x3F];
Â  Â  Â }
Â  Â  Â return;
Â  Â  }

Just out of curiosity, what optimisation flags do you use ?
Experts Exchange is like having an extremely knowledgeable team sitting and waiting for your call. Couldn't do my job half as well as I do without it!
James Murphy
Infinity08

>> Wow, that is a rather dramatic difference for your _d version.
I tested the same code on an AMD Athlon64 3700+, and didn't notice the same huge performance increase I had on the Pentium M. I think I'll do the rest of the tests on this one, as it should be closer to yours :)

>>Â your code illustrated a possibility that I had not considered... a 256-byte table lets us avoid a mask operation on each lookup.
Yep, a lookup table can be a very handy tool when optimizing for speed. But it only really works when there's a lot of data to process, and when you can ensure that the lookup table stays in fast memory (preferrably the L1 cache) the whole time. That's the difficulty, especially since there are so many different architectures, and they all have a different behaviour when it comes to caching.
DanRollins

I've set my VC++ 6.0 compiler to use /O2 Â (oh 2, "maximize speed") which the book says is equivalent to:
Â  Â /Og /Oi /Ot /Oy /Ob1 /Gs /Gf /Gy
Â  Â /Og Â global optimizations, automatic-register allocation, and loop optimization
Â  Â /Oi Â  (irrelevant, has to do with marshalling)
Â  Â /Ot Â instructs the compiler to favor speed over size.
Â  Â /Oy Â suppresses creation of frame pointers on the call stack
Â  Â /Ob1 Â inline expansion of functions (irrelevant)
Â  Â /Gs Â  (irrelevant, has to do with stack probes)
Â  Â /Gf Â  puts single copy of identical strings into the .EXE file
Â  Â /Gy Â  (irrelevant) package individual functions (COMDATs)
DanRollins

MB/sec Â Internal name // Description
-------- Â --------------------------------------------------------------------------------------
168.43 Â ToBase64_infinity08_standard / Infinity08 (std)
181.21 Â ToBase64_infinity08_lookup_b / Infinity08 (multi-table lookup, 6K of tables)

179.79 Â ToBase64_infinity08_lookup_e / Infinity08 (multi-table lookup, 1K of tables)

By my measurements, it's a bit slower than your previous try (within the margin of error)

It would be tremendously interesting to me to see the opcodes generated for the version that you measured as having such a high rate (for comparison with the opcodes I posted in http:#17547635). Â  Your other measurements were relatively proportional to mine, but that one was off the scale by comparison. Â Can you post that ASM listing? Â Whatever tricks your compiler is using would come in handy in the ASM version at least.

I'd hate to jump to the conclusion that it all boils down down to L1 cache-related issues if it is really something else.
Get an unlimited membership to EE for less than \$4 a week.
Unlimited question asking, solutions, articles and more.
Infinity08

The compiler I'm using : g++ (GCC) version 3.3.1 (MinGW32 special 20030804-1)

flags without optimization : -Wall
flags with optimization : -Wall -ansi -fexpensive-optimizations -O3

I've just listed the main loops.

NO OPTIMIZATION:
============

a) standard:
Â  Â  ----------

L99:
Â Â Â Â Â Â cmpl Â Â Â Â Â \$2, 12(%ebp)
Â Â Â Â Â Â jg Â Â Â Â Â L101
Â Â Â Â Â Â jmp Â Â Â Â Â L100
L101:
Â Â Â Â Â Â movl Â Â Â Â Â 16(%ebp), %eax
Â Â Â Â Â Â movl Â Â Â Â Â %eax, %edx
Â Â Â Â Â Â movl Â Â Â Â Â 8(%ebp), %eax
Â Â Â Â Â Â movzbl Â Â Â Â Â (%eax), %eax
Â Â Â Â Â Â shrb Â Â Â Â Â \$2, %al
Â Â Â Â Â Â movzbl Â Â Â Â Â %al, %eax
Â Â Â Â Â Â andl Â Â Â Â Â \$63, %eax
Â Â Â Â Â Â movzbl Â Â Â Â Â _lookup_base64(%eax), %eax
Â Â Â Â Â Â movb Â Â Â Â Â %al, (%edx)
Â Â Â Â Â Â leal Â Â Â Â Â 16(%ebp), %eax
Â Â Â Â Â Â incl Â Â Â Â Â (%eax)
Â Â Â Â Â Â movl Â Â Â Â Â 16(%ebp), %eax
Â Â Â Â Â Â movl Â Â Â Â Â %eax, %ecx
Â Â Â Â Â Â movl Â Â Â Â Â 8(%ebp), %eax
Â Â Â Â Â Â movzbl Â Â Â Â Â (%eax), %eax
Â Â Â Â Â Â sall Â Â Â Â Â \$4, %eax
Â Â Â Â Â Â movl Â Â Â Â Â %eax, %edx
Â Â Â Â Â Â andl Â Â Â Â Â \$48, %edx
Â Â Â Â Â Â movl Â Â Â Â Â 8(%ebp), %eax
Â Â Â Â Â Â incl Â Â Â Â Â %eax
Â Â Â Â Â Â movzbl Â Â Â Â Â (%eax), %eax
Â Â Â Â Â Â shrb Â Â Â Â Â \$4, %al
Â Â Â Â Â Â movzbl Â Â Â Â Â %al, %eax
Â Â Â Â Â Â andl Â Â Â Â Â \$15, %eax
Â Â Â Â Â Â orl Â Â Â Â Â %edx, %eax
Â Â Â Â Â Â movzbl Â Â Â Â Â _lookup_base64(%eax), %eax
Â Â Â Â Â Â movb Â Â Â Â Â %al, (%ecx)
Â Â Â Â Â Â leal Â Â Â Â Â 16(%ebp), %eax
Â Â Â Â Â Â incl Â Â Â Â Â (%eax)
Â Â Â Â Â Â incl Â Â Â Â Â 8(%ebp)
Â Â Â Â Â Â movl Â Â Â Â Â 16(%ebp), %eax
Â Â Â Â Â Â movl Â Â Â Â Â %eax, %ecx
Â Â Â Â Â Â movl Â Â Â Â Â 8(%ebp), %eax
Â Â Â Â Â Â movzbl Â Â Â Â Â (%eax), %eax
Â Â Â Â Â Â sall Â Â Â Â Â \$2, %eax
Â Â Â Â Â Â movl Â Â Â Â Â %eax, %edx
Â Â Â Â Â Â andl Â Â Â Â Â \$60, %edx
Â Â Â Â Â Â movl Â Â Â Â Â 8(%ebp), %eax
Â Â Â Â Â Â incl Â Â Â Â Â %eax
Â Â Â Â Â Â movzbl Â Â Â Â Â (%eax), %eax
Â Â Â Â Â Â shrb Â Â Â Â Â \$6, %al
Â Â Â Â Â Â movzbl Â Â Â Â Â %al, %eax
Â Â Â Â Â Â andl Â Â Â Â Â \$3, %eax
Â Â Â Â Â Â orl Â Â Â Â Â %edx, %eax
Â Â Â Â Â Â movzbl Â Â Â Â Â _lookup_base64(%eax), %eax
Â Â Â Â Â Â movb Â Â Â Â Â %al, (%ecx)
Â Â Â Â Â Â leal Â Â Â Â Â 16(%ebp), %eax
Â Â Â Â Â Â incl Â Â Â Â Â (%eax)
Â Â Â Â Â Â incl Â Â Â Â Â 8(%ebp)
Â Â Â Â Â Â movl Â Â Â Â Â 16(%ebp), %eax
Â Â Â Â Â Â movl Â Â Â Â Â %eax, %edx
Â Â Â Â Â Â movl Â Â Â Â Â 8(%ebp), %eax
Â Â Â Â Â Â movzbl Â Â Â Â Â (%eax), %eax
Â Â Â Â Â Â andl Â Â Â Â Â \$63, %eax
Â Â Â Â Â Â movzbl Â Â Â Â Â _lookup_base64(%eax), %eax
Â Â Â Â Â Â movb Â Â Â Â Â %al, (%edx)
Â Â Â Â Â Â leal Â Â Â Â Â 16(%ebp), %eax
Â Â Â Â Â Â incl Â Â Â Â Â (%eax)
Â Â Â Â Â Â incl Â Â Â Â Â 8(%ebp)
Â Â Â Â Â Â leal Â Â Â Â Â 12(%ebp), %eax
Â Â Â Â Â Â subl Â Â Â Â Â \$3, (%eax)
Â Â Â Â Â Â jmp Â Â Â Â Â L99

b) multi-table lookup, 6K of tables (b):
Â  Â  ---------------------------------------

L115:
Â Â Â Â Â Â cmpl Â Â Â Â Â \$2, 12(%ebp)
Â Â Â Â Â Â jg Â Â Â Â Â L117
Â Â Â Â Â Â jmp Â Â Â Â Â L116
L117:
Â Â Â Â Â Â movl Â Â Â Â Â 16(%ebp), %eax
Â Â Â Â Â Â movl Â Â Â Â Â %eax, %edx
Â Â Â Â Â Â movl Â Â Â Â Â 8(%ebp), %eax
Â Â Â Â Â Â movzbl Â Â Â Â Â (%eax), %eax
Â Â Â Â Â Â movzbl Â Â Â Â Â _lookup_table_b1(%eax), %eax
Â Â Â Â Â Â movb Â Â Â Â Â %al, (%edx)
Â Â Â Â Â Â leal Â Â Â Â Â 16(%ebp), %eax
Â Â Â Â Â Â incl Â Â Â Â Â (%eax)
Â Â Â Â Â Â movl Â Â Â Â Â 16(%ebp), %eax
Â Â Â Â Â Â movl Â Â Â Â Â %eax, %ecx
Â Â Â Â Â Â movl Â Â Â Â Â 8(%ebp), %eax
Â Â Â Â Â Â incl Â Â Â Â Â %eax
Â Â Â Â Â Â movzbl Â Â Â Â Â (%eax), %edx
Â Â Â Â Â Â movl Â Â Â Â Â 8(%ebp), %eax
Â Â Â Â Â Â movzbl Â Â Â Â Â (%eax), %eax
Â Â Â Â Â Â andl Â Â Â Â Â \$3, %eax
Â Â Â Â Â Â sall Â Â Â Â Â \$8, %eax
Â Â Â Â Â Â addl Â Â Â Â Â %edx, %eax
Â Â Â Â Â Â addl Â Â Â Â Â \$_lookup_table_b2, %eax
Â Â Â Â Â Â movzbl Â Â Â Â Â (%eax), %eax
Â Â Â Â Â Â movb Â Â Â Â Â %al, (%ecx)
Â Â Â Â Â Â leal Â Â Â Â Â 16(%ebp), %eax
Â Â Â Â Â Â incl Â Â Â Â Â (%eax)
Â Â Â Â Â Â incl Â Â Â Â Â 8(%ebp)
Â Â Â Â Â Â movl Â Â Â Â Â 16(%ebp), %eax
Â Â Â Â Â Â movl Â Â Â Â Â %eax, %ecx
Â Â Â Â Â Â movl Â Â Â Â Â 8(%ebp), %eax
Â Â Â Â Â Â incl Â Â Â Â Â %eax
Â Â Â Â Â Â movzbl Â Â Â Â Â (%eax), %edx
Â Â Â Â Â Â movl Â Â Â Â Â 8(%ebp), %eax
Â Â Â Â Â Â movzbl Â Â Â Â Â (%eax), %eax
Â Â Â Â Â Â andl Â Â Â Â Â \$15, %eax
Â Â Â Â Â Â sall Â Â Â Â Â \$8, %eax
Â Â Â Â Â Â addl Â Â Â Â Â %edx, %eax
Â Â Â Â Â Â addl Â Â Â Â Â \$_lookup_table_b3, %eax
Â Â Â Â Â Â movzbl Â Â Â Â Â (%eax), %eax
Â Â Â Â Â Â movb Â Â Â Â Â %al, (%ecx)
Â Â Â Â Â Â leal Â Â Â Â Â 16(%ebp), %eax
Â Â Â Â Â Â incl Â Â Â Â Â (%eax)
Â Â Â Â Â Â incl Â Â Â Â Â 8(%ebp)
Â Â Â Â Â Â movl Â Â Â Â Â 16(%ebp), %eax
Â Â Â Â Â Â movl Â Â Â Â Â %eax, %edx
Â Â Â Â Â Â movl Â Â Â Â Â 8(%ebp), %eax
Â Â Â Â Â Â movzbl Â Â Â Â Â (%eax), %eax
Â Â Â Â Â Â movzbl Â Â Â Â Â _lookup_table_b4(%eax), %eax
Â Â Â Â Â Â movb Â Â Â Â Â %al, (%edx)
Â Â Â Â Â Â leal Â Â Â Â Â 16(%ebp), %eax
Â Â Â Â Â Â incl Â Â Â Â Â (%eax)
Â Â Â Â Â Â incl Â Â Â Â Â 8(%ebp)
Â Â Â Â Â Â leal Â Â Â Â Â 12(%ebp), %eax
Â Â Â Â Â Â subl Â Â Â Â Â \$3, (%eax)
Â Â Â Â Â Â jmp Â Â Â Â Â L115

c) multi-table lookup, 1K of tables (e):
Â  Â  ---------------------------------------

L136:
Â Â Â Â Â Â movl Â Â Â Â Â 8(%ebp), %eax
Â Â Â Â Â Â cmpl Â Â Â Â Â -8(%ebp), %eax
Â Â Â Â Â Â jb Â Â Â Â Â L138
Â Â Â Â Â Â jmp Â Â Â Â Â L137
L138:
Â Â Â Â Â Â movl Â Â Â Â Â 16(%ebp), %eax
Â Â Â Â Â Â movl Â Â Â Â Â %eax, %edx
Â Â Â Â Â Â movl Â Â Â Â Â 8(%ebp), %eax
Â Â Â Â Â Â movzbl Â Â Â Â Â (%eax), %eax
Â Â Â Â Â Â movzbl Â Â Â Â Â _lookup_table_e1(%eax), %eax
Â Â Â Â Â Â movb Â Â Â Â Â %al, (%edx)
Â Â Â Â Â Â leal Â Â Â Â Â 16(%ebp), %eax
Â Â Â Â Â Â incl Â Â Â Â Â (%eax)
Â Â Â Â Â Â movl Â Â Â Â Â 16(%ebp), %eax
Â Â Â Â Â Â movl Â Â Â Â Â %eax, %ecx
Â Â Â Â Â Â movl Â Â Â Â Â 8(%ebp), %eax
Â Â Â Â Â Â movzbl Â Â Â Â Â (%eax), %eax
Â Â Â Â Â Â movl Â Â Â Â Â %eax, %edx
Â Â Â Â Â Â andl Â Â Â Â Â \$3, %edx
Â Â Â Â Â Â movl Â Â Â Â Â 8(%ebp), %eax
Â Â Â Â Â Â incl Â Â Â Â Â %eax
Â Â Â Â Â Â movzbl Â Â Â Â Â (%eax), %eax
Â Â Â Â Â Â andl Â Â Â Â Â \$240, %eax
Â Â Â Â Â Â orl Â Â Â Â Â %edx, %eax
Â Â Â Â Â Â movzbl Â Â Â Â Â _lookup_table_e2(%eax), %eax
Â Â Â Â Â Â movb Â Â Â Â Â %al, (%ecx)
Â Â Â Â Â Â leal Â Â Â Â Â 16(%ebp), %eax
Â Â Â Â Â Â incl Â Â Â Â Â (%eax)
Â Â Â Â Â Â incl Â Â Â Â Â 8(%ebp)
Â Â Â Â Â Â movl Â Â Â Â Â 16(%ebp), %eax
Â Â Â Â Â Â movl Â Â Â Â Â %eax, %ecx
Â Â Â Â Â Â movl Â Â Â Â Â 8(%ebp), %eax
Â Â Â Â Â Â movzbl Â Â Â Â Â (%eax), %eax
Â Â Â Â Â Â movl Â Â Â Â Â %eax, %edx
Â Â Â Â Â Â andl Â Â Â Â Â \$15, %edx
Â Â Â Â Â Â movl Â Â Â Â Â 8(%ebp), %eax
Â Â Â Â Â Â incl Â Â Â Â Â %eax
Â Â Â Â Â Â movzbl Â Â Â Â Â (%eax), %eax
Â Â Â Â Â Â andl Â Â Â Â Â \$192, %eax
Â Â Â Â Â Â orl Â Â Â Â Â %edx, %eax
Â Â Â Â Â Â movzbl Â Â Â Â Â _lookup_table_e3(%eax), %eax
Â Â Â Â Â Â movb Â Â Â Â Â %al, (%ecx)
Â Â Â Â Â Â leal Â Â Â Â Â 16(%ebp), %eax
Â Â Â Â Â Â incl Â Â Â Â Â (%eax)
Â Â Â Â Â Â incl Â Â Â Â Â 8(%ebp)
Â Â Â Â Â Â movl Â Â Â Â Â 16(%ebp), %eax
Â Â Â Â Â Â movl Â Â Â Â Â %eax, %edx
Â Â Â Â Â Â movl Â Â Â Â Â 8(%ebp), %eax
Â Â Â Â Â Â movzbl Â Â Â Â Â (%eax), %eax
Â Â Â Â Â Â movzbl Â Â Â Â Â _lookup_table_e4(%eax), %eax
Â Â Â Â Â Â movb Â Â Â Â Â %al, (%edx)
Â Â Â Â Â Â leal Â Â Â Â Â 16(%ebp), %eax
Â Â Â Â Â Â incl Â Â Â Â Â (%eax)
Â Â Â Â Â Â incl Â Â Â Â Â 8(%ebp)
Â Â Â Â Â Â jmp Â Â Â Â Â L136

WITH OPTIMIZATION:
==============

a) standard:
Â  Â  ----------

L171:
Â Â Â Â Â Â movzbl Â Â Â Â Â (%ecx), %eax
Â Â Â Â Â Â subl Â Â Â Â Â \$3, %esi
Â Â Â Â Â Â shrb Â Â Â Â Â \$2, %al
Â Â Â Â Â Â andl Â Â Â Â Â \$63, %eax
Â Â Â Â Â Â movzbl Â Â Â Â Â _lookup_base64(%eax), %edx
Â Â Â Â Â Â movb Â Â Â Â Â %dl, (%ebx)
Â Â Â Â Â Â incl Â Â Â Â Â %ebx
Â Â Â Â Â Â movzbl Â Â Â Â Â (%ecx), %edx
Â Â Â Â Â Â movzbl Â Â Â Â Â 1(%ecx), %eax
Â Â Â Â Â Â incl Â Â Â Â Â %ecx
Â Â Â Â Â Â sall Â Â Â Â Â \$4, %edx
Â Â Â Â Â Â shrb Â Â Â Â Â \$4, %al
Â Â Â Â Â Â andl Â Â Â Â Â \$48, %edx
Â Â Â Â Â Â andl Â Â Â Â Â \$15, %eax
Â Â Â Â Â Â orl Â Â Â Â Â %eax, %edx
Â Â Â Â Â Â movzbl Â Â Â Â Â _lookup_base64(%edx), %eax
Â Â Â Â Â Â movb Â Â Â Â Â %al, (%ebx)
Â Â Â Â Â Â incl Â Â Â Â Â %ebx
Â Â Â Â Â Â movzbl Â Â Â Â Â (%ecx), %edx
Â Â Â Â Â Â movzbl Â Â Â Â Â 1(%ecx), %eax
Â Â Â Â Â Â incl Â Â Â Â Â %ecx
Â Â Â Â Â Â sall Â Â Â Â Â \$2, %edx
Â Â Â Â Â Â shrb Â Â Â Â Â \$6, %al
Â Â Â Â Â Â andl Â Â Â Â Â \$60, %edx
Â Â Â Â Â Â andl Â Â Â Â Â \$3, %eax
Â Â Â Â Â Â orl Â Â Â Â Â %eax, %edx
Â Â Â Â Â Â movzbl Â Â Â Â Â _lookup_base64(%edx), %eax
Â Â Â Â Â Â movb Â Â Â Â Â %al, (%ebx)
Â Â Â Â Â Â incl Â Â Â Â Â %ebx
Â Â Â Â Â Â movzbl Â Â Â Â Â (%ecx), %eax
Â Â Â Â Â Â incl Â Â Â Â Â %ecx
Â Â Â Â Â Â andl Â Â Â Â Â \$63, %eax
Â Â Â Â Â Â movzbl Â Â Â Â Â _lookup_base64(%eax), %edx
Â Â Â Â Â Â movb Â Â Â Â Â %dl, (%ebx)
Â Â Â Â Â Â incl Â Â Â Â Â %ebx
Â Â Â Â Â Â cmpl Â Â Â Â Â \$2, %esi
Â Â Â Â Â Â jg Â Â Â Â Â L171

b) multi-table lookup, 6K of tables (b):
Â  Â  ---------------------------------------

L195:
Â Â Â Â Â Â movzbl Â Â Â Â Â (%ecx), %eax
Â Â Â Â Â Â subl Â Â Â Â Â \$3, %esi
Â Â Â Â Â Â movzbl Â Â Â Â Â _lookup_table_b1(%eax), %edx
Â Â Â Â Â Â movb Â Â Â Â Â %dl, (%ebx)
Â Â Â Â Â Â incl Â Â Â Â Â %ebx
Â Â Â Â Â Â movzbl Â Â Â Â Â (%ecx), %eax
Â Â Â Â Â Â movzbl Â Â Â Â Â 1(%ecx), %edx
Â Â Â Â Â Â incl Â Â Â Â Â %ecx
Â Â Â Â Â Â andl Â Â Â Â Â \$3, %eax
Â Â Â Â Â Â sall Â Â Â Â Â \$8, %eax
Â Â Â Â Â Â movzbl Â Â Â Â Â _lookup_table_b2(%edx,%eax), %eax
Â Â Â Â Â Â movb Â Â Â Â Â %al, (%ebx)
Â Â Â Â Â Â incl Â Â Â Â Â %ebx
Â Â Â Â Â Â movzbl Â Â Â Â Â (%ecx), %eax
Â Â Â Â Â Â movzbl Â Â Â Â Â 1(%ecx), %edx
Â Â Â Â Â Â incl Â Â Â Â Â %ecx
Â Â Â Â Â Â andl Â Â Â Â Â \$15, %eax
Â Â Â Â Â Â sall Â Â Â Â Â \$8, %eax
Â Â Â Â Â Â movzbl Â Â Â Â Â _lookup_table_b3(%edx,%eax), %eax
Â Â Â Â Â Â movb Â Â Â Â Â %al, (%ebx)
Â Â Â Â Â Â incl Â Â Â Â Â %ebx
Â Â Â Â Â Â movzbl Â Â Â Â Â (%ecx), %eax
Â Â Â Â Â Â incl Â Â Â Â Â %ecx
Â Â Â Â Â Â movzbl Â Â Â Â Â _lookup_table_b4(%eax), %edx
Â Â Â Â Â Â movb Â Â Â Â Â %dl, (%ebx)
Â Â Â Â Â Â incl Â Â Â Â Â %ebx
Â Â Â Â Â Â cmpl Â Â Â Â Â \$2, %esi
Â Â Â Â Â Â jg Â Â Â Â Â L195

c) multi-table lookup, 1K of tables (e):
Â  Â  ---------------------------------------

L228:
Â Â Â Â Â Â movzbl Â Â Â Â Â (%ecx), %eax
Â Â Â Â Â Â movzbl Â Â Â Â Â _lookup_table_e1(%eax), %edx
Â Â Â Â Â Â movb Â Â Â Â Â %dl, (%ebx)
Â Â Â Â Â Â incl Â Â Â Â Â %ebx
Â Â Â Â Â Â movzbl Â Â Â Â Â (%ecx), %edx
Â Â Â Â Â Â movzbl Â Â Â Â Â 1(%ecx), %eax
Â Â Â Â Â Â incl Â Â Â Â Â %ecx
Â Â Â Â Â Â andl Â Â Â Â Â \$3, %edx
Â Â Â Â Â Â andl Â Â Â Â Â \$240, %eax
Â Â Â Â Â Â orl Â Â Â Â Â %eax, %edx
Â Â Â Â Â Â movzbl Â Â Â Â Â _lookup_table_e2(%edx), %eax
Â Â Â Â Â Â movb Â Â Â Â Â %al, (%ebx)
Â Â Â Â Â Â incl Â Â Â Â Â %ebx
Â Â Â Â Â Â movzbl Â Â Â Â Â (%ecx), %edx
Â Â Â Â Â Â movzbl Â Â Â Â Â 1(%ecx), %eax
Â Â Â Â Â Â incl Â Â Â Â Â %ecx
Â Â Â Â Â Â andl Â Â Â Â Â \$15, %edx
Â Â Â Â Â Â andl Â Â Â Â Â \$192, %eax
Â Â Â Â Â Â orl Â Â Â Â Â %eax, %edx
Â Â Â Â Â Â movzbl Â Â Â Â Â _lookup_table_e3(%edx), %eax
Â Â Â Â Â Â movb Â Â Â Â Â %al, (%ebx)
Â Â Â Â Â Â incl Â Â Â Â Â %ebx
Â Â Â Â Â Â movzbl Â Â Â Â Â (%ecx), %eax
Â Â Â Â Â Â incl Â Â Â Â Â %ecx
Â Â Â Â Â Â movzbl Â Â Â Â Â _lookup_table_e4(%eax), %edx
Â Â Â Â Â Â movb Â Â Â Â Â %dl, (%ebx)
Â Â Â Â Â Â incl Â Â Â Â Â %ebx
Â Â Â Â Â Â cmpl Â Â Â Â Â %edi, %ecx
Â Â Â Â Â Â jb Â Â Â Â Â L228
Infinity08

>> I'd hate to jump to the conclusion that it all boils down down to L1 cache-related issues if it is really something else.
Well, I've tested the exact same executable on an AMD Athlon64 and an Intel Pentium M, with the results I described earlier.
I know that the comparison is unfair, as they're completely different architectures - but on the other hand, that's exactly what's causing the different behavior imo.

I would be interested to see you try the g++ compiler with the same flags I used to see what results you get. Because there could also be a difference in optimization between Visual C++ and GCC.

These are the optimizations enabled by the -O3 flag :

Â  -falign-functions : Align the start of functions to the machine dependent byte boundary
Â  -falign-jumps : Align branch targets to the machine dependent byte boundary
Â  -falign-labels : Align all branch targets to the machine dependent byte boundary
Â  -falign-loops : Align loops to the machine dependent byte boundary
Â  -fcaller-saves : Enable values to be allocated in registers that will be clobbered by function calls, by emitting extra instructions to save and restore the registers around such calls
Â  -fcprop-registers : Disable the copy-propagation pass to try to reduce scheduling dependencies
Â  -fcrossjumping : Perform cross-jumping transformation
Â  -fcse-follow-jumps : In common subexpression elimination, scan through jump instructions when the target of the jump is not reached by any other path
Â  -fcse-skip-blocks : This is similar to -fcse-follow-jumps, but causes CSE to follow jumps which conditionally skip over blocks
Â  -fdefer-pop : For machines which must pop arguments after a function call, the compiler normally lets arguments accumulate on the stack for several function calls and pops them all at once
Â  -fdelayed-branch : If supported for the target machine, attempt to reorder instructions to exploit instruction slots available after delayed branch instructions
Â  -fdelete-null-pointer-checks : Use global dataflow analysis to identify and eliminate useless checks for null pointers
Â  -fexpensive-optimizations : Perform a number of minor optimizations that are relatively expensive
Â  -fforce-mem : Force memory operands to be copied into registers before doing arithmetic on them
Â  -fgcse : Perform a global common subexpression elimination pass. This pass also performs global constant and copy propagation
Â  -fgcse-lm : When -fgcse-lm is enabled, global common subexpression elimination will attempt to move loads which are only killed by stores into themselves
Â  -fgcse-sm : When -fgcse-sm is enabled, A store motion pass is run after global common subexpression elimination
Â  -fguess-branch-probability : Guess branch probabilities using a randomized model
Â  -fif-conversion : Attempt to transform conditional jumps into branch-less equivalents
Â  -fif-conversion2 : Use conditional execution (where available) to transform conditional jumps into branch-less equivalents
Â  -finline-functions : Integrate all simple functions into their callers
Â  -floop-optimize : Perform loop optimizations: move constant expressions out of loops, simplify exit test conditions and optionally do strength-reduction and loop unrolling as well
Â  -fmerge-constants : Attempt to merge identical constants (string constants and floating point constants) across compilation units
Â  -foptimize-sibling-calls : Optimize sibling and tail recursive calls
Â  -fpeephole2 : Enable machine-specific peephole optimizations
Â  -fregmove : Attempt to reassign register numbers in move instructions and as operands of other simple instructions in order to maximize the amount of register tying
Â  -frename-registers : Attempt to avoid false dependencies in scheduled code by making use of registers left over after register allocation
Â  -freorder-blocks : Reorder basic blocks in the compiled function in order to reduce number of taken branches and improve code locality
Â  -freorder-functions : Reorder basic blocks in the compiled function in order to reduce number of taken branches and improve code locality
Â  -frerun-cse-after-loop : Re-run common subexpression elimination after loop optimizations has been performed
Â  -frerun-loop-opt : Run the loop optimizer twice
Â  -fsched-interblock : Schedule instructions across basic blocks
Â  -fsched-spec : Allow speculative motion of non-load instructions
Â  -fschedule-insns : If supported for the target machine, attempt to reorder instructions to eliminate execution stalls due to required data being unavailable
Â  -fschedule-insns2 : Similar to -fschedule-insns, but requests an additional pass of instruction scheduling after register allocation has been done
Â  -fstrength-reduce : Perform the optimizations of loop strength reduction and elimination of iteration variables
Â  -fstrict-aliasing : Allows the compiler to assume the strictest aliasing rules applicable to the language being compiled
Â  -fthread-jumps : Perform optimizations where we check to see if a jump branches to a location where another comparison subsumed by the first is found
SOLUTION
Get an unlimited membership to EE for less than \$4 a week.
Unlimited question asking, solutions, articles and more.
Infinity08

>> (1) Converting two bytes at once by using a 64x64 lookup table.
I've actually tried that (and some variations on it), but it creates other problems that bring down performance, like:
Â  a) the total size of the lookup tables would be 8kB, which could be too big for general usage (or it might not)
Â  b) you're dealing with 2D tables, which require a lot of operations to just calculate the address
Â  c) each entry in the lookup table will be two bytes instead of 1, so you're not gaining a lot in that respect.
The code I tried using this "trick" was slower than my standard code.

>>Â (2) Â Keeping the conversion table in the cache by periodic uses of cache PREFETCH instructions. Â  Or on the x64, read-specific prefetch.
Can you give a bit more info on those ? They sound interesting. Although, I think, if the data to be encoded is sufficiently "random", we won't need to keep them in the cache this way, as they should be kept in there by the encoding process itself.

>>Â (3) Â Unrolling the main loop a few times.
Certain compiler optimizations already do that. It's worth a shot, but in this particular case, I'm not sure it'll win us much.

>>Â (4) spinning off separate threads to take advantage of multiple cores or CPUs.
We're kind of limited by Dan's test system here :) (2.6Ghz Athlon).

>>Â (5) Â Carefully overlapping instructions to take into account the two to four instruction pipes available on Pentium and neewer CPUs.
That's usually handled by the compiler (if it's doing a good job), and is very difficult to control using C (you really need assembler to do this good).
And it's system dependent, which we should avoid, imo.

>>Â (6) Â Using MMX, SSE, and other instructions that process two to eight operands at one gulp.
Worth a try I guess. I don't know the MMX, SSE instruction sets too well though ... And again, it's hard to control using C instead of ASM.

All of life is about relationships, and EE has made a viirtual community a real community. It lifts everyone's boat
William Peck
SOLUTION
Get an unlimited membership to EE for less than \$4 a week.
Unlimited question asking, solutions, articles and more.
Infinity08

It might be fast, but it's not encoding the string correctly (SA==BIAABIAABIAABIAABIAA...) :) A few things :

1) you're depending on the endianness of the machine (running it on a machine with different endianness will mess up the algorithm).
2) the masking operations in the while loop don't do what you expect them to do.
3) you're depending on the sizes of WORD's and LONG's.
4) the *pChar += 3 part will only work to a certain extent, untill the byte overflows.
5) the if at the end uses pszOutBuf while that has never been incremented, so it overwrites the start of the string.

points 1) and 3) are the exact reasons why I didn't use the tricks you did ...
SOLUTION
Get an unlimited membership to EE for less than \$4 a week.
Unlimited question asking, solutions, articles and more.
grg99

BTW, all those Â "*pSrc++;" Â lines, they almost certainly should be Â "pSrc++;" Â  You want to bump the pointer, not the thing it points to.

DanRollins

Infinity08,
Thanks for posting that ASM. Â I'll be looking at it closely soon...

grg99,
The "mystery algorithm" that I mentiond earlier uses the 8096-byte table with the two-byte lookup scheme and a 12-bit index...
Â
I'm glad you hit on it ! (I felt sure it would come up sooner or later :-)

=-=-=-=-=-=-=-=-=-=-=-=
char Lookup12bits[ ((64*64)*2)+1 ]= Â // 8096+1
"BABBBCBDBEBFBGBHBIBJBKBLBMBNBOBPBQBRBSBTB ... B6B7B8B9B+B/"
...
"9A9B9C9D9E9F9G9H9I9J9K9L9M9N9O9P9Q9R9S9T9... Â 5969798999+9/"
"+A+B+C+D+E+F+G+H+I+J+K+L+M+N+O+P+Q+R+S+T+... Â 5+6+7+8+9+++/"
"/A/B/C/D/E/F/G/H/I/J/K/L/M/N/O/P/Q/R/S/T/... Â 5/6/7/8/9/+//"
;

//------------------------- Dan's "mystery algorithm" (similar to grg99's code)
void mToBase64( BYTE* pSrc, int nSrcLen, char* pszOutBuf )
{
Â Â Â Â Â Â char* p= pszOutBuf;
Â Â Â Â Â Â DWORD d;
Â Â Â Â Â Â WORD* pw= Â  Â  Â  (WORD*)pszOutBuf;
Â Â Â Â Â Â WORD* pawTbl= (WORD*)Lookup12bits;

Â Â Â Â Â Â while ( nSrcLen >= 3 ) {
Â Â Â Â Â Â  Â Â Â Â Â d = *pSrc++; d <<= 8;
Â Â Â Â Â Â  Â Â Â Â Â d|= *pSrc++; d <<= 8;
Â Â Â Â Â Â  Â Â Â Â Â d|= *pSrc++;
Â Â Â Â Â Â  Â Â Â Â Â nSrcLen -= 3;

Â Â Â Â Â Â  Â Â Â Â Â *pw++ = pawTbl[ d Â >> 12 ]; Â // no need for a mask
Â Â Â Â Â Â  Â Â Â Â Â *pw++ = pawTbl[ d &Â 0x00000FFF ];
Â Â Â Â Â Â }
Â Â Â Â Â Â // ... code to fixup for end-of-src logic
}
=-=-=-=-=-=-=-=-=-=-=-=

Infinity08 is right... it's all about endedness... For this alorithm to work, the table has to be arranged with the pairs in the opposite of expected order -- as build by generate_lookup_table_grg12() or as in my static table.

My version is running a little faster than the others (on my box, by my measure, I'm getting 208.50 MB/s), so it's fastest for all except for grg99's...

However, I can't post an "official" timing for grg99's code because the output is not yet correct. Â It's sure to be fast code, but some of what's missing will require a bit of additional code to fix (for instance, the endedness of the INPUT bytes needs to be rectified and the source pointer needs to be updated -- you are currently reading and re-reading the same bytes over and over). Â The required changes will affect the timing of course.

-- Dan
Get an unlimited membership to EE for less than \$4 a week.
Unlimited question asking, solutions, articles and more.
DanRollins

Re: grg99's other suggestions:

(1) 8096-byte (64*64*2) lookup table
I think this will be the fastest technique short of a 1,048,576-byte (64*64*64*4) lookup table -- which seems like overkill, and due to cache thrashing, may not be faster, anyway.

(2) PREFETCH instructions.
I could try this out on the ASM side, but I'm not sure how much difference it will make. Â Most L1 caches are typically far larger than the 8096-byte lookup table, so once it's been fetched, it gets accessed constantly -- so it seems likely to stay there.

(3) Â Unrolling the main loop a few times.
This *could* have a significant effect. Â I know that processing 6 input bytes into 8 output bytes (and running the loop nSrcLen/6 times) gave me a boost in some ASM code I was trying...

(4) take advantage of multiple cores or CPUs.
I'd certainly explore this if there was a \$1,000,000 prize... It would probably nearly double the throughput to have two cores working simultaneously... say one on the first half and the other on the last half.

(5) Â overlapping/pairing instructions (optimize for dual pipeline architecture)...
As Inifinity08 mentioned, the compiler will probably do a pretty good job of this. Â I did see some modest gains in some ASM code I was playing around with. Â But the inner loop ends up being very tight so there aren't a lot of opportunities... you pretty much need the results of each register as soon as you can set it's value.

(6) Â Using MMX ... two to eight operands at one gulp.
My most recent ASM try (unpublished) builds an 8-byte output into an MMX register (six bytes of input) and dumps a quadword at each pass. Â There is even a "non-temporal" version of the MOVQ opcode (MOVNTQ) that does not fill the cache with the output data (that never gets used again). Â This technique is, indeed the fastest I've timed yet... I'll post it in the ASM thread in due time.
Infinity08

>> The "mystery algorithm" that I mentiond earlier uses the 8096-byte table with the two-byte lookup scheme and a 12-bit index...
The only problem with this kind of algorithm, is that you need to take into account machines with different endianness. That's still doable, just use different code on different machines. But it gets harder on machines that have 1byte words for example.
It was my first thought too, but I abandoned the idea for the reasons I mentioned.

>>Â BTW, all those Â "*pSrc++;" Â lines, they almost certainly should be Â "pSrc++;" Â  You want to bump the pointer, not the thing it points to.
heh, that's a code editing problem (copy-paste). But the operator precedence still makes sure that the ++ is done before the *. So, the dereference operation is useless, but doesn't create a problem either, as it's optimized out by the compiler.
grg99

>But the operator precedence still makes sure that the ++ is done before the *.

Huh? Â  Â  Â To my knowledge, Â *p++ means, and has always meant, Â dereference p, then increment p. Â

But you're right, most compilers will optimize away the useless load, unless p is declared volatile.

And I was wrong, it doesnt increment what p points to. Â Dang language.

I will try modyfiying my code to handle both endians, hope that doesnt slow it down too much.
Also it would be fun to try a 3-hextet lookup table, although that's going to be wildly cache-unfriendly, all for a 50% gain at most. Â Oh, what fun!

Also a intersting point: Â if I comment out the line that stores the output data in the output array (two bytes at a time) the speed DOUBLES, which implies the code is going just about as fast as memory can accept (16-bit words) of data. Â One possibility is to buffer up 32-bits of output at a time and see if that helps a bit.
Experts Exchange has (a) saved my job multiple times, (b) saved me hours, days, and even weeks of work, and often (c) makes me look like a superhero! This place is MAGIC!
Walt Forbes
Infinity08

>> Huh? Â  Â  Â To my knowledge, Â *p++ means, and has always meant, Â dereference p, then increment p. Â
I meant that the ++ has higher precedence, so *p++ is the same as *(p++). I might have formulated a bit incorrect by using the word "done" though :) (the "done" i used referred to the compilation, not the execution)
DanRollins

I saw the *p++ and (because it worked :-) figured the same thing: Â It increments the pointer, which now points to a new location, but since the code does not use the data at the pointer, the effect is the same as p++ Â I guess that if you want to increment the value that is AT the pointer, you need to use (*p)++ to override operator precidence.

Other notes:
The Microsoft ATL Server contains a header file (atlenc.h) that contains a Base64 Encoder. Â In case you don't have VS Studio 2003, I found the source code here:

The code does a number of things, such as inserting line breaks every 76 bytes (which is mentioned in the RFC, and might have made this a slightly different challenge, but it's trivial enough that I guess we can ignore it) But when you pare out such superfluous stuff, the Microsoft code has a main loop that looks like this:

int nLoopCnt(nSrcLen/3);

for ( int j=0; j< nLoopCnt; j++ ) {
Â Â Â Â Â Â DWORD dwCurr(0);
Â Â Â Â Â Â for (int n=0; n<3; n++) {
Â Â Â Â Â Â  Â Â Â Â Â dwCurr |= *pSrc++;
Â Â Â Â Â Â  Â Â Â Â Â dwCurr <<= 8;
Â Â Â Â Â Â }
Â Â Â Â Â Â for (int k=0; k<4; k++) {
Â Â Â Â Â Â  Â Â Â Â Â BYTE b = (BYTE)(dwCurr>>26);
Â Â Â Â Â Â  Â Â Â Â Â *pszOutBuf++ = abDigitLookup[b];
Â Â Â Â Â Â  Â Â Â Â Â dwCurr <<= 6;
Â Â Â Â Â Â }
}
//--- send of source logic omitted

This comes in at 166 MB/s Â When you unroll the loops, it goes to 188 MB/s Â (about the rate of "Dan's std" code).

Tell me if I'm wrong, but it looks like this will work regardless of endedness...

They used a clever technique to break out the 6-bit indexes, that interestingly enough, matches with the trick used in craigwardman's technique on the ASM side:

Â  Â 1) Accumulate three source bytes, shifting to the left, 8 bits at-a-time
Â  Â  Â  Â (first byte ends up in the top of the DWORD accumulator)
Â  Â 2) The lookup index is the accumulator value, shifted right by 26 Â (the upper 6 bits)
Â  Â  Â  Â (this eliminates one masking operation)
Â  Â 3) look up the output byte and save it
Â  Â 4) move the remaining bits to the left by 6 bits
Â  Â 5) ... rinse and repeat ( steps 2-4, Â four times)

I think it's an interesting way to do it, mainly because it never would have occurred to me! Â  In my "hay day" of ASM programming of the 8086 (circa 1985), a SHIFT operation took a specific number of clock cycles *plus* some cycles per bit position shifted. Â So a >>Â 26 still *looks like* an expensive operation to me :-)
DanRollins

It does seem that using a single DWORD read to grab the three bytes could be more efficient than three separate 1-BYTE reads... Â But the endian thing makes it a hassle. Â There are several easy solutions in ASM, but I can't work out an efficent way to code it in C...

// ---------- the 'standard' way to collect the three bytes
Â Â Â Â Â Â DWORD d; Â  Â
Â Â Â Â Â Â d = *pSrc++; d <<= 8;
Â Â Â Â Â Â d|= *pSrc++; d <<= 8;
Â Â Â Â Â Â d|= *pSrc++;

// ---------- a try that reads a full DWORD and rearranges...
Â Â Â Â Â Â DWORD d, Â e= *(DWORD*)pSrc;
Â Â Â Â Â Â d = ((e &Â 0x0000FF00))
Â Â Â Â Â Â  Â  | ((e &Â 0x000000FF ) <<Â 16)
Â Â Â Â Â Â  Â  | ((e &Â 0x00FF0000 ) >>Â 16);
Â Â Â Â Â Â pSrc += 3;

The second version ran considerably slower than the first in my time trials.

Does anybody have any ideas on how to get a potential speed up by reading DWORD reads without losing time rearranging the bytes for subsequent processing?
Get an unlimited membership to EE for less than \$4 a week.
Unlimited question asking, solutions, articles and more.
Infinity08

>> Tell me if I'm wrong, but it looks like this will work regardless of endedness...
It will, but really it's not much different than this (my standard code in this case - but any similar approach works too) :

Â  Â  Â  while (nSrcLen >= 3) {
Â  Â  Â  Â  *pszOutBuf++ = lookup_base64[ (*pSrc >>Â 2) &Â 0x3F];
Â  Â  Â  Â  *pszOutBuf++ = lookup_base64[((*pSrc <<Â 4) &Â 0x30) | (((*(pSrc + 1)) >>Â 4) &Â 0x0F)];
Â  Â  Â  Â  pSrc++;
Â  Â  Â  Â  *pszOutBuf++ = lookup_base64[((*pSrc <<Â 2) &Â 0x3C) | (((*(pSrc + 1)) >>Â 6) &Â 0x03)];
Â  Â  Â  Â  pSrc++;
Â  Â  Â  Â  *pszOutBuf++ = lookup_base64[ (*pSrc Â  Â  ) &Â 0x3F];
Â  Â  Â  Â  pSrc++;
Â  Â  Â  Â  nSrcLen -= 3;
Â  Â  Â  }

When we purely count operations (without compiler optimizations), we get :

Â  Â  lookup : 6
Â  Â  assign : 4
Â  Â  in/decrement : 10
Â  Â  shift : 5
Â  Â  and : 6
Â  Â  or : 2
Â  Â  array lookup : 4
Â  Â  comparison : 1

And for this code :

Â  Â  int nLoopCnt(nSrcLen/3);
Â  Â  for ( int j=0; j< nLoopCnt; j++ ) {
Â  Â  Â  DWORD dwCurr(0);
Â  Â  Â  for (int n=0; n<3; n++) {
Â  Â  Â  Â  dwCurr |= *pSrc++;
Â  Â  Â  Â  dwCurr <<= 8;
Â  Â  Â  }
Â  Â  Â  for (int k=0; k<4; k++) {
Â  Â  Â  Â  BYTE b = (BYTE)(dwCurr>>26);
Â  Â  Â  Â  *pszOutBuf++ = abDigitLookup[b];
Â  Â  Â  Â  dwCurr <<= 6;
Â  Â  Â  }
Â  Â  }

we get :

Â  Â  lookup : 3
Â  Â  assign : 8
Â  Â  in/decrement : 15
Â  Â  shift : 11
Â  Â  and : 0
Â  Â  or : 3
Â  Â  array lookup : 4
Â  Â  comparison : 8

Comparing these two tables (I know they're not an exact measurement of performance, but they give a good idea), we can make these observations :

1) the lookups and assigns can be ignored, as the compiler will optimize them so it comes down to 3 lookups and 4 assigns anyway. (without optimization, we're looking at 10 vs. 11 lookups and assigns)

2) the increments and decrements can be easily optimized by the compiler, and are of similar magnitued anyway : 10 vs. 15

3) the number of "real" operations (shift, and, or) is about the same : 13 vs. 14

4) the array lookups are the same (4), which is to be expected, since the same lookup table is used

5) the number of comparisons are easily optimized by the compiler too, so they can be ignored

Which makes 3) the most important observation : both codes will take about the same time.

The thing is, if you want to keep your code endianness and word-size friendly, then you can't get much better code than the ones we already came up with, because you'll always end up with similar operation counts as I gave above.
The only real improvement I can see so far is to increase the size of the lookup table, and take advantage of that. But the problem is that the performance of code making use of that will change a lot between different architectures.

>>Â Does anybody have any ideas on how to get a potential speed up by reading DWORD reads without losing time rearranging the bytes for subsequent processing?
I've been trying to look for a way since you posted this challenge, but I always end up with the same amount of ands, ors and shifts.

The only approach that would work in theory is to have a 4 * 2^24 byte lookup table, which maps 3 bytes to 4 bytes. But in practice that doesn't work, since the size of the L1 cache is too limited (I still have to find a 67 MB L1 cache heh). But just Â for argument's sake, this would allow code like this :

Â  Â  Â  unsigned int src = 0;
Â  Â  Â  while (nSrcLen >= 3) {
Â  Â  Â  Â  src = *pSrc++ <<Â 16;
Â  Â  Â  Â  src |= *pSrc++ <<Â 8;
Â  Â  Â  Â  src |= *pSrc++;
Â  Â  Â  Â  *((unsigned int*) pszOutBuf) = lookup_base64[src];
Â  Â  Â  Â  pszOutBuf += 4;
Â  Â  Â  Â  nSrcLen -= 3;
Â  Â  Â  }

Granted, it also depends on 4-byte unsigned ints, but it doesn't depend on endianness (as long as the lookup table is generated on the same machine), and it's as fast as you can get (minus some minor optimizations probably) IF you have a sufficiently large L1 cache ...
DanRollins

It's instructional to test the idea of using a Monster Table:

DWORD adwTbl[256*256*256]; Â  /// a 64MB lookup table!

void GenMonsterTbl() {
Â Â Â Â Â Â BYTE* pb= (BYTE*) &adwTbl[0];
Â Â Â Â Â Â for ( int j=0; j<256*256*256; j++ ) {
Â Â Â Â Â Â  Â Â Â Â Â DWORD d= j;
Â Â Â Â Â Â  Â Â Â Â Â *pb++ = abDigitLookup[ BYTE((d &Â 0x00FC0000) >>Â 18) ];
Â Â Â Â Â Â  Â Â Â Â Â *pb++ = abDigitLookup[ BYTE((d &Â 0x0003F000) >>Â 12) ];
Â Â Â Â Â Â  Â Â Â Â Â *pb++ = abDigitLookup[ BYTE((d &Â 0x00000FC0) >>Â 6 ) ];
Â Â Â Â Â Â  Â Â Â Â Â *pb++ = abDigitLookup[ BYTE((d &Â 0x0000003F) Â  Â  Â ) ];
Â Â Â Â Â Â }
}

void ToBase64_MonsterTbl( BYTE* pSrc, int nSrcLen, char* pszOutBuf )
{
Â Â Â Â Â Â DWORD* pdw= (DWORD*)pszOutBuf;
Â Â Â Â Â Â for ( int j=0; j< nSrcLen; j+=3 ) {
Â Â Â Â Â Â  Â Â Â Â Â DWORD d;
Â Â Â Â Â Â  Â Â Â Â Â d = *pSrc++; d <<= 8;
Â Â Â Â Â Â  Â Â Â Â Â d|= *pSrc++; d <<= 8;
Â Â Â Â Â Â  Â Â Â Â Â d|= *pSrc++;
Â Â Â Â Â Â  Â Â Â Â Â *pdw++ = adwTbl[d]; Â // output four bytes
Â Â Â Â Â Â }
}

As predicted, the speed is abysmal... around 13MB/s.

Since the code itself is so concise (a mere 42 bytes of opcodes in the loop) the difference can ONLY be memory accesses and though it makes FAR FEWER such accesses (a third as many table lookups and a forth as many output writes), virtually every lookup is a cache miss. Â Since the accesses jump around so much, it's effectviely like having no cache at all!

And please don't tell me "I told you so" ... I already knew it. Â But I'm a pragmatic type of guy and it's reassuring to see predicted results :-)

=-=-=-=-=-=-=-=-=
I'm still mystified by the ultra-fast times measured on Infinitiy08's "6K of tables" code... the ASM output will take some deciphering before I can understand it. Â I'll take another look if I can find time later today.

-- Dan
DanRollins

OK!
I analyzed Infinity08's Xxx_b function's pseudo-ASM listing and converted it to an _asm block. Â
The GCC compiler used some opcode techniques that saved time... for instance, it uses
Â  Â movzx eax, (...any byte)
to avoid needing
Â  Â xor Â  Â  eax,eax
Â  Â mov Â  Â al, (...any byte)
when it will need to use all of eax as an index. Â Other than that, the low-level instruction streams are basically similar. Â

My timing results were similar to earlier:

MB/sec Â Internal name // Description (on AMD Athlon 2800)
-------- Â --------------------------------------------------------------------------------------
180.45 Â ...infinity08_lookup_b / Infinity08 (6K of tables) Â Compiled C
181.81 Â ...infinity08_lookup_b / Infinity08 (6K of tables) Â My ASM conversion

208.43 Â mToBase64 / Dan's "mystery algorithm" (8096-byte lookup table)

HOWEVER!!!!!

I then ran the exact same executables on 2.5 GHz Celeron (genuine Intel) with these results:

MB/sec Â Internal name // Description (on 2.51 Celeron)
-------- Â --------------------------------------------------------------------------------------
306.53 Â ...infinity08_lookup_b / Infinity08 (6K of tables) Â Compiled C
432.21 Â ...infinity08_lookup_b / Infinity08 (6K of tables) Â My ASM conversion

453.01 Â mToBase64 / Dan's "mystery algorithm" (8096-byte lookup table)

The times are all significantly better on the Intel architecture, even though it's running a Celeron (which is known to have smallish caches).

The GCC-generated output (my ASM conversion) was significantly faster than the same algorithm when compiled/optimized by the VC++ compiler. Â  I think that tells us that the VC++ optimizer is smarter (at least when the output is ececuted on Intel). Â The difference between the two (26%) is dramatic... just as Infinity08's said, and this confirms his measurements. Â

I really can't see enough difference in the opcodes to justify such a dramatic difference, but there it is. Â The Intel architecture must be better at cache prediction and pipelining instructions -- a LOT better.

I also ran the 2-byte lookup version for comparison... it's still the fastest (either architecture), but only by a negligible 0.5% on the Intel Celeron (1.5% on my Athlon).

-- Dan
Your help has saved me hundreds of hours of internet surfing.
fblack61
Infinity08

>> The times are all significantly better on the Intel architecture, even though it's running a Celeron (which is known to have smallish caches).
Well, we can't forget which CPU's we tried :

Intel : Pentium M and Pentium Celeron
AMD : Athlon XP and Athlon64

There's a serious difference in architecture between the two. But more importantly : both Intel processors are in a lower performance class than the AMD processors. It would be interesting to see how the code performs on a Pentium 4 or a Duron or something similar.

It could be that encoding algorithms that use small lookup tables have an advantage on systems in the low performance class (Pentium M, Celeron, Duron), however contradictory that may sound.
grg99

You have to watch out, some of the newer instructions like movez cause pipeline stalls on the early Pentiums, so doing the two simpler instructions might be as fast or faster!

It's interesting that the 18-bit lookup table is so dang slow. Â Shows how slow inexpensive memory really is compared to the CPU these days.

I tried some experiments with using MMX instructions to process EIGHT sixbit quantities at a time. Â Only took about eight MMX instructions to process all eight sixbits, BUT that time is swamped by the SIXTEEN shifts it takes to position each sixbit quantity in a separate byte. Â Sigh.
If there was a MMX instruction to pack/unpack 6 bit fields into 8 bit bytes this would be a huge winner, but alas no, not even in the latests SSE4 or AMD extensions..

One last possible trick-- investigate non-sequential processing, such as processing all the first bytes, then the seconds, then the thirds. Â That way you or the compiler could keep the masks and shifts values in registers, AND position all the bits in 4 or 8 bytes with just ONE shift. Wow, gotta try this!

DanRollins

Innovative idea there grg99... it would be "paradoxical" if taking two passes through the data actually saved time, but there might be some other advantage that overrides everything else.

In my MMX tests, I was able to avoid a lot of shift ops by moving back and forth from MMX registers to general registers (where BSWAP and XCHG are available). Â The big win seems to be in using a non-temporal 8-byte move to output the results. Â If you care to share, I'd love to see any work MMX stuff you've tried in the ASM version of this thread (http:Q_21983447.html :-)
Get an unlimited membership to EE for less than \$4 a week.
Unlimited question asking, solutions, articles and more.
DanRollins

Â  Â https://www.experts-exchange.com//Assembly/Q_21983447.html#17711344
I wrote a version of the 8096-byte lookup algorithm code that confirms the assertion that "caching is king" --i.e., that all of the algorithms are losing time to the cache pollution that occurs when we plow through 1MB of of output.

In that example, I changed just one opcode from MOVQ to MOVNTQ and saw the fastest results to date, nearly 30% speed increase by changing just that *one opcode.* Â The MOVNTQ opcode tells the CPU to store the 8-byte MMX register directly to main memory -- bypassing the caches (and thus allowing other data to stay cached).

I think that it's safe to say that when processing large chunks of memory, such as streaming video or this 1MB block of input, this is an overriding concern.

It might be a more valid test of the encoding alorithms to try a variation of my timing benchmark.... a variation in which a much smaller buffer (say 10K) is encoded 100,000,000 times. Â In such a case, all of the input and all of the output and all of the lookup tables would be in cache for the entire run.

But... I think this thread is running down. Â I'll leave this open for a while longer to see if we can get some more submissions or even some general interest. Â If nothing pops, I'll close up and award the points.

-- Dan
DanRollins

This has been an interesting and thoroughly enjoyable intellectual excercise. Â I -- and I hope the other participants -- have learned a lot. Â For one thing the poor performance of the monster lookup table (cache thrashing) was enlightening. Â For another, the significant speed advantage of a genuine Intel CPU (in this particular scenario) surprised me. Â Infinity08's multiple-table scheme was innovative -- I'd have never thought of it :-)

We all put more into this that we would for a standard EE question, so the point awards are not true "compensation" -- the real reward here has been the participation itself. Â Infinity08 posted the code that ran fastest with correct output. Â grg99's contribution *showed the way* to the fastest algorithm measured here (the 8096-byte lookup table) but he did not submit a debugged version for actual timing comparison. Â grg99's other contributions, including the no-table-at-all code, were interesting and his comments insightful, and I am chosing to award a share of the points to him.

For what it's worth: Every submission beat all of the base64 encoders in a Microsoft API or other published utility. Â So that's a bonus :-)

Thanks to everyone who participated and even if you just "lurked" or if you read this at some time in the future, Â I hope that you enjoyed this as much as I did :-)

-- Dan
Infinity08

>> the real reward here has been the participation itself.
Wise words :)

>>Â For what it's worth: Every submission beat all of the base64 encoders in a Microsoft API or other published utility. Â So that's a bonus :-)
Heh.

I enjoyd this discussion/challenge, Dan ! Thanks for that !
I started with Experts Exchange in 2004 and it's been a mainstay of my professional computing life since. It helped me launch a career as a programmer / Oracle data analyst
William Peck