DanRollins
asked on
How fast is your C/C++ Base64 Encoder?
Are you up for a challenge?
Several of us C and ASM programmers had a fun learning experience a few years ago when we tried to outdo each other by writing "The World's Fastest" binary-to-hex converter and output-formatting function. Now I'm throwing out the gauntlet again:
See if you can write a C/C++ function that encodes binary data into a
base64 output string -- and see if yours is faster than everybody else's.
For additional reference, the ASM version of this question is here: http:Q_21983447.html
A Base64 Encoder converts binary data into ASCII codes that can be transmitted/processed by systems that do not normally process raw binary... for instance, in email attachments and in HTTP headers and as values in HTTP URL. The encoder outputs only the following characters:
ABCDEFGHIJKLMNOPQRSTUVWXYZ abcdefghij klmnopqrst uvwxyz0123 456789+/
plus, the output may end in zero, one, or two equal signs (=)
To encode the raw binary, you access it three bytes at a time, taking the data in groups of six bits. For instance:
Hex: 0x07 0xFF 0x02
Binary: 0000 0111 1111 1111 0000 0010 (groups of 4 bits)
Binary: 000001 111111 111100 000010 (groups of 6 bits)
\____/ \____/ \____/ \____/
decimal 1 63 60 2
lookup B / 8 C
The length of the output must always be an even multiple of four, because when you DEcode the data, you will be taking groups of four base64 "digits" and turning them back into three binary bytes of output. Therefore, when ENcoding, if your input data is not an even multiple of three, you will need to pad the end of the output with one or two equal signs (=)
If you need more info, then (as always) RTFR! (read the friggin RFC :-) http://www.ietf.org/rfc/rfc3548.txt
Also, to verify your output, you can use an Encoder at this URL:
http://www.opinionatedgeek.com/dotnet/tools/Base64Encode/Default.aspx
=-=-=-=-=-=-=-=-=-=-=
I'll be speed-testing the submissions using code like this:
//------------------------ ---------- ------ (assume dest buf is large enough)
void ToBase64( BYTE* pSrc, int nSrcLen, char* pszOutBuf ) // {
// your code here!
}
I've already posted one C function here: http:/Assembly/Q_21983447.html#17516191
=-=-=-=-=-=-=-=-=-=-=
I'm betting that the ASM code will out-perform even the most optimized C code. But as we saw in the Hex-converter challenge, a clever algorithm can often win over even the best hand-optimized ASM code (of course the ASM programmers can then implement the same algorithm and shave a few milliseconds and will surely win :-)
So submit your function and I'll post timing/speed comparisons. The timings will measure thousands of passes at a large buffer of random input data, so if that changes the way you write your code, then keep it in mind.
Submit your C/C++ function and I'll measure it against everybody else's as well as the ASM versions and any standard library calls and API functions that I can find. If this works out, I'll start a second challenge for Base64 DEcoder.
Are you game? 500 points and your reputation as a hotshot C programmer are on the line :-)
-- Dan
Several of us C and ASM programmers had a fun learning experience a few years ago when we tried to outdo each other by writing "The World's Fastest" binary-to-hex converter and output-formatting function. Now I'm throwing out the gauntlet again:
See if you can write a C/C++ function that encodes binary data into a
base64 output string -- and see if yours is faster than everybody else's.
For additional reference, the ASM version of this question is here: http:Q_21983447.html
A Base64 Encoder converts binary data into ASCII codes that can be transmitted/processed by systems that do not normally process raw binary... for instance, in email attachments and in HTTP headers and as values in HTTP URL. The encoder outputs only the following characters:
ABCDEFGHIJKLMNOPQRSTUVWXYZ
plus, the output may end in zero, one, or two equal signs (=)
To encode the raw binary, you access it three bytes at a time, taking the data in groups of six bits. For instance:
Hex: 0x07 0xFF 0x02
Binary: 0000 0111 1111 1111 0000 0010 (groups of 4 bits)
Binary: 000001 111111 111100 000010 (groups of 6 bits)
\____/ \____/ \____/ \____/
decimal 1 63 60 2
lookup B / 8 C
The length of the output must always be an even multiple of four, because when you DEcode the data, you will be taking groups of four base64 "digits" and turning them back into three binary bytes of output. Therefore, when ENcoding, if your input data is not an even multiple of three, you will need to pad the end of the output with one or two equal signs (=)
If you need more info, then (as always) RTFR! (read the friggin RFC :-) http://www.ietf.org/rfc/rfc3548.txt
Also, to verify your output, you can use an Encoder at this URL:
http://www.opinionatedgeek.com/dotnet/tools/Base64Encode/Default.aspx
=-=-=-=-=-=-=-=-=-=-=
I'll be speed-testing the submissions using code like this:
//------------------------
void ToBase64( BYTE* pSrc, int nSrcLen, char* pszOutBuf ) // {
// your code here!
}
I've already posted one C function here: http:/Assembly/Q_21983447.html#17516191
=-=-=-=-=-=-=-=-=-=-=
I'm betting that the ASM code will out-perform even the most optimized C code. But as we saw in the Hex-converter challenge, a clever algorithm can often win over even the best hand-optimized ASM code (of course the ASM programmers can then implement the same algorithm and shave a few milliseconds and will surely win :-)
So submit your function and I'll post timing/speed comparisons. The timings will measure thousands of passes at a large buffer of random input data, so if that changes the way you write your code, then keep it in mind.
Submit your C/C++ function and I'll measure it against everybody else's as well as the ASM versions and any standard library calls and API functions that I can find. If this works out, I'll start a second challenge for Base64 DEcoder.
Are you game? 500 points and your reputation as a hotshot C programmer are on the line :-)
-- Dan
Just a question : is this algorithm only, or can we use a lookup table ?
Hate to be a killjoy, but this is an academic exercise only-- in the real world, CPU's are tens to hundreds of times faster than any I/O device, even gigabit ethernet, so doing base64 conversion is not a significant bottleneck.
Still it's a fun challenge-- go to it!
Still it's a fun challenge-- go to it!
ASKER
grg99,
It's not entirely pointless. Imagine that your server must send thousands of 4MB video clips as email attachments. The encoding *might* become a throughput bottleneck if the CPU gets maxed out by multiple threads doing various related activities. Decoding might be more of an issue -- the time spent in transit is a constant, but once it is on your computer, an efficient decoder could prevent delay between clicking the attachment and watching the video.
But, as you say, the practicality of the challenge is not the issue :-)
Infinity08,
There is likely to be at least one lookup table employed by most algorithms. In fact, it would be interesting to see if there is a way to convert *without* a lookup table. For instance, binary-to-hex conversion *could* use a simple lookup in a table like
char table "0123456789ABCDEF";
c= table[b]; // where b starts is a 4-bit nybble and c ends up as an ASCII char
but most binary-to-hex algorithms do it without a lookup; e.g.,
if (b>9) b += 7; // where b is a 4-bit nybble
c= b + '0'; // c is a hex digit
-- Dan
It's not entirely pointless. Imagine that your server must send thousands of 4MB video clips as email attachments. The encoding *might* become a throughput bottleneck if the CPU gets maxed out by multiple threads doing various related activities. Decoding might be more of an issue -- the time spent in transit is a constant, but once it is on your computer, an efficient decoder could prevent delay between clicking the attachment and watching the video.
But, as you say, the practicality of the challenge is not the issue :-)
Infinity08,
There is likely to be at least one lookup table employed by most algorithms. In fact, it would be interesting to see if there is a way to convert *without* a lookup table. For instance, binary-to-hex conversion *could* use a simple lookup in a table like
char table "0123456789ABCDEF";
c= table[b]; // where b starts is a 4-bit nybble and c ends up as an ASCII char
but most binary-to-hex algorithms do it without a lookup; e.g.,
if (b>9) b += 7; // where b is a 4-bit nybble
c= b + '0'; // c is a hex digit
-- Dan
>> There is likely to be at least one lookup table employed by most algorithms. In fact, it would be interesting to see if there is a way to convert *without* a lookup table.
I should have clarified: I wasn't referring to without versus with a lookup table, but whether you can use a lookup table to replace your algorithm. Just a practical question :)
I should have clarified: I wasn't referring to without versus with a lookup table, but whether you can use a lookup table to replace your algorithm. Just a practical question :)
ASKER
For instance, one calculation scheme:
switch (b) { // b is a 6-bit value (0...63)
case 0x3e: b='+'; break; // special cases (not part of any sequence)
case 0x3f: b='/'; break;
default:
if (b>25) b += 6; // Not one of "ABC...XYZ"
if (b>57) b -= 75; // Not one of "abc...xyz" must be '012...789"
b += 'A';
}
switch (b) { // b is a 6-bit value (0...63)
case 0x3e: b='+'; break; // special cases (not part of any sequence)
case 0x3f: b='/'; break;
default:
if (b>25) b += 6; // Not one of "ABC...XYZ"
if (b>57) b -= 75; // Not one of "abc...xyz" must be '012...789"
b += 'A';
}
ASKER
>>... but whether you can use a lookup table to replace your algorithm.
My reply is that a lookup operation (whatever part of the algorithm it "replaces") is not only valid, but likely to be used in the fastest algorithm. There are no particular constraints other than that the decoder needs to be fast if you want a chance to win the challenge :-)
My reply is that a lookup operation (whatever part of the algorithm it "replaces") is not only valid, but likely to be used in the fastest algorithm. There are no particular constraints other than that the decoder needs to be fast if you want a chance to win the challenge :-)
Perfect :)
Oh, and how much time do we have ?
Ahh, you might not want to use a lookup table. If you're encoding data, that's going to be flowing thru the cache, very likely flushing out the lookup table all the time. That means it's often going to take a whole memory access time, approx 3nsec (like, 18 instructions) to get ONE table value!
Only way to tell for sure is to time the code.
Only way to tell for sure is to time the code.
ASKER
Until we think we've rung the last ounce of performance out of the code :-)
Here is my timing methodology:
I will generate a 1MB buffer of random data (the same data for everybody)
I will then call the ToBase64 function 1000 times (process 1000 MB of data)
I will calculate the MB-per-second.
I'll run it three times and list the highest score.
Here is my function that does all of that:
void CBase64Dlg::OnButton1()
{
CWaitCursor c; // show that the code is running
int nSrcLen= 1024*1024; // one true MB
int nPasses= 1000;
BYTE* pData= new BYTE[ nSrcLen ];
char* pszEncoded= new char[ nSrcLen*2 ]; // really only need * 1.5, but close enough
//----------- generate the random binary data
strcpy( (char*)pData, "Hi There Everybody" ); // put recognizable text at the start
int nFirst= strlen( (char*)pData );
srand(123); // seed the random number generator the same for everybody
for (int j=nFirst; j< nSrcLen; j++ ) {
pData[j]= rand() & 0xFF;
}
Sleep(0); // start on a new timeslice
int nStartTick= GetTickCount();
for (int k=0; k<nPasses; k++) {
ToBase64( pData, nSrcLen, pszEncoded ); // encode 1MB of data
}
int nTicksTotal= GetTickCount() - nStartTick; // duration of the test in ms (note: +- about 9ms)
delete pData;
delete pszEncoded;
//--------- output the results
CString sMsg; sMsg.Format(
"\r\nIt took %d ms to process %d MB of data\r\n"
"That is %.2f MB per second\r\n",
(int) nTicksTotal,
(int) nPasses,
(double) nPasses/((double)nTicksTot al/1000)
);
m_ctlEdOutput.SetSel(32767 , 32767);
m_ctlEdOutput.ReplaceSel( sMsg );
}
=-=-=-=-==-=-=-=-=-=-=
Notes:
* The test loops enough times to make the inaccuracy of the GetTickCount() timer unimportant.
* The test is likley to be interrupted many times by other processes, but that should be
relatively consistant. By running it three times (andy verifying *similar* results) that
should not skew the results.
* I'm currently running on a 2.6Ghz Athlon. All times will be taken from here.
If you do your own timings, you should be able to obtain "relative" values (a faster algorithm
should be faster, regardless of CPU speed)
* I'll eyeball the result output to verify that the first part is encoded correctly:
Hi There Everybody (source)
SGkgVGhlcmUgRXZlcnlib2R5 (encoded)
Here is my timing methodology:
I will generate a 1MB buffer of random data (the same data for everybody)
I will then call the ToBase64 function 1000 times (process 1000 MB of data)
I will calculate the MB-per-second.
I'll run it three times and list the highest score.
Here is my function that does all of that:
void CBase64Dlg::OnButton1()
{
CWaitCursor c; // show that the code is running
int nSrcLen= 1024*1024; // one true MB
int nPasses= 1000;
BYTE* pData= new BYTE[ nSrcLen ];
char* pszEncoded= new char[ nSrcLen*2 ]; // really only need * 1.5, but close enough
//----------- generate the random binary data
strcpy( (char*)pData, "Hi There Everybody" ); // put recognizable text at the start
int nFirst= strlen( (char*)pData );
srand(123); // seed the random number generator the same for everybody
for (int j=nFirst; j< nSrcLen; j++ ) {
pData[j]= rand() & 0xFF;
}
Sleep(0); // start on a new timeslice
int nStartTick= GetTickCount();
for (int k=0; k<nPasses; k++) {
ToBase64( pData, nSrcLen, pszEncoded ); // encode 1MB of data
}
int nTicksTotal= GetTickCount() - nStartTick; // duration of the test in ms (note: +- about 9ms)
delete pData;
delete pszEncoded;
//--------- output the results
CString sMsg; sMsg.Format(
"\r\nIt took %d ms to process %d MB of data\r\n"
"That is %.2f MB per second\r\n",
(int) nTicksTotal,
(int) nPasses,
(double) nPasses/((double)nTicksTot
);
m_ctlEdOutput.SetSel(32767
m_ctlEdOutput.ReplaceSel( sMsg );
}
=-=-=-=-==-=-=-=-=-=-=
Notes:
* The test loops enough times to make the inaccuracy of the GetTickCount() timer unimportant.
* The test is likley to be interrupted many times by other processes, but that should be
relatively consistant. By running it three times (andy verifying *similar* results) that
should not skew the results.
* I'm currently running on a 2.6Ghz Athlon. All times will be taken from here.
If you do your own timings, you should be able to obtain "relative" values (a faster algorithm
should be faster, regardless of CPU speed)
* I'll eyeball the result output to verify that the first part is encoded correctly:
Hi There Everybody (source)
SGkgVGhlcmUgRXZlcnlib2R5 (encoded)
ASKER
Timing results to date:
MB/sec Description
-------- -------------------------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- -
58.59 Dan's C "standard" encoder (inner loop at http:/Assembly/Q_21983447.html#17516191
53.42 craigwardman's ASM (rolled loops)
63.36 craigwardman 's ASM (unrolled loops)
I can also tell you that I have written an alternative C version that is running at
97.86 MB/sec
which I'll reveal if nobody else comes up with the same technique. I suspect that conversion to ASM will put it over 100.
Please! Don't let these timings discourage you. Until I've timed your function on my specific computer, there is no way to know how fast it is relative to the other submissions. Also, there can be great benefit in comparing merits and problems with slow functions. It's all about the challenge and the learning experience.
MB/sec Description
-------- --------------------------
58.59 Dan's C "standard" encoder (inner loop at http:/Assembly/Q_21983447.html#17516191
53.42 craigwardman's ASM (rolled loops)
63.36 craigwardman 's ASM (unrolled loops)
I can also tell you that I have written an alternative C version that is running at
97.86 MB/sec
which I'll reveal if nobody else comes up with the same technique. I suspect that conversion to ASM will put it over 100.
Please! Don't let these timings discourage you. Until I've timed your function on my specific computer, there is no way to know how fast it is relative to the other submissions. Also, there can be great benefit in comparing merits and problems with slow functions. It's all about the challenge and the learning experience.
ASKER
Note: On the C functions, I set the compiler optimization on "maximum speed." From past experience, that can have a significant advantage over "Debug (no optimization)" but ends up nearly identical to "minimum size."
ASKER
Some additional results:
MB/sec Description
-------- -------------------------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- -
17.11 The Base64Coder object from http://support.microsoft.com/default.aspx?scid=kb;[LN];Q191239
104.05 CryptBinaryToStringA (in Crypt32.dll)
The bar is raised! A standard API function spins through at perhaps unbeatable rates.
But I think we can beat it.
MB/sec Description
-------- --------------------------
17.11 The Base64Coder object from http://support.microsoft.com/default.aspx?scid=kb;[LN];Q191239
104.05 CryptBinaryToStringA (in Crypt32.dll)
The bar is raised! A standard API function spins through at perhaps unbeatable rates.
But I think we can beat it.
What's the deadline for submission, and how do we submit ?
ASKER
A submission should be source code of the one function, posted right here in this thread. Submit when you have a working solution. Or don't submit at all, just join the discussion and give us your ideas and thoughts. I'm not stating a specific time limit -- I'll keep this and the ASM thread open as long as interest continues.
ASKER
Note: I messed up the compiler optimization settings in previous time trials. Current (corrected) results are:
MB/sec Internal name // Description
-------- -------------------------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- -
54.42 xToBase64 / craigwardman (ASM, std loops)
61.81 yToBase64 / craigwardman (ASM, unrolled loops)
36.80 aToBase64 / Base64Coder http://support.microsoft.com/default.aspx?scid=kb;[LN];Q191239
104.92 cToBase64 / CryptBinaryToString API call
156.86 z2ToBase64 / Dan's 'standard' C code (modified, see below)
188.78 zToBase64 / Dan's 'standard' C code
198.13 mToBase64 / Dan's "mystery" algorithm (C code)
Note that the ASM and API functions timings are consistent with previous measurements, but the C code got much faster (as expected if the complier optimization were wrong).
It's interesting that both of the C functions are considerably faster than the ASM functions and that both of them are almost twice as fast as Microsoft's CryptoAPI code :-) The ASM code generated by the compiler is shown in the ASM thread.
=-=-=-=-=-=-=-=-=-=
In my 'standard C' algorithm, I changed just these lines:
b1= BYTE((d & 0x00FC0000) >> 18);
b2= BYTE((d & 0x0003F000) >> 12);
b3= BYTE((d & 0x00000FC0) >> 6 );
b4= BYTE((d & 0x0000003F) >> 0 );
...to what I thought would be faster... this sequence:
b4= BYTE(d & 0x0000003F);
d >>= 6;
b3= BYTE(d & 0x0000003F);
d >>= 6;
b2= BYTE(d & 0x0000003F);
d >>= 6;
b1= BYTE(d & 0x0000003F);
It turned out that the modified code was actually slower! And not by a trival amount... fully 17% slower! It just goes to show that sometimes the intuitive is not the best/fastest.
Attempting to avoid the use of the intermediate variables b1,b2,b3,b4 I tried:
p[3]= abDigitLookup[ (d & 0x0000003F) ]; d >>= 6;
p[2]= abDigitLookup[ (d & 0x0000003F) ]; d >>= 6;
p[1]= abDigitLookup[ (d & 0x0000003F) ]; d >>= 6;
p[0]= abDigitLookup[ (d & 0x0000003F) ];
p += 4;
...which gave a very slight boost.
But there was no additional gain for combining into longer statements like so:
*p++ = abDigitLookup[ BYTE((d & 0x00FC0000) >> 18) ];
*p++ = abDigitLookup[ BYTE((d & 0x0003F000) >> 12) ];
*p++ = abDigitLookup[ BYTE((d & 0x00000FC0) >> 6 ) ];
*p++ = abDigitLookup[ BYTE((d & 0x0000003F) ) ];
-- Dan
MB/sec Internal name // Description
-------- --------------------------
54.42 xToBase64 / craigwardman (ASM, std loops)
61.81 yToBase64 / craigwardman (ASM, unrolled loops)
36.80 aToBase64 / Base64Coder http://support.microsoft.com/default.aspx?scid=kb;[LN];Q191239
104.92 cToBase64 / CryptBinaryToString API call
156.86 z2ToBase64 / Dan's 'standard' C code (modified, see below)
188.78 zToBase64 / Dan's 'standard' C code
198.13 mToBase64 / Dan's "mystery" algorithm (C code)
Note that the ASM and API functions timings are consistent with previous measurements, but the C code got much faster (as expected if the complier optimization were wrong).
It's interesting that both of the C functions are considerably faster than the ASM functions and that both of them are almost twice as fast as Microsoft's CryptoAPI code :-) The ASM code generated by the compiler is shown in the ASM thread.
=-=-=-=-=-=-=-=-=-=
In my 'standard C' algorithm, I changed just these lines:
b1= BYTE((d & 0x00FC0000) >> 18);
b2= BYTE((d & 0x0003F000) >> 12);
b3= BYTE((d & 0x00000FC0) >> 6 );
b4= BYTE((d & 0x0000003F) >> 0 );
...to what I thought would be faster... this sequence:
b4= BYTE(d & 0x0000003F);
d >>= 6;
b3= BYTE(d & 0x0000003F);
d >>= 6;
b2= BYTE(d & 0x0000003F);
d >>= 6;
b1= BYTE(d & 0x0000003F);
It turned out that the modified code was actually slower! And not by a trival amount... fully 17% slower! It just goes to show that sometimes the intuitive is not the best/fastest.
Attempting to avoid the use of the intermediate variables b1,b2,b3,b4 I tried:
p[3]= abDigitLookup[ (d & 0x0000003F) ]; d >>= 6;
p[2]= abDigitLookup[ (d & 0x0000003F) ]; d >>= 6;
p[1]= abDigitLookup[ (d & 0x0000003F) ]; d >>= 6;
p[0]= abDigitLookup[ (d & 0x0000003F) ];
p += 4;
...which gave a very slight boost.
But there was no additional gain for combining into longer statements like so:
*p++ = abDigitLookup[ BYTE((d & 0x00FC0000) >> 18) ];
*p++ = abDigitLookup[ BYTE((d & 0x0003F000) >> 12) ];
*p++ = abDigitLookup[ BYTE((d & 0x00000FC0) >> 6 ) ];
*p++ = abDigitLookup[ BYTE((d & 0x0000003F) ) ];
-- Dan
ASKER
grg99,
I missed your last post... sorry.
That is a significant point... as you cycle through the large source/input buffer, the CPU will continuously update the L1 cache, possibly to the detriment of ultra-fast access to the lookup table.
I wonder if there is a way to tell the CPU that a particular part of the cache (the lookup table) is more important than the data that you have already processed in the source buffer. On the other hand, if it is watching the action (e.g., counting accesses), it could notice that the lookup table is used often and it might decide to keep it in cache (these modern CPUs are sometimes a very smart about stuff like that :-)
-- Dan
I missed your last post... sorry.
That is a significant point... as you cycle through the large source/input buffer, the CPU will continuously update the L1 cache, possibly to the detriment of ultra-fast access to the lookup table.
I wonder if there is a way to tell the CPU that a particular part of the cache (the lookup table) is more important than the data that you have already processed in the source buffer. On the other hand, if it is watching the action (e.g., counting accesses), it could notice that the lookup table is used often and it might decide to keep it in cache (these modern CPUs are sometimes a very smart about stuff like that :-)
-- Dan
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
Thanks for the submission!
That's a clever and interesting use of lookup tables, fer sure :-)
>> What are the L1 and L2 cache sizes of your CPU btw ?
There is a sticker on my PC indicating an L2 cache of 512K. I don't know for sure how large the L1 cache is, but a standard size for an AMD Athlon 2800 L1 cache is 64KB, so that's a good guess. If so, then even these larger lookup tables should fit cleanly in the fastest cache.
The timing results:
MB/sec Internal name // Description
-------- -------------------------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- -
168.43 ToBase64_infinity08_standa rd / Infinity08 (std)
181.21 ToBase64_infinity08_lookup _b / Infinity08 (multi-table lookup, 6K of tables)
79.59 ToBase64_infinity08_standa rd (no optimization)
90.01 ToBase64_infinity08_lookup _b (no optimization)
188.78 zToBase64 / Dan's 'standard' C code
198.13 mToBase64 / Dan's "mystery" algorithm (C code)
That is, your second version is about 7% faster than your "standard" one, but still slightly slower than "Dan's std" and "Dan's mystery algorithm."
Before you ask... Yes, I was very careful to populate the lookup tables outside of the timer -- I even tried doing it in InitDialog() just to be certain -- and the throughput was no different than if done just before starting the clock.
I also verified the optimization was set at "max speed" and I did a "sanity check" by re-running the other tests -- which came out similar to the previously-posted values
=--=-=-=-=-=-=-=-=-=
It seems to me that using the more extensive lookup tables *ought* to increase the speed by a larger factor. Your main loop is very tight, requiring only about 79 bytes of opcodes (compared to 93 for my "std" function). In trying to analyse the code, I think the bottleneck is that there is still a lot of index calculating going on -- primarily when determining the index values into the 2-d tables (lookup_table_b2 and lookup_table_b3). Here is the ASM emitted by the compiler for the main loop of the ToBase64_infinity08_lookup _b function:
>> *pszOutBuf++ = lookup_table_b1[*pSrc];
xor edx,edx
lea edi,[esi+1]
mov dl,byte ptr [esi]
>> *pszOutBuf++ = lookup_table_b2[*pSrc & 0x03][*(pSrc + 1)];
xor ebx,ebx
inc eax
mov dl,byte ptr lookup_table_b1 (0040766c)[edx]
mov byte ptr [eax-1],dl
mov dl,byte ptr [esi]
mov bl,byte ptr [edi]
and edx,3
shl edx,8
lea esi,[edi+1]
inc eax
mov dl,byte ptr [edx+ebx+40626Ch]
>> *pSrc++;
>> *pszOutBuf++ = lookup_table_b3[*pSrc & 0x0F][*(pSrc + 1)];
xor ebx,ebx
mov byte ptr [eax-1],dl
mov dl,byte ptr [edi]
mov bl,byte ptr [esi]
and edx,0Fh
shl edx,8
inc eax
mov dl,byte ptr [edx+ebx+40666Ch]
mov byte ptr [eax-1],dl
>> *pSrc++;
>> *pszOutBuf++ = lookup_table_b4[*pSrc];
xor edx,edx
mov dl,byte ptr [esi]
inc eax
>> *pSrc++;
inc esi
dec ebp
mov dl,byte ptr lookup_table_b4 (0040776c)[edx]
mov byte ptr [eax-1],dl
jne ToBase64_infinity08_lookup _b+2Dh ; (first opcode above)
pop edi
pop ebp
pop ebx
=-=-=-=-=-=-=--=
Although there is out-of-sequence "obfuscation" by the compiler, a statement like:
lookup_table_b2[*pSrc & 0x03][*(pSrc + 1)];
is a lookup into a 2-d array, which means that two index values need to be calculated...
>> *pszOutBuf++ = lookup_table_b1[*pSrc]; // simple 1-d lookup
xor edx,edx
mov dl,byte ptr [esi] // get source byte
mov dl,byte ptr lookup_table_b1[edx] // look it up
mov byte ptr [eax-1],dl // store the output
>> *pszOutBuf++ = lookup_table_b2[*pSrc & 0x03][*(pSrc + 1)]; ;; complex 2-d lookup
xor edx,edx
mov dl,byte ptr [esi] // get src byte
and edx,3 // calc first part of the index
shl edx,8
xor ebx,ebx
mov bl,byte ptr [edi] // calc second part of the index
mov dl,byte ptr [edx+ebx+40626Ch] // look up the value
mov byte ptr [eax-1],dl // store the output
Hoping to cut down some of the work, I tried this: We all know that...
char table[4][256];
z= table[x][y]
...equates to...
char table[1024];
z= table[(x*256) + y]
So I used this code:
char* t2= &lookup_table_b2[0][0];
char* t3= &lookup_table_b3[0][0];
while (nSrcLen >= 3) {
BYTE b1= *pSrc++;
BYTE b2= *pSrc++;
BYTE b3= *pSrc++;
*pszOutBuf++ = lookup_table_b1[b1];
*pszOutBuf++ = t2[((b1 & 0x03)*256)+b2];
*pszOutBuf++ = t3[((b2 & 0x0F)*256)+b3];
*pszOutBuf++ = lookup_table_b4[b3];
nSrcLen -= 3;
}
But the score came out the same. The local variables b1,b2,b3 wind up as loads of registers (never getting saved to memory) so that shouldn't cause a slowdown.
After trying several alternatives, I have to surmize that though this particular lookup scheme improves upon a standard 64-byte table lookup, we are not seeing maximal performance. Yet :-)
That's a clever and interesting use of lookup tables, fer sure :-)
>> What are the L1 and L2 cache sizes of your CPU btw ?
There is a sticker on my PC indicating an L2 cache of 512K. I don't know for sure how large the L1 cache is, but a standard size for an AMD Athlon 2800 L1 cache is 64KB, so that's a good guess. If so, then even these larger lookup tables should fit cleanly in the fastest cache.
The timing results:
MB/sec Internal name // Description
-------- --------------------------
168.43 ToBase64_infinity08_standa
181.21 ToBase64_infinity08_lookup
79.59 ToBase64_infinity08_standa
90.01 ToBase64_infinity08_lookup
188.78 zToBase64 / Dan's 'standard' C code
198.13 mToBase64 / Dan's "mystery" algorithm (C code)
That is, your second version is about 7% faster than your "standard" one, but still slightly slower than "Dan's std" and "Dan's mystery algorithm."
Before you ask... Yes, I was very careful to populate the lookup tables outside of the timer -- I even tried doing it in InitDialog() just to be certain -- and the throughput was no different than if done just before starting the clock.
I also verified the optimization was set at "max speed" and I did a "sanity check" by re-running the other tests -- which came out similar to the previously-posted values
=--=-=-=-=-=-=-=-=-=
It seems to me that using the more extensive lookup tables *ought* to increase the speed by a larger factor. Your main loop is very tight, requiring only about 79 bytes of opcodes (compared to 93 for my "std" function). In trying to analyse the code, I think the bottleneck is that there is still a lot of index calculating going on -- primarily when determining the index values into the 2-d tables (lookup_table_b2 and lookup_table_b3). Here is the ASM emitted by the compiler for the main loop of the ToBase64_infinity08_lookup
>> *pszOutBuf++ = lookup_table_b1[*pSrc];
xor edx,edx
lea edi,[esi+1]
mov dl,byte ptr [esi]
>> *pszOutBuf++ = lookup_table_b2[*pSrc & 0x03][*(pSrc + 1)];
xor ebx,ebx
inc eax
mov dl,byte ptr lookup_table_b1 (0040766c)[edx]
mov byte ptr [eax-1],dl
mov dl,byte ptr [esi]
mov bl,byte ptr [edi]
and edx,3
shl edx,8
lea esi,[edi+1]
inc eax
mov dl,byte ptr [edx+ebx+40626Ch]
>> *pSrc++;
>> *pszOutBuf++ = lookup_table_b3[*pSrc & 0x0F][*(pSrc + 1)];
xor ebx,ebx
mov byte ptr [eax-1],dl
mov dl,byte ptr [edi]
mov bl,byte ptr [esi]
and edx,0Fh
shl edx,8
inc eax
mov dl,byte ptr [edx+ebx+40666Ch]
mov byte ptr [eax-1],dl
>> *pSrc++;
>> *pszOutBuf++ = lookup_table_b4[*pSrc];
xor edx,edx
mov dl,byte ptr [esi]
inc eax
>> *pSrc++;
inc esi
dec ebp
mov dl,byte ptr lookup_table_b4 (0040776c)[edx]
mov byte ptr [eax-1],dl
jne ToBase64_infinity08_lookup
pop edi
pop ebp
pop ebx
=-=-=-=-=-=-=--=
Although there is out-of-sequence "obfuscation" by the compiler, a statement like:
lookup_table_b2[*pSrc & 0x03][*(pSrc + 1)];
is a lookup into a 2-d array, which means that two index values need to be calculated...
>> *pszOutBuf++ = lookup_table_b1[*pSrc]; // simple 1-d lookup
xor edx,edx
mov dl,byte ptr [esi] // get source byte
mov dl,byte ptr lookup_table_b1[edx] // look it up
mov byte ptr [eax-1],dl // store the output
>> *pszOutBuf++ = lookup_table_b2[*pSrc & 0x03][*(pSrc + 1)]; ;; complex 2-d lookup
xor edx,edx
mov dl,byte ptr [esi] // get src byte
and edx,3 // calc first part of the index
shl edx,8
xor ebx,ebx
mov bl,byte ptr [edi] // calc second part of the index
mov dl,byte ptr [edx+ebx+40626Ch] // look up the value
mov byte ptr [eax-1],dl // store the output
Hoping to cut down some of the work, I tried this: We all know that...
char table[4][256];
z= table[x][y]
...equates to...
char table[1024];
z= table[(x*256) + y]
So I used this code:
char* t2= &lookup_table_b2[0][0];
char* t3= &lookup_table_b3[0][0];
while (nSrcLen >= 3) {
BYTE b1= *pSrc++;
BYTE b2= *pSrc++;
BYTE b3= *pSrc++;
*pszOutBuf++ = lookup_table_b1[b1];
*pszOutBuf++ = t2[((b1 & 0x03)*256)+b2];
*pszOutBuf++ = t3[((b2 & 0x0F)*256)+b3];
*pszOutBuf++ = lookup_table_b4[b3];
nSrcLen -= 3;
}
But the score came out the same. The local variables b1,b2,b3 wind up as loads of registers (never getting saved to memory) so that shouldn't cause a slowdown.
After trying several alternatives, I have to surmize that though this particular lookup scheme improves upon a standard 64-byte table lookup, we are not seeing maximal performance. Yet :-)
>> Before you ask... Yes, I was very careful to populate the lookup tables outside of the timer -- I even tried doing it in InitDialog() just to be certain -- and the throughput was no different than if done just before starting the clock.
Even if you put it inside the timer, it shouldn't change a lot, because we're processing about 1GB of data.
>> That is, your second version is about 7% faster than your "standard" one
The problem is that using lookup tables that way makes the algorithm very compiler dependent (memory architecture dependent). When I tested it on my laptop (Intel Pentium M 1,5GHz - 64kB L1, 1MB L2), I got these results:
My standard code (no optimisations): 80.92 MB/s
My standard code (with optimisations): 95.74 MB/s
My lookup code b (no optimisations): 84.84 MB/s
My lookup code b (with optimisations): 313.09 MB/s
You notice the huuuge increase in performance ? My bet is that the size of my lookup tables is exactly right for my system, but not for yours. Another possibility is different compiler flags. For the record, I used these for g++ :
-Wall -ansi -fexpensive-optimizations -O3
>> It seems to me that using the more extensive lookup tables *ought* to increase the speed by a larger factor.
Yep, but a lot depends on the memory caching that happens. If after 1 pass, all 4 lookup tables are in the L1 cache completely, then the other 999 passes will be lightning fast. If however they get flushed out regularly, the advantage of the lookup tables disappears fast.
My code will be fast when encoding large amounts of data (like in this setup), but for smaller amounts of data, it doesn't win as much and will probably even be slower than regular algorithms.
>> In trying to analyse the code, I think the bottleneck is that there is still a lot of index calculating going on -- primarily when determining the index values into the 2-d tables
I would love to eliminate those 2D tables indeed, and I've tried a bit, but everything I tried untill now resulted in worse performance.
>> I have to surmize that though this particular lookup scheme improves upon a standard 64-byte table lookup, we are not seeing maximal performance. Yet :-)
I'm not giving up :)
Even if you put it inside the timer, it shouldn't change a lot, because we're processing about 1GB of data.
>> That is, your second version is about 7% faster than your "standard" one
The problem is that using lookup tables that way makes the algorithm very compiler dependent (memory architecture dependent). When I tested it on my laptop (Intel Pentium M 1,5GHz - 64kB L1, 1MB L2), I got these results:
My standard code (no optimisations): 80.92 MB/s
My standard code (with optimisations): 95.74 MB/s
My lookup code b (no optimisations): 84.84 MB/s
My lookup code b (with optimisations): 313.09 MB/s
You notice the huuuge increase in performance ? My bet is that the size of my lookup tables is exactly right for my system, but not for yours. Another possibility is different compiler flags. For the record, I used these for g++ :
-Wall -ansi -fexpensive-optimizations -O3
>> It seems to me that using the more extensive lookup tables *ought* to increase the speed by a larger factor.
Yep, but a lot depends on the memory caching that happens. If after 1 pass, all 4 lookup tables are in the L1 cache completely, then the other 999 passes will be lightning fast. If however they get flushed out regularly, the advantage of the lookup tables disappears fast.
My code will be fast when encoding large amounts of data (like in this setup), but for smaller amounts of data, it doesn't win as much and will probably even be slower than regular algorithms.
>> In trying to analyse the code, I think the bottleneck is that there is still a lot of index calculating going on -- primarily when determining the index values into the 2-d tables
I would love to eliminate those 2D tables indeed, and I've tried a bit, but everything I tried untill now resulted in worse performance.
>> I have to surmize that though this particular lookup scheme improves upon a standard 64-byte table lookup, we are not seeing maximal performance. Yet :-)
I'm not giving up :)
ASKER
Wow, that is a rather dramatic difference for your _d version.
I'll see if I can find a "Genuine Intel" processor to see if that is the difference (I'm testing with an AMD).
Also, there may be a way to force the lookup tables into cache and keep them there (at least in the ASM versions). I'll see if I can find out anything about that.
=-=-=-=-=-=-=
I love synergy... your code illustrated a possibility that I had not considered... a 256-byte table lets us avoid a mask operation on each lookup.
char chrTbl256[]=
"ABCDEFGHIJKLMNOPQRSTUVWXY Zabcdefghi jklmnopqrs tuvwxyz012 3456789+/"
"ABCDEFGHIJKLMNOPQRSTUVWXY Zabcdefghi jklmnopqrs tuvwxyz012 3456789+/"
"ABCDEFGHIJKLMNOPQRSTUVWXY Zabcdefghi jklmnopqrs tuvwxyz012 3456789+/"
"ABCDEFGHIJKLMNOPQRSTUVWXY Zabcdefghi jklmnopqrs tuvwxyz012 3456789+/" ;
void z7ToBase64( BYTE* pSrc, int nSrcLen, char* pszOutBuf )
{
char* p= pszOutBuf;
DWORD d;
for ( int j=0; j< nSrcLen; j+=3 ) {
d = *pSrc++; d <<= 8;
d|= *pSrc++; d <<= 8;
d|= *pSrc++;
p[3]= chrTbl256[ BYTE(d) ]; d >>= 6; // note: avoids the & 0x0000003F
p[2]= chrTbl256[ BYTE(d) ]; d >>= 6;
p[1]= chrTbl256[ BYTE(d) ]; d >>= 6;
p[0]= chrTbl256[ BYTE(d) ];
p+= 4;
}
// fixup for final bytes (padding) not shown
}
That avoids 4 billion AND operations across the test and certainly has a measureable effect.
MB/sec Internal name // Description
-------- -------------------------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- -
188.78 zToBase64 / Dan's 'standard' C code
197.13 zToBase64 / Dan's 'standard' C code with 256-byte lookup table (thanks to Infinity08's idea)
I predict that in the ASM version, that very simple and obvious concept (simple and obvious, once somebody shows it to you, lol :-) will have an even greater impact. That's because in ASM, we can swap bytes around very easily (eg, just clear EDX, then move, say BL (source data) to DL and use EDX as the lookup index).
-- Dan
I'll see if I can find a "Genuine Intel" processor to see if that is the difference (I'm testing with an AMD).
Also, there may be a way to force the lookup tables into cache and keep them there (at least in the ASM versions). I'll see if I can find out anything about that.
=-=-=-=-=-=-=
I love synergy... your code illustrated a possibility that I had not considered... a 256-byte table lets us avoid a mask operation on each lookup.
char chrTbl256[]=
"ABCDEFGHIJKLMNOPQRSTUVWXY
"ABCDEFGHIJKLMNOPQRSTUVWXY
"ABCDEFGHIJKLMNOPQRSTUVWXY
"ABCDEFGHIJKLMNOPQRSTUVWXY
void z7ToBase64( BYTE* pSrc, int nSrcLen, char* pszOutBuf )
{
char* p= pszOutBuf;
DWORD d;
for ( int j=0; j< nSrcLen; j+=3 ) {
d = *pSrc++; d <<= 8;
d|= *pSrc++; d <<= 8;
d|= *pSrc++;
p[3]= chrTbl256[ BYTE(d) ]; d >>= 6; // note: avoids the & 0x0000003F
p[2]= chrTbl256[ BYTE(d) ]; d >>= 6;
p[1]= chrTbl256[ BYTE(d) ]; d >>= 6;
p[0]= chrTbl256[ BYTE(d) ];
p+= 4;
}
// fixup for final bytes (padding) not shown
}
That avoids 4 billion AND operations across the test and certainly has a measureable effect.
MB/sec Internal name // Description
-------- --------------------------
188.78 zToBase64 / Dan's 'standard' C code
197.13 zToBase64 / Dan's 'standard' C code with 256-byte lookup table (thanks to Infinity08's idea)
I predict that in the ASM version, that very simple and obvious concept (simple and obvious, once somebody shows it to you, lol :-) will have an even greater impact. That's because in ASM, we can swap bytes around very easily (eg, just clear EDX, then move, say BL (source data) to DL and use EDX as the lookup index).
-- Dan
Ok, a small update : made the lookup tables smaller, and saved some operations in the loop :
void ToBase64_infinity08_lookup _e(BYTE *pSrc, int nSrcLen, char *pszOutBuf) {
BYTE *end = pSrc + nSrcLen - (nSrcLen % 3);
while (pSrc < end) {
*pszOutBuf++ = lookup_table_e1[*pSrc];
*pszOutBuf++ = lookup_table_e2[(*pSrc & 0x03) | (*(pSrc + 1) & 0xF0)];
*pSrc++;
*pszOutBuf++ = lookup_table_e3[(*pSrc & 0x0F) | (*(pSrc + 1) & 0xC0)];
*pSrc++;
*pszOutBuf++ = lookup_table_e4[*pSrc];
*pSrc++;
}
if (nSrcLen %= 3) {
*pszOutBuf++ = lookup_table_e1[*pSrc];
if (--nSrcLen) {
*pszOutBuf++ = lookup_table_e2[(*pSrc & 0x03) + (*(pSrc + 1) & 0xF0)];
*pSrc++;
*pszOutBuf++ = lookup_base64[(*pSrc << 2) & 0x3C];
*pszOutBuf++ = '=';
}
else {
*pszOutBuf++ = lookup_base64[(*pSrc << 4) & 0x30];
*pszOutBuf++ = '=';
*pszOutBuf++ = '=';
}
}
return;
}
lookup tables are generated like this :
char lookup_table_e1[256];
char lookup_table_e2[256];
char lookup_table_e3[256];
char lookup_table_e4[256];
void generate_lookup_table_e() {
int i = 0;
for (i = 0; i < 256; i++) {
lookup_table_e1[i] = lookup_base64[(i >> 2) & 0x3F];
}
for (i = 0; i < 256; i++) {
lookup_table_e2[i] = lookup_base64[((i << 4) & 0x30) | ((i >> 4) & 0x0F)];
}
for (i = 0; i < 256; i++) {
lookup_table_e3[i] = lookup_base64[((i << 2) & 0x3C) | ((i >> 6) & 0x03)];
}
for (i = 0; i < 256; i++) {
lookup_table_e4[i] = lookup_base64[i & 0x3F];
}
return;
}
Just out of curiosity, what optimisation flags do you use ?
void ToBase64_infinity08_lookup
BYTE *end = pSrc + nSrcLen - (nSrcLen % 3);
while (pSrc < end) {
*pszOutBuf++ = lookup_table_e1[*pSrc];
*pszOutBuf++ = lookup_table_e2[(*pSrc & 0x03) | (*(pSrc + 1) & 0xF0)];
*pSrc++;
*pszOutBuf++ = lookup_table_e3[(*pSrc & 0x0F) | (*(pSrc + 1) & 0xC0)];
*pSrc++;
*pszOutBuf++ = lookup_table_e4[*pSrc];
*pSrc++;
}
if (nSrcLen %= 3) {
*pszOutBuf++ = lookup_table_e1[*pSrc];
if (--nSrcLen) {
*pszOutBuf++ = lookup_table_e2[(*pSrc & 0x03) + (*(pSrc + 1) & 0xF0)];
*pSrc++;
*pszOutBuf++ = lookup_base64[(*pSrc << 2) & 0x3C];
*pszOutBuf++ = '=';
}
else {
*pszOutBuf++ = lookup_base64[(*pSrc << 4) & 0x30];
*pszOutBuf++ = '=';
*pszOutBuf++ = '=';
}
}
return;
}
lookup tables are generated like this :
char lookup_table_e1[256];
char lookup_table_e2[256];
char lookup_table_e3[256];
char lookup_table_e4[256];
void generate_lookup_table_e() {
int i = 0;
for (i = 0; i < 256; i++) {
lookup_table_e1[i] = lookup_base64[(i >> 2) & 0x3F];
}
for (i = 0; i < 256; i++) {
lookup_table_e2[i] = lookup_base64[((i << 4) & 0x30) | ((i >> 4) & 0x0F)];
}
for (i = 0; i < 256; i++) {
lookup_table_e3[i] = lookup_base64[((i << 2) & 0x3C) | ((i >> 6) & 0x03)];
}
for (i = 0; i < 256; i++) {
lookup_table_e4[i] = lookup_base64[i & 0x3F];
}
return;
}
Just out of curiosity, what optimisation flags do you use ?
>> Wow, that is a rather dramatic difference for your _d version.
I tested the same code on an AMD Athlon64 3700+, and didn't notice the same huge performance increase I had on the Pentium M. I think I'll do the rest of the tests on this one, as it should be closer to yours :)
>> your code illustrated a possibility that I had not considered... a 256-byte table lets us avoid a mask operation on each lookup.
Yep, a lookup table can be a very handy tool when optimizing for speed. But it only really works when there's a lot of data to process, and when you can ensure that the lookup table stays in fast memory (preferrably the L1 cache) the whole time. That's the difficulty, especially since there are so many different architectures, and they all have a different behaviour when it comes to caching.
I tested the same code on an AMD Athlon64 3700+, and didn't notice the same huge performance increase I had on the Pentium M. I think I'll do the rest of the tests on this one, as it should be closer to yours :)
>> your code illustrated a possibility that I had not considered... a 256-byte table lets us avoid a mask operation on each lookup.
Yep, a lookup table can be a very handy tool when optimizing for speed. But it only really works when there's a lot of data to process, and when you can ensure that the lookup table stays in fast memory (preferrably the L1 cache) the whole time. That's the difficulty, especially since there are so many different architectures, and they all have a different behaviour when it comes to caching.
ASKER
I've set my VC++ 6.0 compiler to use /O2 (oh 2, "maximize speed") which the book says is equivalent to:
/Og /Oi /Ot /Oy /Ob1 /Gs /Gf /Gy
/Og global optimizations, automatic-register allocation, and loop optimization
/Oi (irrelevant, has to do with marshalling)
/Ot instructs the compiler to favor speed over size.
/Oy suppresses creation of frame pointers on the call stack
/Ob1 inline expansion of functions (irrelevant)
/Gs (irrelevant, has to do with stack probes)
/Gf puts single copy of identical strings into the .EXE file
/Gy (irrelevant) package individual functions (COMDATs)
/Og /Oi /Ot /Oy /Ob1 /Gs /Gf /Gy
/Og global optimizations, automatic-register allocation, and loop optimization
/Oi (irrelevant, has to do with marshalling)
/Ot instructs the compiler to favor speed over size.
/Oy suppresses creation of frame pointers on the call stack
/Ob1 inline expansion of functions (irrelevant)
/Gs (irrelevant, has to do with stack probes)
/Gf puts single copy of identical strings into the .EXE file
/Gy (irrelevant) package individual functions (COMDATs)
ASKER
MB/sec Internal name // Description
-------- -------------------------- ---------- ---------- ---------- ---------- ---------- ----------
168.43 ToBase64_infinity08_standa rd / Infinity08 (std)
181.21 ToBase64_infinity08_lookup _b / Infinity08 (multi-table lookup, 6K of tables)
179.79 ToBase64_infinity08_lookup _e / Infinity08 (multi-table lookup, 1K of tables)
By my measurements, it's a bit slower than your previous try (within the margin of error)
It would be tremendously interesting to me to see the opcodes generated for the version that you measured as having such a high rate (for comparison with the opcodes I posted in http:#17547635). Your other measurements were relatively proportional to mine, but that one was off the scale by comparison. Can you post that ASM listing? Whatever tricks your compiler is using would come in handy in the ASM version at least.
I'd hate to jump to the conclusion that it all boils down down to L1 cache-related issues if it is really something else.
-------- --------------------------
168.43 ToBase64_infinity08_standa
181.21 ToBase64_infinity08_lookup
179.79 ToBase64_infinity08_lookup
By my measurements, it's a bit slower than your previous try (within the margin of error)
It would be tremendously interesting to me to see the opcodes generated for the version that you measured as having such a high rate (for comparison with the opcodes I posted in http:#17547635). Your other measurements were relatively proportional to mine, but that one was off the scale by comparison. Can you post that ASM listing? Whatever tricks your compiler is using would come in handy in the ASM version at least.
I'd hate to jump to the conclusion that it all boils down down to L1 cache-related issues if it is really something else.
The compiler I'm using : g++ (GCC) version 3.3.1 (MinGW32 special 20030804-1)
flags without optimization : -Wall
flags with optimization : -Wall -ansi -fexpensive-optimizations -O3
I've just listed the main loops.
NO OPTIMIZATION:
============
a) standard:
----------
L99:
cmpl $2, 12(%ebp)
jg L101
jmp L100
L101:
movl 16(%ebp), %eax
movl %eax, %edx
movl 8(%ebp), %eax
movzbl (%eax), %eax
shrb $2, %al
movzbl %al, %eax
andl $63, %eax
movzbl _lookup_base64(%eax), %eax
movb %al, (%edx)
leal 16(%ebp), %eax
incl (%eax)
movl 16(%ebp), %eax
movl %eax, %ecx
movl 8(%ebp), %eax
movzbl (%eax), %eax
sall $4, %eax
movl %eax, %edx
andl $48, %edx
movl 8(%ebp), %eax
incl %eax
movzbl (%eax), %eax
shrb $4, %al
movzbl %al, %eax
andl $15, %eax
orl %edx, %eax
movzbl _lookup_base64(%eax), %eax
movb %al, (%ecx)
leal 16(%ebp), %eax
incl (%eax)
incl 8(%ebp)
movl 16(%ebp), %eax
movl %eax, %ecx
movl 8(%ebp), %eax
movzbl (%eax), %eax
sall $2, %eax
movl %eax, %edx
andl $60, %edx
movl 8(%ebp), %eax
incl %eax
movzbl (%eax), %eax
shrb $6, %al
movzbl %al, %eax
andl $3, %eax
orl %edx, %eax
movzbl _lookup_base64(%eax), %eax
movb %al, (%ecx)
leal 16(%ebp), %eax
incl (%eax)
incl 8(%ebp)
movl 16(%ebp), %eax
movl %eax, %edx
movl 8(%ebp), %eax
movzbl (%eax), %eax
andl $63, %eax
movzbl _lookup_base64(%eax), %eax
movb %al, (%edx)
leal 16(%ebp), %eax
incl (%eax)
incl 8(%ebp)
leal 12(%ebp), %eax
subl $3, (%eax)
jmp L99
b) multi-table lookup, 6K of tables (b):
-------------------------- ---------- ---
L115:
cmpl $2, 12(%ebp)
jg L117
jmp L116
L117:
movl 16(%ebp), %eax
movl %eax, %edx
movl 8(%ebp), %eax
movzbl (%eax), %eax
movzbl _lookup_table_b1(%eax), %eax
movb %al, (%edx)
leal 16(%ebp), %eax
incl (%eax)
movl 16(%ebp), %eax
movl %eax, %ecx
movl 8(%ebp), %eax
incl %eax
movzbl (%eax), %edx
movl 8(%ebp), %eax
movzbl (%eax), %eax
andl $3, %eax
sall $8, %eax
addl %edx, %eax
addl $_lookup_table_b2, %eax
movzbl (%eax), %eax
movb %al, (%ecx)
leal 16(%ebp), %eax
incl (%eax)
incl 8(%ebp)
movl 16(%ebp), %eax
movl %eax, %ecx
movl 8(%ebp), %eax
incl %eax
movzbl (%eax), %edx
movl 8(%ebp), %eax
movzbl (%eax), %eax
andl $15, %eax
sall $8, %eax
addl %edx, %eax
addl $_lookup_table_b3, %eax
movzbl (%eax), %eax
movb %al, (%ecx)
leal 16(%ebp), %eax
incl (%eax)
incl 8(%ebp)
movl 16(%ebp), %eax
movl %eax, %edx
movl 8(%ebp), %eax
movzbl (%eax), %eax
movzbl _lookup_table_b4(%eax), %eax
movb %al, (%edx)
leal 16(%ebp), %eax
incl (%eax)
incl 8(%ebp)
leal 12(%ebp), %eax
subl $3, (%eax)
jmp L115
c) multi-table lookup, 1K of tables (e):
-------------------------- ---------- ---
L136:
movl 8(%ebp), %eax
cmpl -8(%ebp), %eax
jb L138
jmp L137
L138:
movl 16(%ebp), %eax
movl %eax, %edx
movl 8(%ebp), %eax
movzbl (%eax), %eax
movzbl _lookup_table_e1(%eax), %eax
movb %al, (%edx)
leal 16(%ebp), %eax
incl (%eax)
movl 16(%ebp), %eax
movl %eax, %ecx
movl 8(%ebp), %eax
movzbl (%eax), %eax
movl %eax, %edx
andl $3, %edx
movl 8(%ebp), %eax
incl %eax
movzbl (%eax), %eax
andl $240, %eax
orl %edx, %eax
movzbl _lookup_table_e2(%eax), %eax
movb %al, (%ecx)
leal 16(%ebp), %eax
incl (%eax)
incl 8(%ebp)
movl 16(%ebp), %eax
movl %eax, %ecx
movl 8(%ebp), %eax
movzbl (%eax), %eax
movl %eax, %edx
andl $15, %edx
movl 8(%ebp), %eax
incl %eax
movzbl (%eax), %eax
andl $192, %eax
orl %edx, %eax
movzbl _lookup_table_e3(%eax), %eax
movb %al, (%ecx)
leal 16(%ebp), %eax
incl (%eax)
incl 8(%ebp)
movl 16(%ebp), %eax
movl %eax, %edx
movl 8(%ebp), %eax
movzbl (%eax), %eax
movzbl _lookup_table_e4(%eax), %eax
movb %al, (%edx)
leal 16(%ebp), %eax
incl (%eax)
incl 8(%ebp)
jmp L136
WITH OPTIMIZATION:
==============
a) standard:
----------
L171:
movzbl (%ecx), %eax
subl $3, %esi
shrb $2, %al
andl $63, %eax
movzbl _lookup_base64(%eax), %edx
movb %dl, (%ebx)
incl %ebx
movzbl (%ecx), %edx
movzbl 1(%ecx), %eax
incl %ecx
sall $4, %edx
shrb $4, %al
andl $48, %edx
andl $15, %eax
orl %eax, %edx
movzbl _lookup_base64(%edx), %eax
movb %al, (%ebx)
incl %ebx
movzbl (%ecx), %edx
movzbl 1(%ecx), %eax
incl %ecx
sall $2, %edx
shrb $6, %al
andl $60, %edx
andl $3, %eax
orl %eax, %edx
movzbl _lookup_base64(%edx), %eax
movb %al, (%ebx)
incl %ebx
movzbl (%ecx), %eax
incl %ecx
andl $63, %eax
movzbl _lookup_base64(%eax), %edx
movb %dl, (%ebx)
incl %ebx
cmpl $2, %esi
jg L171
b) multi-table lookup, 6K of tables (b):
-------------------------- ---------- ---
L195:
movzbl (%ecx), %eax
subl $3, %esi
movzbl _lookup_table_b1(%eax), %edx
movb %dl, (%ebx)
incl %ebx
movzbl (%ecx), %eax
movzbl 1(%ecx), %edx
incl %ecx
andl $3, %eax
sall $8, %eax
movzbl _lookup_table_b2(%edx,%eax ), %eax
movb %al, (%ebx)
incl %ebx
movzbl (%ecx), %eax
movzbl 1(%ecx), %edx
incl %ecx
andl $15, %eax
sall $8, %eax
movzbl _lookup_table_b3(%edx,%eax ), %eax
movb %al, (%ebx)
incl %ebx
movzbl (%ecx), %eax
incl %ecx
movzbl _lookup_table_b4(%eax), %edx
movb %dl, (%ebx)
incl %ebx
cmpl $2, %esi
jg L195
c) multi-table lookup, 1K of tables (e):
-------------------------- ---------- ---
L228:
movzbl (%ecx), %eax
movzbl _lookup_table_e1(%eax), %edx
movb %dl, (%ebx)
incl %ebx
movzbl (%ecx), %edx
movzbl 1(%ecx), %eax
incl %ecx
andl $3, %edx
andl $240, %eax
orl %eax, %edx
movzbl _lookup_table_e2(%edx), %eax
movb %al, (%ebx)
incl %ebx
movzbl (%ecx), %edx
movzbl 1(%ecx), %eax
incl %ecx
andl $15, %edx
andl $192, %eax
orl %eax, %edx
movzbl _lookup_table_e3(%edx), %eax
movb %al, (%ebx)
incl %ebx
movzbl (%ecx), %eax
incl %ecx
movzbl _lookup_table_e4(%eax), %edx
movb %dl, (%ebx)
incl %ebx
cmpl %edi, %ecx
jb L228
flags without optimization : -Wall
flags with optimization : -Wall -ansi -fexpensive-optimizations -O3
I've just listed the main loops.
NO OPTIMIZATION:
============
a) standard:
----------
L99:
cmpl $2, 12(%ebp)
jg L101
jmp L100
L101:
movl 16(%ebp), %eax
movl %eax, %edx
movl 8(%ebp), %eax
movzbl (%eax), %eax
shrb $2, %al
movzbl %al, %eax
andl $63, %eax
movzbl _lookup_base64(%eax), %eax
movb %al, (%edx)
leal 16(%ebp), %eax
incl (%eax)
movl 16(%ebp), %eax
movl %eax, %ecx
movl 8(%ebp), %eax
movzbl (%eax), %eax
sall $4, %eax
movl %eax, %edx
andl $48, %edx
movl 8(%ebp), %eax
incl %eax
movzbl (%eax), %eax
shrb $4, %al
movzbl %al, %eax
andl $15, %eax
orl %edx, %eax
movzbl _lookup_base64(%eax), %eax
movb %al, (%ecx)
leal 16(%ebp), %eax
incl (%eax)
incl 8(%ebp)
movl 16(%ebp), %eax
movl %eax, %ecx
movl 8(%ebp), %eax
movzbl (%eax), %eax
sall $2, %eax
movl %eax, %edx
andl $60, %edx
movl 8(%ebp), %eax
incl %eax
movzbl (%eax), %eax
shrb $6, %al
movzbl %al, %eax
andl $3, %eax
orl %edx, %eax
movzbl _lookup_base64(%eax), %eax
movb %al, (%ecx)
leal 16(%ebp), %eax
incl (%eax)
incl 8(%ebp)
movl 16(%ebp), %eax
movl %eax, %edx
movl 8(%ebp), %eax
movzbl (%eax), %eax
andl $63, %eax
movzbl _lookup_base64(%eax), %eax
movb %al, (%edx)
leal 16(%ebp), %eax
incl (%eax)
incl 8(%ebp)
leal 12(%ebp), %eax
subl $3, (%eax)
jmp L99
b) multi-table lookup, 6K of tables (b):
--------------------------
L115:
cmpl $2, 12(%ebp)
jg L117
jmp L116
L117:
movl 16(%ebp), %eax
movl %eax, %edx
movl 8(%ebp), %eax
movzbl (%eax), %eax
movzbl _lookup_table_b1(%eax), %eax
movb %al, (%edx)
leal 16(%ebp), %eax
incl (%eax)
movl 16(%ebp), %eax
movl %eax, %ecx
movl 8(%ebp), %eax
incl %eax
movzbl (%eax), %edx
movl 8(%ebp), %eax
movzbl (%eax), %eax
andl $3, %eax
sall $8, %eax
addl %edx, %eax
addl $_lookup_table_b2, %eax
movzbl (%eax), %eax
movb %al, (%ecx)
leal 16(%ebp), %eax
incl (%eax)
incl 8(%ebp)
movl 16(%ebp), %eax
movl %eax, %ecx
movl 8(%ebp), %eax
incl %eax
movzbl (%eax), %edx
movl 8(%ebp), %eax
movzbl (%eax), %eax
andl $15, %eax
sall $8, %eax
addl %edx, %eax
addl $_lookup_table_b3, %eax
movzbl (%eax), %eax
movb %al, (%ecx)
leal 16(%ebp), %eax
incl (%eax)
incl 8(%ebp)
movl 16(%ebp), %eax
movl %eax, %edx
movl 8(%ebp), %eax
movzbl (%eax), %eax
movzbl _lookup_table_b4(%eax), %eax
movb %al, (%edx)
leal 16(%ebp), %eax
incl (%eax)
incl 8(%ebp)
leal 12(%ebp), %eax
subl $3, (%eax)
jmp L115
c) multi-table lookup, 1K of tables (e):
--------------------------
L136:
movl 8(%ebp), %eax
cmpl -8(%ebp), %eax
jb L138
jmp L137
L138:
movl 16(%ebp), %eax
movl %eax, %edx
movl 8(%ebp), %eax
movzbl (%eax), %eax
movzbl _lookup_table_e1(%eax), %eax
movb %al, (%edx)
leal 16(%ebp), %eax
incl (%eax)
movl 16(%ebp), %eax
movl %eax, %ecx
movl 8(%ebp), %eax
movzbl (%eax), %eax
movl %eax, %edx
andl $3, %edx
movl 8(%ebp), %eax
incl %eax
movzbl (%eax), %eax
andl $240, %eax
orl %edx, %eax
movzbl _lookup_table_e2(%eax), %eax
movb %al, (%ecx)
leal 16(%ebp), %eax
incl (%eax)
incl 8(%ebp)
movl 16(%ebp), %eax
movl %eax, %ecx
movl 8(%ebp), %eax
movzbl (%eax), %eax
movl %eax, %edx
andl $15, %edx
movl 8(%ebp), %eax
incl %eax
movzbl (%eax), %eax
andl $192, %eax
orl %edx, %eax
movzbl _lookup_table_e3(%eax), %eax
movb %al, (%ecx)
leal 16(%ebp), %eax
incl (%eax)
incl 8(%ebp)
movl 16(%ebp), %eax
movl %eax, %edx
movl 8(%ebp), %eax
movzbl (%eax), %eax
movzbl _lookup_table_e4(%eax), %eax
movb %al, (%edx)
leal 16(%ebp), %eax
incl (%eax)
incl 8(%ebp)
jmp L136
WITH OPTIMIZATION:
==============
a) standard:
----------
L171:
movzbl (%ecx), %eax
subl $3, %esi
shrb $2, %al
andl $63, %eax
movzbl _lookup_base64(%eax), %edx
movb %dl, (%ebx)
incl %ebx
movzbl (%ecx), %edx
movzbl 1(%ecx), %eax
incl %ecx
sall $4, %edx
shrb $4, %al
andl $48, %edx
andl $15, %eax
orl %eax, %edx
movzbl _lookup_base64(%edx), %eax
movb %al, (%ebx)
incl %ebx
movzbl (%ecx), %edx
movzbl 1(%ecx), %eax
incl %ecx
sall $2, %edx
shrb $6, %al
andl $60, %edx
andl $3, %eax
orl %eax, %edx
movzbl _lookup_base64(%edx), %eax
movb %al, (%ebx)
incl %ebx
movzbl (%ecx), %eax
incl %ecx
andl $63, %eax
movzbl _lookup_base64(%eax), %edx
movb %dl, (%ebx)
incl %ebx
cmpl $2, %esi
jg L171
b) multi-table lookup, 6K of tables (b):
--------------------------
L195:
movzbl (%ecx), %eax
subl $3, %esi
movzbl _lookup_table_b1(%eax), %edx
movb %dl, (%ebx)
incl %ebx
movzbl (%ecx), %eax
movzbl 1(%ecx), %edx
incl %ecx
andl $3, %eax
sall $8, %eax
movzbl _lookup_table_b2(%edx,%eax
movb %al, (%ebx)
incl %ebx
movzbl (%ecx), %eax
movzbl 1(%ecx), %edx
incl %ecx
andl $15, %eax
sall $8, %eax
movzbl _lookup_table_b3(%edx,%eax
movb %al, (%ebx)
incl %ebx
movzbl (%ecx), %eax
incl %ecx
movzbl _lookup_table_b4(%eax), %edx
movb %dl, (%ebx)
incl %ebx
cmpl $2, %esi
jg L195
c) multi-table lookup, 1K of tables (e):
--------------------------
L228:
movzbl (%ecx), %eax
movzbl _lookup_table_e1(%eax), %edx
movb %dl, (%ebx)
incl %ebx
movzbl (%ecx), %edx
movzbl 1(%ecx), %eax
incl %ecx
andl $3, %edx
andl $240, %eax
orl %eax, %edx
movzbl _lookup_table_e2(%edx), %eax
movb %al, (%ebx)
incl %ebx
movzbl (%ecx), %edx
movzbl 1(%ecx), %eax
incl %ecx
andl $15, %edx
andl $192, %eax
orl %eax, %edx
movzbl _lookup_table_e3(%edx), %eax
movb %al, (%ebx)
incl %ebx
movzbl (%ecx), %eax
incl %ecx
movzbl _lookup_table_e4(%eax), %edx
movb %dl, (%ebx)
incl %ebx
cmpl %edi, %ecx
jb L228
>> I'd hate to jump to the conclusion that it all boils down down to L1 cache-related issues if it is really something else.
Well, I've tested the exact same executable on an AMD Athlon64 and an Intel Pentium M, with the results I described earlier.
I know that the comparison is unfair, as they're completely different architectures - but on the other hand, that's exactly what's causing the different behavior imo.
I would be interested to see you try the g++ compiler with the same flags I used to see what results you get. Because there could also be a difference in optimization between Visual C++ and GCC.
These are the optimizations enabled by the -O3 flag :
-falign-functions : Align the start of functions to the machine dependent byte boundary
-falign-jumps : Align branch targets to the machine dependent byte boundary
-falign-labels : Align all branch targets to the machine dependent byte boundary
-falign-loops : Align loops to the machine dependent byte boundary
-fcaller-saves : Enable values to be allocated in registers that will be clobbered by function calls, by emitting extra instructions to save and restore the registers around such calls
-fcprop-registers : Disable the copy-propagation pass to try to reduce scheduling dependencies
-fcrossjumping : Perform cross-jumping transformation
-fcse-follow-jumps : In common subexpression elimination, scan through jump instructions when the target of the jump is not reached by any other path
-fcse-skip-blocks : This is similar to -fcse-follow-jumps, but causes CSE to follow jumps which conditionally skip over blocks
-fdefer-pop : For machines which must pop arguments after a function call, the compiler normally lets arguments accumulate on the stack for several function calls and pops them all at once
-fdelayed-branch : If supported for the target machine, attempt to reorder instructions to exploit instruction slots available after delayed branch instructions
-fdelete-null-pointer-chec ks : Use global dataflow analysis to identify and eliminate useless checks for null pointers
-fexpensive-optimizations : Perform a number of minor optimizations that are relatively expensive
-fforce-mem : Force memory operands to be copied into registers before doing arithmetic on them
-fgcse : Perform a global common subexpression elimination pass. This pass also performs global constant and copy propagation
-fgcse-lm : When -fgcse-lm is enabled, global common subexpression elimination will attempt to move loads which are only killed by stores into themselves
-fgcse-sm : When -fgcse-sm is enabled, A store motion pass is run after global common subexpression elimination
-fguess-branch-probability : Guess branch probabilities using a randomized model
-fif-conversion : Attempt to transform conditional jumps into branch-less equivalents
-fif-conversion2 : Use conditional execution (where available) to transform conditional jumps into branch-less equivalents
-finline-functions : Integrate all simple functions into their callers
-floop-optimize : Perform loop optimizations: move constant expressions out of loops, simplify exit test conditions and optionally do strength-reduction and loop unrolling as well
-fmerge-constants : Attempt to merge identical constants (string constants and floating point constants) across compilation units
-foptimize-sibling-calls : Optimize sibling and tail recursive calls
-fpeephole2 : Enable machine-specific peephole optimizations
-fregmove : Attempt to reassign register numbers in move instructions and as operands of other simple instructions in order to maximize the amount of register tying
-frename-registers : Attempt to avoid false dependencies in scheduled code by making use of registers left over after register allocation
-freorder-blocks : Reorder basic blocks in the compiled function in order to reduce number of taken branches and improve code locality
-freorder-functions : Reorder basic blocks in the compiled function in order to reduce number of taken branches and improve code locality
-frerun-cse-after-loop : Re-run common subexpression elimination after loop optimizations has been performed
-frerun-loop-opt : Run the loop optimizer twice
-fsched-interblock : Schedule instructions across basic blocks
-fsched-spec : Allow speculative motion of non-load instructions
-fschedule-insns : If supported for the target machine, attempt to reorder instructions to eliminate execution stalls due to required data being unavailable
-fschedule-insns2 : Similar to -fschedule-insns, but requests an additional pass of instruction scheduling after register allocation has been done
-fstrength-reduce : Perform the optimizations of loop strength reduction and elimination of iteration variables
-fstrict-aliasing : Allows the compiler to assume the strictest aliasing rules applicable to the language being compiled
-fthread-jumps : Perform optimizations where we check to see if a jump branches to a location where another comparison subsumed by the first is found
Well, I've tested the exact same executable on an AMD Athlon64 and an Intel Pentium M, with the results I described earlier.
I know that the comparison is unfair, as they're completely different architectures - but on the other hand, that's exactly what's causing the different behavior imo.
I would be interested to see you try the g++ compiler with the same flags I used to see what results you get. Because there could also be a difference in optimization between Visual C++ and GCC.
These are the optimizations enabled by the -O3 flag :
-falign-functions : Align the start of functions to the machine dependent byte boundary
-falign-jumps : Align branch targets to the machine dependent byte boundary
-falign-labels : Align all branch targets to the machine dependent byte boundary
-falign-loops : Align loops to the machine dependent byte boundary
-fcaller-saves : Enable values to be allocated in registers that will be clobbered by function calls, by emitting extra instructions to save and restore the registers around such calls
-fcprop-registers : Disable the copy-propagation pass to try to reduce scheduling dependencies
-fcrossjumping : Perform cross-jumping transformation
-fcse-follow-jumps : In common subexpression elimination, scan through jump instructions when the target of the jump is not reached by any other path
-fcse-skip-blocks : This is similar to -fcse-follow-jumps, but causes CSE to follow jumps which conditionally skip over blocks
-fdefer-pop : For machines which must pop arguments after a function call, the compiler normally lets arguments accumulate on the stack for several function calls and pops them all at once
-fdelayed-branch : If supported for the target machine, attempt to reorder instructions to exploit instruction slots available after delayed branch instructions
-fdelete-null-pointer-chec
-fexpensive-optimizations : Perform a number of minor optimizations that are relatively expensive
-fforce-mem : Force memory operands to be copied into registers before doing arithmetic on them
-fgcse : Perform a global common subexpression elimination pass. This pass also performs global constant and copy propagation
-fgcse-lm : When -fgcse-lm is enabled, global common subexpression elimination will attempt to move loads which are only killed by stores into themselves
-fgcse-sm : When -fgcse-sm is enabled, A store motion pass is run after global common subexpression elimination
-fguess-branch-probability
-fif-conversion : Attempt to transform conditional jumps into branch-less equivalents
-fif-conversion2 : Use conditional execution (where available) to transform conditional jumps into branch-less equivalents
-finline-functions : Integrate all simple functions into their callers
-floop-optimize : Perform loop optimizations: move constant expressions out of loops, simplify exit test conditions and optionally do strength-reduction and loop unrolling as well
-fmerge-constants : Attempt to merge identical constants (string constants and floating point constants) across compilation units
-foptimize-sibling-calls : Optimize sibling and tail recursive calls
-fpeephole2 : Enable machine-specific peephole optimizations
-fregmove : Attempt to reassign register numbers in move instructions and as operands of other simple instructions in order to maximize the amount of register tying
-frename-registers : Attempt to avoid false dependencies in scheduled code by making use of registers left over after register allocation
-freorder-blocks : Reorder basic blocks in the compiled function in order to reduce number of taken branches and improve code locality
-freorder-functions : Reorder basic blocks in the compiled function in order to reduce number of taken branches and improve code locality
-frerun-cse-after-loop : Re-run common subexpression elimination after loop optimizations has been performed
-frerun-loop-opt : Run the loop optimizer twice
-fsched-interblock : Schedule instructions across basic blocks
-fsched-spec : Allow speculative motion of non-load instructions
-fschedule-insns : If supported for the target machine, attempt to reorder instructions to eliminate execution stalls due to required data being unavailable
-fschedule-insns2 : Similar to -fschedule-insns, but requests an additional pass of instruction scheduling after register allocation has been done
-fstrength-reduce : Perform the optimizations of loop strength reduction and elimination of iteration variables
-fstrict-aliasing : Allows the compiler to assume the strictest aliasing rules applicable to the language being compiled
-fthread-jumps : Perform optimizations where we check to see if a jump branches to a location where another comparison subsumed by the first is found
SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
>> (1) Converting two bytes at once by using a 64x64 lookup table.
I've actually tried that (and some variations on it), but it creates other problems that bring down performance, like:
a) the total size of the lookup tables would be 8kB, which could be too big for general usage (or it might not)
b) you're dealing with 2D tables, which require a lot of operations to just calculate the address
c) each entry in the lookup table will be two bytes instead of 1, so you're not gaining a lot in that respect.
The code I tried using this "trick" was slower than my standard code.
>> (2) Keeping the conversion table in the cache by periodic uses of cache PREFETCH instructions. Or on the x64, read-specific prefetch.
Can you give a bit more info on those ? They sound interesting. Although, I think, if the data to be encoded is sufficiently "random", we won't need to keep them in the cache this way, as they should be kept in there by the encoding process itself.
>> (3) Unrolling the main loop a few times.
Certain compiler optimizations already do that. It's worth a shot, but in this particular case, I'm not sure it'll win us much.
>> (4) spinning off separate threads to take advantage of multiple cores or CPUs.
We're kind of limited by Dan's test system here :) (2.6Ghz Athlon).
>> (5) Carefully overlapping instructions to take into account the two to four instruction pipes available on Pentium and neewer CPUs.
That's usually handled by the compiler (if it's doing a good job), and is very difficult to control using C (you really need assembler to do this good).
And it's system dependent, which we should avoid, imo.
>> (6) Using MMX, SSE, and other instructions that process two to eight operands at one gulp.
Worth a try I guess. I don't know the MMX, SSE instruction sets too well though ... And again, it's hard to control using C instead of ASM.
Want to join us and try to be faster ? :)
I've actually tried that (and some variations on it), but it creates other problems that bring down performance, like:
a) the total size of the lookup tables would be 8kB, which could be too big for general usage (or it might not)
b) you're dealing with 2D tables, which require a lot of operations to just calculate the address
c) each entry in the lookup table will be two bytes instead of 1, so you're not gaining a lot in that respect.
The code I tried using this "trick" was slower than my standard code.
>> (2) Keeping the conversion table in the cache by periodic uses of cache PREFETCH instructions. Or on the x64, read-specific prefetch.
Can you give a bit more info on those ? They sound interesting. Although, I think, if the data to be encoded is sufficiently "random", we won't need to keep them in the cache this way, as they should be kept in there by the encoding process itself.
>> (3) Unrolling the main loop a few times.
Certain compiler optimizations already do that. It's worth a shot, but in this particular case, I'm not sure it'll win us much.
>> (4) spinning off separate threads to take advantage of multiple cores or CPUs.
We're kind of limited by Dan's test system here :) (2.6Ghz Athlon).
>> (5) Carefully overlapping instructions to take into account the two to four instruction pipes available on Pentium and neewer CPUs.
That's usually handled by the compiler (if it's doing a good job), and is very difficult to control using C (you really need assembler to do this good).
And it's system dependent, which we should avoid, imo.
>> (6) Using MMX, SSE, and other instructions that process two to eight operands at one gulp.
Worth a try I guess. I don't know the MMX, SSE instruction sets too well though ... And again, it's hard to control using C instead of ASM.
Want to join us and try to be faster ? :)
SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
It might be fast, but it's not encoding the string correctly (SA==BIAABIAABIAABIAABIAA. ..) :) A few things :
1) you're depending on the endianness of the machine (running it on a machine with different endianness will mess up the algorithm).
2) the masking operations in the while loop don't do what you expect them to do.
3) you're depending on the sizes of WORD's and LONG's.
4) the *pChar += 3 part will only work to a certain extent, untill the byte overflows.
5) the if at the end uses pszOutBuf while that has never been incremented, so it overwrites the start of the string.
points 1) and 3) are the exact reasons why I didn't use the tricks you did ...
1) you're depending on the endianness of the machine (running it on a machine with different endianness will mess up the algorithm).
2) the masking operations in the while loop don't do what you expect them to do.
3) you're depending on the sizes of WORD's and LONG's.
4) the *pChar += 3 part will only work to a certain extent, untill the byte overflows.
5) the if at the end uses pszOutBuf while that has never been incremented, so it overwrites the start of the string.
points 1) and 3) are the exact reasons why I didn't use the tricks you did ...
SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
BTW, all those "*pSrc++;" lines, they almost certainly should be "pSrc++;" You want to bump the pointer, not the thing it points to.
ASKER
Infinity08,
Thanks for posting that ASM. I'll be looking at it closely soon...
grg99,
The "mystery algorithm" that I mentiond earlier uses the 8096-byte table with the two-byte lookup scheme and a 12-bit index...
I'm glad you hit on it ! (I felt sure it would come up sooner or later :-)
=-=-=-=-=-=-=-=-=-=-=-=
char Lookup12bits[ ((64*64)*2)+1 ]= // 8096+1
"AAABACADAEAFAGAHAIAJAKALA MANAOAPAQA RASATA ... A6A7A8A9A+A/"
"BABBBCBDBEBFBGBHBIBJBKBLB MBNBOBPBQB RBSBTB ... B6B7B8B9B+B/"
...
"9A9B9C9D9E9F9G9H9I9J9K9L9 M9N9O9P9Q9 R9S9T9... 5969798999+9/"
"+A+B+C+D+E+F+G+H+I+J+K+L+ M+N+O+P+Q+ R+S+T+... 5+6+7+8+9+++/"
"/A/B/C/D/E/F/G/H/I/J/K/L/ M/N/O/P/Q/ R/S/T/... 5/6/7/8/9/+//"
;
//------------------------ - Dan's "mystery algorithm" (similar to grg99's code)
void mToBase64( BYTE* pSrc, int nSrcLen, char* pszOutBuf )
{
char* p= pszOutBuf;
DWORD d;
WORD* pw= (WORD*)pszOutBuf;
WORD* pawTbl= (WORD*)Lookup12bits;
while ( nSrcLen >= 3 ) {
d = *pSrc++; d <<= 8;
d|= *pSrc++; d <<= 8;
d|= *pSrc++;
nSrcLen -= 3;
*pw++ = pawTbl[ d >> 12 ]; // no need for a mask
*pw++ = pawTbl[ d & 0x00000FFF ];
}
// ... code to fixup for end-of-src logic
}
=-=-=-=-=-=-=-=-=-=-=-=
Infinity08 is right... it's all about endedness... For this alorithm to work, the table has to be arranged with the pairs in the opposite of expected order -- as build by generate_lookup_table_grg1 2() or as in my static table.
My version is running a little faster than the others (on my box, by my measure, I'm getting 208.50 MB/s), so it's fastest for all except for grg99's...
However, I can't post an "official" timing for grg99's code because the output is not yet correct. It's sure to be fast code, but some of what's missing will require a bit of additional code to fix (for instance, the endedness of the INPUT bytes needs to be rectified and the source pointer needs to be updated -- you are currently reading and re-reading the same bytes over and over). The required changes will affect the timing of course.
-- Dan
Thanks for posting that ASM. I'll be looking at it closely soon...
grg99,
The "mystery algorithm" that I mentiond earlier uses the 8096-byte table with the two-byte lookup scheme and a 12-bit index...
I'm glad you hit on it ! (I felt sure it would come up sooner or later :-)
=-=-=-=-=-=-=-=-=-=-=-=
char Lookup12bits[ ((64*64)*2)+1 ]= // 8096+1
"AAABACADAEAFAGAHAIAJAKALA
"BABBBCBDBEBFBGBHBIBJBKBLB
...
"9A9B9C9D9E9F9G9H9I9J9K9L9
"+A+B+C+D+E+F+G+H+I+J+K+L+
"/A/B/C/D/E/F/G/H/I/J/K/L/
;
//------------------------
void mToBase64( BYTE* pSrc, int nSrcLen, char* pszOutBuf )
{
char* p= pszOutBuf;
DWORD d;
WORD* pw= (WORD*)pszOutBuf;
WORD* pawTbl= (WORD*)Lookup12bits;
while ( nSrcLen >= 3 ) {
d = *pSrc++; d <<= 8;
d|= *pSrc++; d <<= 8;
d|= *pSrc++;
nSrcLen -= 3;
*pw++ = pawTbl[ d >> 12 ]; // no need for a mask
*pw++ = pawTbl[ d & 0x00000FFF ];
}
// ... code to fixup for end-of-src logic
}
=-=-=-=-=-=-=-=-=-=-=-=
Infinity08 is right... it's all about endedness... For this alorithm to work, the table has to be arranged with the pairs in the opposite of expected order -- as build by generate_lookup_table_grg1
My version is running a little faster than the others (on my box, by my measure, I'm getting 208.50 MB/s), so it's fastest for all except for grg99's...
However, I can't post an "official" timing for grg99's code because the output is not yet correct. It's sure to be fast code, but some of what's missing will require a bit of additional code to fix (for instance, the endedness of the INPUT bytes needs to be rectified and the source pointer needs to be updated -- you are currently reading and re-reading the same bytes over and over). The required changes will affect the timing of course.
-- Dan
ASKER
Re: grg99's other suggestions:
(1) 8096-byte (64*64*2) lookup table
I think this will be the fastest technique short of a 1,048,576-byte (64*64*64*4) lookup table -- which seems like overkill, and due to cache thrashing, may not be faster, anyway.
(2) PREFETCH instructions.
I could try this out on the ASM side, but I'm not sure how much difference it will make. Most L1 caches are typically far larger than the 8096-byte lookup table, so once it's been fetched, it gets accessed constantly -- so it seems likely to stay there.
(3) Unrolling the main loop a few times.
This *could* have a significant effect. I know that processing 6 input bytes into 8 output bytes (and running the loop nSrcLen/6 times) gave me a boost in some ASM code I was trying...
(4) take advantage of multiple cores or CPUs.
I'd certainly explore this if there was a $1,000,000 prize... It would probably nearly double the throughput to have two cores working simultaneously... say one on the first half and the other on the last half.
(5) overlapping/pairing instructions (optimize for dual pipeline architecture)...
As Inifinity08 mentioned, the compiler will probably do a pretty good job of this. I did see some modest gains in some ASM code I was playing around with. But the inner loop ends up being very tight so there aren't a lot of opportunities... you pretty much need the results of each register as soon as you can set it's value.
(6) Using MMX ... two to eight operands at one gulp.
My most recent ASM try (unpublished) builds an 8-byte output into an MMX register (six bytes of input) and dumps a quadword at each pass. There is even a "non-temporal" version of the MOVQ opcode (MOVNTQ) that does not fill the cache with the output data (that never gets used again). This technique is, indeed the fastest I've timed yet... I'll post it in the ASM thread in due time.
(1) 8096-byte (64*64*2) lookup table
I think this will be the fastest technique short of a 1,048,576-byte (64*64*64*4) lookup table -- which seems like overkill, and due to cache thrashing, may not be faster, anyway.
(2) PREFETCH instructions.
I could try this out on the ASM side, but I'm not sure how much difference it will make. Most L1 caches are typically far larger than the 8096-byte lookup table, so once it's been fetched, it gets accessed constantly -- so it seems likely to stay there.
(3) Unrolling the main loop a few times.
This *could* have a significant effect. I know that processing 6 input bytes into 8 output bytes (and running the loop nSrcLen/6 times) gave me a boost in some ASM code I was trying...
(4) take advantage of multiple cores or CPUs.
I'd certainly explore this if there was a $1,000,000 prize... It would probably nearly double the throughput to have two cores working simultaneously... say one on the first half and the other on the last half.
(5) overlapping/pairing instructions (optimize for dual pipeline architecture)...
As Inifinity08 mentioned, the compiler will probably do a pretty good job of this. I did see some modest gains in some ASM code I was playing around with. But the inner loop ends up being very tight so there aren't a lot of opportunities... you pretty much need the results of each register as soon as you can set it's value.
(6) Using MMX ... two to eight operands at one gulp.
My most recent ASM try (unpublished) builds an 8-byte output into an MMX register (six bytes of input) and dumps a quadword at each pass. There is even a "non-temporal" version of the MOVQ opcode (MOVNTQ) that does not fill the cache with the output data (that never gets used again). This technique is, indeed the fastest I've timed yet... I'll post it in the ASM thread in due time.
>> The "mystery algorithm" that I mentiond earlier uses the 8096-byte table with the two-byte lookup scheme and a 12-bit index...
The only problem with this kind of algorithm, is that you need to take into account machines with different endianness. That's still doable, just use different code on different machines. But it gets harder on machines that have 1byte words for example.
It was my first thought too, but I abandoned the idea for the reasons I mentioned.
>> BTW, all those "*pSrc++;" lines, they almost certainly should be "pSrc++;" You want to bump the pointer, not the thing it points to.
heh, that's a code editing problem (copy-paste). But the operator precedence still makes sure that the ++ is done before the *. So, the dereference operation is useless, but doesn't create a problem either, as it's optimized out by the compiler.
The only problem with this kind of algorithm, is that you need to take into account machines with different endianness. That's still doable, just use different code on different machines. But it gets harder on machines that have 1byte words for example.
It was my first thought too, but I abandoned the idea for the reasons I mentioned.
>> BTW, all those "*pSrc++;" lines, they almost certainly should be "pSrc++;" You want to bump the pointer, not the thing it points to.
heh, that's a code editing problem (copy-paste). But the operator precedence still makes sure that the ++ is done before the *. So, the dereference operation is useless, but doesn't create a problem either, as it's optimized out by the compiler.
>But the operator precedence still makes sure that the ++ is done before the *.
Huh? To my knowledge, *p++ means, and has always meant, dereference p, then increment p.
But you're right, most compilers will optimize away the useless load, unless p is declared volatile.
And I was wrong, it doesnt increment what p points to. Dang language.
I will try modyfiying my code to handle both endians, hope that doesnt slow it down too much.
Also it would be fun to try a 3-hextet lookup table, although that's going to be wildly cache-unfriendly, all for a 50% gain at most. Oh, what fun!
Also a intersting point: if I comment out the line that stores the output data in the output array (two bytes at a time) the speed DOUBLES, which implies the code is going just about as fast as memory can accept (16-bit words) of data. One possibility is to buffer up 32-bits of output at a time and see if that helps a bit.
Huh? To my knowledge, *p++ means, and has always meant, dereference p, then increment p.
But you're right, most compilers will optimize away the useless load, unless p is declared volatile.
And I was wrong, it doesnt increment what p points to. Dang language.
I will try modyfiying my code to handle both endians, hope that doesnt slow it down too much.
Also it would be fun to try a 3-hextet lookup table, although that's going to be wildly cache-unfriendly, all for a 50% gain at most. Oh, what fun!
Also a intersting point: if I comment out the line that stores the output data in the output array (two bytes at a time) the speed DOUBLES, which implies the code is going just about as fast as memory can accept (16-bit words) of data. One possibility is to buffer up 32-bits of output at a time and see if that helps a bit.
>> Huh? To my knowledge, *p++ means, and has always meant, dereference p, then increment p.
I meant that the ++ has higher precedence, so *p++ is the same as *(p++). I might have formulated a bit incorrect by using the word "done" though :) (the "done" i used referred to the compilation, not the execution)
I meant that the ++ has higher precedence, so *p++ is the same as *(p++). I might have formulated a bit incorrect by using the word "done" though :) (the "done" i used referred to the compilation, not the execution)
ASKER
I saw the *p++ and (because it worked :-) figured the same thing: It increments the pointer, which now points to a new location, but since the code does not use the data at the pointer, the effect is the same as p++ I guess that if you want to increment the value that is AT the pointer, you need to use (*p)++ to override operator precidence.
Other notes:
The Microsoft ATL Server contains a header file (atlenc.h) that contains a Base64 Encoder. In case you don't have VS Studio 2003, I found the source code here:
http://groups.google.com/group/microsoft.public.vc.language/browse_thread/thread/2fbd5a4418ca6e2c
The code does a number of things, such as inserting line breaks every 76 bytes (which is mentioned in the RFC, and might have made this a slightly different challenge, but it's trivial enough that I guess we can ignore it) But when you pare out such superfluous stuff, the Microsoft code has a main loop that looks like this:
int nLoopCnt(nSrcLen/3);
for ( int j=0; j< nLoopCnt; j++ ) {
DWORD dwCurr(0);
for (int n=0; n<3; n++) {
dwCurr |= *pSrc++;
dwCurr <<= 8;
}
for (int k=0; k<4; k++) {
BYTE b = (BYTE)(dwCurr>>26);
*pszOutBuf++ = abDigitLookup[b];
dwCurr <<= 6;
}
}
//--- send of source logic omitted
This comes in at 166 MB/s When you unroll the loops, it goes to 188 MB/s (about the rate of "Dan's std" code).
Tell me if I'm wrong, but it looks like this will work regardless of endedness...
They used a clever technique to break out the 6-bit indexes, that interestingly enough, matches with the trick used in craigwardman's technique on the ASM side:
1) Accumulate three source bytes, shifting to the left, 8 bits at-a-time
(first byte ends up in the top of the DWORD accumulator)
2) The lookup index is the accumulator value, shifted right by 26 (the upper 6 bits)
(this eliminates one masking operation)
3) look up the output byte and save it
4) move the remaining bits to the left by 6 bits
(discard the top bits, already used)
5) ... rinse and repeat ( steps 2-4, four times)
I think it's an interesting way to do it, mainly because it never would have occurred to me! In my "hay day" of ASM programming of the 8086 (circa 1985), a SHIFT operation took a specific number of clock cycles *plus* some cycles per bit position shifted. So a >> 26 still *looks like* an expensive operation to me :-)
Other notes:
The Microsoft ATL Server contains a header file (atlenc.h) that contains a Base64 Encoder. In case you don't have VS Studio 2003, I found the source code here:
http://groups.google.com/group/microsoft.public.vc.language/browse_thread/thread/2fbd5a4418ca6e2c
The code does a number of things, such as inserting line breaks every 76 bytes (which is mentioned in the RFC, and might have made this a slightly different challenge, but it's trivial enough that I guess we can ignore it) But when you pare out such superfluous stuff, the Microsoft code has a main loop that looks like this:
int nLoopCnt(nSrcLen/3);
for ( int j=0; j< nLoopCnt; j++ ) {
DWORD dwCurr(0);
for (int n=0; n<3; n++) {
dwCurr |= *pSrc++;
dwCurr <<= 8;
}
for (int k=0; k<4; k++) {
BYTE b = (BYTE)(dwCurr>>26);
*pszOutBuf++ = abDigitLookup[b];
dwCurr <<= 6;
}
}
//--- send of source logic omitted
This comes in at 166 MB/s When you unroll the loops, it goes to 188 MB/s (about the rate of "Dan's std" code).
Tell me if I'm wrong, but it looks like this will work regardless of endedness...
They used a clever technique to break out the 6-bit indexes, that interestingly enough, matches with the trick used in craigwardman's technique on the ASM side:
1) Accumulate three source bytes, shifting to the left, 8 bits at-a-time
(first byte ends up in the top of the DWORD accumulator)
2) The lookup index is the accumulator value, shifted right by 26 (the upper 6 bits)
(this eliminates one masking operation)
3) look up the output byte and save it
4) move the remaining bits to the left by 6 bits
(discard the top bits, already used)
5) ... rinse and repeat ( steps 2-4, four times)
I think it's an interesting way to do it, mainly because it never would have occurred to me! In my "hay day" of ASM programming of the 8086 (circa 1985), a SHIFT operation took a specific number of clock cycles *plus* some cycles per bit position shifted. So a >> 26 still *looks like* an expensive operation to me :-)
ASKER
It does seem that using a single DWORD read to grab the three bytes could be more efficient than three separate 1-BYTE reads... But the endian thing makes it a hassle. There are several easy solutions in ASM, but I can't work out an efficent way to code it in C...
// ---------- the 'standard' way to collect the three bytes
DWORD d;
d = *pSrc++; d <<= 8;
d|= *pSrc++; d <<= 8;
d|= *pSrc++;
// ---------- a try that reads a full DWORD and rearranges...
DWORD d, e= *(DWORD*)pSrc;
d = ((e & 0x0000FF00))
| ((e & 0x000000FF ) << 16)
| ((e & 0x00FF0000 ) >> 16);
pSrc += 3;
The second version ran considerably slower than the first in my time trials.
Does anybody have any ideas on how to get a potential speed up by reading DWORD reads without losing time rearranging the bytes for subsequent processing?
// ---------- the 'standard' way to collect the three bytes
DWORD d;
d = *pSrc++; d <<= 8;
d|= *pSrc++; d <<= 8;
d|= *pSrc++;
// ---------- a try that reads a full DWORD and rearranges...
DWORD d, e= *(DWORD*)pSrc;
d = ((e & 0x0000FF00))
| ((e & 0x000000FF ) << 16)
| ((e & 0x00FF0000 ) >> 16);
pSrc += 3;
The second version ran considerably slower than the first in my time trials.
Does anybody have any ideas on how to get a potential speed up by reading DWORD reads without losing time rearranging the bytes for subsequent processing?
>> Tell me if I'm wrong, but it looks like this will work regardless of endedness...
It will, but really it's not much different than this (my standard code in this case - but any similar approach works too) :
while (nSrcLen >= 3) {
*pszOutBuf++ = lookup_base64[ (*pSrc >> 2) & 0x3F];
*pszOutBuf++ = lookup_base64[((*pSrc << 4) & 0x30) | (((*(pSrc + 1)) >> 4) & 0x0F)];
pSrc++;
*pszOutBuf++ = lookup_base64[((*pSrc << 2) & 0x3C) | (((*(pSrc + 1)) >> 6) & 0x03)];
pSrc++;
*pszOutBuf++ = lookup_base64[ (*pSrc ) & 0x3F];
pSrc++;
nSrcLen -= 3;
}
When we purely count operations (without compiler optimizations), we get :
lookup : 6
assign : 4
in/decrement : 10
shift : 5
and : 6
or : 2
array lookup : 4
comparison : 1
And for this code :
int nLoopCnt(nSrcLen/3);
for ( int j=0; j< nLoopCnt; j++ ) {
DWORD dwCurr(0);
for (int n=0; n<3; n++) {
dwCurr |= *pSrc++;
dwCurr <<= 8;
}
for (int k=0; k<4; k++) {
BYTE b = (BYTE)(dwCurr>>26);
*pszOutBuf++ = abDigitLookup[b];
dwCurr <<= 6;
}
}
we get :
lookup : 3
assign : 8
in/decrement : 15
shift : 11
and : 0
or : 3
array lookup : 4
comparison : 8
Comparing these two tables (I know they're not an exact measurement of performance, but they give a good idea), we can make these observations :
1) the lookups and assigns can be ignored, as the compiler will optimize them so it comes down to 3 lookups and 4 assigns anyway. (without optimization, we're looking at 10 vs. 11 lookups and assigns)
2) the increments and decrements can be easily optimized by the compiler, and are of similar magnitued anyway : 10 vs. 15
3) the number of "real" operations (shift, and, or) is about the same : 13 vs. 14
4) the array lookups are the same (4), which is to be expected, since the same lookup table is used
5) the number of comparisons are easily optimized by the compiler too, so they can be ignored
Which makes 3) the most important observation : both codes will take about the same time.
The thing is, if you want to keep your code endianness and word-size friendly, then you can't get much better code than the ones we already came up with, because you'll always end up with similar operation counts as I gave above.
The only real improvement I can see so far is to increase the size of the lookup table, and take advantage of that. But the problem is that the performance of code making use of that will change a lot between different architectures.
>> Does anybody have any ideas on how to get a potential speed up by reading DWORD reads without losing time rearranging the bytes for subsequent processing?
I've been trying to look for a way since you posted this challenge, but I always end up with the same amount of ands, ors and shifts.
The only approach that would work in theory is to have a 4 * 2^24 byte lookup table, which maps 3 bytes to 4 bytes. But in practice that doesn't work, since the size of the L1 cache is too limited (I still have to find a 67 MB L1 cache heh). But just for argument's sake, this would allow code like this :
unsigned int src = 0;
while (nSrcLen >= 3) {
src = *pSrc++ << 16;
src |= *pSrc++ << 8;
src |= *pSrc++;
*((unsigned int*) pszOutBuf) = lookup_base64[src];
pszOutBuf += 4;
nSrcLen -= 3;
}
Granted, it also depends on 4-byte unsigned ints, but it doesn't depend on endianness (as long as the lookup table is generated on the same machine), and it's as fast as you can get (minus some minor optimizations probably) IF you have a sufficiently large L1 cache ...
It will, but really it's not much different than this (my standard code in this case - but any similar approach works too) :
while (nSrcLen >= 3) {
*pszOutBuf++ = lookup_base64[ (*pSrc >> 2) & 0x3F];
*pszOutBuf++ = lookup_base64[((*pSrc << 4) & 0x30) | (((*(pSrc + 1)) >> 4) & 0x0F)];
pSrc++;
*pszOutBuf++ = lookup_base64[((*pSrc << 2) & 0x3C) | (((*(pSrc + 1)) >> 6) & 0x03)];
pSrc++;
*pszOutBuf++ = lookup_base64[ (*pSrc ) & 0x3F];
pSrc++;
nSrcLen -= 3;
}
When we purely count operations (without compiler optimizations), we get :
lookup : 6
assign : 4
in/decrement : 10
shift : 5
and : 6
or : 2
array lookup : 4
comparison : 1
And for this code :
int nLoopCnt(nSrcLen/3);
for ( int j=0; j< nLoopCnt; j++ ) {
DWORD dwCurr(0);
for (int n=0; n<3; n++) {
dwCurr |= *pSrc++;
dwCurr <<= 8;
}
for (int k=0; k<4; k++) {
BYTE b = (BYTE)(dwCurr>>26);
*pszOutBuf++ = abDigitLookup[b];
dwCurr <<= 6;
}
}
we get :
lookup : 3
assign : 8
in/decrement : 15
shift : 11
and : 0
or : 3
array lookup : 4
comparison : 8
Comparing these two tables (I know they're not an exact measurement of performance, but they give a good idea), we can make these observations :
1) the lookups and assigns can be ignored, as the compiler will optimize them so it comes down to 3 lookups and 4 assigns anyway. (without optimization, we're looking at 10 vs. 11 lookups and assigns)
2) the increments and decrements can be easily optimized by the compiler, and are of similar magnitued anyway : 10 vs. 15
3) the number of "real" operations (shift, and, or) is about the same : 13 vs. 14
4) the array lookups are the same (4), which is to be expected, since the same lookup table is used
5) the number of comparisons are easily optimized by the compiler too, so they can be ignored
Which makes 3) the most important observation : both codes will take about the same time.
The thing is, if you want to keep your code endianness and word-size friendly, then you can't get much better code than the ones we already came up with, because you'll always end up with similar operation counts as I gave above.
The only real improvement I can see so far is to increase the size of the lookup table, and take advantage of that. But the problem is that the performance of code making use of that will change a lot between different architectures.
>> Does anybody have any ideas on how to get a potential speed up by reading DWORD reads without losing time rearranging the bytes for subsequent processing?
I've been trying to look for a way since you posted this challenge, but I always end up with the same amount of ands, ors and shifts.
The only approach that would work in theory is to have a 4 * 2^24 byte lookup table, which maps 3 bytes to 4 bytes. But in practice that doesn't work, since the size of the L1 cache is too limited (I still have to find a 67 MB L1 cache heh). But just for argument's sake, this would allow code like this :
unsigned int src = 0;
while (nSrcLen >= 3) {
src = *pSrc++ << 16;
src |= *pSrc++ << 8;
src |= *pSrc++;
*((unsigned int*) pszOutBuf) = lookup_base64[src];
pszOutBuf += 4;
nSrcLen -= 3;
}
Granted, it also depends on 4-byte unsigned ints, but it doesn't depend on endianness (as long as the lookup table is generated on the same machine), and it's as fast as you can get (minus some minor optimizations probably) IF you have a sufficiently large L1 cache ...
ASKER
It's instructional to test the idea of using a Monster Table:
DWORD adwTbl[256*256*256]; /// a 64MB lookup table!
void GenMonsterTbl() {
BYTE* pb= (BYTE*) &adwTbl[0];
for ( int j=0; j<256*256*256; j++ ) {
DWORD d= j;
*pb++ = abDigitLookup[ BYTE((d & 0x00FC0000) >> 18) ];
*pb++ = abDigitLookup[ BYTE((d & 0x0003F000) >> 12) ];
*pb++ = abDigitLookup[ BYTE((d & 0x00000FC0) >> 6 ) ];
*pb++ = abDigitLookup[ BYTE((d & 0x0000003F) ) ];
}
}
void ToBase64_MonsterTbl( BYTE* pSrc, int nSrcLen, char* pszOutBuf )
{
DWORD* pdw= (DWORD*)pszOutBuf;
for ( int j=0; j< nSrcLen; j+=3 ) {
DWORD d;
d = *pSrc++; d <<= 8;
d|= *pSrc++; d <<= 8;
d|= *pSrc++;
*pdw++ = adwTbl[d]; // output four bytes
}
}
As predicted, the speed is abysmal... around 13MB/s.
Since the code itself is so concise (a mere 42 bytes of opcodes in the loop) the difference can ONLY be memory accesses and though it makes FAR FEWER such accesses (a third as many table lookups and a forth as many output writes), virtually every lookup is a cache miss. Since the accesses jump around so much, it's effectviely like having no cache at all!
And please don't tell me "I told you so" ... I already knew it. But I'm a pragmatic type of guy and it's reassuring to see predicted results :-)
=-=-=-=-=-=-=-=-=
I'm still mystified by the ultra-fast times measured on Infinitiy08's "6K of tables" code... the ASM output will take some deciphering before I can understand it. I'll take another look if I can find time later today.
-- Dan
DWORD adwTbl[256*256*256]; /// a 64MB lookup table!
void GenMonsterTbl() {
BYTE* pb= (BYTE*) &adwTbl[0];
for ( int j=0; j<256*256*256; j++ ) {
DWORD d= j;
*pb++ = abDigitLookup[ BYTE((d & 0x00FC0000) >> 18) ];
*pb++ = abDigitLookup[ BYTE((d & 0x0003F000) >> 12) ];
*pb++ = abDigitLookup[ BYTE((d & 0x00000FC0) >> 6 ) ];
*pb++ = abDigitLookup[ BYTE((d & 0x0000003F) ) ];
}
}
void ToBase64_MonsterTbl( BYTE* pSrc, int nSrcLen, char* pszOutBuf )
{
DWORD* pdw= (DWORD*)pszOutBuf;
for ( int j=0; j< nSrcLen; j+=3 ) {
DWORD d;
d = *pSrc++; d <<= 8;
d|= *pSrc++; d <<= 8;
d|= *pSrc++;
*pdw++ = adwTbl[d]; // output four bytes
}
}
As predicted, the speed is abysmal... around 13MB/s.
Since the code itself is so concise (a mere 42 bytes of opcodes in the loop) the difference can ONLY be memory accesses and though it makes FAR FEWER such accesses (a third as many table lookups and a forth as many output writes), virtually every lookup is a cache miss. Since the accesses jump around so much, it's effectviely like having no cache at all!
And please don't tell me "I told you so" ... I already knew it. But I'm a pragmatic type of guy and it's reassuring to see predicted results :-)
=-=-=-=-=-=-=-=-=
I'm still mystified by the ultra-fast times measured on Infinitiy08's "6K of tables" code... the ASM output will take some deciphering before I can understand it. I'll take another look if I can find time later today.
-- Dan
ASKER
OK!
I analyzed Infinity08's Xxx_b function's pseudo-ASM listing and converted it to an _asm block.
The GCC compiler used some opcode techniques that saved time... for instance, it uses
movzx eax, (...any byte)
to avoid needing
xor eax,eax
mov al, (...any byte)
when it will need to use all of eax as an index. Other than that, the low-level instruction streams are basically similar.
My timing results were similar to earlier:
MB/sec Internal name // Description (on AMD Athlon 2800)
-------- -------------------------- ---------- ---------- ---------- ---------- ---------- ----------
180.45 ...infinity08_lookup_b / Infinity08 (6K of tables) Compiled C
181.81 ...infinity08_lookup_b / Infinity08 (6K of tables) My ASM conversion
208.43 mToBase64 / Dan's "mystery algorithm" (8096-byte lookup table)
HOWEVER!!!!!
I then ran the exact same executables on 2.5 GHz Celeron (genuine Intel) with these results:
MB/sec Internal name // Description (on 2.51 Celeron)
-------- -------------------------- ---------- ---------- ---------- ---------- ---------- ----------
306.53 ...infinity08_lookup_b / Infinity08 (6K of tables) Compiled C
432.21 ...infinity08_lookup_b / Infinity08 (6K of tables) My ASM conversion
453.01 mToBase64 / Dan's "mystery algorithm" (8096-byte lookup table)
The times are all significantly better on the Intel architecture, even though it's running a Celeron (which is known to have smallish caches).
The GCC-generated output (my ASM conversion) was significantly faster than the same algorithm when compiled/optimized by the VC++ compiler. I think that tells us that the VC++ optimizer is smarter (at least when the output is ececuted on Intel). The difference between the two (26%) is dramatic... just as Infinity08's said, and this confirms his measurements.
I really can't see enough difference in the opcodes to justify such a dramatic difference, but there it is. The Intel architecture must be better at cache prediction and pipelining instructions -- a LOT better.
I also ran the 2-byte lookup version for comparison... it's still the fastest (either architecture), but only by a negligible 0.5% on the Intel Celeron (1.5% on my Athlon).
-- Dan
I analyzed Infinity08's Xxx_b function's pseudo-ASM listing and converted it to an _asm block.
The GCC compiler used some opcode techniques that saved time... for instance, it uses
movzx eax, (...any byte)
to avoid needing
xor eax,eax
mov al, (...any byte)
when it will need to use all of eax as an index. Other than that, the low-level instruction streams are basically similar.
My timing results were similar to earlier:
MB/sec Internal name // Description (on AMD Athlon 2800)
-------- --------------------------
180.45 ...infinity08_lookup_b / Infinity08 (6K of tables) Compiled C
181.81 ...infinity08_lookup_b / Infinity08 (6K of tables) My ASM conversion
208.43 mToBase64 / Dan's "mystery algorithm" (8096-byte lookup table)
HOWEVER!!!!!
I then ran the exact same executables on 2.5 GHz Celeron (genuine Intel) with these results:
MB/sec Internal name // Description (on 2.51 Celeron)
-------- --------------------------
306.53 ...infinity08_lookup_b / Infinity08 (6K of tables) Compiled C
432.21 ...infinity08_lookup_b / Infinity08 (6K of tables) My ASM conversion
453.01 mToBase64 / Dan's "mystery algorithm" (8096-byte lookup table)
The times are all significantly better on the Intel architecture, even though it's running a Celeron (which is known to have smallish caches).
The GCC-generated output (my ASM conversion) was significantly faster than the same algorithm when compiled/optimized by the VC++ compiler. I think that tells us that the VC++ optimizer is smarter (at least when the output is ececuted on Intel). The difference between the two (26%) is dramatic... just as Infinity08's said, and this confirms his measurements.
I really can't see enough difference in the opcodes to justify such a dramatic difference, but there it is. The Intel architecture must be better at cache prediction and pipelining instructions -- a LOT better.
I also ran the 2-byte lookup version for comparison... it's still the fastest (either architecture), but only by a negligible 0.5% on the Intel Celeron (1.5% on my Athlon).
-- Dan
>> The times are all significantly better on the Intel architecture, even though it's running a Celeron (which is known to have smallish caches).
Well, we can't forget which CPU's we tried :
Intel : Pentium M and Pentium Celeron
AMD : Athlon XP and Athlon64
There's a serious difference in architecture between the two. But more importantly : both Intel processors are in a lower performance class than the AMD processors. It would be interesting to see how the code performs on a Pentium 4 or a Duron or something similar.
It could be that encoding algorithms that use small lookup tables have an advantage on systems in the low performance class (Pentium M, Celeron, Duron), however contradictory that may sound.
Well, we can't forget which CPU's we tried :
Intel : Pentium M and Pentium Celeron
AMD : Athlon XP and Athlon64
There's a serious difference in architecture between the two. But more importantly : both Intel processors are in a lower performance class than the AMD processors. It would be interesting to see how the code performs on a Pentium 4 or a Duron or something similar.
It could be that encoding algorithms that use small lookup tables have an advantage on systems in the low performance class (Pentium M, Celeron, Duron), however contradictory that may sound.
You have to watch out, some of the newer instructions like movez cause pipeline stalls on the early Pentiums, so doing the two simpler instructions might be as fast or faster!
It's interesting that the 18-bit lookup table is so dang slow. Shows how slow inexpensive memory really is compared to the CPU these days.
I tried some experiments with using MMX instructions to process EIGHT sixbit quantities at a time. Only took about eight MMX instructions to process all eight sixbits, BUT that time is swamped by the SIXTEEN shifts it takes to position each sixbit quantity in a separate byte. Sigh.
If there was a MMX instruction to pack/unpack 6 bit fields into 8 bit bytes this would be a huge winner, but alas no, not even in the latests SSE4 or AMD extensions..
One last possible trick-- investigate non-sequential processing, such as processing all the first bytes, then the seconds, then the thirds. That way you or the compiler could keep the masks and shifts values in registers, AND position all the bits in 4 or 8 bytes with just ONE shift. Wow, gotta try this!
It's interesting that the 18-bit lookup table is so dang slow. Shows how slow inexpensive memory really is compared to the CPU these days.
I tried some experiments with using MMX instructions to process EIGHT sixbit quantities at a time. Only took about eight MMX instructions to process all eight sixbits, BUT that time is swamped by the SIXTEEN shifts it takes to position each sixbit quantity in a separate byte. Sigh.
If there was a MMX instruction to pack/unpack 6 bit fields into 8 bit bytes this would be a huge winner, but alas no, not even in the latests SSE4 or AMD extensions..
One last possible trick-- investigate non-sequential processing, such as processing all the first bytes, then the seconds, then the thirds. That way you or the compiler could keep the masks and shifts values in registers, AND position all the bits in 4 or 8 bytes with just ONE shift. Wow, gotta try this!
ASKER
Innovative idea there grg99... it would be "paradoxical" if taking two passes through the data actually saved time, but there might be some other advantage that overrides everything else.
In my MMX tests, I was able to avoid a lot of shift ops by moving back and forth from MMX registers to general registers (where BSWAP and XCHG are available). The big win seems to be in using a non-temporal 8-byte move to output the results. If you care to share, I'd love to see any work MMX stuff you've tried in the ASM version of this thread (http:Q_21983447.html :-)
In my MMX tests, I was able to avoid a lot of shift ops by moving back and forth from MMX registers to general registers (where BSWAP and XCHG are available). The big win seems to be in using a non-temporal 8-byte move to output the results. If you care to share, I'd love to see any work MMX stuff you've tried in the ASM version of this thread (http:Q_21983447.html :-)
ASKER
In the ASM thread,
https://www.experts-exchange.com/questions/21983447/How-fast-is-your-ASM-Base64-Encoder.html#17711344
I wrote a version of the 8096-byte lookup algorithm code that confirms the assertion that "caching is king" --i.e., that all of the algorithms are losing time to the cache pollution that occurs when we plow through 1MB of of output.
In that example, I changed just one opcode from MOVQ to MOVNTQ and saw the fastest results to date, nearly 30% speed increase by changing just that *one opcode.* The MOVNTQ opcode tells the CPU to store the 8-byte MMX register directly to main memory -- bypassing the caches (and thus allowing other data to stay cached).
I think that it's safe to say that when processing large chunks of memory, such as streaming video or this 1MB block of input, this is an overriding concern.
It might be a more valid test of the encoding alorithms to try a variation of my timing benchmark.... a variation in which a much smaller buffer (say 10K) is encoded 100,000,000 times. In such a case, all of the input and all of the output and all of the lookup tables would be in cache for the entire run.
But... I think this thread is running down. I'll leave this open for a while longer to see if we can get some more submissions or even some general interest. If nothing pops, I'll close up and award the points.
-- Dan
https://www.experts-exchange.com/questions/21983447/How-fast-is-your-ASM-Base64-Encoder.html#17711344
I wrote a version of the 8096-byte lookup algorithm code that confirms the assertion that "caching is king" --i.e., that all of the algorithms are losing time to the cache pollution that occurs when we plow through 1MB of of output.
In that example, I changed just one opcode from MOVQ to MOVNTQ and saw the fastest results to date, nearly 30% speed increase by changing just that *one opcode.* The MOVNTQ opcode tells the CPU to store the 8-byte MMX register directly to main memory -- bypassing the caches (and thus allowing other data to stay cached).
I think that it's safe to say that when processing large chunks of memory, such as streaming video or this 1MB block of input, this is an overriding concern.
It might be a more valid test of the encoding alorithms to try a variation of my timing benchmark.... a variation in which a much smaller buffer (say 10K) is encoded 100,000,000 times. In such a case, all of the input and all of the output and all of the lookup tables would be in cache for the entire run.
But... I think this thread is running down. I'll leave this open for a while longer to see if we can get some more submissions or even some general interest. If nothing pops, I'll close up and award the points.
-- Dan
ASKER
This has been an interesting and thoroughly enjoyable intellectual excercise. I -- and I hope the other participants -- have learned a lot. For one thing the poor performance of the monster lookup table (cache thrashing) was enlightening. For another, the significant speed advantage of a genuine Intel CPU (in this particular scenario) surprised me. Infinity08's multiple-table scheme was innovative -- I'd have never thought of it :-)
We all put more into this that we would for a standard EE question, so the point awards are not true "compensation" -- the real reward here has been the participation itself. Infinity08 posted the code that ran fastest with correct output. grg99's contribution *showed the way* to the fastest algorithm measured here (the 8096-byte lookup table) but he did not submit a debugged version for actual timing comparison. grg99's other contributions, including the no-table-at-all code, were interesting and his comments insightful, and I am chosing to award a share of the points to him.
For what it's worth: Every submission beat all of the base64 encoders in a Microsoft API or other published utility. So that's a bonus :-)
Thanks to everyone who participated and even if you just "lurked" or if you read this at some time in the future, I hope that you enjoyed this as much as I did :-)
-- Dan
We all put more into this that we would for a standard EE question, so the point awards are not true "compensation" -- the real reward here has been the participation itself. Infinity08 posted the code that ran fastest with correct output. grg99's contribution *showed the way* to the fastest algorithm measured here (the 8096-byte lookup table) but he did not submit a debugged version for actual timing comparison. grg99's other contributions, including the no-table-at-all code, were interesting and his comments insightful, and I am chosing to award a share of the points to him.
For what it's worth: Every submission beat all of the base64 encoders in a Microsoft API or other published utility. So that's a bonus :-)
Thanks to everyone who participated and even if you just "lurked" or if you read this at some time in the future, I hope that you enjoyed this as much as I did :-)
-- Dan
>> the real reward here has been the participation itself.
Wise words :)
>> For what it's worth: Every submission beat all of the base64 encoders in a Microsoft API or other published utility. So that's a bonus :-)
Heh.
I enjoyd this discussion/challenge, Dan ! Thanks for that !
Wise words :)
>> For what it's worth: Every submission beat all of the base64 encoders in a Microsoft API or other published utility. So that's a bonus :-)
Heh.
I enjoyd this discussion/challenge, Dan ! Thanks for that !