DanRollins
asked on
How fast is your ASM Base64 Encoder?
I was just reading over an old Q here from several years ago...
How fast is your ASM hex converter?
http:/Assembly/Q_20272901.html
which was more of a fun challenge than a real question. I'm not sure how many ASM Experts are participating in this TA these days, but I thought I'd throw out the gauntlet again.
The challenge is to write -- in 80x86 ASM -- an ultra-fast Base64 Encoder.
A Base64 Encoder converts binary data into ASCII codes that can be transmitted/processed by systems that do not normally process raw binary... for instance, in email attachments and in HTTP headers and as values in HTTP URL. The encoder outputs only the following characters:
ABCDEFGHIJKLMNOPQRSTUVWXYZ abcdefghij klmnopqrst uvwxyz0123 456789+/
plus, the output may end in zero, one, or two equal signs (=)
To encode the raw binary, you access it three bytes at a time, taking the data in groups of six bits. For instance:
Hex: 0x07 0xFF 0x02
Binary: 0000 0111 1111 1111 0000 0010 (groups of 4 bits)
Binary: 000001 111111 111100 000010 (groups of 6 bits)
\____/ \____/ \____/ \____/
decimal 1 63 60 2
lookup B / 8 C
The length of the output must always be an even multiple of four, because when you DEcode the data, you will be taking groups of four base64 "digits" and turning them back into three binary bytes of output. Therefore, when ENcoding, if your input data is not an even multiple of three, you will need to pad the end of the output with one or two equal signs (=)
If you need more info, then (as always) RTFR! (read the friggin RFC :-) http://www.ietf.org/rfc/rfc3548.txt
Also, to verify your output, you can use an Encoder at this URL:
http://www.opinionatedgeek.com/dotnet/tools/Base64Encode/Default.aspx
=-=-=-=-=-=-=-=-=-=-=
I'll be speed-testing the submissions using code like this:
//------------------------ ---------- ------ (assume dest buf is large enough)
void ToBase64( BYTE* pSrc, int nSrcLen, char* pszOutBuf ) // {{modified 9/13/2006 --DR, PE}}
{
_asm{
// your code here!
}
}
So if you write your code in a format that can be pasted into an _asm block and compiled by Visual Studio 6, it will make it a bit easier.
If there is any interest in this, I'll post a C-language version for use as comparison (there is one here: http://research.microsoft.com/invisible/src/crypto/cyphers/base64/base64.c.htm -- though it does not look particularly efficient) and I'll also time the ATL functions Base64Encode for reference (my guess is that it is pretty fast... unbeatable?). The CryptoAPI also provides a function, CryptBinaryToString, that will do the work and it might be interesting to see how fast it runs. If this works out, I'll start a second challenge for Base64 ASM Decoder.
Are you game? 500 points and your reputation as an ASM programmer are on the line :-)
-- Dan
How fast is your ASM hex converter?
http:/Assembly/Q_20272901.html
which was more of a fun challenge than a real question. I'm not sure how many ASM Experts are participating in this TA these days, but I thought I'd throw out the gauntlet again.
The challenge is to write -- in 80x86 ASM -- an ultra-fast Base64 Encoder.
A Base64 Encoder converts binary data into ASCII codes that can be transmitted/processed by systems that do not normally process raw binary... for instance, in email attachments and in HTTP headers and as values in HTTP URL. The encoder outputs only the following characters:
ABCDEFGHIJKLMNOPQRSTUVWXYZ
plus, the output may end in zero, one, or two equal signs (=)
To encode the raw binary, you access it three bytes at a time, taking the data in groups of six bits. For instance:
Hex: 0x07 0xFF 0x02
Binary: 0000 0111 1111 1111 0000 0010 (groups of 4 bits)
Binary: 000001 111111 111100 000010 (groups of 6 bits)
\____/ \____/ \____/ \____/
decimal 1 63 60 2
lookup B / 8 C
The length of the output must always be an even multiple of four, because when you DEcode the data, you will be taking groups of four base64 "digits" and turning them back into three binary bytes of output. Therefore, when ENcoding, if your input data is not an even multiple of three, you will need to pad the end of the output with one or two equal signs (=)
If you need more info, then (as always) RTFR! (read the friggin RFC :-) http://www.ietf.org/rfc/rfc3548.txt
Also, to verify your output, you can use an Encoder at this URL:
http://www.opinionatedgeek.com/dotnet/tools/Base64Encode/Default.aspx
=-=-=-=-=-=-=-=-=-=-=
I'll be speed-testing the submissions using code like this:
//------------------------
void ToBase64( BYTE* pSrc, int nSrcLen, char* pszOutBuf ) // {{modified 9/13/2006 --DR, PE}}
{
_asm{
// your code here!
}
}
So if you write your code in a format that can be pasted into an _asm block and compiled by Visual Studio 6, it will make it a bit easier.
If there is any interest in this, I'll post a C-language version for use as comparison (there is one here: http://research.microsoft.com/invisible/src/crypto/cyphers/base64/base64.c.htm -- though it does not look particularly efficient) and I'll also time the ATL functions Base64Encode for reference (my guess is that it is pretty fast... unbeatable?). The CryptoAPI also provides a function, CryptBinaryToString, that will do the work and it might be interesting to see how fast it runs. If this works out, I'll start a second challenge for Base64 ASM Decoder.
Are you game? 500 points and your reputation as an ASM programmer are on the line :-)
-- Dan
how do we know the # of bytes pointed to by pSrc?
ok, well based on having the length passed in as a parameter I have done the following code:
-------------------------- ---------- ---------- ---
void ToBase64( BYTE* pSrc, char* pszOutBuf, int len )
{
char* chr_table="ABCDEFGHIJKLMNO PQRSTUVWXY Zabcdefghi jklmnopqrs tuvwxyz012 3456789+/" ;
__asm{
// pushad (removed for speed - uncomment if necessary)
xor ecx, ecx //byte idx
src_byteLoop:
xor edx, edx
mov esi, pSrc //bytes from source
add esi, ecx
xor eax, eax
//read in 3 bytes seperately (to allow for little endian layout)
mov ah, byte ptr[esi]
mov al, byte ptr[esi+1]
shl eax, 16
mov ah, byte ptr[esi+2]
padded_byteLoop:
mov ebx, 4 //4 sets (of 6 bits) 24bits (3bytes)
next_bitset:
xor edx, edx
mov edi, 6 //6bit groups
bitLoop:
shl eax, 1 //get MSB
adc edx,0 //add 1 if carry
shl edx, 1 //keep significance
dec edi //bit--
jnz bitLoop
shr edx, 1 //div 2 (final bit should have sig 1)
mov esi, chr_table
add esi, edx //set the pointer to correct char
mov al, byte ptr [esi] //al is spare, use it!
mov edi, pszOutBuf
mov byte ptr [edi], al //put char in buffer
add pszOutBuf,1 //next buf
dec ebx //set--
jnz next_bitset
add ecx,3
mov esi, len
sub esi, ecx //len - done = remaining
jle finished
sub esi,3
jge src_byteLoop //still got some groups of 3 bytes
//need to pad out some extra bytes
xor edx, edx
mov edi, pSrc //bytes from source
add edi, ecx
xor eax, eax
//read in 3 bytes regardless of junk data following pSrc
mov ah, byte ptr[edi]
mov al, byte ptr[edi+1]
shl eax, 16
mov ah, byte ptr[edi+2]
//as per the RFC, any padded bytes should be 0s
mov ebx, 0xFFFFFF
lea esi, dword ptr[esi*8+8]
xchg esi, ecx
shl ebx, cl
xchg esi, ecx
and eax, ebx
jmp padded_ByteLoop
finished:
neg esi
jz end
//some bytes were padding, put them as =
padChars:
mov edi, pszOutBuf
sub edi, esi
mov byte ptr[edi], 0x3d
dec esi
jnz padChars
end:
// popad (removed for speed - uncomment if necessary)
}
}
----------------
some notes:
1) I could unroll the predetermined loops for faster performance if necessary - I have left them in for now for readability and to see how quickly it runs before unrolling
2) If this was in pure asm and not in an __asm block then it would be quicker because there would be no need to double dereference the variables in order to get at their data..
3) I am a novice asm programmer so the above was written to just 'do' what it should, I have given some thought to speed (such as not using the stack, using registers instead of memory locations and avoiding length mnemonics such as div), but there may be better ways of achieving what my code does.. please let me know I am always eager to learn!!
--------------------------
void ToBase64( BYTE* pSrc, char* pszOutBuf, int len )
{
char* chr_table="ABCDEFGHIJKLMNO
__asm{
// pushad (removed for speed - uncomment if necessary)
xor ecx, ecx //byte idx
src_byteLoop:
xor edx, edx
mov esi, pSrc //bytes from source
add esi, ecx
xor eax, eax
//read in 3 bytes seperately (to allow for little endian layout)
mov ah, byte ptr[esi]
mov al, byte ptr[esi+1]
shl eax, 16
mov ah, byte ptr[esi+2]
padded_byteLoop:
mov ebx, 4 //4 sets (of 6 bits) 24bits (3bytes)
next_bitset:
xor edx, edx
mov edi, 6 //6bit groups
bitLoop:
shl eax, 1 //get MSB
adc edx,0 //add 1 if carry
shl edx, 1 //keep significance
dec edi //bit--
jnz bitLoop
shr edx, 1 //div 2 (final bit should have sig 1)
mov esi, chr_table
add esi, edx //set the pointer to correct char
mov al, byte ptr [esi] //al is spare, use it!
mov edi, pszOutBuf
mov byte ptr [edi], al //put char in buffer
add pszOutBuf,1 //next buf
dec ebx //set--
jnz next_bitset
add ecx,3
mov esi, len
sub esi, ecx //len - done = remaining
jle finished
sub esi,3
jge src_byteLoop //still got some groups of 3 bytes
//need to pad out some extra bytes
xor edx, edx
mov edi, pSrc //bytes from source
add edi, ecx
xor eax, eax
//read in 3 bytes regardless of junk data following pSrc
mov ah, byte ptr[edi]
mov al, byte ptr[edi+1]
shl eax, 16
mov ah, byte ptr[edi+2]
//as per the RFC, any padded bytes should be 0s
mov ebx, 0xFFFFFF
lea esi, dword ptr[esi*8+8]
xchg esi, ecx
shl ebx, cl
xchg esi, ecx
and eax, ebx
jmp padded_ByteLoop
finished:
neg esi
jz end
//some bytes were padding, put them as =
padChars:
mov edi, pszOutBuf
sub edi, esi
mov byte ptr[edi], 0x3d
dec esi
jnz padChars
end:
// popad (removed for speed - uncomment if necessary)
}
}
----------------
some notes:
1) I could unroll the predetermined loops for faster performance if necessary - I have left them in for now for readability and to see how quickly it runs before unrolling
2) If this was in pure asm and not in an __asm block then it would be quicker because there would be no need to double dereference the variables in order to get at their data..
3) I am a novice asm programmer so the above was written to just 'do' what it should, I have given some thought to speed (such as not using the stack, using registers instead of memory locations and avoiding length mnemonics such as div), but there may be better ways of achieving what my code does.. please let me know I am always eager to learn!!
for the records, is the version with loops unrolled, which after a small amount of testing seems considerably quicker..
-------------------------- ---
void ToBase64( BYTE* pSrc, char* pszOutBuf, int len )
{
char* chr_table="ABCDEFGHIJKLMNO PQRSTUVWXY Zabcdefghi jklmnopqrs tuvwxyz012 3456789+/" ;
__asm{
xor ecx, ecx //byte idx
src_byteLoop:
xor edx, edx
mov esi, pSrc //bytes from source
add esi, ecx
xor eax, eax
//read in 3 bytes seperately (to allow for little endian layout)
mov ah, byte ptr[esi]
mov al, byte ptr[esi+1]
shl eax, 16
mov ah, byte ptr[esi+2]
padded_byteLoop:
//4 sets (of 6 bits) 24bits (3bytes)
//set1
xor edx, edx
//6bit groups
//1
shl eax, 1 //get MSB
adc edx,0 //add 1 if carry
shl edx, 1 //keep significance
//2
shl eax, 1 //get MSB
adc edx,0 //add 1 if carry
shl edx, 1 //keep significance
//3
shl eax, 1 //get MSB
adc edx,0 //add 1 if carry
shl edx, 1 //keep significance
//4
shl eax, 1 //get MSB
adc edx,0 //add 1 if carry
shl edx, 1 //keep significance
//5
shl eax, 1 //get MSB
adc edx,0 //add 1 if carry
shl edx, 1 //keep significance
//6
shl eax, 1 //get MSB
adc edx,0 //add 1 if carry
shl edx, 1 //keep significance
shr edx, 1 //div 2 (final bit should have sig 1)
mov esi, chr_table
add esi, edx //set the pointer to correct char
mov al, byte ptr [esi] //al is spare, use it!
mov edi, pszOutBuf
mov byte ptr [edi], al //put char in buffer
add pszOutBuf,1 //next buf
//set2
xor edx, edx
//1
shl eax, 1 //get MSB
adc edx,0 //add 1 if carry
shl edx, 1 //keep significance
//2
shl eax, 1 //get MSB
adc edx,0 //add 1 if carry
shl edx, 1 //keep significance
//3
shl eax, 1 //get MSB
adc edx,0 //add 1 if carry
shl edx, 1 //keep significance
//4
shl eax, 1 //get MSB
adc edx,0 //add 1 if carry
shl edx, 1 //keep significance
//5
shl eax, 1 //get MSB
adc edx,0 //add 1 if carry
shl edx, 1 //keep significance
//6
shl eax, 1 //get MSB
adc edx,0 //add 1 if carry
shl edx, 1 //keep significance
shr edx, 1 //div 2 (final bit should have sig 1)
mov esi, chr_table
add esi, edx //set the pointer to correct char
mov al, byte ptr [esi] //al is spare, use it!
mov edi, pszOutBuf
mov byte ptr [edi], al //put char in buffer
add pszOutBuf,1 //next buf
//set3
xor edx, edx
//1
shl eax, 1 //get MSB
adc edx,0 //add 1 if carry
shl edx, 1 //keep significance
//2
shl eax, 1 //get MSB
adc edx,0 //add 1 if carry
shl edx, 1 //keep significance
//3
shl eax, 1 //get MSB
adc edx,0 //add 1 if carry
shl edx, 1 //keep significance
//4
shl eax, 1 //get MSB
adc edx,0 //add 1 if carry
shl edx, 1 //keep significance
//5
shl eax, 1 //get MSB
adc edx,0 //add 1 if carry
shl edx, 1 //keep significance
//6
shl eax, 1 //get MSB
adc edx,0 //add 1 if carry
shl edx, 1 //keep significance
shr edx, 1 //div 2 (final bit should have sig 1)
mov esi, chr_table
add esi, edx //set the pointer to correct char
mov al, byte ptr [esi] //al is spare, use it!
mov edi, pszOutBuf
mov byte ptr [edi], al //put char in buffer
add pszOutBuf,1 //next buf
//set4
xor edx, edx
//1
shl eax, 1 //get MSB
adc edx,0 //add 1 if carry
shl edx, 1 //keep significance
//2
shl eax, 1 //get MSB
adc edx,0 //add 1 if carry
shl edx, 1 //keep significance
//3
shl eax, 1 //get MSB
adc edx,0 //add 1 if carry
shl edx, 1 //keep significance
//4
shl eax, 1 //get MSB
adc edx,0 //add 1 if carry
shl edx, 1 //keep significance
//5
shl eax, 1 //get MSB
adc edx,0 //add 1 if carry
shl edx, 1 //keep significance
//6
shl eax, 1 //get MSB
adc edx,0 //add 1 if carry
shl edx, 1 //keep significance
shr edx, 1 //div 2 (final bit should have sig 1)
mov esi, chr_table
add esi, edx //set the pointer to correct char
mov al, byte ptr [esi] //al is spare, use it!
mov edi, pszOutBuf
mov byte ptr [edi], al //put char in buffer
add pszOutBuf,1 //next buf
add ecx,3
mov esi, len
sub esi, ecx //len - done = remaining
jle finished
sub esi,3
jge src_byteLoop //still got some groups of 3 bytes
//need to pad out some extra bytes
xor edx, edx
mov edi, pSrc //bytes from source
add edi, ecx
xor eax, eax
//read in 3 bytes regardless of junk data following pSrc
mov ah, byte ptr[edi]
mov al, byte ptr[edi+1]
shl eax, 16
mov ah, byte ptr[edi+2]
//as per the RFC, any padded bytes should be 0s
mov ebx, 0xFFFFFF
lea esi, dword ptr[esi*8+8]
xchg esi, ecx
shl ebx, cl
xchg esi, ecx
and eax, ebx
jmp padded_ByteLoop
finished:
neg esi
jz end
//some bytes were padding, put them as =
padChars:
mov edi, pszOutBuf
sub edi, esi
mov byte ptr[edi], 0x3d
dec esi
jnz padChars
end:
}
}
------------------------
now ill stop flooding the thread :)
--------------------------
void ToBase64( BYTE* pSrc, char* pszOutBuf, int len )
{
char* chr_table="ABCDEFGHIJKLMNO
__asm{
xor ecx, ecx //byte idx
src_byteLoop:
xor edx, edx
mov esi, pSrc //bytes from source
add esi, ecx
xor eax, eax
//read in 3 bytes seperately (to allow for little endian layout)
mov ah, byte ptr[esi]
mov al, byte ptr[esi+1]
shl eax, 16
mov ah, byte ptr[esi+2]
padded_byteLoop:
//4 sets (of 6 bits) 24bits (3bytes)
//set1
xor edx, edx
//6bit groups
//1
shl eax, 1 //get MSB
adc edx,0 //add 1 if carry
shl edx, 1 //keep significance
//2
shl eax, 1 //get MSB
adc edx,0 //add 1 if carry
shl edx, 1 //keep significance
//3
shl eax, 1 //get MSB
adc edx,0 //add 1 if carry
shl edx, 1 //keep significance
//4
shl eax, 1 //get MSB
adc edx,0 //add 1 if carry
shl edx, 1 //keep significance
//5
shl eax, 1 //get MSB
adc edx,0 //add 1 if carry
shl edx, 1 //keep significance
//6
shl eax, 1 //get MSB
adc edx,0 //add 1 if carry
shl edx, 1 //keep significance
shr edx, 1 //div 2 (final bit should have sig 1)
mov esi, chr_table
add esi, edx //set the pointer to correct char
mov al, byte ptr [esi] //al is spare, use it!
mov edi, pszOutBuf
mov byte ptr [edi], al //put char in buffer
add pszOutBuf,1 //next buf
//set2
xor edx, edx
//1
shl eax, 1 //get MSB
adc edx,0 //add 1 if carry
shl edx, 1 //keep significance
//2
shl eax, 1 //get MSB
adc edx,0 //add 1 if carry
shl edx, 1 //keep significance
//3
shl eax, 1 //get MSB
adc edx,0 //add 1 if carry
shl edx, 1 //keep significance
//4
shl eax, 1 //get MSB
adc edx,0 //add 1 if carry
shl edx, 1 //keep significance
//5
shl eax, 1 //get MSB
adc edx,0 //add 1 if carry
shl edx, 1 //keep significance
//6
shl eax, 1 //get MSB
adc edx,0 //add 1 if carry
shl edx, 1 //keep significance
shr edx, 1 //div 2 (final bit should have sig 1)
mov esi, chr_table
add esi, edx //set the pointer to correct char
mov al, byte ptr [esi] //al is spare, use it!
mov edi, pszOutBuf
mov byte ptr [edi], al //put char in buffer
add pszOutBuf,1 //next buf
//set3
xor edx, edx
//1
shl eax, 1 //get MSB
adc edx,0 //add 1 if carry
shl edx, 1 //keep significance
//2
shl eax, 1 //get MSB
adc edx,0 //add 1 if carry
shl edx, 1 //keep significance
//3
shl eax, 1 //get MSB
adc edx,0 //add 1 if carry
shl edx, 1 //keep significance
//4
shl eax, 1 //get MSB
adc edx,0 //add 1 if carry
shl edx, 1 //keep significance
//5
shl eax, 1 //get MSB
adc edx,0 //add 1 if carry
shl edx, 1 //keep significance
//6
shl eax, 1 //get MSB
adc edx,0 //add 1 if carry
shl edx, 1 //keep significance
shr edx, 1 //div 2 (final bit should have sig 1)
mov esi, chr_table
add esi, edx //set the pointer to correct char
mov al, byte ptr [esi] //al is spare, use it!
mov edi, pszOutBuf
mov byte ptr [edi], al //put char in buffer
add pszOutBuf,1 //next buf
//set4
xor edx, edx
//1
shl eax, 1 //get MSB
adc edx,0 //add 1 if carry
shl edx, 1 //keep significance
//2
shl eax, 1 //get MSB
adc edx,0 //add 1 if carry
shl edx, 1 //keep significance
//3
shl eax, 1 //get MSB
adc edx,0 //add 1 if carry
shl edx, 1 //keep significance
//4
shl eax, 1 //get MSB
adc edx,0 //add 1 if carry
shl edx, 1 //keep significance
//5
shl eax, 1 //get MSB
adc edx,0 //add 1 if carry
shl edx, 1 //keep significance
//6
shl eax, 1 //get MSB
adc edx,0 //add 1 if carry
shl edx, 1 //keep significance
shr edx, 1 //div 2 (final bit should have sig 1)
mov esi, chr_table
add esi, edx //set the pointer to correct char
mov al, byte ptr [esi] //al is spare, use it!
mov edi, pszOutBuf
mov byte ptr [edi], al //put char in buffer
add pszOutBuf,1 //next buf
add ecx,3
mov esi, len
sub esi, ecx //len - done = remaining
jle finished
sub esi,3
jge src_byteLoop //still got some groups of 3 bytes
//need to pad out some extra bytes
xor edx, edx
mov edi, pSrc //bytes from source
add edi, ecx
xor eax, eax
//read in 3 bytes regardless of junk data following pSrc
mov ah, byte ptr[edi]
mov al, byte ptr[edi+1]
shl eax, 16
mov ah, byte ptr[edi+2]
//as per the RFC, any padded bytes should be 0s
mov ebx, 0xFFFFFF
lea esi, dword ptr[esi*8+8]
xchg esi, ecx
shl ebx, cl
xchg esi, ecx
and eax, ebx
jmp padded_ByteLoop
finished:
neg esi
jz end
//some bytes were padding, put them as =
padChars:
mov edi, pszOutBuf
sub edi, esi
mov byte ptr[edi], 0x3d
dec esi
jnz padChars
end:
}
}
------------------------
now ill stop flooding the thread :)
ASKER
Hi craigwardman!
Thanks for popping in. You are right! The source length must be an input parameter. Using my PageEditor tools, I've modified the question text to reflect that (and fixed my link to the RFC).
>>...please let me know I am always eager to learn!!
I hope that this can be a learning experience for everybody involved :-)
I'll be looking at your code soon. First I want to show my first shots at a C-language version... I've found that coding it up in C is often a good first step... to make sure I understand the algorithm.
My first shot at an inner loop looked like this:
for ( int j=0; j< nSrcLen; j+=3 ) {
d = *pSrc++; d <<= 8; // load three bytes ... in the right order for processing
d|= *pSrc++; d <<= 8;
d|= *pSrc++;
BYTE b1= BYTE((d & 0x00FC0000) >> 18);
BYTE b2= BYTE((d & 0x0003F000) >> 12);
BYTE b3= BYTE((d & 0x00000FC0) >> 6 );
BYTE b4= BYTE((d & 0x0000003F) >> 0 );
*p++ = abDigitLookup[ b1 ];
*p++ = abDigitLookup[ b2 ];
*p++ = abDigitLookup[ b3 ];
*p++ = abDigitLookup[ b4 ];
}
I'm omitting the end-of-source buffer padding logic, but that shows the general operation. The endian thing threw me at first... I thought I could just do a DWORD load to get the source bytes, but they end up in the wrong order for subsequent processing via bit shifts. A DWORD load should be much faster -- I think that it should be possible to avoid the three separate byte loads in the ASM version by using XCHG and/or BSWAP.
Before optimization, that sequence of C code generates 337 bytes of machine language opcodes.
After optimization for speed or size it is 130 or 125 bytes, respectively.
In my next version, I tried a slight variation... calculating the outputs taking the data one byte at a time.
for (int j=0; j< nSrcLen; j+=3 ) {
d = *pSrc++; // 0000 0000 aaaa aaaa
b1= BYTE((d & 0x00000FC) >> 2); // 0000 0000 AAAA AA00
d <<= 8; // aaaa aaaa 0000 0000
d |= *pSrc++; // aaaa aaaa bbbb bbbb
b2= BYTE((d & 0x00003F0) >> 4); // 0000 00AA BBBB 0000
d <<= 8; // bbbb bbbb 0000 0000
d |= *pSrc++; // bbbb bbbb cccc cccc
b3= BYTE((d & 0x0000FC0) >> 6 ); // bbbb BBBB CCcc cccc
b4= BYTE((d & 0x000003F) >> 0 ); // bbbb bbbb ccCC CCCC
*p++ = abDigitLookup[b1];
*p++ = abDigitLookup[b2];
*p++ = abDigitLookup[b3];
*p++ = abDigitLookup[b4];
}
The "apparent" advantage would be using shorter shifts (eg, SHR 2 instead of SHR 18) but I think I'm wasting my time... the CPU can shifts 18 positiions as fast as it does one. This version of the code actually generates more opcodes (by about 10). I could save a few (about 5) by avoiding the temp values (b1,b2,b3,b4), using code like:
*p++ = abDigitLookup[BYTE((d & 0x00000FC) >> 2)];
...but i HATE it when I see code like that -- the compiler basically optimizes away the temp variables, so the advantage of readability usually wins. Of course, if a few nanoseconds (or bytes of opcode) are important, then can might be worth doing.
BTW, by comparison, craigwardman's first submission is 120 bytes -- smaller and probably faster (measurements to be taken soon) than even the most optimized of my compiled C versions.
=-=-=-=-=-=-=-=-=-=-=-=-=- =-=-=-=
My end-of-src processing takes a different tack from that of craigwardman. After I'm all done, I do:
*p= 0; // null terminate the output
p--; // back up to the output char before the end
while ( j-- > nSrcLen ) {
*(p--) = '=';
}
In other words, I go ahead and process the one or two bytes of junk at the end of the source in the main loop, but at the end, go back and overwrite my extra output with the padding character. I observed that if the nSrcLen MOD 3 is 1, then there will be two output characters that need to be overwritten and if nSrcLen MOD 3 is 2 then there will be one. I arranged a while loop to use the main loop index (j) and nSrcLen to handle that.
Note that in both my code and craigwardman's submission, we both ASSUME that we have a legal access to up to two bytes of source beyond nSrcLen. Assuming anything like that can be dangerous... Imagine if the caller's buffer has been allocated in such a way that reading even one byte beyond the end of the input buffer would casue an illegal access.
Query:
Should we require that there be no access beyond the specified length of the input buffer? I think that doing so will slow down the final code...
An alterntive would be that we pretend that we have documented a requirement that the source buffer be a multiple of 3 (though nSrcLen and be anything -- an exact count of the length of the data to encode). The problem with that option is that it's ugly -- there's no way it would be a reasonable requirement for production code.
I personally don't mind it if we decide to allow access beyond the end of the buffer for this speed contest, but I think the issue should be mentioned, in any case.
Thoughts?
=-=-=-=-=-=-=-=-=-=-=-=-=- =-=-=-=
I'm going to post a link to this thread the the C++ TA to see if we can get some more interest.
I'll be back with some timing/speed data as soon as I can.
-- Dan
Thanks for popping in. You are right! The source length must be an input parameter. Using my PageEditor tools, I've modified the question text to reflect that (and fixed my link to the RFC).
>>...please let me know I am always eager to learn!!
I hope that this can be a learning experience for everybody involved :-)
I'll be looking at your code soon. First I want to show my first shots at a C-language version... I've found that coding it up in C is often a good first step... to make sure I understand the algorithm.
My first shot at an inner loop looked like this:
for ( int j=0; j< nSrcLen; j+=3 ) {
d = *pSrc++; d <<= 8; // load three bytes ... in the right order for processing
d|= *pSrc++; d <<= 8;
d|= *pSrc++;
BYTE b1= BYTE((d & 0x00FC0000) >> 18);
BYTE b2= BYTE((d & 0x0003F000) >> 12);
BYTE b3= BYTE((d & 0x00000FC0) >> 6 );
BYTE b4= BYTE((d & 0x0000003F) >> 0 );
*p++ = abDigitLookup[ b1 ];
*p++ = abDigitLookup[ b2 ];
*p++ = abDigitLookup[ b3 ];
*p++ = abDigitLookup[ b4 ];
}
I'm omitting the end-of-source buffer padding logic, but that shows the general operation. The endian thing threw me at first... I thought I could just do a DWORD load to get the source bytes, but they end up in the wrong order for subsequent processing via bit shifts. A DWORD load should be much faster -- I think that it should be possible to avoid the three separate byte loads in the ASM version by using XCHG and/or BSWAP.
Before optimization, that sequence of C code generates 337 bytes of machine language opcodes.
After optimization for speed or size it is 130 or 125 bytes, respectively.
In my next version, I tried a slight variation... calculating the outputs taking the data one byte at a time.
for (int j=0; j< nSrcLen; j+=3 ) {
d = *pSrc++; // 0000 0000 aaaa aaaa
b1= BYTE((d & 0x00000FC) >> 2); // 0000 0000 AAAA AA00
d <<= 8; // aaaa aaaa 0000 0000
d |= *pSrc++; // aaaa aaaa bbbb bbbb
b2= BYTE((d & 0x00003F0) >> 4); // 0000 00AA BBBB 0000
d <<= 8; // bbbb bbbb 0000 0000
d |= *pSrc++; // bbbb bbbb cccc cccc
b3= BYTE((d & 0x0000FC0) >> 6 ); // bbbb BBBB CCcc cccc
b4= BYTE((d & 0x000003F) >> 0 ); // bbbb bbbb ccCC CCCC
*p++ = abDigitLookup[b1];
*p++ = abDigitLookup[b2];
*p++ = abDigitLookup[b3];
*p++ = abDigitLookup[b4];
}
The "apparent" advantage would be using shorter shifts (eg, SHR 2 instead of SHR 18) but I think I'm wasting my time... the CPU can shifts 18 positiions as fast as it does one. This version of the code actually generates more opcodes (by about 10). I could save a few (about 5) by avoiding the temp values (b1,b2,b3,b4), using code like:
*p++ = abDigitLookup[BYTE((d & 0x00000FC) >> 2)];
...but i HATE it when I see code like that -- the compiler basically optimizes away the temp variables, so the advantage of readability usually wins. Of course, if a few nanoseconds (or bytes of opcode) are important, then can might be worth doing.
BTW, by comparison, craigwardman's first submission is 120 bytes -- smaller and probably faster (measurements to be taken soon) than even the most optimized of my compiled C versions.
=-=-=-=-=-=-=-=-=-=-=-=-=-
My end-of-src processing takes a different tack from that of craigwardman. After I'm all done, I do:
*p= 0; // null terminate the output
p--; // back up to the output char before the end
while ( j-- > nSrcLen ) {
*(p--) = '=';
}
In other words, I go ahead and process the one or two bytes of junk at the end of the source in the main loop, but at the end, go back and overwrite my extra output with the padding character. I observed that if the nSrcLen MOD 3 is 1, then there will be two output characters that need to be overwritten and if nSrcLen MOD 3 is 2 then there will be one. I arranged a while loop to use the main loop index (j) and nSrcLen to handle that.
Note that in both my code and craigwardman's submission, we both ASSUME that we have a legal access to up to two bytes of source beyond nSrcLen. Assuming anything like that can be dangerous... Imagine if the caller's buffer has been allocated in such a way that reading even one byte beyond the end of the input buffer would casue an illegal access.
Query:
Should we require that there be no access beyond the specified length of the input buffer? I think that doing so will slow down the final code...
An alterntive would be that we pretend that we have documented a requirement that the source buffer be a multiple of 3 (though nSrcLen and be anything -- an exact count of the length of the data to encode). The problem with that option is that it's ugly -- there's no way it would be a reasonable requirement for production code.
I personally don't mind it if we decide to allow access beyond the end of the buffer for this speed contest, but I think the issue should be mentioned, in any case.
Thoughts?
=-=-=-=-=-=-=-=-=-=-=-=-=-
I'm going to post a link to this thread the the C++ TA to see if we can get some more interest.
I'll be back with some timing/speed data as soon as I can.
-- Dan
Hi Dan,
I think for the purpose of this thread, allowing the reading beyond the pSrc is ok - as you say in a proper implementation it may need some extra code to prevent it attempting this operation, but I imagine this would only be a problem if pSrc pointed to an area in memory that aligned itself right next to a protected page..
----
Using XCHG is a good idea, i didnt realise this would automatically account for little endianness.
---
regarding the padding at the end of the stream..
>>if the nSrcLen MOD 3 is 1, then there will be two output characters that need to be overwritten
this is true, but as groups of 3 are converted into 4 chars you will also have 2 chars that are not '=' signs..
The only way to ensure these two are correct is to follow the RFC of zeroing out any incomplete 24bit triplet.. as you could end up encoding junk data that is after the src.. (unless of course you are only reading upto the end of the src).. It must have been this text that made me decide to read in junk bytes:
>>Special processing is performed if fewer than 40 bits are available
at the end of the data being encoded. A full encoding quantum is
always completed at the end of a body. When fewer than 40 input bits
are available in an input group, zero bits are added (on the right)
to form an integral number of 5-bit groups. Padding at the end of
the data is performed using the "=" character.
I suppose, on 2nd reading they expect you to 'add' the bits, not overwrite the junk..
im not sure without seeing the whole of your source, but what would happen in the following situation:
BYTE lpSrc[]={0x01, 0x02,0x03,0x04}; imagine the junk that follows this is 0xFF 0xFF
the last byte's first 6 bits will encode ok, the next 2 bits will be combined with 4 bits of the junk so if you dont zero these out then the last char will be wrong..
Anyway, better go now - need to set off out!!
I think for the purpose of this thread, allowing the reading beyond the pSrc is ok - as you say in a proper implementation it may need some extra code to prevent it attempting this operation, but I imagine this would only be a problem if pSrc pointed to an area in memory that aligned itself right next to a protected page..
----
Using XCHG is a good idea, i didnt realise this would automatically account for little endianness.
---
regarding the padding at the end of the stream..
>>if the nSrcLen MOD 3 is 1, then there will be two output characters that need to be overwritten
this is true, but as groups of 3 are converted into 4 chars you will also have 2 chars that are not '=' signs..
The only way to ensure these two are correct is to follow the RFC of zeroing out any incomplete 24bit triplet.. as you could end up encoding junk data that is after the src.. (unless of course you are only reading upto the end of the src).. It must have been this text that made me decide to read in junk bytes:
>>Special processing is performed if fewer than 40 bits are available
at the end of the data being encoded. A full encoding quantum is
always completed at the end of a body. When fewer than 40 input bits
are available in an input group, zero bits are added (on the right)
to form an integral number of 5-bit groups. Padding at the end of
the data is performed using the "=" character.
I suppose, on 2nd reading they expect you to 'add' the bits, not overwrite the junk..
im not sure without seeing the whole of your source, but what would happen in the following situation:
BYTE lpSrc[]={0x01, 0x02,0x03,0x04}; imagine the junk that follows this is 0xFF 0xFF
the last byte's first 6 bits will encode ok, the next 2 bits will be combined with 4 bits of the junk so if you dont zero these out then the last char will be wrong..
Anyway, better go now - need to set off out!!
ASKER
I'll look into it and comment on that shortly.
In the meantime, the initial timings are in, and your "Unrolled loops" version is fastest! (so far :-).
I've posted the timing methodolgies and the current results at:
http:/Cplusplus/Q_21988706.html#17524209
In the meantime, the initial timings are in, and your "Unrolled loops" version is fastest! (so far :-).
I've posted the timing methodolgies and the current results at:
http:/Cplusplus/Q_21988706.html#17524209
wow, those figures should remove any doubt that unrolling loops improves performance!
10MB/sec quicker than the original :)
hmm, your 'other' C function sounds interesting.. I cant say that im a very strong reader of C unless ive written the code myself, too many long and complex statements - will you be releasing your compiler generated asm for this function?
also.. when you compile your code, are you enabling compiler optimisations? - this point may be worth mentioning in the C thread, as different compilers might optimise the code in different ways, so the generated asm may be much different to the C code and it becomes a contest of which compiler has the best optimisations..
10MB/sec quicker than the original :)
hmm, your 'other' C function sounds interesting.. I cant say that im a very strong reader of C unless ive written the code myself, too many long and complex statements - will you be releasing your compiler generated asm for this function?
also.. when you compile your code, are you enabling compiler optimisations? - this point may be worth mentioning in the C thread, as different compilers might optimise the code in different ways, so the generated asm may be much different to the C code and it becomes a contest of which compiler has the best optimisations..
ASKER
Re...
>> regarding the padding at the end of the stream..
Your quote was from the part of the RFC that describes base32 encoding (a minor oversight, since the principle is the same).
You are correct... i need to change my end-of-source handling. The 'junk' at the end of the source buffer can and does affect the final base64 'digit.'
My encoding of 0x01 0x02 0x03 0x04 0xFF 0xFF is
AQIDBP==
0x01 0x02 0x03 0x04 0xFF 0xFF
0000 0001 0000 0010 0000 0011 0000 0100 1111 1111 1111 1111 // 4-bit groups
000000 010000 001000 000011 000001 001111 111111 111111 // 6-bit groups
0(A) 32(Q) 16(I) 3(D) 1(B) 15(P) 63(/) 63(/) // decimal(base64)
...whereas, if the 'junk' values at the end are 0...
0x01 0x02 0x03 0x04 0x00 0x00
0000 0001 0000 0010 0000 0011 0000 0100 0000 0000 0000 0000 // 4-bit groups
000000 010000 001000 000011 000001 000000 000000 000000 // 6-bit groups
0(A) 32(Q) 16(I) 3(D) 1(B) 0(A) 0(A) 0(A) // decimal(base64)
meaning the final base64 digit should be 'A'
AQIDBP== (mine)
AQIDBA== (correct per the RFC)
It's interesting that my (incorrect) encoding actually DEcodes correctly:
AQID BP== (mine, incorrect)
A Q I D
AQID: 000000 010000 001000 000011 // 6-bit groups
: 0000 0001 0000 0010 0000 0011 // 4-bit groups
: 0x01 0x02 0x03 // fine so far...
B P = =
BP==: 000001 001111 ?????? ?????? // 6-bit groups
: 0000 0100 0011 ???? ???? ???? // 4-bit groups
: 0x04 3? ?? // also fine, since only the 0x04 is output
That's because the DEcoder, seeing two padding chars at the end, will decode only the first six bits of the final 24 (and discard the final 18 bits). If it sees just the one padding char, it will decode the first 12 bits and discard the final six. The values of the discarded bits do not affect the output.
Empirically testing at that base64 decoding website bears it out... when there are two padding chars, the final base64 digit (before the padding) can have a variety of values without affecting the binary output.
=-=-=-=-=-=-=-=-=
All that said... the RFC (page 4) is clear:
>> When fewer than 24 input bits are available in an input group, zero bits
are added (on the right) to form an integral number of 6-bit groups.
Thus, to be within spec, a valid base64 ENcoder function must "encode" up to two extra bytes of 0 if the buffer is not an even multiple of 3.
I'll work on my function to see if I can find an optimal way to do it right.
Nice catch, by the way :-)
>> regarding the padding at the end of the stream..
Your quote was from the part of the RFC that describes base32 encoding (a minor oversight, since the principle is the same).
You are correct... i need to change my end-of-source handling. The 'junk' at the end of the source buffer can and does affect the final base64 'digit.'
My encoding of 0x01 0x02 0x03 0x04 0xFF 0xFF is
AQIDBP==
0x01 0x02 0x03 0x04 0xFF 0xFF
0000 0001 0000 0010 0000 0011 0000 0100 1111 1111 1111 1111 // 4-bit groups
000000 010000 001000 000011 000001 001111 111111 111111 // 6-bit groups
0(A) 32(Q) 16(I) 3(D) 1(B) 15(P) 63(/) 63(/) // decimal(base64)
...whereas, if the 'junk' values at the end are 0...
0x01 0x02 0x03 0x04 0x00 0x00
0000 0001 0000 0010 0000 0011 0000 0100 0000 0000 0000 0000 // 4-bit groups
000000 010000 001000 000011 000001 000000 000000 000000 // 6-bit groups
0(A) 32(Q) 16(I) 3(D) 1(B) 0(A) 0(A) 0(A) // decimal(base64)
meaning the final base64 digit should be 'A'
AQIDBP== (mine)
AQIDBA== (correct per the RFC)
It's interesting that my (incorrect) encoding actually DEcodes correctly:
AQID BP== (mine, incorrect)
A Q I D
AQID: 000000 010000 001000 000011 // 6-bit groups
: 0000 0001 0000 0010 0000 0011 // 4-bit groups
: 0x01 0x02 0x03 // fine so far...
B P = =
BP==: 000001 001111 ?????? ?????? // 6-bit groups
: 0000 0100 0011 ???? ???? ???? // 4-bit groups
: 0x04 3? ?? // also fine, since only the 0x04 is output
That's because the DEcoder, seeing two padding chars at the end, will decode only the first six bits of the final 24 (and discard the final 18 bits). If it sees just the one padding char, it will decode the first 12 bits and discard the final six. The values of the discarded bits do not affect the output.
Empirically testing at that base64 decoding website bears it out... when there are two padding chars, the final base64 digit (before the padding) can have a variety of values without affecting the binary output.
=-=-=-=-=-=-=-=-=
All that said... the RFC (page 4) is clear:
>> When fewer than 24 input bits are available in an input group, zero bits
are added (on the right) to form an integral number of 6-bit groups.
Thus, to be within spec, a valid base64 ENcoder function must "encode" up to two extra bytes of 0 if the buffer is not an even multiple of 3.
I'll work on my function to see if I can find an optimal way to do it right.
Nice catch, by the way :-)
ASKER
Here is the compiler-generated ASM of "Dan's standard' function when optimized for speed:
>> void zToBase64( BYTE* pSrc, int nSrcLen, char* pszOutBuf ) {
push ebp
mov ebp,esp
>> char* p= pszOutBuf; DWORD d; BYTE b1, b2, b3, b4;)
>> for ( int j=0; j< nSrcLen; j+=3 ) {
mov eax,dword ptr [ebp+0Ch]
mov ecx,dword ptr [ebp+10h]
push edi
xor edi,edi
test eax,eax
jle zToBase64+81h ;; to loop exit
push ebx ;;
push esi
push 3
add eax,2
xor edx,edx
pop esi
div eax,esi
mov esi,dword ptr [pSrc]
push 3Fh
mov dword ptr [pszOutBuf],eax
lea edi,[eax+eax*2]
pop eax
>> d= *pSrc++; d <<= 8;
movzx edx,byte ptr [esi]
>> d|= *pSrc++; d <<= 8;
movzx ebx,byte ptr [esi+1]
inc esi
shl edx,8
or edx,ebx
inc esi
shl edx,8
>> d|= *pSrc++;
movzx ebx,byte ptr [esi]
or edx,ebx
inc esi
>> b1= BYTE((d & 0x00FC0000) >> 18);
mov ebx,edx
shr ebx,12h
and ebx,eax
>> b2= BYTE((d & 0x0003F000) >> 12);
>> b3= BYTE((d & 0x00000FC0) >> 6 );
>> b4= BYTE((d & 0x0000003F) >> 0 );
>> *p = abDigitLookup[ b1 ]; p++;
mov bl,byte ptr abDigitLookup[ebx]
mov byte ptr [ecx],bl
mov ebx,edx
shr ebx,0Ch
and ebx,eax
inc ecx
>> *p = abDigitLookup[ b2 ]; p++;
mov bl,byte ptr abDigitLookup[ebx]
mov byte ptr [ecx],bl
mov ebx,edx
shr ebx,6
and ebx,eax
inc ecx
and edx,eax
>> *p = abDigitLookup[ b3 ]; p++;
mov bl,byte ptr abDigitLookup[ebx]
mov byte ptr [ecx],bl
>> *p = abDigitLookup[ b4 ]; p++;
mov dl,byte ptr abDigitLookup[edx]
inc ecx
mov byte ptr [ecx],dl
inc ecx
dec dword ptr [pszOutBuf]
jne zToBase64+28h ;; top of loop
pop esi
pop ebx
>> }
>> // note: TBD: I have not fixed this end-of-source logic
>>*p= 0;
and byte ptr [ecx],0
>> p--;
dec ecx
>> while ( j-- > nSrcLen ) {
mov eax,edi
dec edi
cmp eax,dword ptr [nSrcLen]
jle zToBase64+98h (004016f0)
sub edi,dword ptr [nSrcLen]
inc edi
>> *(p--) = '=';
mov byte ptr [ecx],3Dh
dec ecx
dec edi
jne zToBase64+91h (004016e9)
pop edi
>> }
>> }
pop ebp
ret
=-=-=-=-=-=-=-=--==--=
I have to confess that some of what it's doing seems pretty obscure. It seems to prepare the AND masks in advance -- way in advance -- and doing various things in a rather non-sequential manner.
Looking at this, it did pop out at me that this sequence:
b1= BYTE((d & 0x00FC0000) >> 18);
b2= BYTE((d & 0x0003F000) >> 12);
b3= BYTE((d & 0x00000FC0) >> 6 );
b4= BYTE((d & 0x0000003F) >> 0 );
...is inefficient. The correct way would be to populate B4, then shift d, then populate b3, then shift d, etc. That would ue the same mask in the same bit positions. I think that may be what the compiler is doing... reorganizing my clumsy code into a lean, mean base64-encoding machine!
>> void zToBase64( BYTE* pSrc, int nSrcLen, char* pszOutBuf ) {
push ebp
mov ebp,esp
>> char* p= pszOutBuf; DWORD d; BYTE b1, b2, b3, b4;)
>> for ( int j=0; j< nSrcLen; j+=3 ) {
mov eax,dword ptr [ebp+0Ch]
mov ecx,dword ptr [ebp+10h]
push edi
xor edi,edi
test eax,eax
jle zToBase64+81h ;; to loop exit
push ebx ;;
push esi
push 3
add eax,2
xor edx,edx
pop esi
div eax,esi
mov esi,dword ptr [pSrc]
push 3Fh
mov dword ptr [pszOutBuf],eax
lea edi,[eax+eax*2]
pop eax
>> d= *pSrc++; d <<= 8;
movzx edx,byte ptr [esi]
>> d|= *pSrc++; d <<= 8;
movzx ebx,byte ptr [esi+1]
inc esi
shl edx,8
or edx,ebx
inc esi
shl edx,8
>> d|= *pSrc++;
movzx ebx,byte ptr [esi]
or edx,ebx
inc esi
>> b1= BYTE((d & 0x00FC0000) >> 18);
mov ebx,edx
shr ebx,12h
and ebx,eax
>> b2= BYTE((d & 0x0003F000) >> 12);
>> b3= BYTE((d & 0x00000FC0) >> 6 );
>> b4= BYTE((d & 0x0000003F) >> 0 );
>> *p = abDigitLookup[ b1 ]; p++;
mov bl,byte ptr abDigitLookup[ebx]
mov byte ptr [ecx],bl
mov ebx,edx
shr ebx,0Ch
and ebx,eax
inc ecx
>> *p = abDigitLookup[ b2 ]; p++;
mov bl,byte ptr abDigitLookup[ebx]
mov byte ptr [ecx],bl
mov ebx,edx
shr ebx,6
and ebx,eax
inc ecx
and edx,eax
>> *p = abDigitLookup[ b3 ]; p++;
mov bl,byte ptr abDigitLookup[ebx]
mov byte ptr [ecx],bl
>> *p = abDigitLookup[ b4 ]; p++;
mov dl,byte ptr abDigitLookup[edx]
inc ecx
mov byte ptr [ecx],dl
inc ecx
dec dword ptr [pszOutBuf]
jne zToBase64+28h ;; top of loop
pop esi
pop ebx
>> }
>> // note: TBD: I have not fixed this end-of-source logic
>>*p= 0;
and byte ptr [ecx],0
>> p--;
dec ecx
>> while ( j-- > nSrcLen ) {
mov eax,edi
dec edi
cmp eax,dword ptr [nSrcLen]
jle zToBase64+98h (004016f0)
sub edi,dword ptr [nSrcLen]
inc edi
>> *(p--) = '=';
mov byte ptr [ecx],3Dh
dec ecx
dec edi
jne zToBase64+91h (004016e9)
pop edi
>> }
>> }
pop ebp
ret
=-=-=-=-=-=-=-=--==--=
I have to confess that some of what it's doing seems pretty obscure. It seems to prepare the AND masks in advance -- way in advance -- and doing various things in a rather non-sequential manner.
Looking at this, it did pop out at me that this sequence:
b1= BYTE((d & 0x00FC0000) >> 18);
b2= BYTE((d & 0x0003F000) >> 12);
b3= BYTE((d & 0x00000FC0) >> 6 );
b4= BYTE((d & 0x0000003F) >> 0 );
...is inefficient. The correct way would be to populate B4, then shift d, then populate b3, then shift d, etc. That would ue the same mask in the same bit positions. I think that may be what the compiler is doing... reorganizing my clumsy code into a lean, mean base64-encoding machine!
>>Your quote was from the part of the RFC that describes base32 encoding
I thought the text seemed different :)
Ill have to take some time to fully analyse the code the compiler has produced.. Compiler optimizations can pretty well obscure the code well away from the original source, especially if it has interleaved instructions (this is to allow the processor to process two instructions simulatanously without affecting the logic of the code) this would be one optimization that would be very difficult to do by hand! - this would account for the non-sequential matter, although I have only scanned the code..
A couple of cringy things though, use the 'div' command - this is a slow command and also use of the stack (and not just setting up the stack frame)
some of it looks like its optimized for size..
push 3Fh
mov dword ptr [pszOutBuf],eax
lea edi,[eax+eax*2]
pop eax
why not just
mov dword ptr [pszOutBuf],eax
lea edi,[eax+eax*2]
movzx eax, 0x3F
I would imagine that 'movzx r32, imm32' is quicker than pushing and popping the stack!?
Anway, its that time again!
I thought the text seemed different :)
Ill have to take some time to fully analyse the code the compiler has produced.. Compiler optimizations can pretty well obscure the code well away from the original source, especially if it has interleaved instructions (this is to allow the processor to process two instructions simulatanously without affecting the logic of the code) this would be one optimization that would be very difficult to do by hand! - this would account for the non-sequential matter, although I have only scanned the code..
A couple of cringy things though, use the 'div' command - this is a slow command and also use of the stack (and not just setting up the stack frame)
some of it looks like its optimized for size..
push 3Fh
mov dword ptr [pszOutBuf],eax
lea edi,[eax+eax*2]
pop eax
why not just
mov dword ptr [pszOutBuf],eax
lea edi,[eax+eax*2]
movzx eax, 0x3F
I would imagine that 'movzx r32, imm32' is quicker than pushing and popping the stack!?
Anway, its that time again!
ASKER
You're right. That was set for minimize size. Optimized for speed, the main code looks like
>> for ( int j=0; j< nSrcLen; j+=3 ) {
mov ebx,dword ptr [esp+0Ch]
push ebp
xor ebp,ebp
test ebx,ebx
jle zToBase64+88h (exit loop)
lea edx,[ebx+2]
mov eax,0AAAAAAABh
mul eax,edx
push esi
mov esi,dword ptr [esp+10h]
push edi
mov edi,edx
shr edi,1
lea ebp,[edi+edi*2]
>> d = *pSrc++; d <<= 8;
again:
xor eax,eax
>> d|= *pSrc++; d <<= 8;
xor edx,edx
mov al,byte ptr [esi]
mov dl,byte ptr [esi+1]
inc esi
shl eax,8
or eax,edx
inc esi
>> d|= *pSrc++;
xor edx,edx
mov dl,byte ptr [esi]
shl eax,8
or eax,edx
inc esi
>> b1= BYTE((d & 0x00FC0000) >> 18);
mov edx,eax
shr edx,12h
and edx,3Fh
>> b2= BYTE((d & 0x0003F000) >> 12);
>> b3= BYTE((d & 0x00000FC0) >> 6 );
>> b4= BYTE((d & 0x0000003F) >> 0 );
>> *p = abDigitLookup[ b1 ]; p++;
inc ecx
mov dl,byte ptr abDigitLookup[edx]
mov byte ptr [ecx-1],dl
mov edx,eax
shr edx,0Ch
and edx,3Fh
>> *p = abDigitLookup[ b2 ]; p++;
inc ecx
mov dl,byte ptr abDigitLookup[edx]
mov byte ptr [ecx-1],dl
mov edx,eax
shr edx,6
and edx,3Fh
and eax,3Fh
>> *p = abDigitLookup[ b3 ]; p++;
inc ecx
mov dl,byte ptr abDigitLookup[edx]
mov byte ptr [ecx-1],dl
>> *p = abDigitLookup[ b4 ]; p++;
mov al,byte ptr abDigitLookup[eax]
mov byte ptr [ecx],al
inc ecx
dec edi
jne zToBase64+27h (label "again")
pop edi
pop esi
=-=-=--==-=-=-=-=-
There are many difference, including the ones you mentioned. For instance, the digit lookup code
optimized for size:
mov bl,byte ptr abDigitLookup[ebx]
mov byte ptr [ecx],bl
mov ebx,edx
shr ebx,0Ch
and ebx,eax ; must be smaller (not faster)
inc ecx
optimized for speed:
inc ecx
mov dl,byte ptr abDigitLookup[edx]
mov byte ptr [ecx-1],dl ; interesting! offset-1 after INC?
mov edx,eax
shr edx,0Ch
and edx,3Fh ; must be faster (not smaller)
=-=-=-=-=-=-=-=-=-=-=-=-=- =-=-=
There is still a MUL (now instead of a DIV) in the lead-in code. One would think that the optimizer would avoid either one. I think I recall from a previous discussion that MUL and DIV are not as inefficient as they used to be back in the day (of programming the 8086 with your teeth :-)
Anyway, after changing the optimization, the timing for the C code is much faster (faster even than the "unrolled loops" ASM version!). See http:/Cplusplus/Q_21988706.html#17532671
-- Dan
>> for ( int j=0; j< nSrcLen; j+=3 ) {
mov ebx,dword ptr [esp+0Ch]
push ebp
xor ebp,ebp
test ebx,ebx
jle zToBase64+88h (exit loop)
lea edx,[ebx+2]
mov eax,0AAAAAAABh
mul eax,edx
push esi
mov esi,dword ptr [esp+10h]
push edi
mov edi,edx
shr edi,1
lea ebp,[edi+edi*2]
>> d = *pSrc++; d <<= 8;
again:
xor eax,eax
>> d|= *pSrc++; d <<= 8;
xor edx,edx
mov al,byte ptr [esi]
mov dl,byte ptr [esi+1]
inc esi
shl eax,8
or eax,edx
inc esi
>> d|= *pSrc++;
xor edx,edx
mov dl,byte ptr [esi]
shl eax,8
or eax,edx
inc esi
>> b1= BYTE((d & 0x00FC0000) >> 18);
mov edx,eax
shr edx,12h
and edx,3Fh
>> b2= BYTE((d & 0x0003F000) >> 12);
>> b3= BYTE((d & 0x00000FC0) >> 6 );
>> b4= BYTE((d & 0x0000003F) >> 0 );
>> *p = abDigitLookup[ b1 ]; p++;
inc ecx
mov dl,byte ptr abDigitLookup[edx]
mov byte ptr [ecx-1],dl
mov edx,eax
shr edx,0Ch
and edx,3Fh
>> *p = abDigitLookup[ b2 ]; p++;
inc ecx
mov dl,byte ptr abDigitLookup[edx]
mov byte ptr [ecx-1],dl
mov edx,eax
shr edx,6
and edx,3Fh
and eax,3Fh
>> *p = abDigitLookup[ b3 ]; p++;
inc ecx
mov dl,byte ptr abDigitLookup[edx]
mov byte ptr [ecx-1],dl
>> *p = abDigitLookup[ b4 ]; p++;
mov al,byte ptr abDigitLookup[eax]
mov byte ptr [ecx],al
inc ecx
dec edi
jne zToBase64+27h (label "again")
pop edi
pop esi
=-=-=--==-=-=-=-=-
There are many difference, including the ones you mentioned. For instance, the digit lookup code
optimized for size:
mov bl,byte ptr abDigitLookup[ebx]
mov byte ptr [ecx],bl
mov ebx,edx
shr ebx,0Ch
and ebx,eax ; must be smaller (not faster)
inc ecx
optimized for speed:
inc ecx
mov dl,byte ptr abDigitLookup[edx]
mov byte ptr [ecx-1],dl ; interesting! offset-1 after INC?
mov edx,eax
shr edx,0Ch
and edx,3Fh ; must be faster (not smaller)
=-=-=-=-=-=-=-=-=-=-=-=-=-
There is still a MUL (now instead of a DIV) in the lead-in code. One would think that the optimizer would avoid either one. I think I recall from a previous discussion that MUL and DIV are not as inefficient as they used to be back in the day (of programming the 8086 with your teeth :-)
Anyway, after changing the optimization, the timing for the C code is much faster (faster even than the "unrolled loops" ASM version!). See http:/Cplusplus/Q_21988706.html#17532671
-- Dan
I had a look at using the XCHG command to load the DWORD into a register in correct byte order.. According to my tests this is slower than have the three seperate byte loading commands!
I have re-written my code,since it was so easily beaten by your mystery code and the Crypt32.dll - and now seemingly the standard C function! I dont think my manual attempts at optimizing my code will ever match that of the compiler!
Anywhere, here is my new submission ready for speed testing.. This one comes ready with the loops unrolled...
-------------------------- --
void ToBase64( BYTE* pSrc, char* pszOutBuf, int len )
{
char* chr_table="ABCDEFGHIJKLMNO PQRSTUVWXY Zabcdefghi jklmnopqrs tuvwxyz012 3456789+/" ;
__asm{
mov ecx, len
mov esi, pSrc //bytes from source
src_byteLoop:
xor eax, eax
//read 3 bytes
mov ah, byte ptr[esi]
mov al, byte ptr[esi+1]
shl eax, 16
mov ah, byte ptr[esi+2]
//manipulate in edx bitset1
mov edx, eax
shr edx, 26
//done first 6 bits
shl eax, 6
mov edi, chr_table
add edi, edx //set the pointer to correct char
mov dl, byte ptr [edi]
mov edi, pszOutBuf
mov byte ptr [edi], dl //put char in buffer
add pszOutBuf,1 //next buf
//manipulate in edx bitset2
mov edx, eax
shr edx, 26
//done first 6 bits
shl eax, 6
mov edi, chr_table
add edi, edx //set the pointer to correct char
mov dl, byte ptr [edi]
mov edi, pszOutBuf
mov byte ptr [edi], dl //put char in buffer
add pszOutBuf,1 //next buf
//manipulate in edx bitset3
mov edx, eax
shr edx, 26
//done first 6 bits
shl eax, 6
mov edi, chr_table
add edi, edx //set the pointer to correct char
mov dl, byte ptr [edi]
mov edi, pszOutBuf
mov byte ptr [edi], dl //put char in buffer
add pszOutBuf,1 //next buf
//manipulate in edx bitset4
mov edx, eax
shr edx, 26
//done first 6 bits
shl eax, 6
mov edi, chr_table
add edi, edx //set the pointer to correct char
mov dl, byte ptr [edi]
mov edi, pszOutBuf
mov byte ptr [edi], dl //put char in buffer
add pszOutBuf,1 //next buf
//done these bytes
add esi, 3
sub ecx, 3
cmp ecx, 3
jge src_byteLoop //still got src bytes
xor ebx, ebx //set to zero
cmp ecx, 0
jz finished
//need to pad out some extra bytes
xor eax, eax //zero
//read in 3 bytes regardless of junk data following pSrc
mov ah, byte ptr[esi]
mov al, byte ptr[esi+1]
shl eax, 16
mov ah, byte ptr[esi+2]
sub ecx, 3 //bytes just read
neg ecx //+ve inverse
mov esi, ecx //save how many bytes need padding
//as per the RFC, any padded bytes should be 0s
mov ebx, 0xFFFFFF
lea ecx, dword ptr[ecx*8+8] //calculate bitmask to shift
shl ebx, cl
and eax, ebx //mask out the junk bytes
//manipulate in edx byte 1
mov edx, eax
shr edx, 26
//done first 6 bits
shl eax, 6
mov edi, chr_table
add edi, edx //set the pointer to correct char
mov dl, byte ptr [edi]
mov edi, pszOutBuf
mov byte ptr [edi], dl //put char in buffer
add pszOutBuf,1 //next buf
//manipulate in edx byte 2
mov edx, eax
shr edx, 26
//done first 6 bits
shl eax, 6
mov edi, chr_table
add edi, edx //set the pointer to correct char
mov dl, byte ptr [edi]
mov edi, pszOutBuf
mov byte ptr [edi], dl //put char in buffer
add pszOutBuf,1 //next buf
//manipulate in edx byte 3
mov edx, eax
shr edx, 26
//done first 6 bits
shl eax, 6
mov edi, chr_table
add edi, edx //set the pointer to correct char
mov dl, byte ptr [edi]
mov edi, pszOutBuf
mov byte ptr [edi], dl //put char in buffer
add pszOutBuf,1 //next buf
//manipulate in edx byte 3
mov edx, eax
shr edx, 26
//done first 6 bits
shl eax, 6
mov edi, chr_table
add edi, edx //set the pointer to correct char
mov dl, byte ptr [edi]
mov edi, pszOutBuf
mov byte ptr [edi], dl //put char in buffer
add pszOutBuf,1 //next buf
finished:
test esi, esi
jz end
//some bytes were padding, put them as =
padChars:
mov edi, pszOutBuf
sub edi, esi //move ptr back for num bytes to pad
mov byte ptr[edi], 0x3d //=
dec esi
jnz padChars
end:
}
}
-------------------------- --------
I have avoided my original working with 1 bit at a time, which avoids a lot of instructions!
My local testing shows this code is approximately 2x faster than my original unrolled loop code :)
I have re-written my code,since it was so easily beaten by your mystery code and the Crypt32.dll - and now seemingly the standard C function! I dont think my manual attempts at optimizing my code will ever match that of the compiler!
Anywhere, here is my new submission ready for speed testing.. This one comes ready with the loops unrolled...
--------------------------
void ToBase64( BYTE* pSrc, char* pszOutBuf, int len )
{
char* chr_table="ABCDEFGHIJKLMNO
__asm{
mov ecx, len
mov esi, pSrc //bytes from source
src_byteLoop:
xor eax, eax
//read 3 bytes
mov ah, byte ptr[esi]
mov al, byte ptr[esi+1]
shl eax, 16
mov ah, byte ptr[esi+2]
//manipulate in edx bitset1
mov edx, eax
shr edx, 26
//done first 6 bits
shl eax, 6
mov edi, chr_table
add edi, edx //set the pointer to correct char
mov dl, byte ptr [edi]
mov edi, pszOutBuf
mov byte ptr [edi], dl //put char in buffer
add pszOutBuf,1 //next buf
//manipulate in edx bitset2
mov edx, eax
shr edx, 26
//done first 6 bits
shl eax, 6
mov edi, chr_table
add edi, edx //set the pointer to correct char
mov dl, byte ptr [edi]
mov edi, pszOutBuf
mov byte ptr [edi], dl //put char in buffer
add pszOutBuf,1 //next buf
//manipulate in edx bitset3
mov edx, eax
shr edx, 26
//done first 6 bits
shl eax, 6
mov edi, chr_table
add edi, edx //set the pointer to correct char
mov dl, byte ptr [edi]
mov edi, pszOutBuf
mov byte ptr [edi], dl //put char in buffer
add pszOutBuf,1 //next buf
//manipulate in edx bitset4
mov edx, eax
shr edx, 26
//done first 6 bits
shl eax, 6
mov edi, chr_table
add edi, edx //set the pointer to correct char
mov dl, byte ptr [edi]
mov edi, pszOutBuf
mov byte ptr [edi], dl //put char in buffer
add pszOutBuf,1 //next buf
//done these bytes
add esi, 3
sub ecx, 3
cmp ecx, 3
jge src_byteLoop //still got src bytes
xor ebx, ebx //set to zero
cmp ecx, 0
jz finished
//need to pad out some extra bytes
xor eax, eax //zero
//read in 3 bytes regardless of junk data following pSrc
mov ah, byte ptr[esi]
mov al, byte ptr[esi+1]
shl eax, 16
mov ah, byte ptr[esi+2]
sub ecx, 3 //bytes just read
neg ecx //+ve inverse
mov esi, ecx //save how many bytes need padding
//as per the RFC, any padded bytes should be 0s
mov ebx, 0xFFFFFF
lea ecx, dword ptr[ecx*8+8] //calculate bitmask to shift
shl ebx, cl
and eax, ebx //mask out the junk bytes
//manipulate in edx byte 1
mov edx, eax
shr edx, 26
//done first 6 bits
shl eax, 6
mov edi, chr_table
add edi, edx //set the pointer to correct char
mov dl, byte ptr [edi]
mov edi, pszOutBuf
mov byte ptr [edi], dl //put char in buffer
add pszOutBuf,1 //next buf
//manipulate in edx byte 2
mov edx, eax
shr edx, 26
//done first 6 bits
shl eax, 6
mov edi, chr_table
add edi, edx //set the pointer to correct char
mov dl, byte ptr [edi]
mov edi, pszOutBuf
mov byte ptr [edi], dl //put char in buffer
add pszOutBuf,1 //next buf
//manipulate in edx byte 3
mov edx, eax
shr edx, 26
//done first 6 bits
shl eax, 6
mov edi, chr_table
add edi, edx //set the pointer to correct char
mov dl, byte ptr [edi]
mov edi, pszOutBuf
mov byte ptr [edi], dl //put char in buffer
add pszOutBuf,1 //next buf
//manipulate in edx byte 3
mov edx, eax
shr edx, 26
//done first 6 bits
shl eax, 6
mov edi, chr_table
add edi, edx //set the pointer to correct char
mov dl, byte ptr [edi]
mov edi, pszOutBuf
mov byte ptr [edi], dl //put char in buffer
add pszOutBuf,1 //next buf
finished:
test esi, esi
jz end
//some bytes were padding, put them as =
padChars:
mov edi, pszOutBuf
sub edi, esi //move ptr back for num bytes to pad
mov byte ptr[edi], 0x3d //=
dec esi
jnz padChars
end:
}
}
--------------------------
I have avoided my original working with 1 bit at a time, which avoids a lot of instructions!
My local testing shows this code is approximately 2x faster than my original unrolled loop code :)
ASKER
Good work! That code is, indeed over twice as fast as your previous code:
MB/sec Internal name // Description
-------- -------------------------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- -
61.81 yToBase64 / craigwardman (ASM, unrolled loops)
131.68 y2ToBase64 / craigwardman (ASM, unrolled loops, later version)
but Alas! still not as fast as the optimized C. Just checking the raw size of your code, it is 286 bytes vs 154 bytes for the optimized C.
One thing I noticed is that you are continually updating a memory variable, e.g.,
mov edi, pszOutBuf
mov byte ptr [edi], dl //put char in buffer
add pszOutBuf,1 //next buf
and that's something the compiler avoids... it just uses a register pointer:
mov byte ptr [ecx],dl
inc ecx
I think you can shave some time by avoiding all of those load/stores. Also, as I recall
ADD mem, 1
looks tight, but it is rather slow... My understanding is that (internally) the CPU must read the value from memory into a temporary register, add 1, then write the data back out -- that can be a killer since you need to do that sequence four times for every input group.
-- Dan
MB/sec Internal name // Description
-------- --------------------------
61.81 yToBase64 / craigwardman (ASM, unrolled loops)
131.68 y2ToBase64 / craigwardman (ASM, unrolled loops, later version)
but Alas! still not as fast as the optimized C. Just checking the raw size of your code, it is 286 bytes vs 154 bytes for the optimized C.
One thing I noticed is that you are continually updating a memory variable, e.g.,
mov edi, pszOutBuf
mov byte ptr [edi], dl //put char in buffer
add pszOutBuf,1 //next buf
and that's something the compiler avoids... it just uses a register pointer:
mov byte ptr [ecx],dl
inc ecx
I think you can shave some time by avoiding all of those load/stores. Also, as I recall
ADD mem, 1
looks tight, but it is rather slow... My understanding is that (internally) the CPU must read the value from memory into a temporary register, add 1, then write the data back out -- that can be a killer since you need to do that sequence four times for every input group.
-- Dan
Ah right, thanks for that tip!
I will jig the registers around see if I can free one up for this purpose, if I see a substancial improvement in my test results here (since they seem to agree with yours) I will re-post..
Craig,.
I will jig the registers around see if I can free one up for this purpose, if I see a substancial improvement in my test results here (since they seem to agree with yours) I will re-post..
Craig,.
in fact, there is a bug in the above code (near the end i set ebx to zero as a flag but then test esi - too much jigging!).. but I will fix it in the new version..
ok here it is!! damn CPUs should have more registers!
This version seems to be another 30-40% faster than my current fastest version:
-------------------------- ----
void ToBase64( BYTE* pSrc, char* pszOutBuf, int len )
{
char* chr_table="ABCDEFGHIJKLMNO PQRSTUVWXY Zabcdefghi jklmnopqrs tuvwxyz012 3456789+/" ;
__asm{
mov ecx, len
mov esi, pSrc //bytes from source
mov ebx, pszOutBuf
src_byteLoop:
xor eax, eax
//read 3 bytes
mov ah, byte ptr[esi]
mov al, byte ptr[esi+1]
shl eax, 16
mov ah, byte ptr[esi+2]
//manipulate in edx bitset1
mov edx, eax
shr edx, 26
//done first 6 bits
shl eax, 6
mov edi, chr_table
add edi, edx //set the pointer to correct char
mov dl, byte ptr [edi]
mov byte ptr [ebx], dl //put char in buffer
inc ebx //next buf
//manipulate in edx bitset2
mov edx, eax
shr edx, 26
//done first 6 bits
shl eax, 6
mov edi, chr_table
add edi, edx //set the pointer to correct char
mov dl, byte ptr [edi]
mov byte ptr [ebx], dl //put char in buffer
inc ebx //next buf
//manipulate in edx bitset3
mov edx, eax
shr edx, 26
//done first 6 bits
shl eax, 6
mov edi, chr_table
add edi, edx //set the pointer to correct char
mov dl, byte ptr [edi]
mov byte ptr [ebx], dl //put char in buffer
inc ebx //next buf
//manipulate in edx bitset4
mov edx, eax
shr edx, 26
//done first 6 bits
shl eax, 6
mov edi, chr_table
add edi, edx //set the pointer to correct char
mov dl, byte ptr [edi]
mov byte ptr [ebx], dl //put char in buffer
inc ebx //next buf
//done these bytes
add esi, 3
sub ecx, 3
cmp ecx, 3
jge src_byteLoop //still got src bytes
xor eax, eax //set to zero (pad count)
cmp ecx, 0
jz finished
//need to pad out some extra bytes
//read in 3 bytes regardless of junk data following pSrc - already zero from above)
mov ah, byte ptr[esi]
mov al, byte ptr[esi+1]
shl eax, 16
mov ah, byte ptr[esi+2]
sub ecx, 3 //bytes just read
neg ecx //+ve inverse
mov edx, ecx //save how many bytes need padding
//as per the RFC, any padded bytes should be 0s
mov esi, 0xFFFFFF
lea ecx, dword ptr[ecx*8+8] //calculate bitmask to shift
shl esi, cl
and eax, esi //mask out the junk bytes
mov ecx, edx //restore pad count
//manipulate in edx byte 1
mov edx, eax
shr edx, 26
//done first 6 bits
shl eax, 6
mov edi, chr_table
add edi, edx //set the pointer to correct char
mov dl, byte ptr [edi]
mov byte ptr [ebx], dl //put char in buffer
inc ebx //next buf
//manipulate in edx byte 2
mov edx, eax
shr edx, 26
//done first 6 bits
shl eax, 6
mov edi, chr_table
add edi, edx //set the pointer to correct char
mov dl, byte ptr [edi]
mov byte ptr [ebx], dl //put char in buffer
inc ebx //next buf
//manipulate in edx byte 3
mov edx, eax
shr edx, 26
//done first 6 bits
shl eax, 6
mov edi, chr_table
add edi, edx //set the pointer to correct char
mov dl, byte ptr [edi]
mov byte ptr [ebx], dl //put char in buffer
inc ebx //next buf
//manipulate in edx byte 3
mov edx, eax
shr edx, 26
//done first 6 bits
shl eax, 6
mov edi, chr_table
add edi, edx //set the pointer to correct char
mov dl, byte ptr [edi]
mov byte ptr [ebx], dl //put char in buffer
inc ebx //next buf
mov eax, ecx //'return' pad count
finished:
test eax, eax
jz end
//some bytes were padding, put them as =
sub ebx, eax //move ptr back for num bytes to pad
padChars:
mov byte ptr[ebx], 0x3d //=
inc ebx
dec eax
jnz padChars
end:
}
}
-------------------------- --
let me know how the MB/s goes!
Thanks..
This version seems to be another 30-40% faster than my current fastest version:
--------------------------
void ToBase64( BYTE* pSrc, char* pszOutBuf, int len )
{
char* chr_table="ABCDEFGHIJKLMNO
__asm{
mov ecx, len
mov esi, pSrc //bytes from source
mov ebx, pszOutBuf
src_byteLoop:
xor eax, eax
//read 3 bytes
mov ah, byte ptr[esi]
mov al, byte ptr[esi+1]
shl eax, 16
mov ah, byte ptr[esi+2]
//manipulate in edx bitset1
mov edx, eax
shr edx, 26
//done first 6 bits
shl eax, 6
mov edi, chr_table
add edi, edx //set the pointer to correct char
mov dl, byte ptr [edi]
mov byte ptr [ebx], dl //put char in buffer
inc ebx //next buf
//manipulate in edx bitset2
mov edx, eax
shr edx, 26
//done first 6 bits
shl eax, 6
mov edi, chr_table
add edi, edx //set the pointer to correct char
mov dl, byte ptr [edi]
mov byte ptr [ebx], dl //put char in buffer
inc ebx //next buf
//manipulate in edx bitset3
mov edx, eax
shr edx, 26
//done first 6 bits
shl eax, 6
mov edi, chr_table
add edi, edx //set the pointer to correct char
mov dl, byte ptr [edi]
mov byte ptr [ebx], dl //put char in buffer
inc ebx //next buf
//manipulate in edx bitset4
mov edx, eax
shr edx, 26
//done first 6 bits
shl eax, 6
mov edi, chr_table
add edi, edx //set the pointer to correct char
mov dl, byte ptr [edi]
mov byte ptr [ebx], dl //put char in buffer
inc ebx //next buf
//done these bytes
add esi, 3
sub ecx, 3
cmp ecx, 3
jge src_byteLoop //still got src bytes
xor eax, eax //set to zero (pad count)
cmp ecx, 0
jz finished
//need to pad out some extra bytes
//read in 3 bytes regardless of junk data following pSrc - already zero from above)
mov ah, byte ptr[esi]
mov al, byte ptr[esi+1]
shl eax, 16
mov ah, byte ptr[esi+2]
sub ecx, 3 //bytes just read
neg ecx //+ve inverse
mov edx, ecx //save how many bytes need padding
//as per the RFC, any padded bytes should be 0s
mov esi, 0xFFFFFF
lea ecx, dword ptr[ecx*8+8] //calculate bitmask to shift
shl esi, cl
and eax, esi //mask out the junk bytes
mov ecx, edx //restore pad count
//manipulate in edx byte 1
mov edx, eax
shr edx, 26
//done first 6 bits
shl eax, 6
mov edi, chr_table
add edi, edx //set the pointer to correct char
mov dl, byte ptr [edi]
mov byte ptr [ebx], dl //put char in buffer
inc ebx //next buf
//manipulate in edx byte 2
mov edx, eax
shr edx, 26
//done first 6 bits
shl eax, 6
mov edi, chr_table
add edi, edx //set the pointer to correct char
mov dl, byte ptr [edi]
mov byte ptr [ebx], dl //put char in buffer
inc ebx //next buf
//manipulate in edx byte 3
mov edx, eax
shr edx, 26
//done first 6 bits
shl eax, 6
mov edi, chr_table
add edi, edx //set the pointer to correct char
mov dl, byte ptr [edi]
mov byte ptr [ebx], dl //put char in buffer
inc ebx //next buf
//manipulate in edx byte 3
mov edx, eax
shr edx, 26
//done first 6 bits
shl eax, 6
mov edi, chr_table
add edi, edx //set the pointer to correct char
mov dl, byte ptr [edi]
mov byte ptr [ebx], dl //put char in buffer
inc ebx //next buf
mov eax, ecx //'return' pad count
finished:
test eax, eax
jz end
//some bytes were padding, put them as =
sub ebx, eax //move ptr back for num bytes to pad
padChars:
mov byte ptr[ebx], 0x3d //=
inc ebx
dec eax
jnz padChars
end:
}
}
--------------------------
let me know how the MB/s goes!
Thanks..
ASKER
That does look more efficient.
It seems to me that the end-of-source logic could be done more elegantly (than basically repeating the logic of the main loop) but I haven't worked it out myself, so that is just a hunch -- anyway, that portion of the code is executed just once, so it probably does not have much effect on the overall performance.
I'll run the speed test tommorrow on my computer at the office.
-- Dan
It seems to me that the end-of-source logic could be done more elegantly (than basically repeating the logic of the main loop) but I haven't worked it out myself, so that is just a hunch -- anyway, that portion of the code is executed just once, so it probably does not have much effect on the overall performance.
I'll run the speed test tommorrow on my computer at the office.
-- Dan
I suspect that the final part could be only be improved for size/elegance, but I think the fastest way is to repeat the code, this avoids the need for one if not two conditional jumps in the code..
although its not very good coding practice as it would be harder to maintain, I could imagine the compiler doing this as an optimisation if it knew it could.. (i.e. if the repeated code where put into a function my guess is that the compiler would compile it inline - as I have)
although its not very good coding practice as it would be harder to maintain, I could imagine the compiler doing this as an optimisation if it knew it could.. (i.e. if the repeated code where put into a function my guess is that the compiler would compile it inline - as I have)
ASKER
Good work! That code is 25% faster.
MB/sec Internal name // Description
-------- -------------------------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- -
61.81 yToBase64 / craigwardman (ASM, unrolled loops)
131.68 y2ToBase64 / craigwardman (ASM, unrolled loops, later version)
175.34 y3ToBase64 / craigwardman (ASM, unrolled loops, later version -- fewer load/stores)
You're right -- the compiler will unroll loops (reusing previously-generated code) if it recognizes that doing so will avoid jumps and "special casing" within the all-important inner loop. I see nothing at all wrong with repeating the previous code to handle the special cases at the end.
=-=-=-=-=-=-=-=-=-=-=-=-=- =-=
Have you been following the C++ version of this?
https://www.experts-exchange.com/questions/21988706/How-fast-is-your-C-C-Base64-Encoder.html
C++ Expert infinity08 employed a novel multi-table lookup algorithm to squeeze out some more throughput. As of now, his function and yours are ouputting at very nearly the same rate.
MB/sec Internal name // Description
-------- --------------------------
61.81 yToBase64 / craigwardman (ASM, unrolled loops)
131.68 y2ToBase64 / craigwardman (ASM, unrolled loops, later version)
175.34 y3ToBase64 / craigwardman (ASM, unrolled loops, later version -- fewer load/stores)
You're right -- the compiler will unroll loops (reusing previously-generated code) if it recognizes that doing so will avoid jumps and "special casing" within the all-important inner loop. I see nothing at all wrong with repeating the previous code to handle the special cases at the end.
=-=-=-=-=-=-=-=-=-=-=-=-=-
Have you been following the C++ version of this?
https://www.experts-exchange.com/questions/21988706/How-fast-is-your-C-C-Base64-Encoder.html
C++ Expert infinity08 employed a novel multi-table lookup algorithm to squeeze out some more throughput. As of now, his function and yours are ouputting at very nearly the same rate.
Hi Dan,
Yes I am keeping an eye on the C thread too, although i dont think I will be able to contribute much to it!
Well having the fastest ASM version is ok for now :) I will be having a think about some other ways that I could speed up the algo..
There may be a couple more enhancements I can make to the code, but they are really superficial and im not sure how much difference they will make, I think a re-think is required for a major performance gain..
----
looking at some of your previous posts..
Im not sure how fast the MUL is these days I just know it and DIV are renowned for being slow, but that MUL is faster than DIV..
I suppose if the compiler cant find any other way to do the divide then it will have to settle for MUL (using the multiplicative inverse) as this
is still faster than DIV.. (unless the multiplicative inverse has to be worked out on the fly, but in the code its hard coded as 0xAAAAAAAB)
also,
inc ecx
mov dl,byte ptr abDigitLookup[edx]
mov byte ptr [ecx-1],dl ; interesting! offset-1 after INC?
Im not sure why it has done this, perhaps this was the best option for code interleaving..
I have noticed that when optimized for size the function does not use a stack frame, this frees up the EBP register.. which could be useful in my code - as I have a few possible ideas that may or may not improve performance.. (could you confirm that no stack frame is set up when you compile using my ASM version?)
Thanks.
Yes I am keeping an eye on the C thread too, although i dont think I will be able to contribute much to it!
Well having the fastest ASM version is ok for now :) I will be having a think about some other ways that I could speed up the algo..
There may be a couple more enhancements I can make to the code, but they are really superficial and im not sure how much difference they will make, I think a re-think is required for a major performance gain..
----
looking at some of your previous posts..
Im not sure how fast the MUL is these days I just know it and DIV are renowned for being slow, but that MUL is faster than DIV..
I suppose if the compiler cant find any other way to do the divide then it will have to settle for MUL (using the multiplicative inverse) as this
is still faster than DIV.. (unless the multiplicative inverse has to be worked out on the fly, but in the code its hard coded as 0xAAAAAAAB)
also,
inc ecx
mov dl,byte ptr abDigitLookup[edx]
mov byte ptr [ecx-1],dl ; interesting! offset-1 after INC?
Im not sure why it has done this, perhaps this was the best option for code interleaving..
I have noticed that when optimized for size the function does not use a stack frame, this frees up the EBP register.. which could be useful in my code - as I have a few possible ideas that may or may not improve performance.. (could you confirm that no stack frame is set up when you compile using my ASM version?)
Thanks.
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
correction to my original post of todays date...
>>I have noticed that when optimized for size the function does not use a stack frame,
should be
I have noticed that when optimized for SPEED the function does not use a stack frame,
>>I have noticed that when optimized for size the function does not use a stack frame,
should be
I have noticed that when optimized for SPEED the function does not use a stack frame,
ASKER
Just a note:
I tried using the XLATB opcode for the table lookup...
It seems ideal: Just point EBX to the start of the table (one time only), set a byte to AL, execute XLAT and AL contains the looked up value. Although the code ends up tight and clean, I saw no significant performance increase. My guess is that the microcode for that instruction is not particularly fast (I seem to recall verifying that from the Intel docs a while back).
Speed-testing your new code now...
I tried using the XLATB opcode for the table lookup...
It seems ideal: Just point EBX to the start of the table (one time only), set a byte to AL, execute XLAT and AL contains the looked up value. Although the code ends up tight and clean, I saw no significant performance increase. My guess is that the microcode for that instruction is not particularly fast (I seem to recall verifying that from the Intel docs a while back).
Speed-testing your new code now...
hmm, an interesting mnemonic, I havent seen this one used before..
When you say you saw no increase, was there any decrease? or do you think the CPU just translates the opcode into its sub parts..
i.e.
mov al, byte ptr[ebx+al]
..
I have noticed on several occasions now, how XCHG is slower at copying a DWORD to/from memory than manually copying it a byte at a time (using multiple mov instructions)..
It appears this may only be quicker when you just XCHG r32, r32 (as it is fewer instructions - might even resolve to the same in the CPU as doing it manually!) - so I dont see why Intel have implemented the memory option, as there is a significant loss of performance, which might be overlooked by some developers..
When you say you saw no increase, was there any decrease? or do you think the CPU just translates the opcode into its sub parts..
i.e.
mov al, byte ptr[ebx+al]
..
I have noticed on several occasions now, how XCHG is slower at copying a DWORD to/from memory than manually copying it a byte at a time (using multiple mov instructions)..
It appears this may only be quicker when you just XCHG r32, r32 (as it is fewer instructions - might even resolve to the same in the CPU as doing it manually!) - so I dont see why Intel have implemented the memory option, as there is a significant loss of performance, which might be overlooked by some developers..
ASKER
Excellent! Another 10%+ speed increase!
MB/sec Internal name // Description
-------- -------------------------- ---------- ---------- ---------- ---------- ---------- ---------- -
61.81 yToBase64 / craigwardman (ASM, unrolled loops)
131.68 y2ToBase64 / craigwardman (ASM, later version)
175.34 y3ToBase64 / craigwardman (ASM, later version -- fewer load/stores)
196.92 y4ToBase64 / craigwardman (ASM, later version -- uses bp and [edi+edx])
I was about to suggest a PUSH/POP of EBP to give you another register, but it looks like you figured that out :-)
It's clear that using...
mov bl, byte ptr [edi+edx] // indirect addressing base+offset
mov byte ptr[ebp], bl
...is a big win over a sequence like..
mov edi, chr_table
add edi, edx
mov dl, byte ptr [edi]
mov byte ptr [ebx], dl
=-=-=-=-=-=-=-=-=-=
I didn't notice your sweeeeeet trick with the...
mov edx, eax
shl eax, 6
shr edx, 26
...sequence in your previous code... it took me a while to see what's going on there. That lets you avoid an AND operation (you will always have exactly 6 bits in your lookup table index). Shifts are *very* fast, so that should end up being at least as fast as the technique of using a 256-byte lookup table with duplicated values (as in http:/Cplusplus/Q_21988706.html#17554449).
Nyeece :-)
MB/sec Internal name // Description
-------- --------------------------
61.81 yToBase64 / craigwardman (ASM, unrolled loops)
131.68 y2ToBase64 / craigwardman (ASM, later version)
175.34 y3ToBase64 / craigwardman (ASM, later version -- fewer load/stores)
196.92 y4ToBase64 / craigwardman (ASM, later version -- uses bp and [edi+edx])
I was about to suggest a PUSH/POP of EBP to give you another register, but it looks like you figured that out :-)
It's clear that using...
mov bl, byte ptr [edi+edx] // indirect addressing base+offset
mov byte ptr[ebp], bl
...is a big win over a sequence like..
mov edi, chr_table
add edi, edx
mov dl, byte ptr [edi]
mov byte ptr [ebx], dl
=-=-=-=-=-=-=-=-=-=
I didn't notice your sweeeeeet trick with the...
mov edx, eax
shl eax, 6
shr edx, 26
...sequence in your previous code... it took me a while to see what's going on there. That lets you avoid an AND operation (you will always have exactly 6 bits in your lookup table index). Shifts are *very* fast, so that should end up being at least as fast as the technique of using a 256-byte lookup table with duplicated values (as in http:/Cplusplus/Q_21988706.html#17554449).
Nyeece :-)
thanks for your compliments!
Just to clarify to anyone else reading the code, if you look at my first generation code, I was shifting out one bit at a time from EAX, then slowly shifting them left to keep the significance in EDX, I then realised I may as well copy the whole DWORD at once EAX->EDX, lose the 6 MSBs in the EAX, and keep only the 6 MSBs in EDX (but re-aligned to the correct significance), which of course is acheived by shifting right 26 times...
that 196.92 is slower than expected.. My tests here were into the 230MB/s range on my 1.7GHz AMD Sempron - was that an average, or a first run?
using the pasted copy of your code for testing:
It took 4265 ms to process 1000 MB of data
That is 234.47 MB per second
Press any key to continue
But I suppose there could be other factors involved and all code submitted will be affected..
Just to clarify to anyone else reading the code, if you look at my first generation code, I was shifting out one bit at a time from EAX, then slowly shifting them left to keep the significance in EDX, I then realised I may as well copy the whole DWORD at once EAX->EDX, lose the 6 MSBs in the EAX, and keep only the 6 MSBs in EDX (but re-aligned to the correct significance), which of course is acheived by shifting right 26 times...
that 196.92 is slower than expected.. My tests here were into the 230MB/s range on my 1.7GHz AMD Sempron - was that an average, or a first run?
using the pasted copy of your code for testing:
It took 4265 ms to process 1000 MB of data
That is 234.47 MB per second
Press any key to continue
But I suppose there could be other factors involved and all code submitted will be affected..
ASKER
That timing was the best of three runs.
In all the tests, I've found that the second and third are usually a few MB/s faster than the first, but usually only 1 or 2 MB/s faster. I sometimes click the button a few more times as a sanity check (in case some high-priority background task suddenly kicked in) but I've never seen it vary more and a few MB/s.
True, your times will be different from what I post, but if you use your computer to test one or two of the other submissions, you should see proportional values. Your computer is faster than mine, so higher values are expected. But if you measure your previous tries, those, too, will be faster. For instance, I calculated about a 10% speed increase between your last two submissions. I'd expect you to see (about) the same 10% difference -- even though both actual results would he higher than what I measured.
=-=-=-=-=-=-=-=-=
Regarding XCHG:
>> I have noticed on several occasions now, how XCHG is slower at copying a DWORD to/from memory than manually copying it a byte at a time
An XCHG between register and memory is bound to be slow... it means a read from, and a write to memory. One possible use for that opcode would be in a multi-tasking semaphore system -- it "atomically" obtains a value form memory and stores a new value there without any chance that some other process could interrupt the sequence.
What I was recommending with a register-to-register XCHG was something like
mov BX, WORD PTR [ESI] ; bytes in wrong order
xchg BL,BH ; now in correct order..
One thing I've experimented with is the BSWAP opcode. I tried replacing a sequence like:
mov bh, BYTE PTR [esi]
mov bl, BYTE PTR [esi+1]
shl ebx,8
mov bl, BYTE PTR [esi+2]
...with something like..
mov ebx, DWORD PTR [esi] ;; gets four bytes at once
bswap ebx // endian adjust
shr ebx, 8 // discard the extra and (inefficiently?) re-read it next pass
...with the idea of avoiding three separate one-byte reads by using a single four-byte read plus an adjustment to get the bytes into the desired order for subsequent processing.
My timing tests indicate that the two options work at nearly identical speeds. Most likely, after the first one-byte access, all three bytes are in high-speed cache, so the second and third bytes are instantly available -- nullifying any advantage. I know for certain that on a 4.77 MHz 8086 IBM PC, two separate 8-bit accesses took nearly twice as long as a single 16-bit access... but a LOT of what I used to know back in 1985 is only marginally useful information these days :-)
-- Dan
In all the tests, I've found that the second and third are usually a few MB/s faster than the first, but usually only 1 or 2 MB/s faster. I sometimes click the button a few more times as a sanity check (in case some high-priority background task suddenly kicked in) but I've never seen it vary more and a few MB/s.
True, your times will be different from what I post, but if you use your computer to test one or two of the other submissions, you should see proportional values. Your computer is faster than mine, so higher values are expected. But if you measure your previous tries, those, too, will be faster. For instance, I calculated about a 10% speed increase between your last two submissions. I'd expect you to see (about) the same 10% difference -- even though both actual results would he higher than what I measured.
=-=-=-=-=-=-=-=-=
Regarding XCHG:
>> I have noticed on several occasions now, how XCHG is slower at copying a DWORD to/from memory than manually copying it a byte at a time
An XCHG between register and memory is bound to be slow... it means a read from, and a write to memory. One possible use for that opcode would be in a multi-tasking semaphore system -- it "atomically" obtains a value form memory and stores a new value there without any chance that some other process could interrupt the sequence.
What I was recommending with a register-to-register XCHG was something like
mov BX, WORD PTR [ESI] ; bytes in wrong order
xchg BL,BH ; now in correct order..
One thing I've experimented with is the BSWAP opcode. I tried replacing a sequence like:
mov bh, BYTE PTR [esi]
mov bl, BYTE PTR [esi+1]
shl ebx,8
mov bl, BYTE PTR [esi+2]
...with something like..
mov ebx, DWORD PTR [esi] ;; gets four bytes at once
bswap ebx // endian adjust
shr ebx, 8 // discard the extra and (inefficiently?) re-read it next pass
...with the idea of avoiding three separate one-byte reads by using a single four-byte read plus an adjustment to get the bytes into the desired order for subsequent processing.
My timing tests indicate that the two options work at nearly identical speeds. Most likely, after the first one-byte access, all three bytes are in high-speed cache, so the second and third bytes are instantly available -- nullifying any advantage. I know for certain that on a 4.77 MHz 8086 IBM PC, two separate 8-bit accesses took nearly twice as long as a single 16-bit access... but a LOT of what I used to know back in 1985 is only marginally useful information these days :-)
-- Dan
hehe, well i was 1 year old in 1985 so im sure you have more useful info than me :)
i have run the speed test with all of my versions, here are the results:
ver1 - rolled loops - 46.18 MB per second
ver1.2 - unrolled loops - 52.03 MB per second
ver2 - ready unrolled - 112.47 MB per second
ver2.2 - no mem access - 163.69 MB per second
ver2.3 - EBP version - 231.91 MB per second
As you can see, there seems to be a bigger jump at my end, i wonder if there is something specific to my CPU/mobo that it likes my register only (ebp version) code so much ...
i have run the speed test with all of my versions, here are the results:
ver1 - rolled loops - 46.18 MB per second
ver1.2 - unrolled loops - 52.03 MB per second
ver2 - ready unrolled - 112.47 MB per second
ver2.2 - no mem access - 163.69 MB per second
ver2.3 - EBP version - 231.91 MB per second
As you can see, there seems to be a bigger jump at my end, i wonder if there is something specific to my CPU/mobo that it likes my register only (ebp version) code so much ...
ASKER
Hey! Making me feel OLD will NOT improve your timings <G>
>> i wonder if there is something specific to my CPU/mobo...
That's a reasonable assumption. Things like DDR RAM and front-side bus speed would probably have a proportionally greater impact as you eliminate more memory accesses, while raw opcode count would be less important.
I'm also wary of the timings for another reason... I might have more or fewer simultaneous tasks time-sharing the CPU. If the other tasks do a lot of work, they would tend to overwrite the L2/L1 cache, forcing a reload when my timeslice comes back around. Also, the faster the code runs, the fewer the number of chances it has to be interrupted by other tasks. So there could be an accellerating upward curve, where even a slight efficiency gain yields significant throughput increase...
>> i wonder if there is something specific to my CPU/mobo...
That's a reasonable assumption. Things like DDR RAM and front-side bus speed would probably have a proportionally greater impact as you eliminate more memory accesses, while raw opcode count would be less important.
I'm also wary of the timings for another reason... I might have more or fewer simultaneous tasks time-sharing the CPU. If the other tasks do a lot of work, they would tend to overwrite the L2/L1 cache, forcing a reload when my timeslice comes back around. Also, the faster the code runs, the fewer the number of chances it has to be interrupted by other tasks. So there could be an accellerating upward curve, where even a slight efficiency gain yields significant throughput increase...
Yes that is true..
I suppose it really depends on the architecture and the maximum performance of the PC you run the code on now, rather than the code itself..
For example, the Infinity08_lookup_b that he measured > 300MB/s I have tested on my machine and got average of 187MB/s.. Again my processor is AMD so this appears to be entirely dependant on architecure..
On reading the C thread, it seems as though you have also looked into this and got some confirmative (is that a word?) results...
I suppose it really depends on the architecture and the maximum performance of the PC you run the code on now, rather than the code itself..
For example, the Infinity08_lookup_b that he measured > 300MB/s I have tested on my machine and got average of 187MB/s.. Again my processor is AMD so this appears to be entirely dependant on architecure..
On reading the C thread, it seems as though you have also looked into this and got some confirmative (is that a word?) results...
Hey Dan, are you still taking submissions?
ASKER
Absolutely!
I think that there are a lot of avenues that can still be explored. I hope that this is as much (or more of) a learning experience as a contest. Even little changes can have a measurable effect. I don't think we've come close to wringing this thing out :-)
I have some new ideas myself, but I've been waiting for some more Experts to join in!
I think that there are a lot of avenues that can still be explored. I hope that this is as much (or more of) a learning experience as a contest. Even little changes can have a measurable effect. I don't think we've come close to wringing this thing out :-)
I have some new ideas myself, but I've been waiting for some more Experts to join in!
I'm just going to start where craigwardman has left off and try to optimize his code further. I'll hopefully have something by monday.
Brian
Brian
Hi Dan,
I've finally got around to having a tinker with this.
I've got a working algorithm that should be quite quick when converted to assembler. Unfortunately my assembler is rusty enough to make this a very slow process. Meanwhile here's the algorithm.
// Includes.
#include <stdio.h>
// The digits of the number.
char * digits = "ABCDEFGHIJKLMNOPQRSTUVWXY Zabcdefghi jklmnopqrs tuvwxyz012 3456789+/" ;
void PrintTable(int * c, int * carry, bool cAsChars)
{
int i;
// The characters.
if ( cAsChars ) for ( i = 0; i < 256; i++ ) printf("%c", digits[c[i]]);
else for ( i = 0; i < 256; i++ ) printf("%d,", c[i]);
printf ("\n");
// The carries.
for ( i = 0; i < 256; i++ ) printf("%d,", carry[i]);
printf ("\n");
}
void GenerateTables ( void )
{
int i;
int c [256];
int carry [256];
// I want tables of character and carry for each byte of the three.
/*
The carry from the previous calculation will be added to whatever is in
the c array to lookup in the digits array.
*/
// Table for first byte.
// Top 6 bits is character.
// Bottom 2 is carry to top 2 of next 6.
for ( i = 0; i < 256; i++ )
{
c[i] = i>>2;
carry[i] = (i & 0x03) << 4;
}
PrintTable(c,carry,true);
// Table for second byte.
// Top 4 bits is character.
// Bottom 4 is carry to top 4 of next 6.
for ( i = 0; i < 256; i++ )
{
c[i] = i>>4;
carry[i] = (i & 0x0f) << 2;
}
PrintTable(c,carry,false);
// Table for third byte.
// Top 2 bits is character.
// Bottom 6 is fourth character.
for ( i = 0; i < 256; i++ )
{
c[i] = i>>6;
carry[i] = (i & 0x3f);
}
PrintTable(c,carry,false);
}
// The tables generated by the above code.
//int c0 [] = {0,0,0,0,1,1,1,1,2,2,2,2,3 ,3,3,3,4,4 ,4,4,5,5,5 ,5,6,6,6,6 ,7,7,7,7,8 ,8,8,8,9,9 ,9,9,10,10 ,10,10,11, 11,11,11,1 2,12,12,12 ,13,13,13, 13,14,14,1 4,14,15,15 ,15,15,16, 16,16,16,1 7,17,17,17 ,18,18,18, 18,19,19,1 9,19,20,20 ,20,20,21, 21,21,21,2 2,22,22,22 ,23,23,23, 23,24,24,2 4,24,25,25 ,25,25,26, 26,26,26,2 7,27,27,27 ,28,28,28, 28,29,29,2 9,29,30,30 ,30,30,31, 31,31,31,3 2,32,32,32 ,33,33,33, 33,34,34,3 4,34,35,35 ,35,35,36, 36,36,36,3 7,37,37,37 ,38,38,38, 38,39,39,3 9,39,40,40 ,40,40,41, 41,41,41,4 2,42,42,42 ,43,43,43, 43,44,44,4 4,44,45,45 ,45,45,46, 46,46,46,4 7,47,47,47 ,48,48,48, 48,49,49,4 9,49,50,50 ,50,50,51, 51,51,51,5 2,52,52,52 ,53,53,53, 53,54,54,5 4,54,55,55 ,55,55,56, 56,56,56,5 7,57,57,57 ,58,58,58, 58,59,59,5 9,59,60,60 ,60,60,61, 61,61,61,6 2,62,62,62 ,63,63,63, 63};
char c0 [] = "AAAABBBBCCCCDDDDEEEEFFFFG GGGHHHHIII IJJJJKKKKL LLLMMMMNNN NOOOOPPPPQ QQQRRRRSSS STTTTUUUUV VVVWWWWXXX XYYYYZZZZa aaabbbbccc cddddeeeef fffgggghhh hiiiijjjjk kkkllllmmm mnnnnoooop pppqqqqrrr rssssttttu uuuvvvvwww wxxxxyyyyz zzz0000111 1222233334 4445555666 6777788889 999++++/// /";
int carry0 [] = {0,16,32,48,0,16,32,48,0,1 6,32,48,0, 16,32,48,0 ,16,32,48, 0,16,32,48 ,0,16,32,4 8,0,16,32, 48,0,16,32 ,48,0,16,3 2,48,0,16, 32,48,0,16 ,32,48,0,1 6,32,48,0, 16,32,48,0 ,16,32,48, 0,16,32,48 ,0,16,32,4 8,0,16,32, 48,0,16,32 ,48,0,16,3 2,48,0,16, 32,48,0,16 ,32,48,0,1 6,32,48,0, 16,32,48,0 ,16,32,48, 0,16,32,48 ,0,16,32,4 8,0,16,32, 48,0,16,32 ,48,0,16,3 2,48,0,16, 32,48,0,16 ,32,48,0,1 6,32,48,0, 16,32,48,0 ,16,32,48, 0,16,32,48 ,0,16,32,4 8,0,16,32, 48,0,16,32 ,48,0,16,3 2,48,0,16, 32,48,0,16 ,32,48,0,1 6,32,48,0, 16,32,48,0 ,16,32,48, 0,16,32,48 ,0,16,32,4 8,0,16,32, 48,0,16,32 ,48,0,16,3 2,48,0,16, 32,48,0,16 ,32,48,0,1 6,32,48,0, 16,32,48,0 ,16,32,48, 0,16,32,48 ,0,16,32,4 8,0,16,32, 48,0,16,32 ,48,0,16,3 2,48,0,16, 32,48,0,16 ,32,48,0,1 6,32,48,0, 16,32,48};
int c1 [] = {0,0,0,0,0,0,0,0,0,0,0,0,0 ,0,0,0,1,1 ,1,1,1,1,1 ,1,1,1,1,1 ,1,1,1,1,2 ,2,2,2,2,2 ,2,2,2,2,2 ,2,2,2,2,2 ,3,3,3,3,3 ,3,3,3,3,3 ,3,3,3,3,3 ,3,4,4,4,4 ,4,4,4,4,4 ,4,4,4,4,4 ,4,4,5,5,5 ,5,5,5,5,5 ,5,5,5,5,5 ,5,5,5,6,6 ,6,6,6,6,6 ,6,6,6,6,6 ,6,6,6,6,7 ,7,7,7,7,7 ,7,7,7,7,7 ,7,7,7,7,7 ,8,8,8,8,8 ,8,8,8,8,8 ,8,8,8,8,8 ,8,9,9,9,9 ,9,9,9,9,9 ,9,9,9,9,9 ,9,9,10,10 ,10,10,10, 10,10,10,1 0,10,10,10 ,10,10,10, 10,11,11,1 1,11,11,11 ,11,11,11, 11,11,11,1 1,11,11,11 ,12,12,12, 12,12,12,1 2,12,12,12 ,12,12,12, 12,12,12,1 3,13,13,13 ,13,13,13, 13,13,13,1 3,13,13,13 ,13,13,14, 14,14,14,1 4,14,14,14 ,14,14,14, 14,14,14,1 4,14,15,15 ,15,15,15, 15,15,15,1 5,15,15,15 ,15,15,15, 15};
int carry1 [] = {0,4,8,12,16,20,24,28,32,3 6,40,44,48 ,52,56,60, 0,4,8,12,1 6,20,24,28 ,32,36,40, 44,48,52,5 6,60,0,4,8 ,12,16,20, 24,28,32,3 6,40,44,48 ,52,56,60, 0,4,8,12,1 6,20,24,28 ,32,36,40, 44,48,52,5 6,60,0,4,8 ,12,16,20, 24,28,32,3 6,40,44,48 ,52,56,60, 0,4,8,12,1 6,20,24,28 ,32,36,40, 44,48,52,5 6,60,0,4,8 ,12,16,20, 24,28,32,3 6,40,44,48 ,52,56,60, 0,4,8,12,1 6,20,24,28 ,32,36,40, 44,48,52,5 6,60,0,4,8 ,12,16,20, 24,28,32,3 6,40,44,48 ,52,56,60, 0,4,8,12,1 6,20,24,28 ,32,36,40, 44,48,52,5 6,60,0,4,8 ,12,16,20, 24,28,32,3 6,40,44,48 ,52,56,60, 0,4,8,12,1 6,20,24,28 ,32,36,40, 44,48,52,5 6,60,0,4,8 ,12,16,20, 24,28,32,3 6,40,44,48 ,52,56,60, 0,4,8,12,1 6,20,24,28 ,32,36,40, 44,48,52,5 6,60,0,4,8 ,12,16,20, 24,28,32,3 6,40,44,48 ,52,56,60, 0,4,8,12,1 6,20,24,28 ,32,36,40, 44,48,52,5 6,60};
int c2 [] = {0,0,0,0,0,0,0,0,0,0,0,0,0 ,0,0,0,0,0 ,0,0,0,0,0 ,0,0,0,0,0 ,0,0,0,0,0 ,0,0,0,0,0 ,0,0,0,0,0 ,0,0,0,0,0 ,0,0,0,0,0 ,0,0,0,0,0 ,0,0,0,0,0 ,0,1,1,1,1 ,1,1,1,1,1 ,1,1,1,1,1 ,1,1,1,1,1 ,1,1,1,1,1 ,1,1,1,1,1 ,1,1,1,1,1 ,1,1,1,1,1 ,1,1,1,1,1 ,1,1,1,1,1 ,1,1,1,1,1 ,1,1,1,1,1 ,1,1,1,1,1 ,2,2,2,2,2 ,2,2,2,2,2 ,2,2,2,2,2 ,2,2,2,2,2 ,2,2,2,2,2 ,2,2,2,2,2 ,2,2,2,2,2 ,2,2,2,2,2 ,2,2,2,2,2 ,2,2,2,2,2 ,2,2,2,2,2 ,2,2,2,2,2 ,2,2,2,2,3 ,3,3,3,3,3 ,3,3,3,3,3 ,3,3,3,3,3 ,3,3,3,3,3 ,3,3,3,3,3 ,3,3,3,3,3 ,3,3,3,3,3 ,3,3,3,3,3 ,3,3,3,3,3 ,3,3,3,3,3 ,3,3,3,3,3 ,3,3,3,3,3 ,3,3,3};
int carry2 [] = {0,1,2,3,4,5,6,7,8,9,10,11 ,12,13,14, 15,16,17,1 8,19,20,21 ,22,23,24, 25,26,27,2 8,29,30,31 ,32,33,34, 35,36,37,3 8,39,40,41 ,42,43,44, 45,46,47,4 8,49,50,51 ,52,53,54, 55,56,57,5 8,59,60,61 ,62,63,0,1 ,2,3,4,5,6 ,7,8,9,10, 11,12,13,1 4,15,16,17 ,18,19,20, 21,22,23,2 4,25,26,27 ,28,29,30, 31,32,33,3 4,35,36,37 ,38,39,40, 41,42,43,4 4,45,46,47 ,48,49,50, 51,52,53,5 4,55,56,57 ,58,59,60, 61,62,63,0 ,1,2,3,4,5 ,6,7,8,9,1 0,11,12,13 ,14,15,16, 17,18,19,2 0,21,22,23 ,24,25,26, 27,28,29,3 0,31,32,33 ,34,35,36, 37,38,39,4 0,41,42,43 ,44,45,46, 47,48,49,5 0,51,52,53 ,54,55,56, 57,58,59,6 0,61,62,63 ,0,1,2,3,4 ,5,6,7,8,9 ,10,11,12, 13,14,15,1 6,17,18,19 ,20,21,22, 23,24,25,2 6,27,28,29 ,30,31,32, 33,34,35,3 6,37,38,39 ,40,41,42, 43,44,45,4 6,47,48,49 ,50,51,52, 53,54,55,5 6,57,58,59 ,60,61,62, 63};
// Here for debugging.
char output [1024];
void ToBase64 (unsigned char * data, int bytes, char * s)
{
int remainder [] = {0,2,3};
// How many chars to generate. 4 for every 3 bytes plus a remainder.
int chars = ((bytes / 3) * 4) + remainder[bytes%3];
while ( chars > 0 )
{
int carry = 0;
// First character.
*s++ = c0[*data];
carry = carry0[*data];
data += 1; bytes -= 1; chars -= 1;
// Second character.
if ( chars > 0 )
{
if ( bytes > 0 )
{
*s = digits[carry+c1[*data]];
carry = carry1[*data];
data += 1; bytes -= 1;
} else {
// We must consume the bottom two bits of the first byte.
*s = digits[carry];
}
chars -= 1;
} else {
*s = '=';
}
s += 1;
// Third character.
if ( chars > 0 )
{
if ( bytes > 0 )
{
*s = digits[carry+c2[*data]];
carry = carry2[*data];
data += 1; bytes -= 1;
} else {
// Consume the carry from the last byte.
*s = digits[carry];
}
chars -= 1;
} else {
*s = '=';
}
s += 1;
// Fourth character.
if ( chars > 0 )
{
*s = digits[carry];
chars -= 1;
} else {
*s = '=';
}
s += 1;
}
}
void atest ( void )
{
unsigned char test [] = "12345678";
//unsigned char test [] = {0x00};
// Generate my tables.
//GenerateTables();
memset(output,0,sizeof(out put));
// Convert it.
//ToBase64asm(test,sizeof( test)-1,ou tput);
ToBase64(test,sizeof(test) -1,output) ;
// Print it.
printf("%s\n",output);
}
Given enough registers this would be pretty optimal I believe. However, register juggling seems to be impeding the implementation. Join in if you wish.
Paul
PS: I am already considering holding *data in a register but I am short as it is.
I've finally got around to having a tinker with this.
I've got a working algorithm that should be quite quick when converted to assembler. Unfortunately my assembler is rusty enough to make this a very slow process. Meanwhile here's the algorithm.
// Includes.
#include <stdio.h>
// The digits of the number.
char * digits = "ABCDEFGHIJKLMNOPQRSTUVWXY
void PrintTable(int * c, int * carry, bool cAsChars)
{
int i;
// The characters.
if ( cAsChars ) for ( i = 0; i < 256; i++ ) printf("%c", digits[c[i]]);
else for ( i = 0; i < 256; i++ ) printf("%d,", c[i]);
printf ("\n");
// The carries.
for ( i = 0; i < 256; i++ ) printf("%d,", carry[i]);
printf ("\n");
}
void GenerateTables ( void )
{
int i;
int c [256];
int carry [256];
// I want tables of character and carry for each byte of the three.
/*
The carry from the previous calculation will be added to whatever is in
the c array to lookup in the digits array.
*/
// Table for first byte.
// Top 6 bits is character.
// Bottom 2 is carry to top 2 of next 6.
for ( i = 0; i < 256; i++ )
{
c[i] = i>>2;
carry[i] = (i & 0x03) << 4;
}
PrintTable(c,carry,true);
// Table for second byte.
// Top 4 bits is character.
// Bottom 4 is carry to top 4 of next 6.
for ( i = 0; i < 256; i++ )
{
c[i] = i>>4;
carry[i] = (i & 0x0f) << 2;
}
PrintTable(c,carry,false);
// Table for third byte.
// Top 2 bits is character.
// Bottom 6 is fourth character.
for ( i = 0; i < 256; i++ )
{
c[i] = i>>6;
carry[i] = (i & 0x3f);
}
PrintTable(c,carry,false);
}
// The tables generated by the above code.
//int c0 [] = {0,0,0,0,1,1,1,1,2,2,2,2,3
char c0 [] = "AAAABBBBCCCCDDDDEEEEFFFFG
int carry0 [] = {0,16,32,48,0,16,32,48,0,1
int c1 [] = {0,0,0,0,0,0,0,0,0,0,0,0,0
int carry1 [] = {0,4,8,12,16,20,24,28,32,3
int c2 [] = {0,0,0,0,0,0,0,0,0,0,0,0,0
int carry2 [] = {0,1,2,3,4,5,6,7,8,9,10,11
// Here for debugging.
char output [1024];
void ToBase64 (unsigned char * data, int bytes, char * s)
{
int remainder [] = {0,2,3};
// How many chars to generate. 4 for every 3 bytes plus a remainder.
int chars = ((bytes / 3) * 4) + remainder[bytes%3];
while ( chars > 0 )
{
int carry = 0;
// First character.
*s++ = c0[*data];
carry = carry0[*data];
data += 1; bytes -= 1; chars -= 1;
// Second character.
if ( chars > 0 )
{
if ( bytes > 0 )
{
*s = digits[carry+c1[*data]];
carry = carry1[*data];
data += 1; bytes -= 1;
} else {
// We must consume the bottom two bits of the first byte.
*s = digits[carry];
}
chars -= 1;
} else {
*s = '=';
}
s += 1;
// Third character.
if ( chars > 0 )
{
if ( bytes > 0 )
{
*s = digits[carry+c2[*data]];
carry = carry2[*data];
data += 1; bytes -= 1;
} else {
// Consume the carry from the last byte.
*s = digits[carry];
}
chars -= 1;
} else {
*s = '=';
}
s += 1;
// Fourth character.
if ( chars > 0 )
{
*s = digits[carry];
chars -= 1;
} else {
*s = '=';
}
s += 1;
}
}
void atest ( void )
{
unsigned char test [] = "12345678";
//unsigned char test [] = {0x00};
// Generate my tables.
//GenerateTables();
memset(output,0,sizeof(out
// Convert it.
//ToBase64asm(test,sizeof(
ToBase64(test,sizeof(test)
// Print it.
printf("%s\n",output);
}
Given enough registers this would be pretty optimal I believe. However, register juggling seems to be impeding the implementation. Join in if you wish.
Paul
PS: I am already considering holding *data in a register but I am short as it is.
Hi DanRollins,
How about this? Just slap it in the above code and call it.
void ToBase64asm(unsigned char * data, int bytes, char * s)
{
// Insignificant bit, easier done in C.
char remainder [] = {0,2,3};
int chars = ((bytes / 3) * 4) + remainder[bytes%3];
// Safety harness.
if ( bytes == 0 ) return;
if ( bytes > 0xff ) return; // Cover this with an outer loop later.
_asm {
// NOTE: I am assuming ds == es!
// This can be catered for easily if not the case.
// si = data would be nice but I need si for table lookups.
mov esi,data
// di = s
mov edi,s
// bh = bytes
mov ebx,bytes
mov bh,bl
// cx = chars
mov ecx,chars
charsLoop:
// bl = carry
// edx spare
// First character.
//*s++ = c0[*data];
xor eax,eax
mov al,[esi]
mov edx,esi
mov esi,eax // si = *data
mov al,c0[esi]
stosb
//carry = carry0[*data];
mov bl,carry0[esi]
mov esi,edx // si = data
//data += 1; bytes -= 1; chars -= 1;
inc esi
dec bh
dec cx
// Second character.
//if ( chars > 0 )
//{
jcxz secondIsEquals
// chars -= 1;
dec cx;
// if ( bytes > 0 )
// {
test bh,bh
jz secondBytesIsZero
// *s = digits[carry+c1[*data]];
xor eax,eax
mov al,[esi]
mov edx,esi
mov esi,eax // si = *data
mov al,c1[esi]
add al,bl
// carry = carry1[*data];
mov bl,carry1[esi]
//continued *s = digits[carry+c1[*data]];
mov esi,eax // si = carry+c1[*data]
mov al,digits[esi];
stosb
mov esi,edx // si = data
// data += 1; bytes -= 1;
inc esi
dec bh
jmp doneSecond
secondBytesIsZero:
// *s = digits[carry];
mov edx,esi
xor ax,ax
mov al,bl
mov esi,eax
mov al,digits[esi]
stosb
mov esi,edx
jmp doneSecond
secondIsEquals:
mov al,'='
stosb
doneSecond:
// Third character.
//if ( chars > 0 )
//{
jcxz thirdIsEquals
// chars -= 1;
dec cx;
// if ( bytes > 0 )
// {
test bh,bh
jz thirdBytesIsZero
// *s = digits[carry+c2[*data]];
xor eax,eax
mov al,[esi]
mov edx,esi
mov esi,eax // si = *data
mov al,c2[esi]
add al,bl
// carry = carry2[*data];
mov bl,carry2[esi]
//continued *s = digits[carry+c2[*data]];
mov esi,eax // si = carry+c2[*data]
mov al,digits[esi];
stosb
mov esi,edx // si = data
// data += 1; bytes -= 1;
inc esi
dec bh
jmp doneThird
thirdBytesIsZero:
// *s = digits[carry];
mov edx,esi
xor ax,ax
mov al,bl
mov esi,eax
mov al,digits[esi]
stosb
mov esi,edx
jmp doneThird
thirdIsEquals:
mov al,'='
stosb
doneThird:
// Fourth character.
//if ( chars > 0 )
//{
jcxz fourthIsEquals
// chars -= 1;
dec cx
// *s = digits[carry];
mov edx,esi
xor eax,eax
mov al,bl
mov esi,eax
mov al,digits[esi]
stosb
mov esi,edx
jmp doneFourth
fourthIsEquals:
// *s = '=';
mov al,'='
stosb
doneFourth:
// Next char.
//loop charsLoop
jcxz done
jmp charsloop
done:
}
}
How does that measure up? Some of the lookups could be calculated to improve things but I think the algorithm is sound.
Paul
How about this? Just slap it in the above code and call it.
void ToBase64asm(unsigned char * data, int bytes, char * s)
{
// Insignificant bit, easier done in C.
char remainder [] = {0,2,3};
int chars = ((bytes / 3) * 4) + remainder[bytes%3];
// Safety harness.
if ( bytes == 0 ) return;
if ( bytes > 0xff ) return; // Cover this with an outer loop later.
_asm {
// NOTE: I am assuming ds == es!
// This can be catered for easily if not the case.
// si = data would be nice but I need si for table lookups.
mov esi,data
// di = s
mov edi,s
// bh = bytes
mov ebx,bytes
mov bh,bl
// cx = chars
mov ecx,chars
charsLoop:
// bl = carry
// edx spare
// First character.
//*s++ = c0[*data];
xor eax,eax
mov al,[esi]
mov edx,esi
mov esi,eax // si = *data
mov al,c0[esi]
stosb
//carry = carry0[*data];
mov bl,carry0[esi]
mov esi,edx // si = data
//data += 1; bytes -= 1; chars -= 1;
inc esi
dec bh
dec cx
// Second character.
//if ( chars > 0 )
//{
jcxz secondIsEquals
// chars -= 1;
dec cx;
// if ( bytes > 0 )
// {
test bh,bh
jz secondBytesIsZero
// *s = digits[carry+c1[*data]];
xor eax,eax
mov al,[esi]
mov edx,esi
mov esi,eax // si = *data
mov al,c1[esi]
add al,bl
// carry = carry1[*data];
mov bl,carry1[esi]
//continued *s = digits[carry+c1[*data]];
mov esi,eax // si = carry+c1[*data]
mov al,digits[esi];
stosb
mov esi,edx // si = data
// data += 1; bytes -= 1;
inc esi
dec bh
jmp doneSecond
secondBytesIsZero:
// *s = digits[carry];
mov edx,esi
xor ax,ax
mov al,bl
mov esi,eax
mov al,digits[esi]
stosb
mov esi,edx
jmp doneSecond
secondIsEquals:
mov al,'='
stosb
doneSecond:
// Third character.
//if ( chars > 0 )
//{
jcxz thirdIsEquals
// chars -= 1;
dec cx;
// if ( bytes > 0 )
// {
test bh,bh
jz thirdBytesIsZero
// *s = digits[carry+c2[*data]];
xor eax,eax
mov al,[esi]
mov edx,esi
mov esi,eax // si = *data
mov al,c2[esi]
add al,bl
// carry = carry2[*data];
mov bl,carry2[esi]
//continued *s = digits[carry+c2[*data]];
mov esi,eax // si = carry+c2[*data]
mov al,digits[esi];
stosb
mov esi,edx // si = data
// data += 1; bytes -= 1;
inc esi
dec bh
jmp doneThird
thirdBytesIsZero:
// *s = digits[carry];
mov edx,esi
xor ax,ax
mov al,bl
mov esi,eax
mov al,digits[esi]
stosb
mov esi,edx
jmp doneThird
thirdIsEquals:
mov al,'='
stosb
doneThird:
// Fourth character.
//if ( chars > 0 )
//{
jcxz fourthIsEquals
// chars -= 1;
dec cx
// *s = digits[carry];
mov edx,esi
xor eax,eax
mov al,bl
mov esi,eax
mov al,digits[esi]
stosb
mov esi,edx
jmp doneFourth
fourthIsEquals:
// *s = '=';
mov al,'='
stosb
doneFourth:
// Next char.
//loop charsLoop
jcxz done
jmp charsloop
done:
}
}
How does that measure up? Some of the lookups could be calculated to improve things but I think the algorithm is sound.
Paul
Hi DanRollins,
Here's a polished up version.
// Includes.
#include <stdio.h>
#include <conio.h>
// The digits of the number.
char digits [] = "ABCDEFGHIJKLMNOPQRSTUVWXY Zabcdefghi jklmnopqrs tuvwxyz012 3456789+/" ;
void PrintTable(int * c, int * carry, bool cAsChars)
{
int i;
// The characters.
if ( cAsChars ) for ( i = 0; i < 256; i++ ) printf("%c", digits[c[i]]);
else for ( i = 0; i < 256; i++ ) printf("%d,", c[i]);
printf ("\n");
// The carries.
for ( i = 0; i < 256; i++ ) printf("%d,", carry[i]);
printf ("\n");
}
void GenerateTables ( void )
{
int i;
int c [256];
int carry [256];
// I want tables of character and carry for each byte of the three.
/*
The carry from the previous calculation will be added to whatever is in
the c array to lookup in the digits array.
*/
// Table for first byte.
// Top 6 bits is character.
// Bottom 2 is carry to top 2 of next 6.
for ( i = 0; i < 256; i++ )
{
c[i] = i>>2;
carry[i] = (i & 0x03) << 4;
}
PrintTable(c,carry,true);
// Table for second byte.
// Top 4 bits is character.
// Bottom 4 is carry to top 4 of next 6.
for ( i = 0; i < 256; i++ )
{
c[i] = i>>4;
carry[i] = (i & 0x0f) << 2;
}
PrintTable(c,carry,false);
// Table for third byte.
// Top 2 bits is character.
// Bottom 6 is fourth character.
for ( i = 0; i < 256; i++ )
{
c[i] = i>>6;
carry[i] = (i & 0x3f);
}
PrintTable(c,carry,false);
}
// The tables generated by the above code.
//int c0 [] = {0,0,0,0,1,1,1,1,2,2,2,2,3 ,3,3,3,4,4 ,4,4,5,5,5 ,5,6,6,6,6 ,7,7,7,7,8 ,8,8,8,9,9 ,9,9,10,10 ,10,10,11, 11,11,11,1 2,12,12,12 ,13,13,13, 13,14,14,1 4,14,15,15 ,15,15,16, 16,16,16,1 7,17,17,17 ,18,18,18, 18,19,19,1 9,19,20,20 ,20,20,21, 21,21,21,2 2,22,22,22 ,23,23,23, 23,24,24,2 4,24,25,25 ,25,25,26, 26,26,26,2 7,27,27,27 ,28,28,28, 28,29,29,2 9,29,30,30 ,30,30,31, 31,31,31,3 2,32,32,32 ,33,33,33, 33,34,34,3 4,34,35,35 ,35,35,36, 36,36,36,3 7,37,37,37 ,38,38,38, 38,39,39,3 9,39,40,40 ,40,40,41, 41,41,41,4 2,42,42,42 ,43,43,43, 43,44,44,4 4,44,45,45 ,45,45,46, 46,46,46,4 7,47,47,47 ,48,48,48, 48,49,49,4 9,49,50,50 ,50,50,51, 51,51,51,5 2,52,52,52 ,53,53,53, 53,54,54,5 4,54,55,55 ,55,55,56, 56,56,56,5 7,57,57,57 ,58,58,58, 58,59,59,5 9,59,60,60 ,60,60,61, 61,61,61,6 2,62,62,62 ,63,63,63, 63};
char c0 [] = "AAAABBBBCCCCDDDDEEEEFFFFG GGGHHHHIII IJJJJKKKKL LLLMMMMNNN NOOOOPPPPQ QQQRRRRSSS STTTTUUUUV VVVWWWWXXX XYYYYZZZZa aaabbbbccc cddddeeeef fffgggghhh hiiiijjjjk kkkllllmmm mnnnnoooop pppqqqqrrr rssssttttu uuuvvvvwww wxxxxyyyyz zzz0000111 1222233334 4445555666 6777788889 999++++/// /";
char carry0 [] = {0,16,32,48,0,16,32,48,0,1 6,32,48,0, 16,32,48,0 ,16,32,48, 0,16,32,48 ,0,16,32,4 8,0,16,32, 48,0,16,32 ,48,0,16,3 2,48,0,16, 32,48,0,16 ,32,48,0,1 6,32,48,0, 16,32,48,0 ,16,32,48, 0,16,32,48 ,0,16,32,4 8,0,16,32, 48,0,16,32 ,48,0,16,3 2,48,0,16, 32,48,0,16 ,32,48,0,1 6,32,48,0, 16,32,48,0 ,16,32,48, 0,16,32,48 ,0,16,32,4 8,0,16,32, 48,0,16,32 ,48,0,16,3 2,48,0,16, 32,48,0,16 ,32,48,0,1 6,32,48,0, 16,32,48,0 ,16,32,48, 0,16,32,48 ,0,16,32,4 8,0,16,32, 48,0,16,32 ,48,0,16,3 2,48,0,16, 32,48,0,16 ,32,48,0,1 6,32,48,0, 16,32,48,0 ,16,32,48, 0,16,32,48 ,0,16,32,4 8,0,16,32, 48,0,16,32 ,48,0,16,3 2,48,0,16, 32,48,0,16 ,32,48,0,1 6,32,48,0, 16,32,48,0 ,16,32,48, 0,16,32,48 ,0,16,32,4 8,0,16,32, 48,0,16,32 ,48,0,16,3 2,48,0,16, 32,48,0,16 ,32,48,0,1 6,32,48,0, 16,32,48};
char c1 [] = {0,0,0,0,0,0,0,0,0,0,0,0,0 ,0,0,0,1,1 ,1,1,1,1,1 ,1,1,1,1,1 ,1,1,1,1,2 ,2,2,2,2,2 ,2,2,2,2,2 ,2,2,2,2,2 ,3,3,3,3,3 ,3,3,3,3,3 ,3,3,3,3,3 ,3,4,4,4,4 ,4,4,4,4,4 ,4,4,4,4,4 ,4,4,5,5,5 ,5,5,5,5,5 ,5,5,5,5,5 ,5,5,5,6,6 ,6,6,6,6,6 ,6,6,6,6,6 ,6,6,6,6,7 ,7,7,7,7,7 ,7,7,7,7,7 ,7,7,7,7,7 ,8,8,8,8,8 ,8,8,8,8,8 ,8,8,8,8,8 ,8,9,9,9,9 ,9,9,9,9,9 ,9,9,9,9,9 ,9,9,10,10 ,10,10,10, 10,10,10,1 0,10,10,10 ,10,10,10, 10,11,11,1 1,11,11,11 ,11,11,11, 11,11,11,1 1,11,11,11 ,12,12,12, 12,12,12,1 2,12,12,12 ,12,12,12, 12,12,12,1 3,13,13,13 ,13,13,13, 13,13,13,1 3,13,13,13 ,13,13,14, 14,14,14,1 4,14,14,14 ,14,14,14, 14,14,14,1 4,14,15,15 ,15,15,15, 15,15,15,1 5,15,15,15 ,15,15,15, 15};
char carry1 [] = {0,4,8,12,16,20,24,28,32,3 6,40,44,48 ,52,56,60, 0,4,8,12,1 6,20,24,28 ,32,36,40, 44,48,52,5 6,60,0,4,8 ,12,16,20, 24,28,32,3 6,40,44,48 ,52,56,60, 0,4,8,12,1 6,20,24,28 ,32,36,40, 44,48,52,5 6,60,0,4,8 ,12,16,20, 24,28,32,3 6,40,44,48 ,52,56,60, 0,4,8,12,1 6,20,24,28 ,32,36,40, 44,48,52,5 6,60,0,4,8 ,12,16,20, 24,28,32,3 6,40,44,48 ,52,56,60, 0,4,8,12,1 6,20,24,28 ,32,36,40, 44,48,52,5 6,60,0,4,8 ,12,16,20, 24,28,32,3 6,40,44,48 ,52,56,60, 0,4,8,12,1 6,20,24,28 ,32,36,40, 44,48,52,5 6,60,0,4,8 ,12,16,20, 24,28,32,3 6,40,44,48 ,52,56,60, 0,4,8,12,1 6,20,24,28 ,32,36,40, 44,48,52,5 6,60,0,4,8 ,12,16,20, 24,28,32,3 6,40,44,48 ,52,56,60, 0,4,8,12,1 6,20,24,28 ,32,36,40, 44,48,52,5 6,60,0,4,8 ,12,16,20, 24,28,32,3 6,40,44,48 ,52,56,60, 0,4,8,12,1 6,20,24,28 ,32,36,40, 44,48,52,5 6,60};
char c2 [] = {0,0,0,0,0,0,0,0,0,0,0,0,0 ,0,0,0,0,0 ,0,0,0,0,0 ,0,0,0,0,0 ,0,0,0,0,0 ,0,0,0,0,0 ,0,0,0,0,0 ,0,0,0,0,0 ,0,0,0,0,0 ,0,0,0,0,0 ,0,0,0,0,0 ,0,1,1,1,1 ,1,1,1,1,1 ,1,1,1,1,1 ,1,1,1,1,1 ,1,1,1,1,1 ,1,1,1,1,1 ,1,1,1,1,1 ,1,1,1,1,1 ,1,1,1,1,1 ,1,1,1,1,1 ,1,1,1,1,1 ,1,1,1,1,1 ,1,1,1,1,1 ,2,2,2,2,2 ,2,2,2,2,2 ,2,2,2,2,2 ,2,2,2,2,2 ,2,2,2,2,2 ,2,2,2,2,2 ,2,2,2,2,2 ,2,2,2,2,2 ,2,2,2,2,2 ,2,2,2,2,2 ,2,2,2,2,2 ,2,2,2,2,2 ,2,2,2,2,3 ,3,3,3,3,3 ,3,3,3,3,3 ,3,3,3,3,3 ,3,3,3,3,3 ,3,3,3,3,3 ,3,3,3,3,3 ,3,3,3,3,3 ,3,3,3,3,3 ,3,3,3,3,3 ,3,3,3,3,3 ,3,3,3,3,3 ,3,3,3,3,3 ,3,3,3};
char carry2 [] = {0,1,2,3,4,5,6,7,8,9,10,11 ,12,13,14, 15,16,17,1 8,19,20,21 ,22,23,24, 25,26,27,2 8,29,30,31 ,32,33,34, 35,36,37,3 8,39,40,41 ,42,43,44, 45,46,47,4 8,49,50,51 ,52,53,54, 55,56,57,5 8,59,60,61 ,62,63,0,1 ,2,3,4,5,6 ,7,8,9,10, 11,12,13,1 4,15,16,17 ,18,19,20, 21,22,23,2 4,25,26,27 ,28,29,30, 31,32,33,3 4,35,36,37 ,38,39,40, 41,42,43,4 4,45,46,47 ,48,49,50, 51,52,53,5 4,55,56,57 ,58,59,60, 61,62,63,0 ,1,2,3,4,5 ,6,7,8,9,1 0,11,12,13 ,14,15,16, 17,18,19,2 0,21,22,23 ,24,25,26, 27,28,29,3 0,31,32,33 ,34,35,36, 37,38,39,4 0,41,42,43 ,44,45,46, 47,48,49,5 0,51,52,53 ,54,55,56, 57,58,59,6 0,61,62,63 ,0,1,2,3,4 ,5,6,7,8,9 ,10,11,12, 13,14,15,1 6,17,18,19 ,20,21,22, 23,24,25,2 6,27,28,29 ,30,31,32, 33,34,35,3 6,37,38,39 ,40,41,42, 43,44,45,4 6,47,48,49 ,50,51,52, 53,54,55,5 6,57,58,59 ,60,61,62, 63};
// Here for debugging.
char output [1024];
void ToBase64 (unsigned char * data, int bytes, char * s)
{
int remainder [] = {0,2,3};
// How many chars to generate. 4 for every 3 bytes plus a remainder.
int chars = ((bytes / 3) * 4) + remainder[bytes%3];
while ( chars > 0 )
{
int carry = 0;
// First character.
*s++ = c0[*data];
carry = carry0[*data];
data += 1; bytes -= 1; chars -= 1;
// Second character.
if ( chars > 0 )
{
if ( bytes > 0 )
{
*s = digits[carry+c1[*data]];
carry = carry1[*data];
data += 1; bytes -= 1;
} else {
// We must consume the bottom two bits of the first byte.
*s = digits[carry];
}
chars -= 1;
} else {
*s = '=';
}
s += 1;
// Third character.
if ( chars > 0 )
{
if ( bytes > 0 )
{
*s = digits[carry+c2[*data]];
carry = carry2[*data];
data += 1; bytes -= 1;
} else {
// Consume the carry from the last byte.
*s = digits[carry];
}
chars -= 1;
} else {
*s = '=';
}
s += 1;
// Fourth character.
if ( chars > 0 )
{
*s = digits[carry];
chars -= 1;
} else {
*s = '=';
}
s += 1;
}
}
void ToBase64asm(unsigned char * data, int bytes, char * s)
{
// Insignificant bit, easier done in C.
char remainder [] = {0,2,3};
int chars = ((bytes / 3) * 4) + remainder[bytes%3];
// Safety harness.
if ( bytes == 0 ) return;
if ( bytes > 0xff ) return; // Cover this with an outer loop later.
_asm {
// NOTE: I am assuming ds == es!
// This can be catered for easily if not the case.
// si = data would be nice but I need si for table lookups.
mov esi,data
// di = s
mov edi,s
// bh = bytes
mov ebx,bytes
mov bh,bl
// cx = chars
mov ecx,chars
charsLoop:
// bl = carry
// edx spare
// First character.
//*s++ = c0[*data];
xor eax,eax
mov al,[esi]
mov edx,esi
mov esi,eax // si = *data
mov al,c0[esi]
stosb
//carry = carry0[*data];
mov bl,carry0[esi]
mov esi,edx // si = data
//data += 1; bytes -= 1; chars -= 1;
inc esi
dec bh
dec ecx
// Second character.
//if ( chars > 0 )
//{
jcxz secondIsEquals
// chars -= 1;
dec ecx;
// if ( bytes > 0 )
// {
test bh,bh
jz secondBytesIsZero
// *s = digits[carry+c1[*data]];
xor eax,eax
mov al,[esi]
mov edx,esi
mov esi,eax // si = *data
mov al,c1[esi]
add al,bl
// carry = carry1[*data];
mov bl,carry1[esi]
//continued *s = digits[carry+c1[*data]];
mov esi,eax // si = carry+c1[*data]
mov al,digits[esi];
stosb
mov esi,edx // si = data
// data += 1; bytes -= 1;
inc esi
dec bh
jmp doneSecond
secondBytesIsZero:
// *s = digits[carry];
mov edx,esi
xor eax,eax
mov al,bl
mov esi,eax
mov al,digits[esi]
stosb
mov esi,edx
jmp doneSecond
secondIsEquals:
mov al,'='
stosb
doneSecond:
// Third character.
//if ( chars > 0 )
//{
jcxz thirdIsEquals
// chars -= 1;
dec ecx;
// if ( bytes > 0 )
// {
test bh,bh
jz thirdBytesIsZero
// *s = digits[carry+c2[*data]];
xor eax,eax
mov al,[esi]
mov edx,esi
mov esi,eax // si = *data
mov al,c2[esi]
add al,bl
// carry = carry2[*data];
mov bl,carry2[esi]
//continued *s = digits[carry+c2[*data]];
mov esi,eax // si = carry+c2[*data]
mov al,digits[esi];
stosb
mov esi,edx // si = data
// data += 1; bytes -= 1;
inc esi
dec bh
jmp doneThird
thirdBytesIsZero:
// *s = digits[carry];
mov edx,esi
xor ax,ax
mov al,bl
mov esi,eax
mov al,digits[esi]
stosb
mov esi,edx
jmp doneThird
thirdIsEquals:
mov al,'='
stosb
doneThird:
// Fourth character.
//if ( chars > 0 )
//{
jcxz fourthIsEquals
// chars -= 1;
dec ecx
// *s = digits[carry];
mov edx,esi
xor eax,eax
mov al,bl
mov esi,eax
mov al,digits[esi]
stosb
mov esi,edx
jmp doneFourth
fourthIsEquals:
// *s = '=';
mov al,'='
stosb
doneFourth:
// Next char.
//loop charsLoop
jcxz done
jmp charsloop
done:
}
}
void atest ( void )
{
unsigned char test [] = "1234567";
// Generate my tables.
//GenerateTables();
// Convert it.
memset(output,0,sizeof(out put)); ToBase64(test,sizeof(test) -1,output) ; printf("%s\n",output);
memset(output,0,sizeof(out put)); ToBase64asm(test,sizeof(te st)-1,outp ut); printf("%s\n",output);
_getch();
}
I've often found that it is worth the attempt to pre-calculate as much as possible. However, in this case it may end up being significantly less efficient. I would be interested in the outcome.
Note the limitation of a maximum of 255 bytes to encode. A simple outer loop may be necessary if you need it to process larger chunks. Obviously each chunk must be a multiple of 3 bytes so 255 bytes per chunk would probably be the most efficient..
Paul
Here's a polished up version.
// Includes.
#include <stdio.h>
#include <conio.h>
// The digits of the number.
char digits [] = "ABCDEFGHIJKLMNOPQRSTUVWXY
void PrintTable(int * c, int * carry, bool cAsChars)
{
int i;
// The characters.
if ( cAsChars ) for ( i = 0; i < 256; i++ ) printf("%c", digits[c[i]]);
else for ( i = 0; i < 256; i++ ) printf("%d,", c[i]);
printf ("\n");
// The carries.
for ( i = 0; i < 256; i++ ) printf("%d,", carry[i]);
printf ("\n");
}
void GenerateTables ( void )
{
int i;
int c [256];
int carry [256];
// I want tables of character and carry for each byte of the three.
/*
The carry from the previous calculation will be added to whatever is in
the c array to lookup in the digits array.
*/
// Table for first byte.
// Top 6 bits is character.
// Bottom 2 is carry to top 2 of next 6.
for ( i = 0; i < 256; i++ )
{
c[i] = i>>2;
carry[i] = (i & 0x03) << 4;
}
PrintTable(c,carry,true);
// Table for second byte.
// Top 4 bits is character.
// Bottom 4 is carry to top 4 of next 6.
for ( i = 0; i < 256; i++ )
{
c[i] = i>>4;
carry[i] = (i & 0x0f) << 2;
}
PrintTable(c,carry,false);
// Table for third byte.
// Top 2 bits is character.
// Bottom 6 is fourth character.
for ( i = 0; i < 256; i++ )
{
c[i] = i>>6;
carry[i] = (i & 0x3f);
}
PrintTable(c,carry,false);
}
// The tables generated by the above code.
//int c0 [] = {0,0,0,0,1,1,1,1,2,2,2,2,3
char c0 [] = "AAAABBBBCCCCDDDDEEEEFFFFG
char carry0 [] = {0,16,32,48,0,16,32,48,0,1
char c1 [] = {0,0,0,0,0,0,0,0,0,0,0,0,0
char carry1 [] = {0,4,8,12,16,20,24,28,32,3
char c2 [] = {0,0,0,0,0,0,0,0,0,0,0,0,0
char carry2 [] = {0,1,2,3,4,5,6,7,8,9,10,11
// Here for debugging.
char output [1024];
void ToBase64 (unsigned char * data, int bytes, char * s)
{
int remainder [] = {0,2,3};
// How many chars to generate. 4 for every 3 bytes plus a remainder.
int chars = ((bytes / 3) * 4) + remainder[bytes%3];
while ( chars > 0 )
{
int carry = 0;
// First character.
*s++ = c0[*data];
carry = carry0[*data];
data += 1; bytes -= 1; chars -= 1;
// Second character.
if ( chars > 0 )
{
if ( bytes > 0 )
{
*s = digits[carry+c1[*data]];
carry = carry1[*data];
data += 1; bytes -= 1;
} else {
// We must consume the bottom two bits of the first byte.
*s = digits[carry];
}
chars -= 1;
} else {
*s = '=';
}
s += 1;
// Third character.
if ( chars > 0 )
{
if ( bytes > 0 )
{
*s = digits[carry+c2[*data]];
carry = carry2[*data];
data += 1; bytes -= 1;
} else {
// Consume the carry from the last byte.
*s = digits[carry];
}
chars -= 1;
} else {
*s = '=';
}
s += 1;
// Fourth character.
if ( chars > 0 )
{
*s = digits[carry];
chars -= 1;
} else {
*s = '=';
}
s += 1;
}
}
void ToBase64asm(unsigned char * data, int bytes, char * s)
{
// Insignificant bit, easier done in C.
char remainder [] = {0,2,3};
int chars = ((bytes / 3) * 4) + remainder[bytes%3];
// Safety harness.
if ( bytes == 0 ) return;
if ( bytes > 0xff ) return; // Cover this with an outer loop later.
_asm {
// NOTE: I am assuming ds == es!
// This can be catered for easily if not the case.
// si = data would be nice but I need si for table lookups.
mov esi,data
// di = s
mov edi,s
// bh = bytes
mov ebx,bytes
mov bh,bl
// cx = chars
mov ecx,chars
charsLoop:
// bl = carry
// edx spare
// First character.
//*s++ = c0[*data];
xor eax,eax
mov al,[esi]
mov edx,esi
mov esi,eax // si = *data
mov al,c0[esi]
stosb
//carry = carry0[*data];
mov bl,carry0[esi]
mov esi,edx // si = data
//data += 1; bytes -= 1; chars -= 1;
inc esi
dec bh
dec ecx
// Second character.
//if ( chars > 0 )
//{
jcxz secondIsEquals
// chars -= 1;
dec ecx;
// if ( bytes > 0 )
// {
test bh,bh
jz secondBytesIsZero
// *s = digits[carry+c1[*data]];
xor eax,eax
mov al,[esi]
mov edx,esi
mov esi,eax // si = *data
mov al,c1[esi]
add al,bl
// carry = carry1[*data];
mov bl,carry1[esi]
//continued *s = digits[carry+c1[*data]];
mov esi,eax // si = carry+c1[*data]
mov al,digits[esi];
stosb
mov esi,edx // si = data
// data += 1; bytes -= 1;
inc esi
dec bh
jmp doneSecond
secondBytesIsZero:
// *s = digits[carry];
mov edx,esi
xor eax,eax
mov al,bl
mov esi,eax
mov al,digits[esi]
stosb
mov esi,edx
jmp doneSecond
secondIsEquals:
mov al,'='
stosb
doneSecond:
// Third character.
//if ( chars > 0 )
//{
jcxz thirdIsEquals
// chars -= 1;
dec ecx;
// if ( bytes > 0 )
// {
test bh,bh
jz thirdBytesIsZero
// *s = digits[carry+c2[*data]];
xor eax,eax
mov al,[esi]
mov edx,esi
mov esi,eax // si = *data
mov al,c2[esi]
add al,bl
// carry = carry2[*data];
mov bl,carry2[esi]
//continued *s = digits[carry+c2[*data]];
mov esi,eax // si = carry+c2[*data]
mov al,digits[esi];
stosb
mov esi,edx // si = data
// data += 1; bytes -= 1;
inc esi
dec bh
jmp doneThird
thirdBytesIsZero:
// *s = digits[carry];
mov edx,esi
xor ax,ax
mov al,bl
mov esi,eax
mov al,digits[esi]
stosb
mov esi,edx
jmp doneThird
thirdIsEquals:
mov al,'='
stosb
doneThird:
// Fourth character.
//if ( chars > 0 )
//{
jcxz fourthIsEquals
// chars -= 1;
dec ecx
// *s = digits[carry];
mov edx,esi
xor eax,eax
mov al,bl
mov esi,eax
mov al,digits[esi]
stosb
mov esi,edx
jmp doneFourth
fourthIsEquals:
// *s = '=';
mov al,'='
stosb
doneFourth:
// Next char.
//loop charsLoop
jcxz done
jmp charsloop
done:
}
}
void atest ( void )
{
unsigned char test [] = "1234567";
// Generate my tables.
//GenerateTables();
// Convert it.
memset(output,0,sizeof(out
memset(output,0,sizeof(out
_getch();
}
I've often found that it is worth the attempt to pre-calculate as much as possible. However, in this case it may end up being significantly less efficient. I would be interested in the outcome.
Note the limitation of a maximum of 255 bytes to encode. A simple outer loop may be necessary if you need it to process larger chunks. Obviously each chunk must be a multiple of 3 bytes so 255 bytes per chunk would probably be the most efficient..
Paul
ASKER
I'll test it on Tuesday (when I have access to the PC I've been using) so the timings will be comparable.
>>Note the limitation of a maximum of 255 bytes to encode....
The test harness creates a 1MB buffer and it calls the encoding function it 1000 times... I haven't had time to look at your submission yet, so forgive me if I'm misunderstanding, but are you saying that your function will not encode the entire 1MB buffer?
-- Dan
>>Note the limitation of a maximum of 255 bytes to encode....
The test harness creates a 1MB buffer and it calls the encoding function it 1000 times... I haven't had time to look at your submission yet, so forgive me if I'm misunderstanding, but are you saying that your function will not encode the entire 1MB buffer?
-- Dan
Hi Dan,
Here's a testable version with an outer loop that handles the 255 byte limitation:
// Includes.
#include <stdio.h>
#include <conio.h>
// The digits of the number.
char digits [] = "ABCDEFGHIJKLMNOPQRSTUVWXY Zabcdefghi jklmnopqrs tuvwxyz012 3456789+/" ;
void PrintTable(int * c, int * carry, bool cAsChars)
{
int i;
// The characters.
if ( cAsChars ) for ( i = 0; i < 256; i++ ) printf("%c", digits[c[i]]);
else for ( i = 0; i < 256; i++ ) printf("%d,", c[i]);
printf ("\n");
// The carries.
for ( i = 0; i < 256; i++ ) printf("%d,", carry[i]);
printf ("\n");
}
void GenerateTables ( void )
{
int i;
int c [256];
int carry [256];
// I want tables of character and carry for each byte of the three.
/*
The carry from the previous calculation will be added to whatever is in
the c array to lookup in the digits array.
*/
// Table for first byte.
// Top 6 bits is character.
// Bottom 2 is carry to top 2 of next 6.
for ( i = 0; i < 256; i++ )
{
c[i] = i>>2;
carry[i] = (i & 0x03) << 4;
}
PrintTable(c,carry,true);
// Table for second byte.
// Top 4 bits is character.
// Bottom 4 is carry to top 4 of next 6.
for ( i = 0; i < 256; i++ )
{
c[i] = i>>4;
carry[i] = (i & 0x0f) << 2;
}
PrintTable(c,carry,false);
// Table for third byte.
// Top 2 bits is character.
// Bottom 6 is fourth character.
for ( i = 0; i < 256; i++ )
{
c[i] = i>>6;
carry[i] = (i & 0x3f);
}
PrintTable(c,carry,false);
}
// The tables generated by the above code.
//int c0 [] = {0,0,0,0,1,1,1,1,2,2,2,2,3 ,3,3,3,4,4 ,4,4,5,5,5 ,5,6,6,6,6 ,7,7,7,7,8 ,8,8,8,9,9 ,9,9,10,10 ,10,10,11, 11,11,11,1 2,12,12,12 ,13,13,13, 13,14,14,1 4,14,15,15 ,15,15,16, 16,16,16,1 7,17,17,17 ,18,18,18, 18,19,19,1 9,19,20,20 ,20,20,21, 21,21,21,2 2,22,22,22 ,23,23,23, 23,24,24,2 4,24,25,25 ,25,25,26, 26,26,26,2 7,27,27,27 ,28,28,28, 28,29,29,2 9,29,30,30 ,30,30,31, 31,31,31,3 2,32,32,32 ,33,33,33, 33,34,34,3 4,34,35,35 ,35,35,36, 36,36,36,3 7,37,37,37 ,38,38,38, 38,39,39,3 9,39,40,40 ,40,40,41, 41,41,41,4 2,42,42,42 ,43,43,43, 43,44,44,4 4,44,45,45 ,45,45,46, 46,46,46,4 7,47,47,47 ,48,48,48, 48,49,49,4 9,49,50,50 ,50,50,51, 51,51,51,5 2,52,52,52 ,53,53,53, 53,54,54,5 4,54,55,55 ,55,55,56, 56,56,56,5 7,57,57,57 ,58,58,58, 58,59,59,5 9,59,60,60 ,60,60,61, 61,61,61,6 2,62,62,62 ,63,63,63, 63};
char c0 [] = "AAAABBBBCCCCDDDDEEEEFFFFG GGGHHHHIII IJJJJKKKKL LLLMMMMNNN NOOOOPPPPQ QQQRRRRSSS STTTTUUUUV VVVWWWWXXX XYYYYZZZZa aaabbbbccc cddddeeeef fffgggghhh hiiiijjjjk kkkllllmmm mnnnnoooop pppqqqqrrr rssssttttu uuuvvvvwww wxxxxyyyyz zzz0000111 1222233334 4445555666 6777788889 999++++/// /";
char carry0 [] = {0,16,32,48,0,16,32,48,0,1 6,32,48,0, 16,32,48,0 ,16,32,48, 0,16,32,48 ,0,16,32,4 8,0,16,32, 48,0,16,32 ,48,0,16,3 2,48,0,16, 32,48,0,16 ,32,48,0,1 6,32,48,0, 16,32,48,0 ,16,32,48, 0,16,32,48 ,0,16,32,4 8,0,16,32, 48,0,16,32 ,48,0,16,3 2,48,0,16, 32,48,0,16 ,32,48,0,1 6,32,48,0, 16,32,48,0 ,16,32,48, 0,16,32,48 ,0,16,32,4 8,0,16,32, 48,0,16,32 ,48,0,16,3 2,48,0,16, 32,48,0,16 ,32,48,0,1 6,32,48,0, 16,32,48,0 ,16,32,48, 0,16,32,48 ,0,16,32,4 8,0,16,32, 48,0,16,32 ,48,0,16,3 2,48,0,16, 32,48,0,16 ,32,48,0,1 6,32,48,0, 16,32,48,0 ,16,32,48, 0,16,32,48 ,0,16,32,4 8,0,16,32, 48,0,16,32 ,48,0,16,3 2,48,0,16, 32,48,0,16 ,32,48,0,1 6,32,48,0, 16,32,48,0 ,16,32,48, 0,16,32,48 ,0,16,32,4 8,0,16,32, 48,0,16,32 ,48,0,16,3 2,48,0,16, 32,48,0,16 ,32,48,0,1 6,32,48,0, 16,32,48};
char c1 [] = {0,0,0,0,0,0,0,0,0,0,0,0,0 ,0,0,0,1,1 ,1,1,1,1,1 ,1,1,1,1,1 ,1,1,1,1,2 ,2,2,2,2,2 ,2,2,2,2,2 ,2,2,2,2,2 ,3,3,3,3,3 ,3,3,3,3,3 ,3,3,3,3,3 ,3,4,4,4,4 ,4,4,4,4,4 ,4,4,4,4,4 ,4,4,5,5,5 ,5,5,5,5,5 ,5,5,5,5,5 ,5,5,5,6,6 ,6,6,6,6,6 ,6,6,6,6,6 ,6,6,6,6,7 ,7,7,7,7,7 ,7,7,7,7,7 ,7,7,7,7,7 ,8,8,8,8,8 ,8,8,8,8,8 ,8,8,8,8,8 ,8,9,9,9,9 ,9,9,9,9,9 ,9,9,9,9,9 ,9,9,10,10 ,10,10,10, 10,10,10,1 0,10,10,10 ,10,10,10, 10,11,11,1 1,11,11,11 ,11,11,11, 11,11,11,1 1,11,11,11 ,12,12,12, 12,12,12,1 2,12,12,12 ,12,12,12, 12,12,12,1 3,13,13,13 ,13,13,13, 13,13,13,1 3,13,13,13 ,13,13,14, 14,14,14,1 4,14,14,14 ,14,14,14, 14,14,14,1 4,14,15,15 ,15,15,15, 15,15,15,1 5,15,15,15 ,15,15,15, 15};
char carry1 [] = {0,4,8,12,16,20,24,28,32,3 6,40,44,48 ,52,56,60, 0,4,8,12,1 6,20,24,28 ,32,36,40, 44,48,52,5 6,60,0,4,8 ,12,16,20, 24,28,32,3 6,40,44,48 ,52,56,60, 0,4,8,12,1 6,20,24,28 ,32,36,40, 44,48,52,5 6,60,0,4,8 ,12,16,20, 24,28,32,3 6,40,44,48 ,52,56,60, 0,4,8,12,1 6,20,24,28 ,32,36,40, 44,48,52,5 6,60,0,4,8 ,12,16,20, 24,28,32,3 6,40,44,48 ,52,56,60, 0,4,8,12,1 6,20,24,28 ,32,36,40, 44,48,52,5 6,60,0,4,8 ,12,16,20, 24,28,32,3 6,40,44,48 ,52,56,60, 0,4,8,12,1 6,20,24,28 ,32,36,40, 44,48,52,5 6,60,0,4,8 ,12,16,20, 24,28,32,3 6,40,44,48 ,52,56,60, 0,4,8,12,1 6,20,24,28 ,32,36,40, 44,48,52,5 6,60,0,4,8 ,12,16,20, 24,28,32,3 6,40,44,48 ,52,56,60, 0,4,8,12,1 6,20,24,28 ,32,36,40, 44,48,52,5 6,60,0,4,8 ,12,16,20, 24,28,32,3 6,40,44,48 ,52,56,60, 0,4,8,12,1 6,20,24,28 ,32,36,40, 44,48,52,5 6,60};
char c2 [] = {0,0,0,0,0,0,0,0,0,0,0,0,0 ,0,0,0,0,0 ,0,0,0,0,0 ,0,0,0,0,0 ,0,0,0,0,0 ,0,0,0,0,0 ,0,0,0,0,0 ,0,0,0,0,0 ,0,0,0,0,0 ,0,0,0,0,0 ,0,0,0,0,0 ,0,1,1,1,1 ,1,1,1,1,1 ,1,1,1,1,1 ,1,1,1,1,1 ,1,1,1,1,1 ,1,1,1,1,1 ,1,1,1,1,1 ,1,1,1,1,1 ,1,1,1,1,1 ,1,1,1,1,1 ,1,1,1,1,1 ,1,1,1,1,1 ,1,1,1,1,1 ,2,2,2,2,2 ,2,2,2,2,2 ,2,2,2,2,2 ,2,2,2,2,2 ,2,2,2,2,2 ,2,2,2,2,2 ,2,2,2,2,2 ,2,2,2,2,2 ,2,2,2,2,2 ,2,2,2,2,2 ,2,2,2,2,2 ,2,2,2,2,2 ,2,2,2,2,3 ,3,3,3,3,3 ,3,3,3,3,3 ,3,3,3,3,3 ,3,3,3,3,3 ,3,3,3,3,3 ,3,3,3,3,3 ,3,3,3,3,3 ,3,3,3,3,3 ,3,3,3,3,3 ,3,3,3,3,3 ,3,3,3,3,3 ,3,3,3,3,3 ,3,3,3};
char carry2 [] = {0,1,2,3,4,5,6,7,8,9,10,11 ,12,13,14, 15,16,17,1 8,19,20,21 ,22,23,24, 25,26,27,2 8,29,30,31 ,32,33,34, 35,36,37,3 8,39,40,41 ,42,43,44, 45,46,47,4 8,49,50,51 ,52,53,54, 55,56,57,5 8,59,60,61 ,62,63,0,1 ,2,3,4,5,6 ,7,8,9,10, 11,12,13,1 4,15,16,17 ,18,19,20, 21,22,23,2 4,25,26,27 ,28,29,30, 31,32,33,3 4,35,36,37 ,38,39,40, 41,42,43,4 4,45,46,47 ,48,49,50, 51,52,53,5 4,55,56,57 ,58,59,60, 61,62,63,0 ,1,2,3,4,5 ,6,7,8,9,1 0,11,12,13 ,14,15,16, 17,18,19,2 0,21,22,23 ,24,25,26, 27,28,29,3 0,31,32,33 ,34,35,36, 37,38,39,4 0,41,42,43 ,44,45,46, 47,48,49,5 0,51,52,53 ,54,55,56, 57,58,59,6 0,61,62,63 ,0,1,2,3,4 ,5,6,7,8,9 ,10,11,12, 13,14,15,1 6,17,18,19 ,20,21,22, 23,24,25,2 6,27,28,29 ,30,31,32, 33,34,35,3 6,37,38,39 ,40,41,42, 43,44,45,4 6,47,48,49 ,50,51,52, 53,54,55,5 6,57,58,59 ,60,61,62, 63};
// Here for debugging.
char output [(1024*1024)/3*4+1];
void ToBase64 (unsigned char * data, int bytes, char * s)
{
int remainder [] = {0,2,3};
// How many chars to generate. 4 for every 3 bytes plus a remainder.
int chars = ((bytes / 3) * 4) + remainder[bytes%3];
while ( chars > 0 )
{
int carry = 0;
// First character.
*s++ = c0[*data];
carry = carry0[*data];
data += 1; bytes -= 1; chars -= 1;
// Second character.
if ( chars > 0 )
{
if ( bytes > 0 )
{
*s = digits[carry+c1[*data]];
carry = carry1[*data];
data += 1; bytes -= 1;
} else {
// We must consume the bottom two bits of the first byte.
*s = digits[carry];
}
chars -= 1;
} else {
*s = '=';
}
s += 1;
// Third character.
if ( chars > 0 )
{
if ( bytes > 0 )
{
*s = digits[carry+c2[*data]];
carry = carry2[*data];
data += 1; bytes -= 1;
} else {
// Consume the carry from the last byte.
*s = digits[carry];
}
chars -= 1;
} else {
*s = '=';
}
s += 1;
// Fourth character.
if ( chars > 0 )
{
*s = digits[carry];
chars -= 1;
} else {
*s = '=';
}
s += 1;
}
}
void _ToBase64asm(unsigned char * data, int bytes, char * s, int chars)
{
_asm {
// NOTE: I am assuming ds == es!
// This can be catered for easily if not the case.
// si = data would be nice but I need si for table lookups.
mov esi,data
// di = s
mov edi,s
// bh = bytes
mov ebx,bytes
mov bh,bl
// cx = chars
mov ecx,chars
charsLoop:
// bl = carry
// edx spare
// First character.
//*s++ = c0[*data];
xor eax,eax
mov al,[esi]
mov edx,esi
mov esi,eax // si = *data
mov al,c0[esi]
stosb
//carry = carry0[*data];
mov bl,carry0[esi]
mov esi,edx // si = data
//data += 1; bytes -= 1; chars -= 1;
inc esi
dec bh
dec ecx
// Second character.
//if ( chars > 0 )
//{
jcxz secondIsEquals
// chars -= 1;
dec ecx;
// if ( bytes > 0 )
// {
test bh,bh
jz secondBytesIsZero
// *s = digits[carry+c1[*data]];
xor eax,eax
mov al,[esi]
mov edx,esi
mov esi,eax // si = *data
mov al,c1[esi]
add al,bl
// carry = carry1[*data];
mov bl,carry1[esi]
//continued *s = digits[carry+c1[*data]];
mov esi,eax // si = carry+c1[*data]
mov al,digits[esi];
stosb
mov esi,edx // si = data
// data += 1; bytes -= 1;
inc esi
dec bh
jmp doneSecond
secondBytesIsZero:
// *s = digits[carry];
mov edx,esi
xor eax,eax
mov al,bl
mov esi,eax
mov al,digits[esi]
stosb
mov esi,edx
jmp doneSecond
secondIsEquals:
mov al,'='
stosb
doneSecond:
// Third character.
//if ( chars > 0 )
//{
jcxz thirdIsEquals
// chars -= 1;
dec ecx;
// if ( bytes > 0 )
// {
test bh,bh
jz thirdBytesIsZero
// *s = digits[carry+c2[*data]];
xor eax,eax
mov al,[esi]
mov edx,esi
mov esi,eax // si = *data
mov al,c2[esi]
add al,bl
// carry = carry2[*data];
mov bl,carry2[esi]
//continued *s = digits[carry+c2[*data]];
mov esi,eax // si = carry+c2[*data]
mov al,digits[esi];
stosb
mov esi,edx // si = data
// data += 1; bytes -= 1;
inc esi
dec bh
jmp doneThird
thirdBytesIsZero:
// *s = digits[carry];
mov edx,esi
xor ax,ax
mov al,bl
mov esi,eax
mov al,digits[esi]
stosb
mov esi,edx
jmp doneThird
thirdIsEquals:
mov al,'='
stosb
doneThird:
// Fourth character.
//if ( chars > 0 )
//{
jcxz fourthIsEquals
// chars -= 1;
dec ecx
// *s = digits[carry];
mov edx,esi
xor eax,eax
mov al,bl
mov esi,eax
mov al,digits[esi]
stosb
mov esi,edx
jmp doneFourth
fourthIsEquals:
// *s = '=';
mov al,'='
stosb
doneFourth:
// Next char.
//loop charsLoop
jcxz done
jmp charsloop
done:
}
}
void ToBase64asm(unsigned char * data, int bytes, char * s)
{
// Insignificant bit, easier done in C.
char remainder [] = {0,2,3};
int chars = ((bytes / 3) * 4) + remainder[bytes%3];
// Safety harness.
if ( bytes == 0 ) return;
// Do all the 255byte blocks.
while ( bytes > 0xff )
{
_ToBase64asm(data, 0xff, s, (0xff / 3) * 4);
data += 0xff;
bytes -= 0xff;
s += (0xff / 3) * 4;
chars -= (0xff / 3) * 4;
}
// And the last block.
if ( bytes > 0 ) _ToBase64asm(data, bytes, s, chars);
}
void atest ( void )
{
unsigned char test [] = "1234567";
unsigned char * test2 = (unsigned char *)malloc(1024 * 1024);
// Generate my tables.
//GenerateTables();
// Convert it.
//memset(output,0,sizeof(o utput)); ToBase64(test,1024 * 1024,output); printf("%s\n",output);
memset(output,0,sizeof(out put)); ToBase64asm(test2,1024 * 1024,output); printf("%s\n",output);
memset(output,0,sizeof(out put)); ToBase64(test,sizeof(test) -1,output) ; printf("%s\n",output);
memset(output,0,sizeof(out put)); ToBase64asm(test,sizeof(te st)-1,outp ut); printf("%s\n",output);
_getch();
}
PS: Nice to see your pic on this page again. :-)
Paul
Here's a testable version with an outer loop that handles the 255 byte limitation:
// Includes.
#include <stdio.h>
#include <conio.h>
// The digits of the number.
char digits [] = "ABCDEFGHIJKLMNOPQRSTUVWXY
void PrintTable(int * c, int * carry, bool cAsChars)
{
int i;
// The characters.
if ( cAsChars ) for ( i = 0; i < 256; i++ ) printf("%c", digits[c[i]]);
else for ( i = 0; i < 256; i++ ) printf("%d,", c[i]);
printf ("\n");
// The carries.
for ( i = 0; i < 256; i++ ) printf("%d,", carry[i]);
printf ("\n");
}
void GenerateTables ( void )
{
int i;
int c [256];
int carry [256];
// I want tables of character and carry for each byte of the three.
/*
The carry from the previous calculation will be added to whatever is in
the c array to lookup in the digits array.
*/
// Table for first byte.
// Top 6 bits is character.
// Bottom 2 is carry to top 2 of next 6.
for ( i = 0; i < 256; i++ )
{
c[i] = i>>2;
carry[i] = (i & 0x03) << 4;
}
PrintTable(c,carry,true);
// Table for second byte.
// Top 4 bits is character.
// Bottom 4 is carry to top 4 of next 6.
for ( i = 0; i < 256; i++ )
{
c[i] = i>>4;
carry[i] = (i & 0x0f) << 2;
}
PrintTable(c,carry,false);
// Table for third byte.
// Top 2 bits is character.
// Bottom 6 is fourth character.
for ( i = 0; i < 256; i++ )
{
c[i] = i>>6;
carry[i] = (i & 0x3f);
}
PrintTable(c,carry,false);
}
// The tables generated by the above code.
//int c0 [] = {0,0,0,0,1,1,1,1,2,2,2,2,3
char c0 [] = "AAAABBBBCCCCDDDDEEEEFFFFG
char carry0 [] = {0,16,32,48,0,16,32,48,0,1
char c1 [] = {0,0,0,0,0,0,0,0,0,0,0,0,0
char carry1 [] = {0,4,8,12,16,20,24,28,32,3
char c2 [] = {0,0,0,0,0,0,0,0,0,0,0,0,0
char carry2 [] = {0,1,2,3,4,5,6,7,8,9,10,11
// Here for debugging.
char output [(1024*1024)/3*4+1];
void ToBase64 (unsigned char * data, int bytes, char * s)
{
int remainder [] = {0,2,3};
// How many chars to generate. 4 for every 3 bytes plus a remainder.
int chars = ((bytes / 3) * 4) + remainder[bytes%3];
while ( chars > 0 )
{
int carry = 0;
// First character.
*s++ = c0[*data];
carry = carry0[*data];
data += 1; bytes -= 1; chars -= 1;
// Second character.
if ( chars > 0 )
{
if ( bytes > 0 )
{
*s = digits[carry+c1[*data]];
carry = carry1[*data];
data += 1; bytes -= 1;
} else {
// We must consume the bottom two bits of the first byte.
*s = digits[carry];
}
chars -= 1;
} else {
*s = '=';
}
s += 1;
// Third character.
if ( chars > 0 )
{
if ( bytes > 0 )
{
*s = digits[carry+c2[*data]];
carry = carry2[*data];
data += 1; bytes -= 1;
} else {
// Consume the carry from the last byte.
*s = digits[carry];
}
chars -= 1;
} else {
*s = '=';
}
s += 1;
// Fourth character.
if ( chars > 0 )
{
*s = digits[carry];
chars -= 1;
} else {
*s = '=';
}
s += 1;
}
}
void _ToBase64asm(unsigned char * data, int bytes, char * s, int chars)
{
_asm {
// NOTE: I am assuming ds == es!
// This can be catered for easily if not the case.
// si = data would be nice but I need si for table lookups.
mov esi,data
// di = s
mov edi,s
// bh = bytes
mov ebx,bytes
mov bh,bl
// cx = chars
mov ecx,chars
charsLoop:
// bl = carry
// edx spare
// First character.
//*s++ = c0[*data];
xor eax,eax
mov al,[esi]
mov edx,esi
mov esi,eax // si = *data
mov al,c0[esi]
stosb
//carry = carry0[*data];
mov bl,carry0[esi]
mov esi,edx // si = data
//data += 1; bytes -= 1; chars -= 1;
inc esi
dec bh
dec ecx
// Second character.
//if ( chars > 0 )
//{
jcxz secondIsEquals
// chars -= 1;
dec ecx;
// if ( bytes > 0 )
// {
test bh,bh
jz secondBytesIsZero
// *s = digits[carry+c1[*data]];
xor eax,eax
mov al,[esi]
mov edx,esi
mov esi,eax // si = *data
mov al,c1[esi]
add al,bl
// carry = carry1[*data];
mov bl,carry1[esi]
//continued *s = digits[carry+c1[*data]];
mov esi,eax // si = carry+c1[*data]
mov al,digits[esi];
stosb
mov esi,edx // si = data
// data += 1; bytes -= 1;
inc esi
dec bh
jmp doneSecond
secondBytesIsZero:
// *s = digits[carry];
mov edx,esi
xor eax,eax
mov al,bl
mov esi,eax
mov al,digits[esi]
stosb
mov esi,edx
jmp doneSecond
secondIsEquals:
mov al,'='
stosb
doneSecond:
// Third character.
//if ( chars > 0 )
//{
jcxz thirdIsEquals
// chars -= 1;
dec ecx;
// if ( bytes > 0 )
// {
test bh,bh
jz thirdBytesIsZero
// *s = digits[carry+c2[*data]];
xor eax,eax
mov al,[esi]
mov edx,esi
mov esi,eax // si = *data
mov al,c2[esi]
add al,bl
// carry = carry2[*data];
mov bl,carry2[esi]
//continued *s = digits[carry+c2[*data]];
mov esi,eax // si = carry+c2[*data]
mov al,digits[esi];
stosb
mov esi,edx // si = data
// data += 1; bytes -= 1;
inc esi
dec bh
jmp doneThird
thirdBytesIsZero:
// *s = digits[carry];
mov edx,esi
xor ax,ax
mov al,bl
mov esi,eax
mov al,digits[esi]
stosb
mov esi,edx
jmp doneThird
thirdIsEquals:
mov al,'='
stosb
doneThird:
// Fourth character.
//if ( chars > 0 )
//{
jcxz fourthIsEquals
// chars -= 1;
dec ecx
// *s = digits[carry];
mov edx,esi
xor eax,eax
mov al,bl
mov esi,eax
mov al,digits[esi]
stosb
mov esi,edx
jmp doneFourth
fourthIsEquals:
// *s = '=';
mov al,'='
stosb
doneFourth:
// Next char.
//loop charsLoop
jcxz done
jmp charsloop
done:
}
}
void ToBase64asm(unsigned char * data, int bytes, char * s)
{
// Insignificant bit, easier done in C.
char remainder [] = {0,2,3};
int chars = ((bytes / 3) * 4) + remainder[bytes%3];
// Safety harness.
if ( bytes == 0 ) return;
// Do all the 255byte blocks.
while ( bytes > 0xff )
{
_ToBase64asm(data, 0xff, s, (0xff / 3) * 4);
data += 0xff;
bytes -= 0xff;
s += (0xff / 3) * 4;
chars -= (0xff / 3) * 4;
}
// And the last block.
if ( bytes > 0 ) _ToBase64asm(data, bytes, s, chars);
}
void atest ( void )
{
unsigned char test [] = "1234567";
unsigned char * test2 = (unsigned char *)malloc(1024 * 1024);
// Generate my tables.
//GenerateTables();
// Convert it.
//memset(output,0,sizeof(o
memset(output,0,sizeof(out
memset(output,0,sizeof(out
memset(output,0,sizeof(out
_getch();
}
PS: Nice to see your pic on this page again. :-)
Paul
ASKER
Hi Paul!
I finally grabbed a moment to look at your code. That's a clever use of lookup tables and (interestingly) avoids the use of binary shift operations during the main loop -- prefering to precalculate the values and store them into lookup tables. Let me see if I understand the algorithm:
Examine the data three bytes at-a-time and output four bytes per triplet. For each triplet:
1) The first input byte can be looked up in a table that just repeats each output character four times (co[]). That effectively ignores the lowest two bits (same output regardless of bits 0 and 1). That yields the first output.
2) Remember the part of the data that was ignored (you refer to it as "the carry") This will be one of the values 0,16,32,48 (0x00,0x10,0x20,0x30) looked up from the carry0[] table.
3) Calculate the sum of the carry from step #2 plus a lookup from a another table( c1[]). The c1[] table is 16 repeats of each possible value; that is, it outputs the same value regardless of the low four bits of the input. By adding that looked-up value to the saved carry, we have an index into the standard Base64 lookup table (digits[]). That yields the second output.
4) Remember the part of the second-byte data that was ignored in step #3 (the low four bits). Use that as an index into the carry1[] table (which contains 0,4,8,...,52,56,60, repeated 16 times). That is, this table is those low four bits shifted left by 4.
5) Again calculate an index into the standard Base64 lookup table, from the combined value of result of the carry1[] lookup and a lookup that effectively uses the top two bits from the next input byte.
6 (... repeat a similar procedure for the last byte...)
That is: For each byte except the first, you lookup a value from a table that ignores certain bits and you combine that with a value from another table that ignores other bits, resulting in an index into the standard lookup table.
I'm impressed :-) This is one of the reasons I so enjoy these challenge questions... that is an an approach that would not have occurred to me. But in retrospect, makes perfect sense.
Now as to the timings:
MB/sec Internal name // Description
-------- -------------------------- ---------- ---------- ---------- ---------- ---------- ---------- -
131.24 pcToBase64 / PaulCaswell's C (7 lookup tables)
140.66 pcToBase64as / PaulCaswell's ASM conversion (7 lookup tables)
196.92 y4ToBase64 / craigwardman (ASM, later version -- uses bp and [edi+edx])
Alas, clever as it is, it is not as fast as some of the other techniques. Also, the C-to-ASM conversion was only slightly faster. I think part of the reason for that is the ASM version uses that 256-byte chunking technique and thus requires the need for an outer loop. Each of the 1000 times through the test loop, the outer loop of your function needs to make 4,000 function calls; that is a killer -- there is a lot of 'hidden' overhead in each CALL to _ToBase64asm().
Another thing that holds it back (I believe) is that at each input byte, you are checking for end-of-input and jumping around the end-of-input logic. The faster algorithms avoid millions of those compare-and-jump sequences by just bulling through and doing all three bytes in the all-important inner-loop logic and then doing a one-time adjustment at the end.
Don't be too dissapointed! craigwardman's first submissions needed some tweaks here and there before he reached high-speed nirvana :-) Also, your code is already *way* faster than either of the Microsoft "submissions" CryptBinaryToStringA in Crypt32.dll or the CBase64Coder object :-)
-- Dan
I finally grabbed a moment to look at your code. That's a clever use of lookup tables and (interestingly) avoids the use of binary shift operations during the main loop -- prefering to precalculate the values and store them into lookup tables. Let me see if I understand the algorithm:
Examine the data three bytes at-a-time and output four bytes per triplet. For each triplet:
1) The first input byte can be looked up in a table that just repeats each output character four times (co[]). That effectively ignores the lowest two bits (same output regardless of bits 0 and 1). That yields the first output.
2) Remember the part of the data that was ignored (you refer to it as "the carry") This will be one of the values 0,16,32,48 (0x00,0x10,0x20,0x30) looked up from the carry0[] table.
3) Calculate the sum of the carry from step #2 plus a lookup from a another table( c1[]). The c1[] table is 16 repeats of each possible value; that is, it outputs the same value regardless of the low four bits of the input. By adding that looked-up value to the saved carry, we have an index into the standard Base64 lookup table (digits[]). That yields the second output.
4) Remember the part of the second-byte data that was ignored in step #3 (the low four bits). Use that as an index into the carry1[] table (which contains 0,4,8,...,52,56,60, repeated 16 times). That is, this table is those low four bits shifted left by 4.
5) Again calculate an index into the standard Base64 lookup table, from the combined value of result of the carry1[] lookup and a lookup that effectively uses the top two bits from the next input byte.
6 (... repeat a similar procedure for the last byte...)
That is: For each byte except the first, you lookup a value from a table that ignores certain bits and you combine that with a value from another table that ignores other bits, resulting in an index into the standard lookup table.
I'm impressed :-) This is one of the reasons I so enjoy these challenge questions... that is an an approach that would not have occurred to me. But in retrospect, makes perfect sense.
Now as to the timings:
MB/sec Internal name // Description
-------- --------------------------
131.24 pcToBase64 / PaulCaswell's C (7 lookup tables)
140.66 pcToBase64as / PaulCaswell's ASM conversion (7 lookup tables)
196.92 y4ToBase64 / craigwardman (ASM, later version -- uses bp and [edi+edx])
Alas, clever as it is, it is not as fast as some of the other techniques. Also, the C-to-ASM conversion was only slightly faster. I think part of the reason for that is the ASM version uses that 256-byte chunking technique and thus requires the need for an outer loop. Each of the 1000 times through the test loop, the outer loop of your function needs to make 4,000 function calls; that is a killer -- there is a lot of 'hidden' overhead in each CALL to _ToBase64asm().
Another thing that holds it back (I believe) is that at each input byte, you are checking for end-of-input and jumping around the end-of-input logic. The faster algorithms avoid millions of those compare-and-jump sequences by just bulling through and doing all three bytes in the all-important inner-loop logic and then doing a one-time adjustment at the end.
Don't be too dissapointed! craigwardman's first submissions needed some tweaks here and there before he reached high-speed nirvana :-) Also, your code is already *way* faster than either of the Microsoft "submissions" CryptBinaryToStringA in Crypt32.dll or the CBase64Coder object :-)
-- Dan
Hi Dan,
Thanks for the feedback. As you can see this is just the first hit.
I take the point of the outer loop. That will go soon enough. That was a hack to get something that functions at the last minute when I noticed your test mechanism.
My next stage is to find a balance between shift/mask and table lookup. I am unsure how many cycles I can cut that way but it's certainly central to the algorithm so anything I can squeeze out of that will be a bonus.
That is an excellent idea to stick the '=' padding in at the end. I think that will be my first change.
The table technique goes back to my old 6502 days when there were no maths functions at all except add/sub. I have to admit that taking a wole K for tables would not have been acceptable, especially when you only have 64 of them. Precalculating is an excellent optimisation as I am sure you are aware. I am sure many of the 'old' algorithms would benefit a little.
I'll be back with the next iteration soon I hope.
Paul
Thanks for the feedback. As you can see this is just the first hit.
I take the point of the outer loop. That will go soon enough. That was a hack to get something that functions at the last minute when I noticed your test mechanism.
My next stage is to find a balance between shift/mask and table lookup. I am unsure how many cycles I can cut that way but it's certainly central to the algorithm so anything I can squeeze out of that will be a bonus.
That is an excellent idea to stick the '=' padding in at the end. I think that will be my first change.
The table technique goes back to my old 6502 days when there were no maths functions at all except add/sub. I have to admit that taking a wole K for tables would not have been acceptable, especially when you only have 64 of them. Precalculating is an excellent optimisation as I am sure you are aware. I am sure many of the 'old' algorithms would benefit a little.
I'll be back with the next iteration soon I hope.
Paul
That seems to have improved things a bit.
There may be still some small improvements but the extra register seems to help.
void ToBase64asm(unsigned char * data, int bytes, char * s)
{
_asm {
// NOTE: I am assuming ds == es!
// This can be catered for easily if not the case.
// si = data would be nice but I need si for table lookups.
mov esi,data
// di = s
mov edi,s
// cx = bytes
mov ecx,bytes
// I will only use al. Empty all of ax.
xor eax,eax
bytesLoop:
// bl = carry
// edx spare, used for si push.
// First character.
//*s++ = c0[*data];
mov al,[esi]
mov edx,esi
mov esi,eax // si = *data
mov al,c0[esi]
stosb
//carry = carry0[*data];
mov bl,carry0[esi]
mov esi,edx // si = data
//data += 1; bytes -= 1;
inc esi
dec ecx
//if ( bytes > 0 )
jcxz noMoreBytes
//{
// // Second character.
// *s++ = digits[carry+c1[*data]];
mov al,[esi]
mov edx,esi
mov esi,eax // si = *data
mov al,c1[esi]
add al,bl
// carry = carry1[*data];
mov bl,carry1[esi]
//continued *s = digits[carry+c1[*data]];
mov esi,eax // si = carry+c1[*data]
mov al,digits[esi];
stosb
mov esi,edx // si = data
// data += 1; bytes -= 1;
inc esi
dec ecx
// // Third character.
// if ( bytes > 0 )
jcxz noMoreBytes
// {
// *s++ = digits[carry+c2[*data]];
mov al,[esi]
mov edx,esi
mov esi,eax // si = *data
mov al,c2[esi]
add al,bl
// carry = carry2[*data];
mov bl,carry2[esi]
//continued *s = digits[carry+c2[*data]];
mov esi,eax // si = carry+c2[*data]
mov al,digits[esi];
stosb
mov esi,edx // si = data
// data += 1; bytes -= 1;
inc esi
//dec ecx // Done by loop
noMoreBytes:
//// Consume the last carry.
//*s++ = digits[carry];
mov edx,esi
mov al,bl
mov esi,eax
mov al,digits[esi]
stosb
mov esi,edx
// Next char.
jcxz done
loop bytesLoop
done:
}
}
I hope you've kept the previous run so I dont have to post all the tables again. Note that there is no outer loop now.
Paul
There may be still some small improvements but the extra register seems to help.
void ToBase64asm(unsigned char * data, int bytes, char * s)
{
_asm {
// NOTE: I am assuming ds == es!
// This can be catered for easily if not the case.
// si = data would be nice but I need si for table lookups.
mov esi,data
// di = s
mov edi,s
// cx = bytes
mov ecx,bytes
// I will only use al. Empty all of ax.
xor eax,eax
bytesLoop:
// bl = carry
// edx spare, used for si push.
// First character.
//*s++ = c0[*data];
mov al,[esi]
mov edx,esi
mov esi,eax // si = *data
mov al,c0[esi]
stosb
//carry = carry0[*data];
mov bl,carry0[esi]
mov esi,edx // si = data
//data += 1; bytes -= 1;
inc esi
dec ecx
//if ( bytes > 0 )
jcxz noMoreBytes
//{
// // Second character.
// *s++ = digits[carry+c1[*data]];
mov al,[esi]
mov edx,esi
mov esi,eax // si = *data
mov al,c1[esi]
add al,bl
// carry = carry1[*data];
mov bl,carry1[esi]
//continued *s = digits[carry+c1[*data]];
mov esi,eax // si = carry+c1[*data]
mov al,digits[esi];
stosb
mov esi,edx // si = data
// data += 1; bytes -= 1;
inc esi
dec ecx
// // Third character.
// if ( bytes > 0 )
jcxz noMoreBytes
// {
// *s++ = digits[carry+c2[*data]];
mov al,[esi]
mov edx,esi
mov esi,eax // si = *data
mov al,c2[esi]
add al,bl
// carry = carry2[*data];
mov bl,carry2[esi]
//continued *s = digits[carry+c2[*data]];
mov esi,eax // si = carry+c2[*data]
mov al,digits[esi];
stosb
mov esi,edx // si = data
// data += 1; bytes -= 1;
inc esi
//dec ecx // Done by loop
noMoreBytes:
//// Consume the last carry.
//*s++ = digits[carry];
mov edx,esi
mov al,bl
mov esi,eax
mov al,digits[esi]
stosb
mov esi,edx
// Next char.
jcxz done
loop bytesLoop
done:
}
}
I hope you've kept the previous run so I dont have to post all the tables again. Note that there is no outer loop now.
Paul
SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Not sure how the final figures will come out by I've just tested craigwardmans's final code against this one and I get:
Paul 298ms
Craig 3,603ms
for 1000 * 1mb.
I look forward to the final figures.
I wonder if it would be worth running your test on different cpu/mobo setups to see if any of them are particularly good in a certain environment.
I suspect mine makes more use of data cache than instruction cache.
Paul
Paul 298ms
Craig 3,603ms
for 1000 * 1mb.
I look forward to the final figures.
I wonder if it would be worth running your test on different cpu/mobo setups to see if any of them are particularly good in a certain environment.
I suspect mine makes more use of data cache than instruction cache.
Paul
ASKER
My first timing test on your last submission really threw me... it was so much faster than any other, that I had to check it carefully. Alas, I did find a problem with it...
When you use the opcode,
JCXZ
you are only testing the 16-bit CX register. Since the nSrcLen starts out at
1024*1024= 1048576 = 0x00100000
then the first
DEC ECX
sets the 16-bit CX register to 0xFFFF and the subsequent
JCXZ
opcodes will end up exiting the loop after processing only about 65,000 characters (of the original 1MB). When I replaced the several JCXZ opcodes with
JECXZ
I got valid timings that are more in line with what I expected. Please verify my result, but I'm pretty sure that this is what happened.
Please note that I would have made the exact same mistake (coming from an 8088 background and having last done "actual production ASM" coding in about 1989 :-) and I had to look up the opcode in the Intel literature to be sure... so don't feel too bad :-)
BTW, the LOOP opcode works on ECX (unless the CPU is set to globally use 16-bit mode) -- I had to look that up because I was not at all certain.
With the changes you made, I did see a slight improvement over your previous submission.
MB/sec Internal name // Description
-------- -------------------------- ---------- ---------- ---------- ---------- ---------- ---------- -
140.66 pcToBase64asm / PaulCaswell's ASM conversion (7 lookup tables)
142.17 pc2ToBase64asm / PaulCaswell's (7 lookup tables) later revision after JECXZ
196.92 y4ToBase64 / craigwardman (ASM, later version -- uses bp and [edi+edx])
=-=-=-=-=-=-=-=-=-=-=
>> I wonder if it would be worth running your test on different cpu/mobo setups...
>> I suspect mine makes more use of data cache than instruction cache.
In fact, it turns out that Genuine Intel processors do end up processing signifiantly higher volumes for (what is purported yo be "an equivallent" GHz rating). I show some results for running on a Celeron in the C++ version of this Q:
https://www.experts-exchange.com/questions/21988706/How-fast-is-your-C-C-Base64-Encoder.html#17581593
In that thread, we also talked a bit about the L1/L2 caching situation. The timing test, as designed, is certain to "thrash the cache" since it processes 1MB chunks while most caches are only 512K or less. In my next two posts, I'll illustrate that rather dramatically.
-- Dan
When you use the opcode,
JCXZ
you are only testing the 16-bit CX register. Since the nSrcLen starts out at
1024*1024= 1048576 = 0x00100000
then the first
DEC ECX
sets the 16-bit CX register to 0xFFFF and the subsequent
JCXZ
opcodes will end up exiting the loop after processing only about 65,000 characters (of the original 1MB). When I replaced the several JCXZ opcodes with
JECXZ
I got valid timings that are more in line with what I expected. Please verify my result, but I'm pretty sure that this is what happened.
Please note that I would have made the exact same mistake (coming from an 8088 background and having last done "actual production ASM" coding in about 1989 :-) and I had to look up the opcode in the Intel literature to be sure... so don't feel too bad :-)
BTW, the LOOP opcode works on ECX (unless the CPU is set to globally use 16-bit mode) -- I had to look that up because I was not at all certain.
With the changes you made, I did see a slight improvement over your previous submission.
MB/sec Internal name // Description
-------- --------------------------
140.66 pcToBase64asm / PaulCaswell's ASM conversion (7 lookup tables)
142.17 pc2ToBase64asm / PaulCaswell's (7 lookup tables) later revision after JECXZ
196.92 y4ToBase64 / craigwardman (ASM, later version -- uses bp and [edi+edx])
=-=-=-=-=-=-=-=-=-=-=
>> I wonder if it would be worth running your test on different cpu/mobo setups...
>> I suspect mine makes more use of data cache than instruction cache.
In fact, it turns out that Genuine Intel processors do end up processing signifiantly higher volumes for (what is purported yo be "an equivallent" GHz rating). I show some results for running on a Celeron in the C++ version of this Q:
https://www.experts-exchange.com/questions/21988706/How-fast-is-your-C-C-Base64-Encoder.html#17581593
In that thread, we also talked a bit about the L1/L2 caching situation. The timing test, as designed, is certain to "thrash the cache" since it processes 1MB chunks while most caches are only 512K or less. In my next two posts, I'll illustrate that rather dramatically.
-- Dan
ASKER
This is my "mystery algorithm" -- it employs a 8096-byte lookup table... I look up a 16-bit output value for each 12 bits of input. One of the tricks of this is that the table has the byte-pairs reversed to avoid fix-up before storing the output
Another thing about this is that it reads four bytes at-a-time, but discards one of them for each 3-byte triplet. So it is reading a lot of bytes twice. I thought there would be a big improvlent when I used that technique, but for various reasons, it's not nuch faster (if any) than just reading the three bytes one-at-a-time.
I also used the "scaled indexed indirect adressing mode" which is a first for me... an opcode like
mov bx, WORD PTR Lookup12bits[edx*2]
...is faster than the two-opcode sequence...
shl edx,2
mov bx, WORD PTR Lookup12bits[edx]
and it also preserves the value of edx (though I don't use that feature).
Note that i'm "cheating" because I've omitted the end-of-input fixup that really needs to be done, but since that is an insignificant fraction of the overall time (conpared to processing the entire MB of data) I don't feel bad about omitting it.
//------------------------ ----------
// 8096-byte lookup table; not using MMX
//
//char Lookup12bits[ ((64*64)*2)+1 ]= // 8096+1
//"AAABACADAEAFAGAHAIAJAKA LAMANAOAPA QARASATA ... A6A7A8A9A+A/"
//"BABBBCBDBEBFBGBHBIBJBKB LBMBNBOBPB QBRBSBTB ... B6B7B8B9B+B/"
//...
//
//"9A9B9C9D9E9F9G9H9I9J9K9 L9M9N9O9P9 Q9R9S9T9.. . 5969798999+9/"
//"+A+B+C+D+E+F+G+H+I+J+K+ L+M+N+O+P+ Q+R+S+T+.. . 5+6+7+8+9+++/"
//"/A/B/C/D/E/F/G/H/I/J/K/ L/M/N/O/P/ Q/R/S/T/.. . 5/6/7/8/9/+//"
//;
// 8096-byte lookup table; not using MMX
// timed at 219.92 MB/s
//
void q5ToBase64( BYTE* pSrc, int nSrcLen, char* pszOutBuf )
{
_asm {
mov edi, pszOutBuf;
mov esi, pSrc;
mov ecx, esi;
add ecx, nSrcLen ;; all done when esi > ecx
again:
mov eax, DWORD PTR [esi]
bswap eax // big-endian adjust
shr eax,8 // discard byte n+3 (using only first 3)
mov edx,eax
and edx, 0x00000fff
mov bx, WORD PTR Lookup12bits[edx*2]
shl ebx,16 // save the first two bytes in bits 31-16
shr eax,12 // no need to mask here
mov bx, WORD PTR Lookup12bits[eax*2] // next two bytes in bits 15-0
mov DWORD PTR [edi], ebx
add edi, 4
add esi, 3
cmp esi, ecx
jle again
}
// end-of-input adjustment code omitted
}
Another thing about this is that it reads four bytes at-a-time, but discards one of them for each 3-byte triplet. So it is reading a lot of bytes twice. I thought there would be a big improvlent when I used that technique, but for various reasons, it's not nuch faster (if any) than just reading the three bytes one-at-a-time.
I also used the "scaled indexed indirect adressing mode" which is a first for me... an opcode like
mov bx, WORD PTR Lookup12bits[edx*2]
...is faster than the two-opcode sequence...
shl edx,2
mov bx, WORD PTR Lookup12bits[edx]
and it also preserves the value of edx (though I don't use that feature).
Note that i'm "cheating" because I've omitted the end-of-input fixup that really needs to be done, but since that is an insignificant fraction of the overall time (conpared to processing the entire MB of data) I don't feel bad about omitting it.
//------------------------
// 8096-byte lookup table; not using MMX
//
//char Lookup12bits[ ((64*64)*2)+1 ]= // 8096+1
//"AAABACADAEAFAGAHAIAJAKA
//"BABBBCBDBEBFBGBHBIBJBKB
//...
//
//"9A9B9C9D9E9F9G9H9I9J9K9
//"+A+B+C+D+E+F+G+H+I+J+K+
//"/A/B/C/D/E/F/G/H/I/J/K/
//;
// 8096-byte lookup table; not using MMX
// timed at 219.92 MB/s
//
void q5ToBase64( BYTE* pSrc, int nSrcLen, char* pszOutBuf )
{
_asm {
mov edi, pszOutBuf;
mov esi, pSrc;
mov ecx, esi;
add ecx, nSrcLen ;; all done when esi > ecx
again:
mov eax, DWORD PTR [esi]
bswap eax // big-endian adjust
shr eax,8 // discard byte n+3 (using only first 3)
mov edx,eax
and edx, 0x00000fff
mov bx, WORD PTR Lookup12bits[edx*2]
shl ebx,16 // save the first two bytes in bits 31-16
shr eax,12 // no need to mask here
mov bx, WORD PTR Lookup12bits[eax*2] // next two bytes in bits 15-0
mov DWORD PTR [edi], ebx
add edi, 4
add esi, 3
cmp esi, ecx
jle again
}
// end-of-input adjustment code omitted
}
ASKER
This is a variation of the above... It uses a 12-bit index into the 8096-byte table. The difference is that in this code, I accumulate EIGHT bytes of output into an MMX register and then when I store it to memory, I'm using a special version of the MOVQ opcode that prevents the CPU from caching the output.
The MOVNTQ opcode is designed for "streaming output" ... Normally, if you store a variable to memory, it is very likely that you will need to access it again soon -- so it's smart that the CPU puts it in the fastest cache. But in situations like this one, I'd rather not clutter the cache with data that I'll not need to access any time soon. As best as I can tell, this lets the lookup table and (perhaps part of) the source data stay in L1 cache. The timing results show that changing just that one SINGLE opcode make a significant difference:
MB/sec Internal name // Description
-------- -------------------------- ---------- ---------- ---------- ---------- ---------- ---------- -
140.66 pcToBase64asm / PaulCaswell's ASM conversion (7 lookup tables)
142.17 pc2ToBase64asm / PaulCaswell's (7 lookup tables) later revision after JECXZ
196.92 y4ToBase64 / craigwardman (ASM, later version -- uses bp and [edi+edx])
219.92 q5ToBase64 / DanRollins's "mystery algo" -- 8096-byte lookup table (no MMX)
219.92 q6ToBase64 / Dan's "mystery algo" -- accumulating 8 bytes to output in an MMX register
307.69 q6xToBase64 / same as q6ToBase64 but using MOVNTQ rather than MOVQ
=--=-=-=-=-=-=-=-=-=-=-=-= -=
//------------------------ --
// 8096-byte lookup table; using MMX for output (and using cache-managlemnet)
// timed at 307.96 MB/s
//
// PLUS, it collects 8 bytes into an MMX register for output
// Note that using MOVQ, it runs at "only*" 216,
// but witn MOVNTQ, it zooms in at 307
//
void q6ToBase64( BYTE* pSrc, int nSrcLen, char* pszOutBuf )
{
_asm {
EMMS
mov edi, pszOutBuf;
mov esi, pSrc;
mov ecx, esi;
add ecx, nSrcLen ;; all done when esi > ecx
again:
mov eax, DWORD PTR [esi]
bswap eax // endian adjust
shr eax,8 // discard byte n+3 (using only first 3)
mov edx,eax
and edx, 0x00000fff
mov bx, WORD PTR Lookup12bits[edx*2]
shl ebx,16
shr eax,12
mov bx, WORD PTR Lookup12bits[eax*2]
add esi, 3 // point to next three bytes
movd mm0, ebx
//------------------------ ------ repeat above for next three bytes
mov eax, DWORD PTR [esi]
bswap eax
shr eax,8
mov edx,eax
and edx, 0x00000fff
mov bx, WORD PTR Lookup12bits[edx*2]
shl ebx,16
shr eax,12
mov bx, WORD PTR Lookup12bits[eax*2]
movd mm1, ebx
psll mm1, 32
por mm0, mm1
//movq [edi],mm0 // runs at 219 MB/s
//movntq [edi],mm0 // runs at 307 MB/s
_emit 0x0F
_emit 0xE7
_emit 0x07
add edi, 8
add esi, 3
cmp esi, ecx
jle again
}
// end-of-input adjustment code omitted
}
The MOVNTQ opcode is designed for "streaming output" ... Normally, if you store a variable to memory, it is very likely that you will need to access it again soon -- so it's smart that the CPU puts it in the fastest cache. But in situations like this one, I'd rather not clutter the cache with data that I'll not need to access any time soon. As best as I can tell, this lets the lookup table and (perhaps part of) the source data stay in L1 cache. The timing results show that changing just that one SINGLE opcode make a significant difference:
MB/sec Internal name // Description
-------- --------------------------
140.66 pcToBase64asm / PaulCaswell's ASM conversion (7 lookup tables)
142.17 pc2ToBase64asm / PaulCaswell's (7 lookup tables) later revision after JECXZ
196.92 y4ToBase64 / craigwardman (ASM, later version -- uses bp and [edi+edx])
219.92 q5ToBase64 / DanRollins's "mystery algo" -- 8096-byte lookup table (no MMX)
219.92 q6ToBase64 / Dan's "mystery algo" -- accumulating 8 bytes to output in an MMX register
307.69 q6xToBase64 / same as q6ToBase64 but using MOVNTQ rather than MOVQ
=--=-=-=-=-=-=-=-=-=-=-=-=
//------------------------
// 8096-byte lookup table; using MMX for output (and using cache-managlemnet)
// timed at 307.96 MB/s
//
// PLUS, it collects 8 bytes into an MMX register for output
// Note that using MOVQ, it runs at "only*" 216,
// but witn MOVNTQ, it zooms in at 307
//
void q6ToBase64( BYTE* pSrc, int nSrcLen, char* pszOutBuf )
{
_asm {
EMMS
mov edi, pszOutBuf;
mov esi, pSrc;
mov ecx, esi;
add ecx, nSrcLen ;; all done when esi > ecx
again:
mov eax, DWORD PTR [esi]
bswap eax // endian adjust
shr eax,8 // discard byte n+3 (using only first 3)
mov edx,eax
and edx, 0x00000fff
mov bx, WORD PTR Lookup12bits[edx*2]
shl ebx,16
shr eax,12
mov bx, WORD PTR Lookup12bits[eax*2]
add esi, 3 // point to next three bytes
movd mm0, ebx
//------------------------
mov eax, DWORD PTR [esi]
bswap eax
shr eax,8
mov edx,eax
and edx, 0x00000fff
mov bx, WORD PTR Lookup12bits[edx*2]
shl ebx,16
shr eax,12
mov bx, WORD PTR Lookup12bits[eax*2]
movd mm1, ebx
psll mm1, 32
por mm0, mm1
//movq [edi],mm0 // runs at 219 MB/s
//movntq [edi],mm0 // runs at 307 MB/s
_emit 0x0F
_emit 0xE7
_emit 0x07
add edi, 8
add esi, 3
cmp esi, ecx
jle again
}
// end-of-input adjustment code omitted
}
Dan,
Wow! That's some amazing stuff. Pretty impressive performance too.
I have read in many places that there is a performance hit when reading on odd byte boundaries. I wonder if there is a way of keeping the fourth byte you read for the next time around. Perhaps read it 4 and 2 instead, preserving the the fourth for next time around. It certainly would make it fidddly but your discovery that movntq makes such a significant difference suggests that the main performance bottleneck is memory I/O.
Thanks for the jecxz thing, I really couldnt understand why it was so much faster. :-(
Good Challenge! :-)
Paul
Wow! That's some amazing stuff. Pretty impressive performance too.
I have read in many places that there is a performance hit when reading on odd byte boundaries. I wonder if there is a way of keeping the fourth byte you read for the next time around. Perhaps read it 4 and 2 instead, preserving the the fourth for next time around. It certainly would make it fidddly but your discovery that movntq makes such a significant difference suggests that the main performance bottleneck is memory I/O.
Thanks for the jecxz thing, I really couldnt understand why it was so much faster. :-(
Good Challenge! :-)
Paul
ASKER
Coming from an 8088 background and having written code for the Mac II (Motorolla 68020), I too, was worried about those odd-address accesses. But in some time trials, I saw no particular performance loss. I formed a theory as to why:
My theory is that accessing data that's in L1 cache is nearly as fast as accessing a register. If so, then once a cache line is filled (typically 32 bytes, I think), you can get to any byte -- regardless of its physical address -- without an odd-byte penalty.
In this app, we "stream" through the input... touching each byte sequentially. As soon as we access a byte, the group of bytes next to it get cached. Since every byte in the 1MB sample will be accessed, there is no time lost to cache misses -- there would be exactly the same number of cach misses in either case.
It might be fun to try to look for evidence that the theory is correct... write a test that randomly accesses all over the place in say, a 10 MB buffer. On the first test, take care that all WORD accesses start on even boundaries and on the next test, make sure that they all start on odd boundaries. In that test, nearly every access would be a cache miss.... I'd wager that the timing for even-byte reads would be similar to those for odd-byte reads.
A special case would be a test contrived to access WORDs in which each and every access was odd and also straddled a cache-line boundary. That would casue TWO cache misses on each DWORD access and the performance would be terrible.
=-=-=-=-=-=-=-=
I also thought of optimizing the reads by grabbing 8-byte chunks into an MMX register. But I spent so many opcodes shuffling the bytes into the right order for processing that the results were no good. That technique might show promise using the sort of multi-table pre-calculated lookup techniques you used...
I read that the latest processors now support a PSHUFL opcode that could handle the endian-rearranging all at once... Using SSE, I could read 128 bits (16 bytes, 5 source triplets) into a single SSE register and grind out 20 bytes of output at a time. :-)
My theory is that accessing data that's in L1 cache is nearly as fast as accessing a register. If so, then once a cache line is filled (typically 32 bytes, I think), you can get to any byte -- regardless of its physical address -- without an odd-byte penalty.
In this app, we "stream" through the input... touching each byte sequentially. As soon as we access a byte, the group of bytes next to it get cached. Since every byte in the 1MB sample will be accessed, there is no time lost to cache misses -- there would be exactly the same number of cach misses in either case.
It might be fun to try to look for evidence that the theory is correct... write a test that randomly accesses all over the place in say, a 10 MB buffer. On the first test, take care that all WORD accesses start on even boundaries and on the next test, make sure that they all start on odd boundaries. In that test, nearly every access would be a cache miss.... I'd wager that the timing for even-byte reads would be similar to those for odd-byte reads.
A special case would be a test contrived to access WORDs in which each and every access was odd and also straddled a cache-line boundary. That would casue TWO cache misses on each DWORD access and the performance would be terrible.
=-=-=-=-=-=-=-=
I also thought of optimizing the reads by grabbing 8-byte chunks into an MMX register. But I spent so many opcodes shuffling the bytes into the right order for processing that the results were no good. That technique might show promise using the sort of multi-table pre-calculated lookup techniques you used...
I read that the latest processors now support a PSHUFL opcode that could handle the endian-rearranging all at once... Using SSE, I could read 128 bits (16 bytes, 5 source triplets) into a single SSE register and grind out 20 bytes of output at a time. :-)
My first idea was to use XLAT as well, though that may just be because I never had a chance to do so :).
I'm not feeling crazy enough to implement it right now (3 AM), but here's an idea:
How about a 16M look-up table?
I'm not feeling crazy enough to implement it right now (3 AM), but here's an idea:
How about a 16M look-up table?
Yea, it's not wise to talk assembly at this hour...
What I meant was a LUT with 2^24 entries, one for each byte triplet... With 4 bytes per entry, which would be... 2^24*4 = 16M*4 = 64M
I guess it comes down to how fast RAMs are these days...
What I meant was a LUT with 2^24 entries, one for each byte triplet... With 4 bytes per entry, which would be... 2^24*4 = 16M*4 = 64M
I guess it comes down to how fast RAMs are these days...
ASKER
hi aib_42,
Thanks for joining in!
In the C++ version of this thread, I tried the 64MB lookup table (http:/Cplusplus/Q_21988706.html#17579860). It seemed like a wothwhile idea, but the performance was terrible. As best as I can tell, it is because the table can't remain in cache, so the time lost to cache misses will far outweigh the advantage of having a tiny "code footprint."
Bits Bits Table
Input Output Size (Bytes)
6 8 64 (64)*1
12 16 8096 (64*64)*2
18 24 786,432 (64*64*64)*3
24 32 66,060,288 (64*64*64*64)*4
So far, the 8096-byte lookup table mechanism (two output bytes for each possible pair of 6-bit values) seems to be the best. The next larger lookup table would be 786K which might work on a CPU that had 1MB of L1 cache.
There might be a clever variation that uses two or more different smaller lookup tables (Infinity08 came up with one such system in the C++ thread) but IMO any system using a table larger than the cache will perform pooorly.
-- Dan
Thanks for joining in!
In the C++ version of this thread, I tried the 64MB lookup table (http:/Cplusplus/Q_21988706.html#17579860). It seemed like a wothwhile idea, but the performance was terrible. As best as I can tell, it is because the table can't remain in cache, so the time lost to cache misses will far outweigh the advantage of having a tiny "code footprint."
Bits Bits Table
Input Output Size (Bytes)
6 8 64 (64)*1
12 16 8096 (64*64)*2
18 24 786,432 (64*64*64)*3
24 32 66,060,288 (64*64*64*64)*4
So far, the 8096-byte lookup table mechanism (two output bytes for each possible pair of 6-bit values) seems to be the best. The next larger lookup table would be 786K which might work on a CPU that had 1MB of L1 cache.
There might be a clever variation that uses two or more different smaller lookup tables (Infinity08 came up with one such system in the C++ thread) but IMO any system using a table larger than the cache will perform pooorly.
-- Dan
Funny enough, I had to write a BASE64 encoding routine in C today for an SMTP client.
I didn't like any of the snippets I found on the web. Perhaps I should start a "how _readable_ is your base64 encoder" challenge? :)
This is how I did it:
loop on every byte of the source, keep track of the previous byte, too
keep a counter modulo 4, this will be called "cycle"
switch on 'cycle':
0 -> read the first 6 bits of the byte
1 -> read the last 2 bits of the previous byte | first 4 of the current
2 -> read the last 4 bits of the previous byte | first 2 of the current
3 -> read the last 6 bits of the previous byte, roll back the input one place
padding is easy, too, it's another switch on 'cycle'.
I think you can call this "de-optimization by loop rolling".
I didn't like any of the snippets I found on the web. Perhaps I should start a "how _readable_ is your base64 encoder" challenge? :)
This is how I did it:
loop on every byte of the source, keep track of the previous byte, too
keep a counter modulo 4, this will be called "cycle"
switch on 'cycle':
0 -> read the first 6 bits of the byte
1 -> read the last 2 bits of the previous byte | first 4 of the current
2 -> read the last 4 bits of the previous byte | first 2 of the current
3 -> read the last 6 bits of the previous byte, roll back the input one place
padding is easy, too, it's another switch on 'cycle'.
I think you can call this "de-optimization by loop rolling".
ASKER
Well, I guss it *is* time to cliose this one up.
There were only two "official" submissions in the ASM category. By my measure, craigwardman's was somewhat faster than that of PaulCaswell, so I sent more points his way. FWIW, all of the subissions were far faster than the generic utilities available from standard Microsoft libraries :-)
My own submission http:#17711344 ended up fastest -- by using MMX opcodes and using some underhanded tweaks of the caching control...
The C++ version of this is here: http:/Cplusplus/Q_21988706.html and it included very interesting discussion about CPU (AMD vs Intel) and compiler differences etc.
In the end, this thread was more about playing around with opcodes and learning a few things than it was a contest.
Thanks to all who participated!
-- Dan
There were only two "official" submissions in the ASM category. By my measure, craigwardman's was somewhat faster than that of PaulCaswell, so I sent more points his way. FWIW, all of the subissions were far faster than the generic utilities available from standard Microsoft libraries :-)
My own submission http:#17711344 ended up fastest -- by using MMX opcodes and using some underhanded tweaks of the caching control...
The C++ version of this is here: http:/Cplusplus/Q_21988706.html and it included very interesting discussion about CPU (AMD vs Intel) and compiler differences etc.
In the end, this thread was more about playing around with opcodes and learning a few things than it was a contest.
Thanks to all who participated!
-- Dan