DanRollins
asked on
How fast is your ASM hex converter?
Note [3/24/2002]: Continuing at https://www.experts-exchange.com/assembly/Q.20280946.html
Note [3/31/2002]: And then continued here: http:Q.20283475.html
Just throwing out the gauntlet:
Write an 80x86 routine that converts 16 bytes of data to hex and ASCII format. The ASCII 16 bytes should show . (period) for characters <32 or >127. Output format is 65 bytes:
XX XX XX XX XX XX XX XX-XX XX XX XX XX XX XX XX aaaaaaaaaaaaaaaa\0
//------------------------ ---------- ------
void HexLineOut( BYTE* pSrc, char* pszBuf )
{
_asm{
// your code here!
}
}
Prove that your code is the fastest on earth. Show that it is faster than similar code written in straight (optimized) C. We could run tests by converting a 10 MB buffer and timing the results.
Is anyone game?
Note [3/31/2002]: And then continued here: http:Q.20283475.html
Just throwing out the gauntlet:
Write an 80x86 routine that converts 16 bytes of data to hex and ASCII format. The ASCII 16 bytes should show . (period) for characters <32 or >127. Output format is 65 bytes:
XX XX XX XX XX XX XX XX-XX XX XX XX XX XX XX XX aaaaaaaaaaaaaaaa\0
//------------------------
void HexLineOut( BYTE* pSrc, char* pszBuf )
{
_asm{
// your code here!
}
}
Prove that your code is the fastest on earth. Show that it is faster than similar code written in straight (optimized) C. We could run tests by converting a 10 MB buffer and timing the results.
Is anyone game?
ASKER
Sure! It could be used in a debugger. I chose it because I'm petty sure I can dig up some binary-to-hex tricks that the C optimizer won't know about :-)
Here is a not-t0o-efficient C version of the target routine:
void HexLineOut( BYTE* pSrc, char* pszBuf )
{
int j;
BYTE c, d;
for( j=0; j<16; j++ ) {
c= *pSrc++;
d= c >> 4; // get the hi nybble
if ( d < 10 ) *pszBuf++ = d+'0';
else *pszBuf++ = d+'0' + ('A' - ('9'+1)); // tip: this is d+0x37
d= c & 0x0f; // get the lo nybble
if ( d < 10 ) *pszBuf++ = d+'0';
else *pszBuf++ = d+'0' + ('A' - ('9'+1));
if ( j == 7 ) *pszBuf++ = '-'; // space betwee bytes
else *pszBuf++ = ' '; // dash betwee bytes
}
*pszBuf++ = ' '; // space before ascii
pSrc -= 16; // back to the start
for( j=0; j<16; j++ ) {
c= *pSrc++;
if ( c<32 || c> 127 ) {
c= '.';
}
*pszBuf++ = c;
}
*pszBuf= '\0';
}
Give it a go!
-- Dan
Here is a not-t0o-efficient C version of the target routine:
void HexLineOut( BYTE* pSrc, char* pszBuf )
{
int j;
BYTE c, d;
for( j=0; j<16; j++ ) {
c= *pSrc++;
d= c >> 4; // get the hi nybble
if ( d < 10 ) *pszBuf++ = d+'0';
else *pszBuf++ = d+'0' + ('A' - ('9'+1)); // tip: this is d+0x37
d= c & 0x0f; // get the lo nybble
if ( d < 10 ) *pszBuf++ = d+'0';
else *pszBuf++ = d+'0' + ('A' - ('9'+1));
if ( j == 7 ) *pszBuf++ = '-'; // space betwee bytes
else *pszBuf++ = ' '; // dash betwee bytes
}
*pszBuf++ = ' '; // space before ascii
pSrc -= 16; // back to the start
for( j=0; j<16; j++ ) {
c= *pSrc++;
if ( c<32 || c> 127 ) {
c= '.';
}
*pszBuf++ = c;
}
*pszBuf= '\0';
}
Give it a go!
-- Dan
>>Prove that your code is the fastest on earth.
I'm not a bad assambly programmer and might try to write the fastest.
But how in gods name can you prove any code is the fastest other than not being able to find faster code. Wich could also mean you're bad at searching or you have bad competition.
Would it not be more easy to understand to write:
void HexLineOut(BYTE pSrc[16], char pszBuf[65])
I'm not a bad assambly programmer and might try to write the fastest.
But how in gods name can you prove any code is the fastest other than not being able to find faster code. Wich could also mean you're bad at searching or you have bad competition.
Would it not be more easy to understand to write:
void HexLineOut(BYTE pSrc[16], char pszBuf[65])
ASKER
OK, I'll up the ante...
You must prove that your code is the fastest in the known Universe, and all Not-Yet-Found Parallel Universes, and Whatever Higher Dimensions Nobody Has Ever Imagined, and in any Arbitrary Namespaces That Begin with Z or Feature the Word Omega.
Anything less and you don't win! Supporting evidence such as: "Well, my Mommy said it was really fast when she wasn't playing Quake" will be given more weight in the final judging than "I heard it say 'Zoom' a coupla times".
So get to work or you won't have a chance at the golden apple! The Judges' Decision is Familial!
-- Dan
You must prove that your code is the fastest in the known Universe, and all Not-Yet-Found Parallel Universes, and Whatever Higher Dimensions Nobody Has Ever Imagined, and in any Arbitrary Namespaces That Begin with Z or Feature the Word Omega.
Anything less and you don't win! Supporting evidence such as: "Well, my Mommy said it was really fast when she wasn't playing Quake" will be given more weight in the final judging than "I heard it say 'Zoom' a coupla times".
So get to work or you won't have a chance at the golden apple! The Judges' Decision is Familial!
-- Dan
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
Smarter than what? who?
If you want to write an Eiffel version, then by all means, do it! It's hard to imagine that it could be faster than *well-written* hand-optimized ASM code, but perhaps it could make use of multiple processors or something.
-- Dan
If you want to write an Eiffel version, then by all means, do it! It's hard to imagine that it could be faster than *well-written* hand-optimized ASM code, but perhaps it could make use of multiple processors or something.
-- Dan
> Smarter than what? who?
Big blue against the people: computer-optimized c and opt code against the postings here of course :)
Let's keep it fair: one processor. But I guess you're right, especially with an (apparently) trivial task as this, how can a computer outperform a human being?
But I'm getting curious. Let's get to work.
Cheers
Big blue against the people: computer-optimized c and opt code against the postings here of course :)
Let's keep it fair: one processor. But I guess you're right, especially with an (apparently) trivial task as this, how can a computer outperform a human being?
But I'm getting curious. Let's get to work.
Cheers
ASKER
Here is some helper code that I am placing in a dialog-based app for testing purposes. So far, my best benchmark is around 20,000 characters per ms (a 450MHz P3), but I really haven't started with the hand-optimization yet.
BOOL FileToBuf( LPCSTR sFilename, BYTE* pBuf, int nMax )
{
int nLen= 0;
HFILE hFile= _lopen(sFilename, OF_READ );
if (hFile == HFILE_ERROR ) {
return( FALSE );
}
int nFileLen= _llseek( hFile, 0L, SEEK_END );
_llseek( hFile, 0L, SEEK_SET );
for (int j=0; j<nFileLen; j += 1024) {
if (nLen + 1024 > nMax ) {
break;
}
int nActual= _lread( hFile, pBuf, 1024 );
pBuf += nActual;
nLen += nActual;
}
_lclose( hFile );
return( nLen );
}
void CHexConvertDlg::DumpBuf( BYTE* p, int nLen, BOOL fDisplay /*=TRUE*/ )
{
char szBuf[68];
for (int j= 0; j< nLen; j += 16 ) {
HexLineOut( &p[j], szBuf );
if ( fDisplay ) {
szBuf[65]='\r'; szBuf[66]='\n'; szBuf[67]='\0';
m_ctlEdit.SetSel(-1,-1);
m_ctlEdit.ReplaceSel(szBuf );
m_ctlEdit.SendMessage(EM_S CROLLCARET ,0,0);
}
}
}
void CHexConvertDlg::OnButton1( )
{
UpdateData( TRUE ); // get the value of m_fShowOutput amd m_nIterCnt
BYTE* pBuf= new BYTE[800000];
int nFileLen= FileToBuf("c:\\temp\\test. txt", pBuf, 800000 );
nFileLen &= 0xfffffff0; // round to closest 16
int nTotalBytes= 0;
timeBeginPeriod(1);
DWORD nStartTick= timeGetTime();
for (int j=0; j< m_nIterCnt; j++ ) {
m_ctlEdit.SetWindowText("" );
DumpBuf( pBuf, nFileLen, m_fShowOutput );
nTotalBytes += nFileLen;
}
DWORD nTotalTicks= timeGetTime()-nStartTick;
timeEndPeriod(1);
DWORD nCntPerTick= -1;
if ( nTotalTicks != 0 ) nCntPerTick= nTotalBytes/nTotalTicks;
CString sMsg;
sMsg.Format( "%d bytes converted in %d ms\n\n %d conversions per ms", nTotalBytes, nTotalTicks, nCntPerTick );
MessageBox(sMsg);
delete pBuf;
}
BOOL FileToBuf( LPCSTR sFilename, BYTE* pBuf, int nMax )
{
int nLen= 0;
HFILE hFile= _lopen(sFilename, OF_READ );
if (hFile == HFILE_ERROR ) {
return( FALSE );
}
int nFileLen= _llseek( hFile, 0L, SEEK_END );
_llseek( hFile, 0L, SEEK_SET );
for (int j=0; j<nFileLen; j += 1024) {
if (nLen + 1024 > nMax ) {
break;
}
int nActual= _lread( hFile, pBuf, 1024 );
pBuf += nActual;
nLen += nActual;
}
_lclose( hFile );
return( nLen );
}
void CHexConvertDlg::DumpBuf( BYTE* p, int nLen, BOOL fDisplay /*=TRUE*/ )
{
char szBuf[68];
for (int j= 0; j< nLen; j += 16 ) {
HexLineOut( &p[j], szBuf );
if ( fDisplay ) {
szBuf[65]='\r'; szBuf[66]='\n'; szBuf[67]='\0';
m_ctlEdit.SetSel(-1,-1);
m_ctlEdit.ReplaceSel(szBuf
m_ctlEdit.SendMessage(EM_S
}
}
}
void CHexConvertDlg::OnButton1(
{
UpdateData( TRUE ); // get the value of m_fShowOutput amd m_nIterCnt
BYTE* pBuf= new BYTE[800000];
int nFileLen= FileToBuf("c:\\temp\\test.
nFileLen &= 0xfffffff0; // round to closest 16
int nTotalBytes= 0;
timeBeginPeriod(1);
DWORD nStartTick= timeGetTime();
for (int j=0; j< m_nIterCnt; j++ ) {
m_ctlEdit.SetWindowText(""
DumpBuf( pBuf, nFileLen, m_fShowOutput );
nTotalBytes += nFileLen;
}
DWORD nTotalTicks= timeGetTime()-nStartTick;
timeEndPeriod(1);
DWORD nCntPerTick= -1;
if ( nTotalTicks != 0 ) nCntPerTick= nTotalBytes/nTotalTicks;
CString sMsg;
sMsg.Format( "%d bytes converted in %d ms\n\n %d conversions per ms", nTotalBytes, nTotalTicks, nCntPerTick );
MessageBox(sMsg);
delete pBuf;
}
Something like this should already be pretty good (no jump/loops, does some pairing, tries to avoid excessive memory access, uses aligned addresses where easily possible):
// Assuming:
// esi = binary data buffer
// edi = destination char buffer, size 66 bytes
sub ebx,ebx
mov edx,[esi+00]
mov bl,dl
mov ax,word ptr [ebx*2+@hex]
shr edx,8
mov [edi+00],ax
mov bl,dl
mov ax,word ptr [ebx*2+@hex]
mov byte ptr [edi+02],32
shr edx,8
mov [edi+03],ax
mov bl,dl
mov ax,word ptr [ebx*2+@hex]
mov byte ptr [edi+05],32
shr edx,8
mov [edi+06],ax
mov ax,word ptr [edx*2+@hex]
mov byte ptr [edi+08],32
mov [edi+09],ax
mov byte ptr [edi+11],32
mov edx,[esi+04]
mov bl,dl
mov ax,word ptr [ebx*2+@hex]
shr edx,8
mov [edi+12],ax
mov bl,dl
mov ax,word ptr [ebx*2+@hex]
mov byte ptr [edi+14],32
shr edx,8
mov [edi+15],ax
mov bl,dl
mov ax,word ptr [ebx*2+@hex]
mov byte ptr [edi+17],32
shr edx,8
mov [edi+18],ax
mov ax,word ptr [edx*2+@hex]
mov byte ptr [edi+20],32
mov [edi+21],ax
mov byte ptr [edi+23],45
mov edx,[esi+08]
mov bl,dl
mov ax,word ptr [ebx*2+@hex]
shr edx,8
mov [edi+24],ax
mov bl,dl
mov ax,word ptr [ebx*2+@hex]
mov byte ptr [edi+26],32
shr edx,8
mov [edi+27],ax
mov bl,dl
mov ax,word ptr [ebx*2+@hex]
mov byte ptr [edi+29],32
shr edx,8
mov [edi+30],ax
mov ax,word ptr [edx*2+@hex]
mov byte ptr [edi+32],32
mov [edi+33],ax
mov byte ptr [edi+35],32
mov edx,[esi+12]
mov bl,dl
mov ax,word ptr [ebx*2+@hex]
shr edx,8
mov [edi+36],ax
mov bl,dl
mov ax,word ptr [ebx*2+@hex]
mov byte ptr [edi+38],32
shr edx,8
mov [edi+39],ax
mov bl,dl
mov ax,word ptr [ebx*2+@hex]
mov byte ptr [edi+41],32
shr edx,8
mov [edi+42],ax
mov ax,word ptr [edx*2+@hex]
mov byte ptr [edi+44],32
mov [edi+45],ax
mov byte ptr [edi+47],32
mov edx,[esi+00]
rol edx,8
mov bl,dl
mov al,byte ptr [ebx+@ascii]
rol edx,8
shl eax,8
mov bl,dl
mov al,byte ptr [ebx+@ascii]
rol edx,8
shl eax,8
mov bl,dl
mov al,byte ptr [ebx+@ascii]
rol edx,8
and edx,$000000FF
shl eax,8
mov al,byte ptr [edx+@ascii]
mov [edi+48],eax
mov edx,[esi+04]
rol edx,8
mov bl,dl
mov al,byte ptr [ebx+@ascii]
rol edx,8
shl eax,8
mov bl,dl
mov al,byte ptr [ebx+@ascii]
rol edx,8
shl eax,8
mov bl,dl
mov al,byte ptr [ebx+@ascii]
rol edx,8
and edx,$000000FF
shl eax,8
mov al,byte ptr [edx+@ascii]
mov [edi+52],eax
mov edx,[esi+08]
rol edx,8
mov bl,dl
mov al,byte ptr [ebx+@ascii]
rol edx,8
shl eax,8
mov bl,dl
mov al,byte ptr [ebx+@ascii]
rol edx,8
shl eax,8
mov bl,dl
mov al,byte ptr [ebx+@ascii]
rol edx,8
and edx,$000000FF
shl eax,8
mov al,byte ptr [edx+@ascii]
mov [edi+56],eax
mov edx,[esi+12]
rol edx,8
mov bl,dl
mov al,byte ptr [ebx+@ascii]
rol edx,8
shl eax,8
mov bl,dl
mov al,byte ptr [ebx+@ascii]
rol edx,8
shl eax,8
mov bl,dl
mov al,byte ptr [ebx+@ascii]
rol edx,8
and edx,$000000FF
shl eax,8
mov al,byte ptr [edx+@ascii]
mov byte ptr [edi+65],0
ret
@hex:
db '000102030405060708090A0B0 C0D0E0F'
db '101112131415161718191A1B1 C1D1E1F'
db '202122232425262728292A2B2 C2D2E2F'
db '303132333435363738393A3B3 C3D3E3F'
db '404142434445464748494A4B4 C4D4E4F'
db '505152535455565758595A5B5 C5D5E5F'
db '606162636465666768696A6B6 C6D6E6F'
db '707172737475767778797A7B7 C7D7E7F'
db '808182838485868788898A8B8 C8D8E8F'
db '909192939495969798999A9B9 C9D9E9F'
db 'A0A1A2A3A4A5A6A7A8A9AAABA CADAEAF'
db 'B0B1B2B3B4B5B6B7B8B9BABBB CBDBEBF'
db 'C0C1C2C3C4C5C6C7C8C9CACBC CCDCECF'
db 'D0D1D2D3D4D5D6D7D8D9DADBD CDDDEDF'
db 'E0E1E2E3E4E5E6E7E8E9EAEBE CEDEEEF'
db 'F0F1F2F3F4F5F6F7F8F9FAFBF CFDFEFF'
@ascii:
db 046,046,046,046,046,046,04 6,046,046, 046,046,04 6,046,046, 046,046
db 046,046,046,046,046,046,04 6,046,046, 046,046,04 6,046,046, 046,046
db 032,033,034,035,036,037,03 8,039,040, 041,042,04 3,044,045, 046,047
db 048,049,050,051,052,053,05 4,055,056, 057,058,05 9,060,061, 062,063
db 064,065,066,067,068,069,07 0,071,072, 073,074,07 5,076,077, 078,079
db 080,081,082,083,084,085,08 6,087,088, 089,090,09 1,092,093, 094,095
db 096,097,098,099,100,101,10 2,103,104, 105,106,10 7,108,109, 110,111
db 112,113,114,115,116,117,11 8,119,120, 121,122,12 3,124,125, 126,127
db 046,046,046,046,046,046,04 6,046,046, 046,046,04 6,046,046, 046,046
db 046,046,046,046,046,046,04 6,046,046, 046,046,04 6,046,046, 046,046
db 046,046,046,046,046,046,04 6,046,046, 046,046,04 6,046,046, 046,046
db 046,046,046,046,046,046,04 6,046,046, 046,046,04 6,046,046, 046,046
db 046,046,046,046,046,046,04 6,046,046, 046,046,04 6,046,046, 046,046
db 046,046,046,046,046,046,04 6,046,046, 046,046,04 6,046,046, 046,046
db 046,046,046,046,046,046,04 6,046,046, 046,046,04 6,046,046, 046,046
db 046,046,046,046,046,046,04 6,046,046, 046,046,04 6,046,046, 046,046
// Assuming:
// esi = binary data buffer
// edi = destination char buffer, size 66 bytes
sub ebx,ebx
mov edx,[esi+00]
mov bl,dl
mov ax,word ptr [ebx*2+@hex]
shr edx,8
mov [edi+00],ax
mov bl,dl
mov ax,word ptr [ebx*2+@hex]
mov byte ptr [edi+02],32
shr edx,8
mov [edi+03],ax
mov bl,dl
mov ax,word ptr [ebx*2+@hex]
mov byte ptr [edi+05],32
shr edx,8
mov [edi+06],ax
mov ax,word ptr [edx*2+@hex]
mov byte ptr [edi+08],32
mov [edi+09],ax
mov byte ptr [edi+11],32
mov edx,[esi+04]
mov bl,dl
mov ax,word ptr [ebx*2+@hex]
shr edx,8
mov [edi+12],ax
mov bl,dl
mov ax,word ptr [ebx*2+@hex]
mov byte ptr [edi+14],32
shr edx,8
mov [edi+15],ax
mov bl,dl
mov ax,word ptr [ebx*2+@hex]
mov byte ptr [edi+17],32
shr edx,8
mov [edi+18],ax
mov ax,word ptr [edx*2+@hex]
mov byte ptr [edi+20],32
mov [edi+21],ax
mov byte ptr [edi+23],45
mov edx,[esi+08]
mov bl,dl
mov ax,word ptr [ebx*2+@hex]
shr edx,8
mov [edi+24],ax
mov bl,dl
mov ax,word ptr [ebx*2+@hex]
mov byte ptr [edi+26],32
shr edx,8
mov [edi+27],ax
mov bl,dl
mov ax,word ptr [ebx*2+@hex]
mov byte ptr [edi+29],32
shr edx,8
mov [edi+30],ax
mov ax,word ptr [edx*2+@hex]
mov byte ptr [edi+32],32
mov [edi+33],ax
mov byte ptr [edi+35],32
mov edx,[esi+12]
mov bl,dl
mov ax,word ptr [ebx*2+@hex]
shr edx,8
mov [edi+36],ax
mov bl,dl
mov ax,word ptr [ebx*2+@hex]
mov byte ptr [edi+38],32
shr edx,8
mov [edi+39],ax
mov bl,dl
mov ax,word ptr [ebx*2+@hex]
mov byte ptr [edi+41],32
shr edx,8
mov [edi+42],ax
mov ax,word ptr [edx*2+@hex]
mov byte ptr [edi+44],32
mov [edi+45],ax
mov byte ptr [edi+47],32
mov edx,[esi+00]
rol edx,8
mov bl,dl
mov al,byte ptr [ebx+@ascii]
rol edx,8
shl eax,8
mov bl,dl
mov al,byte ptr [ebx+@ascii]
rol edx,8
shl eax,8
mov bl,dl
mov al,byte ptr [ebx+@ascii]
rol edx,8
and edx,$000000FF
shl eax,8
mov al,byte ptr [edx+@ascii]
mov [edi+48],eax
mov edx,[esi+04]
rol edx,8
mov bl,dl
mov al,byte ptr [ebx+@ascii]
rol edx,8
shl eax,8
mov bl,dl
mov al,byte ptr [ebx+@ascii]
rol edx,8
shl eax,8
mov bl,dl
mov al,byte ptr [ebx+@ascii]
rol edx,8
and edx,$000000FF
shl eax,8
mov al,byte ptr [edx+@ascii]
mov [edi+52],eax
mov edx,[esi+08]
rol edx,8
mov bl,dl
mov al,byte ptr [ebx+@ascii]
rol edx,8
shl eax,8
mov bl,dl
mov al,byte ptr [ebx+@ascii]
rol edx,8
shl eax,8
mov bl,dl
mov al,byte ptr [ebx+@ascii]
rol edx,8
and edx,$000000FF
shl eax,8
mov al,byte ptr [edx+@ascii]
mov [edi+56],eax
mov edx,[esi+12]
rol edx,8
mov bl,dl
mov al,byte ptr [ebx+@ascii]
rol edx,8
shl eax,8
mov bl,dl
mov al,byte ptr [ebx+@ascii]
rol edx,8
shl eax,8
mov bl,dl
mov al,byte ptr [ebx+@ascii]
rol edx,8
and edx,$000000FF
shl eax,8
mov al,byte ptr [edx+@ascii]
mov byte ptr [edi+65],0
ret
@hex:
db '000102030405060708090A0B0
db '101112131415161718191A1B1
db '202122232425262728292A2B2
db '303132333435363738393A3B3
db '404142434445464748494A4B4
db '505152535455565758595A5B5
db '606162636465666768696A6B6
db '707172737475767778797A7B7
db '808182838485868788898A8B8
db '909192939495969798999A9B9
db 'A0A1A2A3A4A5A6A7A8A9AAABA
db 'B0B1B2B3B4B5B6B7B8B9BABBB
db 'C0C1C2C3C4C5C6C7C8C9CACBC
db 'D0D1D2D3D4D5D6D7D8D9DADBD
db 'E0E1E2E3E4E5E6E7E8E9EAEBE
db 'F0F1F2F3F4F5F6F7F8F9FAFBF
@ascii:
db 046,046,046,046,046,046,04
db 046,046,046,046,046,046,04
db 032,033,034,035,036,037,03
db 048,049,050,051,052,053,05
db 064,065,066,067,068,069,07
db 080,081,082,083,084,085,08
db 096,097,098,099,100,101,10
db 112,113,114,115,116,117,11
db 046,046,046,046,046,046,04
db 046,046,046,046,046,046,04
db 046,046,046,046,046,046,04
db 046,046,046,046,046,046,04
db 046,046,046,046,046,046,04
db 046,046,046,046,046,046,04
db 046,046,046,046,046,046,04
db 046,046,046,046,046,046,04
ASKER
AvonWyss,
Excellent! Cool! I'm glad there is some action here!
Just a little sloppy... I needed to fix two bugs (see code below)
The code tests out at *very nearly as fast* as a 'reference' C routine I wrote that uses a lookup table for the 'Hex' data (I did not implement the lookup table for the ASCII output -- though that is clearly going to improve the speed when implemented.).
That you are unrolling the loops and collecting the four bytes before outputing is a definite win, but I'm sure there is pleny of room for improvement.
I think that all of the lines that use indirect addressing are costing you: For instance,
mov ax,word ptr [ebx*2+hex]
or
mov al,byte ptr [ebx+ascii]
are probably expensive in comparison to a straight lookup through the index register and an increment of that register each time (I could be wrong though...).
Here is the VC++ -ized version of your code, with my changes inidcated:
char hex[]=
"000102030405060708090A0B0 C0D0E0F"
"101112131415161718191A1B1 C1D1E1F"
"202122232425262728292A2B2 C2D2E2F"
"303132333435363738393A3B3 C3D3E3F"
"404142434445464748494A4B4 C4D4E4F"
"505152535455565758595A5B5 C5D5E5F"
"606162636465666768696A6B6 C6D6E6F"
"707172737475767778797A7B7 C7D7E7F"
"808182838485868788898A8B8 C8D8E8F"
"909192939495969798999A9B9 C9D9E9F"
"A0A1A2A3A4A5A6A7A8A9AAABA CADAEAF"
"B0B1B2B3B4B5B6B7B8B9BABBB CBDBEBF"
"C0C1C2C3C4C5C6C7C8C9CACBC CCDCECF"
"D0D1D2D3D4D5D6D7D8D9DADBD CDDDEDF"
"E0E1E2E3E4E5E6E7E8E9EAEBE CEDEEEF"
"F0F1F2F3F4F5F6F7F8F9FAFBF CFDFEFF";
char ascii[]= {
46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46,
46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46,
32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47,
48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63,
64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79,
80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95,
96, 97, 98, 99, 100,101,102,103,104,105,10 6,107,108, 109,110,11 1,
112,113,114,115,116,117,11 8,119,120, 121,122,12 3,124,125, 126,127,
46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46,
46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46,
46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46,
46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46,
46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46,
46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46,
46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46,
46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46
};
void AvonWysHexLineOut( BYTE* pSrc, char* pszBuf )
{
_asm {
mov esi, pSrc; ; <<--- Added
mov edi, pszBuf; ; <<--- Added
sub ebx,ebx
mov edx,[esi+00]
mov bl,dl
mov ax,word ptr [ebx*2+hex]
shr edx,8
mov [edi+00],ax
mov bl,dl
mov ax,word ptr [ebx*2+hex]
mov byte ptr [edi+2],32
shr edx,8
mov [edi+03],ax
mov bl,dl
mov ax,word ptr [ebx*2+hex]
mov byte ptr [edi+5],32
shr edx,8
mov [edi+6],ax
mov ax,word ptr [edx*2+hex]
mov byte ptr [edi+8],32
mov [edi+9],ax
mov byte ptr [edi+11],32
mov edx,[esi+4]
mov bl,dl
mov ax,word ptr [ebx*2+hex]
shr edx,8
mov [edi+12],ax
mov bl,dl
mov ax,word ptr [ebx*2+hex]
mov byte ptr [edi+14],32
shr edx,8
mov [edi+15],ax
mov bl,dl
mov ax,word ptr [ebx*2+hex]
mov byte ptr [edi+17],32
shr edx,8
mov [edi+18],ax
mov ax,word ptr [edx*2+hex]
mov byte ptr [edi+20],32
mov [edi+21],ax
mov byte ptr [edi+23],45
mov edx,[esi+8]
mov bl,dl
mov ax,word ptr [ebx*2+hex]
shr edx,8
mov [edi+24],ax
mov bl,dl
mov ax,word ptr [ebx*2+hex]
mov byte ptr [edi+26],32
shr edx,8
mov [edi+27],ax
mov bl,dl
mov ax,word ptr [ebx*2+hex]
mov byte ptr [edi+29],32
shr edx,8
mov [edi+30],ax
mov ax,word ptr [edx*2+hex]
mov byte ptr [edi+32],32
mov [edi+33],ax
mov byte ptr [edi+35],32
mov edx,[esi+12]
mov bl,dl
mov ax,word ptr [ebx*2+hex]
shr edx,8
mov [edi+36],ax
mov bl,dl
mov ax,word ptr [ebx*2+hex]
mov byte ptr [edi+38],32
shr edx,8
mov [edi+39],ax
mov bl,dl
mov ax,word ptr [ebx*2+hex]
mov byte ptr [edi+41],32
shr edx,8
mov [edi+42],ax
mov ax,word ptr [edx*2+hex]
mov byte ptr [edi+44],32
mov [edi+45],ax
mov byte ptr [edi+47],32
mov edx,[esi+00]
rol edx,8
mov bl,dl
mov al,byte ptr [ebx+ascii]
rol edx,8
shl eax,8
mov bl,dl
mov al,byte ptr [ebx+ascii]
rol edx,8
shl eax,8
mov bl,dl
mov al,byte ptr [ebx+ascii]
rol edx,8
and edx,0xFF
shl eax,8
mov al,byte ptr [edx+ascii]
mov [edi+48],eax
mov edx,[esi+04]
rol edx,8
mov bl,dl
mov al,byte ptr [ebx+ascii]
rol edx,8
shl eax,8
mov bl,dl
mov al,byte ptr [ebx+ascii]
rol edx,8
shl eax,8
mov bl,dl
mov al,byte ptr [ebx+ascii]
rol edx,8
and edx,0xFF
shl eax,8
mov al,byte ptr [edx+ascii]
mov [edi+52],eax
mov edx,[esi+8]
rol edx,8
mov bl,dl
mov al,byte ptr [ebx+ascii]
rol edx,8
shl eax,8
mov bl,dl
mov al,byte ptr [ebx+ascii]
rol edx,8
shl eax,8
mov bl,dl
mov al,byte ptr [ebx+ascii]
rol edx,8
and edx,0xFF
shl eax,8
mov al,byte ptr [edx+ascii]
mov [edi+56],eax
mov edx,[esi+12]
rol edx,8
mov bl,dl
mov al,byte ptr [ebx+ascii]
rol edx,8
shl eax,8
mov bl,dl
mov al,byte ptr [ebx+ascii]
rol edx,8
shl eax,8
mov bl,dl
mov al,byte ptr [ebx+ascii]
rol edx,8
and edx,0xFF
shl eax,8
mov al,byte ptr [edx+ascii]
mov [edi+60],eax ;<<<--- added
mov byte ptr [edi+64],0 ;<<<--- fixed
}
}
-- Dan
P.S. I'm still looking for a winner...
Excellent! Cool! I'm glad there is some action here!
Just a little sloppy... I needed to fix two bugs (see code below)
The code tests out at *very nearly as fast* as a 'reference' C routine I wrote that uses a lookup table for the 'Hex' data (I did not implement the lookup table for the ASCII output -- though that is clearly going to improve the speed when implemented.).
That you are unrolling the loops and collecting the four bytes before outputing is a definite win, but I'm sure there is pleny of room for improvement.
I think that all of the lines that use indirect addressing are costing you: For instance,
mov ax,word ptr [ebx*2+hex]
or
mov al,byte ptr [ebx+ascii]
are probably expensive in comparison to a straight lookup through the index register and an increment of that register each time (I could be wrong though...).
Here is the VC++ -ized version of your code, with my changes inidcated:
char hex[]=
"000102030405060708090A0B0
"101112131415161718191A1B1
"202122232425262728292A2B2
"303132333435363738393A3B3
"404142434445464748494A4B4
"505152535455565758595A5B5
"606162636465666768696A6B6
"707172737475767778797A7B7
"808182838485868788898A8B8
"909192939495969798999A9B9
"A0A1A2A3A4A5A6A7A8A9AAABA
"B0B1B2B3B4B5B6B7B8B9BABBB
"C0C1C2C3C4C5C6C7C8C9CACBC
"D0D1D2D3D4D5D6D7D8D9DADBD
"E0E1E2E3E4E5E6E7E8E9EAEBE
"F0F1F2F3F4F5F6F7F8F9FAFBF
char ascii[]= {
46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46,
46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46,
32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47,
48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63,
64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79,
80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95,
96, 97, 98, 99, 100,101,102,103,104,105,10
112,113,114,115,116,117,11
46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46,
46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46,
46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46,
46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46,
46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46,
46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46,
46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46,
46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46
};
void AvonWysHexLineOut( BYTE* pSrc, char* pszBuf )
{
_asm {
mov esi, pSrc; ; <<--- Added
mov edi, pszBuf; ; <<--- Added
sub ebx,ebx
mov edx,[esi+00]
mov bl,dl
mov ax,word ptr [ebx*2+hex]
shr edx,8
mov [edi+00],ax
mov bl,dl
mov ax,word ptr [ebx*2+hex]
mov byte ptr [edi+2],32
shr edx,8
mov [edi+03],ax
mov bl,dl
mov ax,word ptr [ebx*2+hex]
mov byte ptr [edi+5],32
shr edx,8
mov [edi+6],ax
mov ax,word ptr [edx*2+hex]
mov byte ptr [edi+8],32
mov [edi+9],ax
mov byte ptr [edi+11],32
mov edx,[esi+4]
mov bl,dl
mov ax,word ptr [ebx*2+hex]
shr edx,8
mov [edi+12],ax
mov bl,dl
mov ax,word ptr [ebx*2+hex]
mov byte ptr [edi+14],32
shr edx,8
mov [edi+15],ax
mov bl,dl
mov ax,word ptr [ebx*2+hex]
mov byte ptr [edi+17],32
shr edx,8
mov [edi+18],ax
mov ax,word ptr [edx*2+hex]
mov byte ptr [edi+20],32
mov [edi+21],ax
mov byte ptr [edi+23],45
mov edx,[esi+8]
mov bl,dl
mov ax,word ptr [ebx*2+hex]
shr edx,8
mov [edi+24],ax
mov bl,dl
mov ax,word ptr [ebx*2+hex]
mov byte ptr [edi+26],32
shr edx,8
mov [edi+27],ax
mov bl,dl
mov ax,word ptr [ebx*2+hex]
mov byte ptr [edi+29],32
shr edx,8
mov [edi+30],ax
mov ax,word ptr [edx*2+hex]
mov byte ptr [edi+32],32
mov [edi+33],ax
mov byte ptr [edi+35],32
mov edx,[esi+12]
mov bl,dl
mov ax,word ptr [ebx*2+hex]
shr edx,8
mov [edi+36],ax
mov bl,dl
mov ax,word ptr [ebx*2+hex]
mov byte ptr [edi+38],32
shr edx,8
mov [edi+39],ax
mov bl,dl
mov ax,word ptr [ebx*2+hex]
mov byte ptr [edi+41],32
shr edx,8
mov [edi+42],ax
mov ax,word ptr [edx*2+hex]
mov byte ptr [edi+44],32
mov [edi+45],ax
mov byte ptr [edi+47],32
mov edx,[esi+00]
rol edx,8
mov bl,dl
mov al,byte ptr [ebx+ascii]
rol edx,8
shl eax,8
mov bl,dl
mov al,byte ptr [ebx+ascii]
rol edx,8
shl eax,8
mov bl,dl
mov al,byte ptr [ebx+ascii]
rol edx,8
and edx,0xFF
shl eax,8
mov al,byte ptr [edx+ascii]
mov [edi+48],eax
mov edx,[esi+04]
rol edx,8
mov bl,dl
mov al,byte ptr [ebx+ascii]
rol edx,8
shl eax,8
mov bl,dl
mov al,byte ptr [ebx+ascii]
rol edx,8
shl eax,8
mov bl,dl
mov al,byte ptr [ebx+ascii]
rol edx,8
and edx,0xFF
shl eax,8
mov al,byte ptr [edx+ascii]
mov [edi+52],eax
mov edx,[esi+8]
rol edx,8
mov bl,dl
mov al,byte ptr [ebx+ascii]
rol edx,8
shl eax,8
mov bl,dl
mov al,byte ptr [ebx+ascii]
rol edx,8
shl eax,8
mov bl,dl
mov al,byte ptr [ebx+ascii]
rol edx,8
and edx,0xFF
shl eax,8
mov al,byte ptr [edx+ascii]
mov [edi+56],eax
mov edx,[esi+12]
rol edx,8
mov bl,dl
mov al,byte ptr [ebx+ascii]
rol edx,8
shl eax,8
mov bl,dl
mov al,byte ptr [ebx+ascii]
rol edx,8
shl eax,8
mov bl,dl
mov al,byte ptr [ebx+ascii]
rol edx,8
and edx,0xFF
shl eax,8
mov al,byte ptr [edx+ascii]
mov [edi+60],eax ;<<<--- added
mov byte ptr [edi+64],0 ;<<<--- fixed
}
}
-- Dan
P.S. I'm still looking for a winner...
Dan, thanks for trying my code. Note that the bugs are due to the helper code I used when writing it, which I then stopped again and obviously broke my code.
Also, a test like this makes it obvious that nowadays processors and compilers already do a pretty good job on generating fast code. For optimal use of the processor pipelines, lots of simple commands are about the best code you can generate, and the compilers do exactly this.
Anyways, there are pretty huge differences with the processors available by means of pairing and other things. However, indirect addressing should no longer be costy on any processor since the Pentium (1 cycle allways, but may impact pairing due to the additional registers used). What could be a real neckbreaker are my 16-bit reads and writes on uneven addresses. I believe these are not very good, but I didn't really have a better idea. Maybe I could also collect 32 bits of data to write before doing so (e.g. "XX X","X XX","XX X" etc. pattern). I I have some time again, I'll look into it.
Is there a method to align ASM data or code (the compiler also does this) in memory in the inline assembler of VC++?
Also, a test like this makes it obvious that nowadays processors and compilers already do a pretty good job on generating fast code. For optimal use of the processor pipelines, lots of simple commands are about the best code you can generate, and the compilers do exactly this.
Anyways, there are pretty huge differences with the processors available by means of pairing and other things. However, indirect addressing should no longer be costy on any processor since the Pentium (1 cycle allways, but may impact pairing due to the additional registers used). What could be a real neckbreaker are my 16-bit reads and writes on uneven addresses. I believe these are not very good, but I didn't really have a better idea. Maybe I could also collect 32 bits of data to write before doing so (e.g. "XX X","X XX","XX X" etc. pattern). I I have some time again, I'll look into it.
Is there a method to align ASM data or code (the compiler also does this) in memory in the inline assembler of VC++?
ASKER
>>Is there a method to align ASM data or code
If you posted this Q in the Assembly TA, I could get my first Assembly TA points!
The VC++ _asm assembler supports the EVEN and ALIGN directives. But only for code (it inserts NOPs as needed).
It does not support DB, DW, etc. for defining data (as you may have inferred from my mods to your code). But since you can define it outside of the _asm block, I think it will automatically be aligned in the same manner as other C data structures (it is a project setting that defaults to 8 bytes).
>>indirect addressing should no longer be costy on any
>> processor since the Pentium
I can't make heads-nor-tails of opcode timings in the later 80x86 documentation. It used to be there were nice simple charts showing the clock-count for each addressing mode and opcode.
>>Neckbreaker...16-bit reads and writes on uneven addresses
I agree, this must be costing something.
>>Maybe I could also collect 32 bits of data to write
>>before doing so (e.g. "XX X","X XX","XX X" etc. pattern).
This shouldn't be too hard since the loops are unrolled anyway... as soon as you have four bytes accumulated, output'em.
-- Dan
If you posted this Q in the Assembly TA, I could get my first Assembly TA points!
The VC++ _asm assembler supports the EVEN and ALIGN directives. But only for code (it inserts NOPs as needed).
It does not support DB, DW, etc. for defining data (as you may have inferred from my mods to your code). But since you can define it outside of the _asm block, I think it will automatically be aligned in the same manner as other C data structures (it is a project setting that defaults to 8 bytes).
>>indirect addressing should no longer be costy on any
>> processor since the Pentium
I can't make heads-nor-tails of opcode timings in the later 80x86 documentation. It used to be there were nice simple charts showing the clock-count for each addressing mode and opcode.
>>Neckbreaker...16-bit reads and writes on uneven addresses
I agree, this must be costing something.
>>Maybe I could also collect 32 bits of data to write
>>before doing so (e.g. "XX X","X XX","XX X" etc. pattern).
This shouldn't be too hard since the loops are unrolled anyway... as soon as you have four bytes accumulated, output'em.
-- Dan
You should also not count the time it takes to display on the screen. At this point, you are at the mercy of the video display to see how fast it can display a line of text in the edit control.
Instead, turn off the edit control's redraw before you start. Then turn it back on when you are done:
timeBeginPeriod(1);
DWORD nStartTick= timeGetTime();
m_ctlEdit.SetRedraw(FALSE) ;
// make calls to dump hex
DWORD nTotalTicks= timeGetTime()-nStartTick;
timeEndPeriod(1);
m_ctlEdit.SetRedraw(TRUE);
Of course, by doing this, you will no longer see the updates on the screen until it is finished. If that is a part of your design, then you cannot do this. But realize that you are timing the output as well as the conversion.
Instead, turn off the edit control's redraw before you start. Then turn it back on when you are done:
timeBeginPeriod(1);
DWORD nStartTick= timeGetTime();
m_ctlEdit.SetRedraw(FALSE)
// make calls to dump hex
DWORD nTotalTicks= timeGetTime()-nStartTick;
timeEndPeriod(1);
m_ctlEdit.SetRedraw(TRUE);
Of course, by doing this, you will no longer see the updates on the screen until it is finished. If that is a part of your design, then you cannot do this. But realize that you are timing the output as well as the conversion.
ASKER
>>You should also not count the time it takes to display on the screen.
You are right thui!
Notice that I have a flag named m_fShowOutput that get passed into the DumpBuf routine. When I benchmark other people's code, I turn this ON. No matter how well-written the code, everybody always scores out at about 9 conversions per millisconed. But when I test my own code, I 'accidentlally' turn it off. Longshot, but perhaps that is why my code is currently yielding about 21,000 conversions per millisecond.
-- Dan
P.S.
I am about to publish my own Hex-to-ASCII routine and when I do that all challengers will cry alligator tears of frustration, knowing that the Universe will never see faster code. So now's your chance.
You are right thui!
Notice that I have a flag named m_fShowOutput that get passed into the DumpBuf routine. When I benchmark other people's code, I turn this ON. No matter how well-written the code, everybody always scores out at about 9 conversions per millisconed. But when I test my own code, I 'accidentlally' turn it off. Longshot, but perhaps that is why my code is currently yielding about 21,000 conversions per millisecond.
-- Dan
P.S.
I am about to publish my own Hex-to-ASCII routine and when I do that all challengers will cry alligator tears of frustration, knowing that the Universe will never see faster code. So now's your chance.
DanRollins, in this case, please do test the code again without output at all, so that the conditions are the same for both applications... and post the resultsd again :-)
ASKER
Oh PUUUULLLEEEEEASE.
How stupid is this Mr. DanRollins. Hm, he was able to convert my raw ASM into runnnable VC++ code and he fixed two of my coding errors. He suggested some rather subtle improvements to my code and helped me learn about code alignment in an _asm block. Even so, he is probably so dense that he could not recognoize the difference between 9 conversions per millisecond and 20,000 conversions per milliseconfd. I'd better remind him to not display the lines as they are converted.
sheesh.
-- Dan
How stupid is this Mr. DanRollins. Hm, he was able to convert my raw ASM into runnnable VC++ code and he fixed two of my coding errors. He suggested some rather subtle improvements to my code and helped me learn about code alignment in an _asm block. Even so, he is probably so dense that he could not recognoize the difference between 9 conversions per millisecond and 20,000 conversions per milliseconfd. I'd better remind him to not display the lines as they are converted.
sheesh.
-- Dan
ASKER
Here is jonin's submission from the C++ TA. It boils down to:
inline void f()
{
printf("%x %x %x %x %x %x %x %x-%x %x %x %x %x %x %x %x",
data[0],data[1],data[2],da ta[3],data [4],data[5 ],data[6],
data[7],data[8],data[9],da ta[10],dat a[11],data [12],data[ 13],
data[14],data[15]);
printf(" %c%c%c%c%c%c%c%c%c%c%c%c%c %c%c%c\n",
ascii[data[0]],
ascii[data[1]],
ascii[data[2]],
ascii[data[3]],
ascii[data[4]],
ascii[data[5]],
ascii[data[6]],
ascii[data[7]],
ascii[data[8]],
ascii[data[9]],
ascii[data[10]],
ascii[data[11]],
ascii[data[12]],
ascii[data[13]],
ascii[data[14]],
ascii[data[15]]);
}
=--=-=-=-=-=-=-=-=-==-
I will convert from printf to the sprintf and run a timing test on it tonight, but I'm afraid that it will be considerably slower than some other tries. Also, I fear that it won't meet the specs in that if any of the bytes < 0x10, they will display as a single hex digit rather than the two digits in the spec.
jonin says that even bad ASM will be 10 times faster than fast C, but I disagree. My reference C fn is just slightly faster than AvonWyss's ASM code. But it avoids the (relatively) massive overhead involved in pushing 32 data values onto the stack and calling printf.
-- Dan
inline void f()
{
printf("%x %x %x %x %x %x %x %x-%x %x %x %x %x %x %x %x",
data[0],data[1],data[2],da
data[7],data[8],data[9],da
data[14],data[15]);
printf(" %c%c%c%c%c%c%c%c%c%c%c%c%c
ascii[data[0]],
ascii[data[1]],
ascii[data[2]],
ascii[data[3]],
ascii[data[4]],
ascii[data[5]],
ascii[data[6]],
ascii[data[7]],
ascii[data[8]],
ascii[data[9]],
ascii[data[10]],
ascii[data[11]],
ascii[data[12]],
ascii[data[13]],
ascii[data[14]],
ascii[data[15]]);
}
=--=-=-=-=-=-=-=-=-==-
I will convert from printf to the sprintf and run a timing test on it tonight, but I'm afraid that it will be considerably slower than some other tries. Also, I fear that it won't meet the specs in that if any of the bytes < 0x10, they will display as a single hex digit rather than the two digits in the spec.
jonin says that even bad ASM will be 10 times faster than fast C, but I disagree. My reference C fn is just slightly faster than AvonWyss's ASM code. But it avoids the (relatively) massive overhead involved in pushing 32 data values onto the stack and calling printf.
-- Dan
I'm sorry to say but I don't see 1 single assembly command.
And I clearly remember reading:
_asm{
// your code here!
}
So in my humble opinion you do not qualify to get the points.
And I clearly remember reading:
_asm{
// your code here!
}
So in my humble opinion you do not qualify to get the points.
ASKER
Hi NicoLan...
I invited some C/C++ experts to try their hand at this challenge. I think that it will be interesting to see how well a C-language optimizer -- with a cleverly implemented algorithm -- can do against raw ASM.
-- Dan
I invited some C/C++ experts to try their hand at this challenge. I think that it will be interesting to see how well a C-language optimizer -- with a cleverly implemented algorithm -- can do against raw ASM.
-- Dan
ASKER
Here is Pavlik's C/C++ submission (from http:Q.20280145.html )
I will test it tonight and post the timings! Note that I am not certain that it meets all criteria in the specification, but I won't hold that against him if it can be modified to meet the spec without slowing down.
static const long digits[256] =
{
0x3030, 0x3130,0x3230, 0x3330, 0x3430, 0x3530, 0x3630, 0x3730, 0x3830, 0x3930, 0x6130, 0x6230, 0x6330,
0x6430, 0x6530, 0x6630,
0x3031, 0x3131,0x3231, 0x3331, 0x3431, 0x3531, 0x3631, 0x3731, 0x3831, 0x3931, 0x6131, 0x6231, 0x6331,
0x6431, 0x6531, 0x6631,
0x3032, 0x3132,0x3232, 0x3332, 0x3432, 0x3532, 0x3632, 0x3732, 0x3832, 0x3932, 0x6132, 0x6232, 0x6332,
0x6432, 0x6532, 0x6632,
0x3033, 0x3133,0x3233, 0x3333, 0x3433, 0x3533, 0x3633, 0x3733, 0x3833, 0x3933, 0x6133, 0x6233, 0x6333,
0x6433, 0x6533, 0x6633,
0x3034, 0x3134,0x3234, 0x3334, 0x3434, 0x3534, 0x3634, 0x3734, 0x3834, 0x3934, 0x6134, 0x6234, 0x6334,
0x6434, 0x6534, 0x6634,
0x3035, 0x3135,0x3235, 0x3335, 0x3435, 0x3535, 0x3635, 0x3735, 0x3835, 0x3935, 0x6135, 0x6235, 0x6335,
0x6435, 0x6535, 0x6635,
0x3036, 0x3136,0x3236, 0x3336, 0x3436, 0x3536, 0x3636, 0x3736, 0x3836, 0x3936, 0x6136, 0x6236, 0x6336,
0x6436, 0x6536, 0x6636,
0x3037, 0x3137,0x3237, 0x3337, 0x3437, 0x3537, 0x3637, 0x3737, 0x3837, 0x3937, 0x6137, 0x6237, 0x6337,
0x6437, 0x6537, 0x6637,
0x3038, 0x3138,0x3238, 0x3338, 0x3438, 0x3538, 0x3638, 0x3738, 0x3838, 0x3938, 0x6138, 0x6238, 0x6338,
0x6438, 0x6538, 0x6638,
0x3039, 0x3139,0x3239, 0x3339, 0x3439, 0x3539, 0x3639, 0x3739, 0x3839, 0x3939, 0x6139, 0x6239, 0x6339,
0x6439, 0x6539, 0x6639,
0x3061, 0x3161,0x3261, 0x3361, 0x3461, 0x3561, 0x3661, 0x3761, 0x3861, 0x3961, 0x6161, 0x6261, 0x6361,
0x6461, 0x6561, 0x6661,
0x3062, 0x3162,0x3262, 0x3362, 0x3462, 0x3562, 0x3662, 0x3762, 0x3862, 0x3962, 0x6162, 0x6262, 0x6362,
0x6462, 0x6562, 0x6662,
0x3063, 0x3163,0x3263, 0x3363, 0x3463, 0x3563, 0x3663, 0x3763, 0x3863, 0x3963, 0x6163, 0x6263, 0x6363,
0x6463, 0x6563, 0x6663,
0x3064, 0x3164,0x3264, 0x3364, 0x3464, 0x3564, 0x3664, 0x3764, 0x3864, 0x3964, 0x6164, 0x6264, 0x6364,
0x6464, 0x6564, 0x6664,
0x3065, 0x3165,0x3265, 0x3365, 0x3465, 0x3565, 0x3665, 0x3765, 0x3865, 0x3965, 0x6165, 0x6265, 0x6365,
0x6465, 0x6565, 0x6665,
0x3066, 0x3166,0x3266, 0x3366, 0x3466, 0x3566, 0x3666, 0x3766, 0x3866, 0x3966, 0x6166, 0x6266, 0x6366,
0x6466, 0x6566, 0x6666
};
static const char chars[256] =
{
'.', '.', '.', '.', '.', '.', '.', '.', '.', '.', '.', '.', '.', '.', '.', '.',
'.', '.', '.', '.', '.', '.', '.', '.', '.', '.', '.', '.', '.', '.', '.', '.',
'\x20', '\x21','\x22','\x23','\x24 ','\x25',' \x26','\x2 7','\x28', '\x29','\x 2a','\x2b' ,'\x2c','\ x2d','\x2e ','\x2f',
'\x30', '\x31','\x32','\x33','\x34 ','\x35',' \x36','\x3 7','\x38', '\x39','\x 3a','\x3b' ,'\x3c','\ x3d','\x3e ','\x3f',
'\x40', '\x41','\x42','\x43','\x44 ','\x45',' \x46','\x4 7','\x48', '\x49','\x 4a','\x4b' ,'\x4c','\ x4d','\x4e ','\x4f',
'\x50', '\x51','\x52','\x53','\x54 ','\x55',' \x56','\x5 7','\x58', '\x59','\x 5a','\x5b' ,'\x5c','\ x5d','\x5e ','\x5f',
'\x60', '\x61','\x62','\x63','\x64 ','\x65',' \x66','\x6 7','\x68', '\x69','\x 6a','\x6b' ,'\x6c','\ x6d','\x6e ','\x6f',
'\x70', '\x71','\x72','\x73','\x74 ','\x75',' \x76','\x7 7','\x78', '\x79','\x 7a','\x7b' ,'\x7c','\ x7d','\x7e ','\x7f',
'\x80', '\x81','\x82','\x83','\x84 ','\x85',' \x86','\x8 7','\x88', '\x89','\x 8a','\x8b' ,'\x8c','\ x8d','\x8e ','\x8f',
'\x90', '\x91','\x92','\x93','\x94 ','\x95',' \x96','\x9 7','\x98', '\x99','\x 9a','\x9b' ,'\x9c','\ x9d','\x9e ','\x9f',
'\xa0', '\xa1','\xa2','\xa3','\xa4 ','\xa5',' \xa6','\xa 7','\xa8', '\xa9','\x aa','\xab' ,'\xac','\ xad','\xae ','\xaf',
'\xb0', '\xb1','\xb2','\xb3','\xb4 ','\xb5',' \xb6','\xb 7','\xb8', '\xb9','\x ba','\xbb' ,'\xbc','\ xbd','\xbe ','\xbf',
'\xc0', '\xc1','\xc2','\xc3','\xc4 ','\xc5',' \xc6','\xc 7','\xc8', '\xc9','\x ca','\xcb' ,'\xcc','\ xcd','\xce ','\xcf',
'\xd0', '\xd1','\xd2','\xd3','\xd4 ','\xd5',' \xd6','\xd 7','\xd8', '\xd9','\x da','\xdb' ,'\xdc','\ xdd','\xde ','\xdf',
'\xe0', '\xe1','\xe2','\xe3','\xe4 ','\xe5',' \xe6','\xe 7','\xe8', '\xe9','\x ea','\xeb' ,'\xec','\ xed','\xee ','\xef',
'\xf0', '\xf1','\xf2','\xf3','\xf4 ','\xf5',' \xf6','\xf 7','\xf8', '\xf9','\x fa','\xfb' ,'\xfc','\ xfd','\xfe ','\xff'
};
long* pBuf = (long*) buf;
for (int i = 4; i > 0; --i)
{
*pBuf = (digits[(*dump)]) | 0x200000L | ((digits[*(dump+1)] & 0xffL) << 24);
++pBuf;
*pBuf = (digits[*(dump+1)] >> 8 ) | (0x2000L) | (digits[*(dump+2)] << 16);
++pBuf;
*pBuf = (digits[*(dump+3)] << 8) | 0x20000020L;
++pBuf;
dump += 4;
}
dump -= 16;
*pBuf =
chars[*dump] |
(chars[*(dump+1)] << 8) |
(chars[*(dump+2)] << 16) |
(chars[*(dump+3)] << 24);
++pBuf;
dump += 4;
*pBuf =
chars[*dump] |
(chars[*(dump+1)] << 8) |
(chars[*(dump+2)] << 16) |
(chars[*(dump+3)] << 24);
++pBuf;
dump += 4;
*pBuf =
chars[*dump] |
(chars[*(dump+1)] << 8) |
(chars[*(dump+2)] << 16) |
(chars[*(dump+3)] << 24);
++pBuf;
dump += 4;
*pBuf =
chars[*dump] |
(chars[*(dump+1)] << 8) |
(chars[*(dump+2)] << 16) |
(chars[*(dump+3)] << 24);
++pBuf;
*pBuf = 0;
}
I will test it tonight and post the timings! Note that I am not certain that it meets all criteria in the specification, but I won't hold that against him if it can be modified to meet the spec without slowing down.
static const long digits[256] =
{
0x3030, 0x3130,0x3230, 0x3330, 0x3430, 0x3530, 0x3630, 0x3730, 0x3830, 0x3930, 0x6130, 0x6230, 0x6330,
0x6430, 0x6530, 0x6630,
0x3031, 0x3131,0x3231, 0x3331, 0x3431, 0x3531, 0x3631, 0x3731, 0x3831, 0x3931, 0x6131, 0x6231, 0x6331,
0x6431, 0x6531, 0x6631,
0x3032, 0x3132,0x3232, 0x3332, 0x3432, 0x3532, 0x3632, 0x3732, 0x3832, 0x3932, 0x6132, 0x6232, 0x6332,
0x6432, 0x6532, 0x6632,
0x3033, 0x3133,0x3233, 0x3333, 0x3433, 0x3533, 0x3633, 0x3733, 0x3833, 0x3933, 0x6133, 0x6233, 0x6333,
0x6433, 0x6533, 0x6633,
0x3034, 0x3134,0x3234, 0x3334, 0x3434, 0x3534, 0x3634, 0x3734, 0x3834, 0x3934, 0x6134, 0x6234, 0x6334,
0x6434, 0x6534, 0x6634,
0x3035, 0x3135,0x3235, 0x3335, 0x3435, 0x3535, 0x3635, 0x3735, 0x3835, 0x3935, 0x6135, 0x6235, 0x6335,
0x6435, 0x6535, 0x6635,
0x3036, 0x3136,0x3236, 0x3336, 0x3436, 0x3536, 0x3636, 0x3736, 0x3836, 0x3936, 0x6136, 0x6236, 0x6336,
0x6436, 0x6536, 0x6636,
0x3037, 0x3137,0x3237, 0x3337, 0x3437, 0x3537, 0x3637, 0x3737, 0x3837, 0x3937, 0x6137, 0x6237, 0x6337,
0x6437, 0x6537, 0x6637,
0x3038, 0x3138,0x3238, 0x3338, 0x3438, 0x3538, 0x3638, 0x3738, 0x3838, 0x3938, 0x6138, 0x6238, 0x6338,
0x6438, 0x6538, 0x6638,
0x3039, 0x3139,0x3239, 0x3339, 0x3439, 0x3539, 0x3639, 0x3739, 0x3839, 0x3939, 0x6139, 0x6239, 0x6339,
0x6439, 0x6539, 0x6639,
0x3061, 0x3161,0x3261, 0x3361, 0x3461, 0x3561, 0x3661, 0x3761, 0x3861, 0x3961, 0x6161, 0x6261, 0x6361,
0x6461, 0x6561, 0x6661,
0x3062, 0x3162,0x3262, 0x3362, 0x3462, 0x3562, 0x3662, 0x3762, 0x3862, 0x3962, 0x6162, 0x6262, 0x6362,
0x6462, 0x6562, 0x6662,
0x3063, 0x3163,0x3263, 0x3363, 0x3463, 0x3563, 0x3663, 0x3763, 0x3863, 0x3963, 0x6163, 0x6263, 0x6363,
0x6463, 0x6563, 0x6663,
0x3064, 0x3164,0x3264, 0x3364, 0x3464, 0x3564, 0x3664, 0x3764, 0x3864, 0x3964, 0x6164, 0x6264, 0x6364,
0x6464, 0x6564, 0x6664,
0x3065, 0x3165,0x3265, 0x3365, 0x3465, 0x3565, 0x3665, 0x3765, 0x3865, 0x3965, 0x6165, 0x6265, 0x6365,
0x6465, 0x6565, 0x6665,
0x3066, 0x3166,0x3266, 0x3366, 0x3466, 0x3566, 0x3666, 0x3766, 0x3866, 0x3966, 0x6166, 0x6266, 0x6366,
0x6466, 0x6566, 0x6666
};
static const char chars[256] =
{
'.', '.', '.', '.', '.', '.', '.', '.', '.', '.', '.', '.', '.', '.', '.', '.',
'.', '.', '.', '.', '.', '.', '.', '.', '.', '.', '.', '.', '.', '.', '.', '.',
'\x20', '\x21','\x22','\x23','\x24
'\x30', '\x31','\x32','\x33','\x34
'\x40', '\x41','\x42','\x43','\x44
'\x50', '\x51','\x52','\x53','\x54
'\x60', '\x61','\x62','\x63','\x64
'\x70', '\x71','\x72','\x73','\x74
'\x80', '\x81','\x82','\x83','\x84
'\x90', '\x91','\x92','\x93','\x94
'\xa0', '\xa1','\xa2','\xa3','\xa4
'\xb0', '\xb1','\xb2','\xb3','\xb4
'\xc0', '\xc1','\xc2','\xc3','\xc4
'\xd0', '\xd1','\xd2','\xd3','\xd4
'\xe0', '\xe1','\xe2','\xe3','\xe4
'\xf0', '\xf1','\xf2','\xf3','\xf4
};
long* pBuf = (long*) buf;
for (int i = 4; i > 0; --i)
{
*pBuf = (digits[(*dump)]) | 0x200000L | ((digits[*(dump+1)] & 0xffL) << 24);
++pBuf;
*pBuf = (digits[*(dump+1)] >> 8 ) | (0x2000L) | (digits[*(dump+2)] << 16);
++pBuf;
*pBuf = (digits[*(dump+3)] << 8) | 0x20000020L;
++pBuf;
dump += 4;
}
dump -= 16;
*pBuf =
chars[*dump] |
(chars[*(dump+1)] << 8) |
(chars[*(dump+2)] << 16) |
(chars[*(dump+3)] << 24);
++pBuf;
dump += 4;
*pBuf =
chars[*dump] |
(chars[*(dump+1)] << 8) |
(chars[*(dump+2)] << 16) |
(chars[*(dump+3)] << 24);
++pBuf;
dump += 4;
*pBuf =
chars[*dump] |
(chars[*(dump+1)] << 8) |
(chars[*(dump+2)] << 16) |
(chars[*(dump+3)] << 24);
++pBuf;
dump += 4;
*pBuf =
chars[*dump] |
(chars[*(dump+1)] << 8) |
(chars[*(dump+2)] << 16) |
(chars[*(dump+3)] << 24);
++pBuf;
*pBuf = 0;
}
I do the following:
1. read data 4 bytes at a time
2. operate on lower nibbles of each byte
3. make a mask that corresponds to each nibble. if the nibble is less than 10, mask is 0xff, else mask is 0x00
4. add (mask & 0x30303030) to original nibbles
5. add (mask & 0x37373737) to original nibbles
6. at this point, the register should contain 4 bytes, each a ASCII representation of the corresponding lower nibble.
7. store these values, and repeat with higher nibbles.
code:
mov edx, [esi+00]
and edx, 0x0f0f0f0f
mov ebx, edx
sub ebx, 0x0a0a0a0a
and ebx, 0xf0f0f0f0
mov ecx, ebx
shr ecx, 4
or ebx, ecx
// at this point, ebx should contain the mask for each nibble
mov ecx, ebx
and ecx, 0x30303030
add edx, ecx
mov ecx, ebx
not ecx
and ecx, 0x37373737
add edx, ecx
// at this point, edx contains ASCII representation of each nibble.
// my ASM is too rusty, so don't know how to populate the result buffer.
1. read data 4 bytes at a time
2. operate on lower nibbles of each byte
3. make a mask that corresponds to each nibble. if the nibble is less than 10, mask is 0xff, else mask is 0x00
4. add (mask & 0x30303030) to original nibbles
5. add (mask & 0x37373737) to original nibbles
6. at this point, the register should contain 4 bytes, each a ASCII representation of the corresponding lower nibble.
7. store these values, and repeat with higher nibbles.
code:
mov edx, [esi+00]
and edx, 0x0f0f0f0f
mov ebx, edx
sub ebx, 0x0a0a0a0a
and ebx, 0xf0f0f0f0
mov ecx, ebx
shr ecx, 4
or ebx, ecx
// at this point, ebx should contain the mask for each nibble
mov ecx, ebx
and ecx, 0x30303030
add edx, ecx
mov ecx, ebx
not ecx
and ecx, 0x37373737
add edx, ecx
// at this point, edx contains ASCII representation of each nibble.
// my ASM is too rusty, so don't know how to populate the result buffer.
ASKER
First, analysis of jonnin's code. I had to modify his format string to enforce two-hex digits per byte and I use sprintf to output to the buffer as specified in the challenge.
This is processing a measly 450 conversions per millisecond -- fully optimized and wNOT displaying the output. He mentions in his post that ASM is likely to be 100 times as fast, and he is right.
Here is the working version (same ascii[] array as in AvonWyss entry) :
void JonninHexLineOut( BYTE* pSrc, char* pszBuf )
{
BYTE* data= pSrc;
sprintf(pszBuf,
"%.2x %.2x %.2x %.2x %.2x %.2x %.2x %.2x-"
"%.2x %.2x %.2x %.2x %.2x %.2x %.2x %.2x",
data[0],data[1],data[2],da ta[3],data [4],data[5 ],data[6],
data[7],data[8],data[9],da ta[10],dat a[11],data [12],data[ 13],
data[14],data[15]);
sprintf(&pszBuf[47], " %c%c%c%c%c%c%c%c%c%c%c%c%c %c%c%c",
ascii[data[0]], ascii[data[1]], ascii[data[2]], ascii[data[3]],
ascii[data[4]], ascii[data[5]], ascii[data[6]], ascii[data[7]],
ascii[data[8]], ascii[data[9]], ascii[data[10]], ascii[data[11]],
ascii[data[12]], ascii[data[13]], ascii[data[14]], ascii[data[15]]);
}
This is processing a measly 450 conversions per millisecond -- fully optimized and wNOT displaying the output. He mentions in his post that ASM is likely to be 100 times as fast, and he is right.
Here is the working version (same ascii[] array as in AvonWyss entry) :
void JonninHexLineOut( BYTE* pSrc, char* pszBuf )
{
BYTE* data= pSrc;
sprintf(pszBuf,
"%.2x %.2x %.2x %.2x %.2x %.2x %.2x %.2x-"
"%.2x %.2x %.2x %.2x %.2x %.2x %.2x %.2x",
data[0],data[1],data[2],da
data[7],data[8],data[9],da
data[14],data[15]);
sprintf(&pszBuf[47], " %c%c%c%c%c%c%c%c%c%c%c%c%c
ascii[data[0]], ascii[data[1]], ascii[data[2]], ascii[data[3]],
ascii[data[4]], ascii[data[5]], ascii[data[6]], ascii[data[7]],
ascii[data[8]], ascii[data[9]], ascii[data[10]], ascii[data[11]],
ascii[data[12]], ascii[data[13]], ascii[data[14]], ascii[data[15]]);
}
ASKER
Now Pavlik submitted a C-language entry that is pretty awesome. This fn uses two lookup tables and it process the input in 32-bit units and outputs in 32-bit units (using a clever masking scheme for the hex part, I might add).
The original code had these (relatively minor) problems
1) it did not put the dash after the 8th set of hex digits
2) it outputs characters > 127 (rather than '.' as specified).
3) A subtle err occurs because of signed vs unsigned values. The error goes away when I modify his chars[256] array to type 'unsigned char' (it was 'char').
(anyone care to predict how the bug manifest itself -- what the output looked like before my fix?)
Anyway, when compiled with full optimization, this is processing a whopping
40,300 conversions per millisecond!
nearly TWICE as fast an any routine so far! And it is C, NOT ASM!
Pavlik, if you unroll that first loop, you can get a minor increase. I'm surprized you didn't do it.
Here is the code he submitted (with my minor changes to meet spec):
static const long digits[256] =
{
0x3030, 0x3130,0x3230, 0x3330, 0x3430, 0x3530, 0x3630, 0x3730, 0x3830, 0x3930, 0x6130, 0x6230, 0x6330, 0x6430, 0x6530, 0x6630,
0x3031, 0x3131,0x3231, 0x3331, 0x3431, 0x3531, 0x3631, 0x3731, 0x3831, 0x3931, 0x6131, 0x6231, 0x6331, 0x6431, 0x6531, 0x6631,
0x3032, 0x3132,0x3232, 0x3332, 0x3432, 0x3532, 0x3632, 0x3732, 0x3832, 0x3932, 0x6132, 0x6232, 0x6332, 0x6432, 0x6532, 0x6632,
0x3033, 0x3133,0x3233, 0x3333, 0x3433, 0x3533, 0x3633, 0x3733, 0x3833, 0x3933, 0x6133, 0x6233, 0x6333, 0x6433, 0x6533, 0x6633,
0x3034, 0x3134,0x3234, 0x3334, 0x3434, 0x3534, 0x3634, 0x3734, 0x3834, 0x3934, 0x6134, 0x6234, 0x6334, 0x6434, 0x6534, 0x6634,
0x3035, 0x3135,0x3235, 0x3335, 0x3435, 0x3535, 0x3635, 0x3735, 0x3835, 0x3935, 0x6135, 0x6235, 0x6335, 0x6435, 0x6535, 0x6635,
0x3036, 0x3136,0x3236, 0x3336, 0x3436, 0x3536, 0x3636, 0x3736, 0x3836, 0x3936, 0x6136, 0x6236, 0x6336, 0x6436, 0x6536, 0x6636,
0x3037, 0x3137,0x3237, 0x3337, 0x3437, 0x3537, 0x3637, 0x3737, 0x3837, 0x3937, 0x6137, 0x6237, 0x6337, 0x6437, 0x6537, 0x6637,
0x3038, 0x3138,0x3238, 0x3338, 0x3438, 0x3538, 0x3638, 0x3738, 0x3838, 0x3938, 0x6138, 0x6238, 0x6338, 0x6438, 0x6538, 0x6638,
0x3039, 0x3139,0x3239, 0x3339, 0x3439, 0x3539, 0x3639, 0x3739, 0x3839, 0x3939, 0x6139, 0x6239, 0x6339, 0x6439, 0x6539, 0x6639,
0x3061, 0x3161,0x3261, 0x3361, 0x3461, 0x3561, 0x3661, 0x3761, 0x3861, 0x3961, 0x6161, 0x6261, 0x6361, 0x6461, 0x6561, 0x6661,
0x3062, 0x3162,0x3262, 0x3362, 0x3462, 0x3562, 0x3662, 0x3762, 0x3862, 0x3962, 0x6162, 0x6262, 0x6362, 0x6462, 0x6562, 0x6662,
0x3063, 0x3163,0x3263, 0x3363, 0x3463, 0x3563, 0x3663, 0x3763, 0x3863, 0x3963, 0x6163, 0x6263, 0x6363, 0x6463, 0x6563, 0x6663,
0x3064, 0x3164,0x3264, 0x3364, 0x3464, 0x3564, 0x3664, 0x3764, 0x3864, 0x3964, 0x6164, 0x6264, 0x6364, 0x6464, 0x6564, 0x6664,
0x3065, 0x3165,0x3265, 0x3365, 0x3465, 0x3565, 0x3665, 0x3765, 0x3865, 0x3965, 0x6165, 0x6265, 0x6365, 0x6465, 0x6565, 0x6665,
0x3066, 0x3166,0x3266, 0x3366, 0x3466, 0x3566, 0x3666, 0x3766, 0x3866, 0x3966, 0x6166, 0x6266, 0x6366, 0x6466, 0x6566, 0x6666
};
static const unsigned char chars[256] = //<<----- [DR] fixed (was signed: char)
{
'.', '.', '.', '.', '.', '.', '.', '.', '.', '.', '.', '.', '.', '.', '.', '.',
'.', '.', '.', '.', '.', '.', '.', '.', '.', '.', '.', '.', '.', '.', '.', '.',
'\x20', '\x21','\x22','\x23','\x24 ','\x25',' \x26','\x2 7','\x28', '\x29','\x 2a','\x2b' ,'\x2c','\ x2d','\x2e ','\x2f',
'\x30', '\x31','\x32','\x33','\x34 ','\x35',' \x36','\x3 7','\x38', '\x39','\x 3a','\x3b' ,'\x3c','\ x3d','\x3e ','\x3f',
'\x40', '\x41','\x42','\x43','\x44 ','\x45',' \x46','\x4 7','\x48', '\x49','\x 4a','\x4b' ,'\x4c','\ x4d','\x4e ','\x4f',
'\x50', '\x51','\x52','\x53','\x54 ','\x55',' \x56','\x5 7','\x58', '\x59','\x 5a','\x5b' ,'\x5c','\ x5d','\x5e ','\x5f',
'\x60', '\x61','\x62','\x63','\x64 ','\x65',' \x66','\x6 7','\x68', '\x69','\x 6a','\x6b' ,'\x6c','\ x6d','\x6e ','\x6f',
'\x70', '\x71','\x72','\x73','\x74 ','\x75',' \x76','\x7 7','\x78', '\x79','\x 7a','\x7b' ,'\x7c','\ x7d','\x7e ','\x7f',
'\x80', '\x81','\x82','\x83','\x84 ','\x85',' \x86','\x8 7','\x88', '\x89','\x 8a','\x8b' ,'\x8c','\ x8d','\x8e ','\x8f',
'\x90', '\x91','\x92','\x93','\x94 ','\x95',' \x96','\x9 7','\x98', '\x99','\x 9a','\x9b' ,'\x9c','\ x9d','\x9e ','\x9f',
'\xa0', '\xa1','\xa2','\xa3','\xa4 ','\xa5',' \xa6','\xa 7','\xa8', '\xa9','\x aa','\xab' ,'\xac','\ xad','\xae ','\xaf',
'\xb0', '\xb1','\xb2','\xb3','\xb4 ','\xb5',' \xb6','\xb 7','\xb8', '\xb9','\x ba','\xbb' ,'\xbc','\ xbd','\xbe ','\xbf',
'\xc0', '\xc1','\xc2','\xc3','\xc4 ','\xc5',' \xc6','\xc 7','\xc8', '\xc9','\x ca','\xcb' ,'\xcc','\ xcd','\xce ','\xcf',
'\xd0', '\xd1','\xd2','\xd3','\xd4 ','\xd5',' \xd6','\xd 7','\xd8', '\xd9','\x da','\xdb' ,'\xdc','\ xdd','\xde ','\xdf',
'\xe0', '\xe1','\xe2','\xe3','\xe4 ','\xe5',' \xe6','\xe 7','\xe8', '\xe9','\x ea','\xeb' ,'\xec','\ xed','\xee ','\xef',
'\xf0', '\xf1','\xf2','\xf3','\xf4 ','\xf5',' \xf6','\xf 7','\xf8', '\xf9','\x fa','\xfb' ,'\xfc','\ xfd','\xfe ','\xff'
};
void HexLineOut( BYTE* pSrc, char* pszBuf )
{
BYTE* dump = pSrc;
long* pBuf = (long*) pszBuf;
for (int i = 4; i > 0; --i) {
*pBuf = (digits[(*dump)]) | 0x200000L | ((digits[*(dump+1)] & 0xffL) << 24);
++pBuf;
*pBuf = (digits[*(dump+1)] >> 8 ) | (0x2000L) | (digits[*(dump+2)] << 16);
++pBuf;
*pBuf = (digits[*(dump+3)] << 8) | 0x20000020L;
++pBuf;
dump += 4;
}
dump -= 16;
*pBuf =
chars[*dump] | (chars[*(dump+1)] << 8) | (chars[*(dump+2)] << 16) | (chars[*(dump+3)] << 24);
++pBuf;
dump += 4;
*pBuf =
chars[*dump] | (chars[*(dump+1)] << 8) | (chars[*(dump+2)] << 16) | (chars[*(dump+3)] << 24);
++pBuf;
dump += 4;
*pBuf =
chars[*dump] | (chars[*(dump+1)] << 8) | (chars[*(dump+2)] << 16) | (chars[*(dump+3)] << 24);
++pBuf;
dump += 4;
*pBuf =
chars[*dump] | (chars[*(dump+1)] << 8) | (chars[*(dump+2)] << 16) | (chars[*(dump+3)] << 24);
++pBuf;
*pBuf = 0;
pszBuf[23]= '-'; // <<----- [DR] Added
}
=-=-=-=-=-=-=-=-
I'm admitting here and now that I may not be able to beat that by much.
-- Dan
The original code had these (relatively minor) problems
1) it did not put the dash after the 8th set of hex digits
2) it outputs characters > 127 (rather than '.' as specified).
3) A subtle err occurs because of signed vs unsigned values. The error goes away when I modify his chars[256] array to type 'unsigned char' (it was 'char').
(anyone care to predict how the bug manifest itself -- what the output looked like before my fix?)
Anyway, when compiled with full optimization, this is processing a whopping
40,300 conversions per millisecond!
nearly TWICE as fast an any routine so far! And it is C, NOT ASM!
Pavlik, if you unroll that first loop, you can get a minor increase. I'm surprized you didn't do it.
Here is the code he submitted (with my minor changes to meet spec):
static const long digits[256] =
{
0x3030, 0x3130,0x3230, 0x3330, 0x3430, 0x3530, 0x3630, 0x3730, 0x3830, 0x3930, 0x6130, 0x6230, 0x6330, 0x6430, 0x6530, 0x6630,
0x3031, 0x3131,0x3231, 0x3331, 0x3431, 0x3531, 0x3631, 0x3731, 0x3831, 0x3931, 0x6131, 0x6231, 0x6331, 0x6431, 0x6531, 0x6631,
0x3032, 0x3132,0x3232, 0x3332, 0x3432, 0x3532, 0x3632, 0x3732, 0x3832, 0x3932, 0x6132, 0x6232, 0x6332, 0x6432, 0x6532, 0x6632,
0x3033, 0x3133,0x3233, 0x3333, 0x3433, 0x3533, 0x3633, 0x3733, 0x3833, 0x3933, 0x6133, 0x6233, 0x6333, 0x6433, 0x6533, 0x6633,
0x3034, 0x3134,0x3234, 0x3334, 0x3434, 0x3534, 0x3634, 0x3734, 0x3834, 0x3934, 0x6134, 0x6234, 0x6334, 0x6434, 0x6534, 0x6634,
0x3035, 0x3135,0x3235, 0x3335, 0x3435, 0x3535, 0x3635, 0x3735, 0x3835, 0x3935, 0x6135, 0x6235, 0x6335, 0x6435, 0x6535, 0x6635,
0x3036, 0x3136,0x3236, 0x3336, 0x3436, 0x3536, 0x3636, 0x3736, 0x3836, 0x3936, 0x6136, 0x6236, 0x6336, 0x6436, 0x6536, 0x6636,
0x3037, 0x3137,0x3237, 0x3337, 0x3437, 0x3537, 0x3637, 0x3737, 0x3837, 0x3937, 0x6137, 0x6237, 0x6337, 0x6437, 0x6537, 0x6637,
0x3038, 0x3138,0x3238, 0x3338, 0x3438, 0x3538, 0x3638, 0x3738, 0x3838, 0x3938, 0x6138, 0x6238, 0x6338, 0x6438, 0x6538, 0x6638,
0x3039, 0x3139,0x3239, 0x3339, 0x3439, 0x3539, 0x3639, 0x3739, 0x3839, 0x3939, 0x6139, 0x6239, 0x6339, 0x6439, 0x6539, 0x6639,
0x3061, 0x3161,0x3261, 0x3361, 0x3461, 0x3561, 0x3661, 0x3761, 0x3861, 0x3961, 0x6161, 0x6261, 0x6361, 0x6461, 0x6561, 0x6661,
0x3062, 0x3162,0x3262, 0x3362, 0x3462, 0x3562, 0x3662, 0x3762, 0x3862, 0x3962, 0x6162, 0x6262, 0x6362, 0x6462, 0x6562, 0x6662,
0x3063, 0x3163,0x3263, 0x3363, 0x3463, 0x3563, 0x3663, 0x3763, 0x3863, 0x3963, 0x6163, 0x6263, 0x6363, 0x6463, 0x6563, 0x6663,
0x3064, 0x3164,0x3264, 0x3364, 0x3464, 0x3564, 0x3664, 0x3764, 0x3864, 0x3964, 0x6164, 0x6264, 0x6364, 0x6464, 0x6564, 0x6664,
0x3065, 0x3165,0x3265, 0x3365, 0x3465, 0x3565, 0x3665, 0x3765, 0x3865, 0x3965, 0x6165, 0x6265, 0x6365, 0x6465, 0x6565, 0x6665,
0x3066, 0x3166,0x3266, 0x3366, 0x3466, 0x3566, 0x3666, 0x3766, 0x3866, 0x3966, 0x6166, 0x6266, 0x6366, 0x6466, 0x6566, 0x6666
};
static const unsigned char chars[256] = //<<----- [DR] fixed (was signed: char)
{
'.', '.', '.', '.', '.', '.', '.', '.', '.', '.', '.', '.', '.', '.', '.', '.',
'.', '.', '.', '.', '.', '.', '.', '.', '.', '.', '.', '.', '.', '.', '.', '.',
'\x20', '\x21','\x22','\x23','\x24
'\x30', '\x31','\x32','\x33','\x34
'\x40', '\x41','\x42','\x43','\x44
'\x50', '\x51','\x52','\x53','\x54
'\x60', '\x61','\x62','\x63','\x64
'\x70', '\x71','\x72','\x73','\x74
'\x80', '\x81','\x82','\x83','\x84
'\x90', '\x91','\x92','\x93','\x94
'\xa0', '\xa1','\xa2','\xa3','\xa4
'\xb0', '\xb1','\xb2','\xb3','\xb4
'\xc0', '\xc1','\xc2','\xc3','\xc4
'\xd0', '\xd1','\xd2','\xd3','\xd4
'\xe0', '\xe1','\xe2','\xe3','\xe4
'\xf0', '\xf1','\xf2','\xf3','\xf4
};
void HexLineOut( BYTE* pSrc, char* pszBuf )
{
BYTE* dump = pSrc;
long* pBuf = (long*) pszBuf;
for (int i = 4; i > 0; --i) {
*pBuf = (digits[(*dump)]) | 0x200000L | ((digits[*(dump+1)] & 0xffL) << 24);
++pBuf;
*pBuf = (digits[*(dump+1)] >> 8 ) | (0x2000L) | (digits[*(dump+2)] << 16);
++pBuf;
*pBuf = (digits[*(dump+3)] << 8) | 0x20000020L;
++pBuf;
dump += 4;
}
dump -= 16;
*pBuf =
chars[*dump] | (chars[*(dump+1)] << 8) | (chars[*(dump+2)] << 16) | (chars[*(dump+3)] << 24);
++pBuf;
dump += 4;
*pBuf =
chars[*dump] | (chars[*(dump+1)] << 8) | (chars[*(dump+2)] << 16) | (chars[*(dump+3)] << 24);
++pBuf;
dump += 4;
*pBuf =
chars[*dump] | (chars[*(dump+1)] << 8) | (chars[*(dump+2)] << 16) | (chars[*(dump+3)] << 24);
++pBuf;
dump += 4;
*pBuf =
chars[*dump] | (chars[*(dump+1)] << 8) | (chars[*(dump+2)] << 16) | (chars[*(dump+3)] << 24);
++pBuf;
*pBuf = 0;
pszBuf[23]= '-'; // <<----- [DR] Added
}
=-=-=-=-=-=-=-=-
I'm admitting here and now that I may not be able to beat that by much.
-- Dan
Screw EE! I cannot post my new source, I only get that stupid "no text" thing.
Here's my new version. No tables, no jumps, all memory access is 32-bit on 32-bit boundaries.
mov edi,pszBuf
mov esi,pSrc
mov byte ptr [edi+64],0
mov eax,[esi]
mov edx,eax
and eax,252645135
shr edx,4
mov ecx,eax
and edx,252645135
add eax,101058054
mov ebx,edx
and eax,269488144
add ebx,101058054
shr eax,2
and ebx,269488144
add ecx,eax
shr ebx,2
shr eax,1
add edx,ebx
add ecx,eax
shr ebx,1
shr eax,1
add edx,ebx
add ecx,eax
shr ebx,1
add ecx,808464432
add edx,ebx
add edx,808464432
mov al,32
mov ah,dh
shl eax,16
mov al,dl
mov ah,cl
shr edx,16
mov [edi],eax
shr ecx,8
mov al,dl
mov ah,ch
shl eax,16
mov al,cl
mov ah,32
shr ecx,16
mov [edi+4],eax
mov al,cl
mov ah,32
shl eax,16
mov al,32
mov ah,dh
mov [edi+8],eax
mov eax,[esi+4]
mov edx,eax
and eax,252645135
shr edx,4
mov ecx,eax
and edx,252645135
add eax,101058054
mov ebx,edx
and eax,269488144
add ebx,101058054
shr eax,2
and ebx,269488144
add ecx,eax
shr ebx,2
shr eax,1
add edx,ebx
add ecx,eax
shr ebx,1
shr eax,1
add edx,ebx
add ecx,eax
shr ebx,1
add ecx,808464432
add edx,ebx
add edx,808464432
mov al,32
mov ah,dh
shl eax,16
mov al,dl
mov ah,cl
shr edx,16
mov [edi+12],eax
shr ecx,8
mov al,dl
mov ah,ch
shl eax,16
mov al,cl
mov ah,32
shr ecx,16
mov [edi+16],eax
mov al,cl
mov ah,45
shl eax,16
mov al,32
mov ah,dh
mov [edi+20],eax
mov eax,[esi+8]
mov edx,eax
and eax,252645135
shr edx,4
mov ecx,eax
and edx,252645135
add eax,101058054
mov ebx,edx
and eax,269488144
add ebx,101058054
shr eax,2
and ebx,269488144
add ecx,eax
shr ebx,2
shr eax,1
add edx,ebx
add ecx,eax
shr ebx,1
shr eax,1
add edx,ebx
add ecx,eax
shr ebx,1
add ecx,808464432
add edx,ebx
add edx,808464432
mov al,32
mov ah,dh
shl eax,16
mov al,dl
mov ah,cl
shr edx,16
mov [edi+24],eax
shr ecx,8
mov al,dl
mov ah,ch
shl eax,16
mov al,cl
mov ah,32
shr ecx,16
mov [edi+28],eax
mov al,cl
mov ah,32
shl eax,16
mov al,32
mov ah,dh
mov [edi+32],eax
mov eax,[esi+12]
mov edx,eax
and eax,252645135
shr edx,4
mov ecx,eax
and edx,252645135
add eax,101058054
mov ebx,edx
and eax,269488144
add ebx,101058054
shr eax,2
and ebx,269488144
add ecx,eax
shr ebx,2
shr eax,1
add edx,ebx
add ecx,eax
shr ebx,1
shr eax,1
add edx,ebx
add ecx,eax
shr ebx,1
add ecx,808464432
add edx,ebx
add edx,808464432
mov al,32
mov ah,dh
shl eax,16
mov al,dl
mov ah,cl
shr edx,16
mov [edi+36],eax
shr ecx,8
mov al,dl
mov ah,ch
shl eax,16
mov al,cl
mov ah,32
shr ecx,16
mov [edi+40],eax
mov al,cl
mov ah,32
shl eax,16
mov al,32
mov ah,dh
mov [edi+44],eax
mov eax,[esi]
mov ebx,eax
mov ecx,eax
and eax,1616928864
not ecx
add eax,1616928864
and eax,ecx
and eax,2155905152
shr eax,7
mov edx,$FF
mul edx
mov ecx,[esi+4]
and ebx,eax
not eax
and eax,774778414
or eax,ebx
mov [edi+48],eax
mov eax,ecx
mov ebx,ecx
and eax,1616928864
not ebx
add eax,1616928864
and eax,ebx
and eax,2155905152
shr eax,7
mov edx,$FF
mul edx
mov ebx,[esi+8]
and ecx,eax
not eax
and eax,774778414
or eax,ecx
mov [edi+52],eax
mov eax,ebx
mov ecx,ebx
and eax,1616928864
not ecx
add eax,1616928864
and eax,ecx
and eax,2155905152
shr eax,7
mov edx,$FF
mul edx
mov ecx,[esi+12]
and ebx,eax
not eax
and eax,774778414
or eax,ebx
mov [edi+56],eax
mov eax,ecx
mov ebx,ecx
and eax,1616928864
not ebx
add eax,1616928864
and eax,ebx
and eax,2155905152
shr eax,7
mov edx,$FF
mul edx
and ecx,eax
not eax
and eax,774778414
or eax,ecx
mov [edi+60],eax
Hope you like it...
mov edi,pszBuf
mov esi,pSrc
mov byte ptr [edi+64],0
mov eax,[esi]
mov edx,eax
and eax,252645135
shr edx,4
mov ecx,eax
and edx,252645135
add eax,101058054
mov ebx,edx
and eax,269488144
add ebx,101058054
shr eax,2
and ebx,269488144
add ecx,eax
shr ebx,2
shr eax,1
add edx,ebx
add ecx,eax
shr ebx,1
shr eax,1
add edx,ebx
add ecx,eax
shr ebx,1
add ecx,808464432
add edx,ebx
add edx,808464432
mov al,32
mov ah,dh
shl eax,16
mov al,dl
mov ah,cl
shr edx,16
mov [edi],eax
shr ecx,8
mov al,dl
mov ah,ch
shl eax,16
mov al,cl
mov ah,32
shr ecx,16
mov [edi+4],eax
mov al,cl
mov ah,32
shl eax,16
mov al,32
mov ah,dh
mov [edi+8],eax
mov eax,[esi+4]
mov edx,eax
and eax,252645135
shr edx,4
mov ecx,eax
and edx,252645135
add eax,101058054
mov ebx,edx
and eax,269488144
add ebx,101058054
shr eax,2
and ebx,269488144
add ecx,eax
shr ebx,2
shr eax,1
add edx,ebx
add ecx,eax
shr ebx,1
shr eax,1
add edx,ebx
add ecx,eax
shr ebx,1
add ecx,808464432
add edx,ebx
add edx,808464432
mov al,32
mov ah,dh
shl eax,16
mov al,dl
mov ah,cl
shr edx,16
mov [edi+12],eax
shr ecx,8
mov al,dl
mov ah,ch
shl eax,16
mov al,cl
mov ah,32
shr ecx,16
mov [edi+16],eax
mov al,cl
mov ah,45
shl eax,16
mov al,32
mov ah,dh
mov [edi+20],eax
mov eax,[esi+8]
mov edx,eax
and eax,252645135
shr edx,4
mov ecx,eax
and edx,252645135
add eax,101058054
mov ebx,edx
and eax,269488144
add ebx,101058054
shr eax,2
and ebx,269488144
add ecx,eax
shr ebx,2
shr eax,1
add edx,ebx
add ecx,eax
shr ebx,1
shr eax,1
add edx,ebx
add ecx,eax
shr ebx,1
add ecx,808464432
add edx,ebx
add edx,808464432
mov al,32
mov ah,dh
shl eax,16
mov al,dl
mov ah,cl
shr edx,16
mov [edi+24],eax
shr ecx,8
mov al,dl
mov ah,ch
shl eax,16
mov al,cl
mov ah,32
shr ecx,16
mov [edi+28],eax
mov al,cl
mov ah,32
shl eax,16
mov al,32
mov ah,dh
mov [edi+32],eax
mov eax,[esi+12]
mov edx,eax
and eax,252645135
shr edx,4
mov ecx,eax
and edx,252645135
add eax,101058054
mov ebx,edx
and eax,269488144
add ebx,101058054
shr eax,2
and ebx,269488144
add ecx,eax
shr ebx,2
shr eax,1
add edx,ebx
add ecx,eax
shr ebx,1
shr eax,1
add edx,ebx
add ecx,eax
shr ebx,1
add ecx,808464432
add edx,ebx
add edx,808464432
mov al,32
mov ah,dh
shl eax,16
mov al,dl
mov ah,cl
shr edx,16
mov [edi+36],eax
shr ecx,8
mov al,dl
mov ah,ch
shl eax,16
mov al,cl
mov ah,32
shr ecx,16
mov [edi+40],eax
mov al,cl
mov ah,32
shl eax,16
mov al,32
mov ah,dh
mov [edi+44],eax
mov eax,[esi]
mov ebx,eax
mov ecx,eax
and eax,1616928864
not ecx
add eax,1616928864
and eax,ecx
and eax,2155905152
shr eax,7
mov edx,$FF
mul edx
mov ecx,[esi+4]
and ebx,eax
not eax
and eax,774778414
or eax,ebx
mov [edi+48],eax
mov eax,ecx
mov ebx,ecx
and eax,1616928864
not ebx
add eax,1616928864
and eax,ebx
and eax,2155905152
shr eax,7
mov edx,$FF
mul edx
mov ebx,[esi+8]
and ecx,eax
not eax
and eax,774778414
or eax,ecx
mov [edi+52],eax
mov eax,ebx
mov ecx,ebx
and eax,1616928864
not ecx
add eax,1616928864
and eax,ecx
and eax,2155905152
shr eax,7
mov edx,$FF
mul edx
mov ecx,[esi+12]
and ebx,eax
not eax
and eax,774778414
or eax,ebx
mov [edi+56],eax
mov eax,ecx
mov ebx,ecx
and eax,1616928864
not ebx
add eax,1616928864
and eax,ebx
and eax,2155905152
shr eax,7
mov edx,$FF
mul edx
and ecx,eax
not eax
and eax,774778414
or eax,ecx
mov [edi+60],eax
Hope you like it...
ASKER
AvonWyss,
That second submission is unusual to say the least. Like Pavlik, you are always reading 32-bit aligned values and writing 32-bit aligned values. But rather than looking up the hex-digit values, you are calculating them.
I suppose that I am expected to be mystified here. The calculation technique appears to be rather obscure, but only because you have intentionally made it so by using decimal values and interleaved the opcodes (possibly for pipeline optimization). Lets see what it is actually doing:
=-=--==-=-=-=-=-=-=-=-=-=- =-=-=-=-=- =-=-=-=-=-
CODE ANALYSIS
In the main sequence, you are working with four nibbles simultaneously. To simpliy the discusiion, lets look at one nibble at a time:
assume the first input is either 'W' (0x57) or 'Z' (0x5A)
n1= one nibble // examples: 7 or 10 (examples in decimal)
n2= n1 +6 // examples: 13 or 16
n3= 1 if n2>16, else n3=0
n4 = n1 + (n3*4)+(n3*2)+ n3
i.e. n4= n1+7 // examples: 7 (n1+0) or 17(n1+7)
n4 = n4 + 0x30 // examples: 55 (0x37='7') or 65 (0x41='A')
So your code adds 6 to obtain a temp result of 1 or a 0 (depending upon whether the sum > 10) Then it multiplies that temp result by 7 resulting in 0 or 7, then it adds that to '0' returning '0'-'9' or 'A'-'F'
The ASCII output, I have not figured out. Maybe you or someone else can explain it (I show an unobfuscated version of it below).
=-=--==-=-=-=-=-=-=-=-=-=- =-=-=-=-=- =-=-=-=-=-
TIMING ANALYSIS
Alas, as clever as this code is -- avoiding all JMPs and interleaving opcodes to maximize pipelining -- it cannot complete with simple lookup logic. I think there are two downfalls:
1) the code itself is nearly twice as many opcode bytes as Pavlik's (708 vs. 393) so there is just more processing going on.
2) The use of the MUL instruction is a killer (at least it used to be when I kept track of these things).
Your code was getting about
30,200 convertsions per millisecond.
Nice try, interesting code, but Sorry! not the fastest.
=-=-=-=-=-=-=-=-=-=-=-=-=-
CODE Footnotes
mov eax,[esi]
mov edx,eax
and eax,0x0F0F0F0F // eax is lo nibbles (four of them; use ah for example)
mov ecx,eax // save a copy
add eax,0x06060606 // ah is lo nibble + 6
and eax,0x10101010 // ah is 1 if a low nibble was > 9
shr eax,2 // ah is 4 if a low nibble was > 9
add ecx,eax // ah is low nibble + 4 if a low nibble was > 9
shr eax,1 // += 2 if a low nibble was > 9
add ecx,eax // += 1 if a low nibble was > 9 (total +7)
add edx,0x30303030 // += '0' (convert to 0-9 and A-F
// (omitted: repeat for hi nibbles)
// (omitted) rearrange the bytes into eax for 'big endian' architecture
// repeat four times for 16 bytes
///////// ASCII output portion -- who can explain this!
mov eax,[esi]
mov ebx,eax
mov ecx,eax
not ecx
and eax,0x60606060
add eax,0x60606060
and eax,ecx
and eax,0x80808080
shr eax,7 // 0 or 1 for each byte
mov edx,0xff // 00 or FF for each byte
mul edx // killer time penalty here
and ebx,eax
not eax
and eax,0x2E2E2E2E // '....'
or eax,ebx
mov [edi+48],eax
-- Dan
That second submission is unusual to say the least. Like Pavlik, you are always reading 32-bit aligned values and writing 32-bit aligned values. But rather than looking up the hex-digit values, you are calculating them.
I suppose that I am expected to be mystified here. The calculation technique appears to be rather obscure, but only because you have intentionally made it so by using decimal values and interleaved the opcodes (possibly for pipeline optimization). Lets see what it is actually doing:
=-=--==-=-=-=-=-=-=-=-=-=-
CODE ANALYSIS
In the main sequence, you are working with four nibbles simultaneously. To simpliy the discusiion, lets look at one nibble at a time:
assume the first input is either 'W' (0x57) or 'Z' (0x5A)
n1= one nibble // examples: 7 or 10 (examples in decimal)
n2= n1 +6 // examples: 13 or 16
n3= 1 if n2>16, else n3=0
n4 = n1 + (n3*4)+(n3*2)+ n3
i.e. n4= n1+7 // examples: 7 (n1+0) or 17(n1+7)
n4 = n4 + 0x30 // examples: 55 (0x37='7') or 65 (0x41='A')
So your code adds 6 to obtain a temp result of 1 or a 0 (depending upon whether the sum > 10) Then it multiplies that temp result by 7 resulting in 0 or 7, then it adds that to '0' returning '0'-'9' or 'A'-'F'
The ASCII output, I have not figured out. Maybe you or someone else can explain it (I show an unobfuscated version of it below).
=-=--==-=-=-=-=-=-=-=-=-=-
TIMING ANALYSIS
Alas, as clever as this code is -- avoiding all JMPs and interleaving opcodes to maximize pipelining -- it cannot complete with simple lookup logic. I think there are two downfalls:
1) the code itself is nearly twice as many opcode bytes as Pavlik's (708 vs. 393) so there is just more processing going on.
2) The use of the MUL instruction is a killer (at least it used to be when I kept track of these things).
Your code was getting about
30,200 convertsions per millisecond.
Nice try, interesting code, but Sorry! not the fastest.
=-=-=-=-=-=-=-=-=-=-=-=-=-
CODE Footnotes
mov eax,[esi]
mov edx,eax
and eax,0x0F0F0F0F // eax is lo nibbles (four of them; use ah for example)
mov ecx,eax // save a copy
add eax,0x06060606 // ah is lo nibble + 6
and eax,0x10101010 // ah is 1 if a low nibble was > 9
shr eax,2 // ah is 4 if a low nibble was > 9
add ecx,eax // ah is low nibble + 4 if a low nibble was > 9
shr eax,1 // += 2 if a low nibble was > 9
add ecx,eax // += 1 if a low nibble was > 9 (total +7)
add edx,0x30303030 // += '0' (convert to 0-9 and A-F
// (omitted: repeat for hi nibbles)
// (omitted) rearrange the bytes into eax for 'big endian' architecture
// repeat four times for 16 bytes
///////// ASCII output portion -- who can explain this!
mov eax,[esi]
mov ebx,eax
mov ecx,eax
not ecx
and eax,0x60606060
add eax,0x60606060
and eax,ecx
and eax,0x80808080
shr eax,7 // 0 or 1 for each byte
mov edx,0xff // 00 or FF for each byte
mul edx // killer time penalty here
and ebx,eax
not eax
and eax,0x2E2E2E2E // '....'
or eax,ebx
mov [edi+48],eax
-- Dan
Note that I "obfuscated" the hex values because my ASM uses other hex constants than the one in VC++ (you certainly have noticed the $FF one which I forgot to convert). The rest of the obfuscation is for CPU pairing, trying to avoid using one register twice in a row. But this only as sidenote.
I'll explain the ASCII output portion:
Let's split a char into its 8 bits to explain. We have
76543210 Bits
ABBCCCCC Name
Now, your condition >=32 or <128, means that a char is valid exactly when any bit B is set (>31) but not bit A (>127). Therefore, I first mask the char to only reflect the B bits and add 01100000. The highest bit will become 1 in every case where any of the bits B were set and remain 0 if none was set. So I now have a mask telling me if the B bits were OK or not.
Next step is an AND NOT with the original value to kill the "OK" bit if the bit A is set. Finally I only keep the A bit and shift it right, so that I now have a valid mask with a 1 for every valid char. Unfortunately, I need a 0xFF mask in order to filter the chars using AND, and that's why I multiply my mask.
The rest is easy, I do a AND with the original values, which gives me only the valid chars, inverse my mask, convert the mask to "." chars, and add it to the valid chars. That's it.
I'll explain the ASCII output portion:
Let's split a char into its 8 bits to explain. We have
76543210 Bits
ABBCCCCC Name
Now, your condition >=32 or <128, means that a char is valid exactly when any bit B is set (>31) but not bit A (>127). Therefore, I first mask the char to only reflect the B bits and add 01100000. The highest bit will become 1 in every case where any of the bits B were set and remain 0 if none was set. So I now have a mask telling me if the B bits were OK or not.
Next step is an AND NOT with the original value to kill the "OK" bit if the bit A is set. Finally I only keep the A bit and shift it right, so that I now have a valid mask with a 1 for every valid char. Unfortunately, I need a 0xFF mask in order to filter the chars using AND, and that's why I multiply my mask.
The rest is easy, I do a AND with the original values, which gives me only the valid chars, inverse my mask, convert the mask to "." chars, and add it to the valid chars. That's it.
The MUL can be replaced by adding an additional statement:
mov eax,[esi] // EAX: 4 source chars
mov ebx,eax // EBX: copy of them
mov ecx,eax
not ecx // ECX: negated copy (only highest bit of interest)
and eax,$60606060
add eax,$60606060 // highest bit is 1 if bits 5/6 were OK
and eax,ecx // remove the highest bit if set in source
shr eax,7
and eax,$01010101 // mask to 1 for valid, 0 for invalid
mov edx,$80808080 // set up subtraction buffer
sub edx,eax // subtract. mask 0 -> 80, 1 -> 7F
and edx,$80808080 // mask 0 -> 00, 1 -> 7F
and ebx,edx // filter valid chars. invalid chars may be $80
not edx // invert mask
and edx,$2E2E2E2E // make all invalid chars 2E ('.')
or edx,ebx // combine valid and invalid chars
mov [edi+48],edx // write result
mov eax,[esi] // EAX: 4 source chars
mov ebx,eax // EBX: copy of them
mov ecx,eax
not ecx // ECX: negated copy (only highest bit of interest)
and eax,$60606060
add eax,$60606060 // highest bit is 1 if bits 5/6 were OK
and eax,ecx // remove the highest bit if set in source
shr eax,7
and eax,$01010101 // mask to 1 for valid, 0 for invalid
mov edx,$80808080 // set up subtraction buffer
sub edx,eax // subtract. mask 0 -> 80, 1 -> 7F
and edx,$80808080 // mask 0 -> 00, 1 -> 7F
and ebx,edx // filter valid chars. invalid chars may be $80
not edx // invert mask
and edx,$2E2E2E2E // make all invalid chars 2E ('.')
or edx,ebx // combine valid and invalid chars
mov [edi+48],edx // write result
Ok, here's a MUL-free version of the part that does the ASCII:
mov eax,[esi]
mov ebx,eax
mov ecx,eax
and eax,1616928864
not ecx
add eax,1616928864
and eax,ecx
and eax,2155905152
shr eax,7
mov edx,$80808080
sub edx,eax
and edx,$7F7F7F7F
mov ecx,[esi+4]
and ebx,edx
not edx
and edx,774778414
or edx,ebx
mov eax,ecx
mov [edi+48],edx
mov ebx,ecx
and eax,1616928864
not ebx
add eax,1616928864
and eax,ebx
and eax,2155905152
shr eax,7
mov edx,$80808080
sub edx,eax
and edx,$7F7F7F7F
mov ebx,[esi+8]
and ecx,edx
not edx
and edx,774778414
or edx,ecx
mov eax,ebx
mov [edi+52],edx
mov ecx,ebx
and eax,1616928864
not ecx
add eax,1616928864
and eax,ecx
and eax,2155905152
shr eax,7
mov edx,$80808080
sub edx,eax
and edx,$7F7F7F7F
mov ecx,[esi+12]
and ebx,edx
not edx
and edx,774778414
or edx,ebx
mov eax,ecx
mov [edi+56],edx
mov ebx,ecx
and eax,1616928864
not ebx
add eax,1616928864
and eax,ebx
and eax,2155905152
shr eax,7
mov edx,$80808080
sub edx,eax
and edx,$7F7F7F7F
and ecx,edx
not edx
and edx,774778414
or edx,ecx
mov [edi+60],edx
mov eax,[esi]
mov ebx,eax
mov ecx,eax
and eax,1616928864
not ecx
add eax,1616928864
and eax,ecx
and eax,2155905152
shr eax,7
mov edx,$80808080
sub edx,eax
and edx,$7F7F7F7F
mov ecx,[esi+4]
and ebx,edx
not edx
and edx,774778414
or edx,ebx
mov eax,ecx
mov [edi+48],edx
mov ebx,ecx
and eax,1616928864
not ebx
add eax,1616928864
and eax,ebx
and eax,2155905152
shr eax,7
mov edx,$80808080
sub edx,eax
and edx,$7F7F7F7F
mov ebx,[esi+8]
and ecx,edx
not edx
and edx,774778414
or edx,ecx
mov eax,ebx
mov [edi+52],edx
mov ecx,ebx
and eax,1616928864
not ecx
add eax,1616928864
and eax,ecx
and eax,2155905152
shr eax,7
mov edx,$80808080
sub edx,eax
and edx,$7F7F7F7F
mov ecx,[esi+12]
and ebx,edx
not edx
and edx,774778414
or edx,ebx
mov eax,ecx
mov [edi+56],edx
mov ebx,ecx
and eax,1616928864
not ebx
add eax,1616928864
and eax,ebx
and eax,2155905152
shr eax,7
mov edx,$80808080
sub edx,eax
and edx,$7F7F7F7F
and ecx,edx
not edx
and edx,774778414
or edx,ecx
mov [edi+60],edx
i don't know if anybody read the code snippet i posted earlier. but it was exactly written for the purpose of avoiding MUL in the algorithm similar to AvonWyss. The method is, when 9 is subtracted by a nibble, its higher nibble will become 1111b if the nibble is greater than 9, otherwise it will become 0000b. without multiplying this already gives us the mask needed to distinguish 0-9 from A-F.
here is a slightly optimized version, it's the code for the first four bytes. it's still slightly slower than lookup table.
mov esi, pSrc;
mov edi, pszBuf;
// let edx store original
mov eax, [esi+00]
mov edx, eax
mov ebx, 0x0a0a0a0a
and eax, 0x0f0f0f0f
sub ebx, eax
shr ebx, 4
and ebx, 0x07070707
or ebx, 0x30303030
add eax, ebx
// repeat with higher nibbles
// mov edx, [esi+00]
shr edx, 4
mov ebx, 0x0a0a0a0a
and edx, 0x0f0f0f0f
sub ebx, edx
and ebx, 0x70707070
shr ebx, 4
or ebx, 0x30303030
add edx, ebx
// display
mov [edi+0], al
shr eax, 8
mov [edi+1], dl
shr edx, 8
mov [edi+3], al
shr eax, 8
mov [edi+4], dl
shr edx, 8
mov [edi+6], al
shr eax, 8
mov [edi+7], dl
shr edx, 8
mov [edi+9], al
mov [edi+10], dl
here is a slightly optimized version, it's the code for the first four bytes. it's still slightly slower than lookup table.
mov esi, pSrc;
mov edi, pszBuf;
// let edx store original
mov eax, [esi+00]
mov edx, eax
mov ebx, 0x0a0a0a0a
and eax, 0x0f0f0f0f
sub ebx, eax
shr ebx, 4
and ebx, 0x07070707
or ebx, 0x30303030
add eax, ebx
// repeat with higher nibbles
// mov edx, [esi+00]
shr edx, 4
mov ebx, 0x0a0a0a0a
and edx, 0x0f0f0f0f
sub ebx, edx
and ebx, 0x70707070
shr ebx, 4
or ebx, 0x30303030
add edx, ebx
// display
mov [edi+0], al
shr eax, 8
mov [edi+1], dl
shr edx, 8
mov [edi+3], al
shr eax, 8
mov [edi+4], dl
shr edx, 8
mov [edi+6], al
shr eax, 8
mov [edi+7], dl
shr edx, 8
mov [edi+9], al
mov [edi+10], dl
abson, unfortnuately, I think that your code does not work correctly in all situations:
mov eax, 0x0A090A09 // sample
mov ebx, 0x0a0a0a0a
sub ebx, eax // ebx is now 00010001
shr ebx, 4 // ebx is now 00001000
and ebx, 0x07070707 // ebx is now 00000000
or ebx, 0x30303030 // ebx is now 30303030
add eax, ebx // ebx is now 3A393A39 = ":9:9"
Your previous version was better, but also failed:
mov edx, 0x0A090A09
mov ebx, edx
sub ebx, 0x0a0a0a0a // ebx=0xFFFEFFFF
and ebx, 0xf0f0f0f0 // ebx=0xF0F0F0F0
mov ecx, ebx
shr ecx, 4 // ecx=0x0F0F0F0F
or ebx, ecx // ebx=0xFFFFFFFF
// at this point, ebx should contain the mask for each nibble
It's obvious that this mask cannot be right. The correct mask would be 0x00FF00FF.
mov eax, 0x0A090A09 // sample
mov ebx, 0x0a0a0a0a
sub ebx, eax // ebx is now 00010001
shr ebx, 4 // ebx is now 00001000
and ebx, 0x07070707 // ebx is now 00000000
or ebx, 0x30303030 // ebx is now 30303030
add eax, ebx // ebx is now 3A393A39 = ":9:9"
Your previous version was better, but also failed:
mov edx, 0x0A090A09
mov ebx, edx
sub ebx, 0x0a0a0a0a // ebx=0xFFFEFFFF
and ebx, 0xf0f0f0f0 // ebx=0xF0F0F0F0
mov ecx, ebx
shr ecx, 4 // ecx=0x0F0F0F0F
or ebx, ecx // ebx=0xFFFFFFFF
// at this point, ebx should contain the mask for each nibble
It's obvious that this mask cannot be right. The correct mask would be 0x00FF00FF.
Forgot to mention: The MUL was only in used in the ASCII part of the string, not the HEX conversion. See my code for details.
Sorry for all the posts. Sometimes I just don't wait long enough before posting... ;-)
abson, what's going wron in you code is the following:
0x00000000 - 0x00000001 == 0xFFFFFFFF
however, you are assuming that only the nibble changes:
0x00000000 - 0x00000001 != 0x000000FF
Your (original) code can be fixed by adding an additional bit which makes sure that the subtraction ends after the nibble:
mov edx, 0x0A090A09
mov ebx, edx
or ebx, 0x01010100
sub ebx, 0x0a0a0a0a // ebx=0x00FF00FF
and ebx, 0xf0f0f0f0 // ebx=0x00F000F0
mov ecx, ebx
shr ecx, 4 // ecx=0x000F000F
or ebx, ecx // ebx=0x00FF00FF
Which is the correct mask for the following operations.
abson, what's going wron in you code is the following:
0x00000000 - 0x00000001 == 0xFFFFFFFF
however, you are assuming that only the nibble changes:
0x00000000 - 0x00000001 != 0x000000FF
Your (original) code can be fixed by adding an additional bit which makes sure that the subtraction ends after the nibble:
mov edx, 0x0A090A09
mov ebx, edx
or ebx, 0x01010100
sub ebx, 0x0a0a0a0a // ebx=0x00FF00FF
and ebx, 0xf0f0f0f0 // ebx=0x00F000F0
mov ecx, ebx
shr ecx, 4 // ecx=0x000F000F
or ebx, ecx // ebx=0x00FF00FF
Which is the correct mask for the following operations.
i don't know if anybody read the code snippet i posted earlier. but it was exactly written for the purpose of avoiding MUL in the algorithm similar to AvonWyss. The method is, when 9 is subtracted by a nibble, its higher nibble will become 1111b if the nibble is greater than 9, otherwise it will become 0000b. without multiplying this already gives us the mask needed to distinguish 0-9 from A-F.
here is a slightly optimized version, it's the code for the first four bytes. it's still slightly slower than lookup table.
mov esi, pSrc;
mov edi, pszBuf;
// let edx store original
mov eax, [esi+00]
mov edx, eax
mov ebx, 0x0a0a0a0a
and eax, 0x0f0f0f0f
sub ebx, eax
shr ebx, 4
and ebx, 0x07070707
or ebx, 0x30303030
add eax, ebx
// repeat with higher nibbles
// mov edx, [esi+00]
shr edx, 4
mov ebx, 0x0a0a0a0a
and edx, 0x0f0f0f0f
sub ebx, edx
and ebx, 0x70707070
shr ebx, 4
or ebx, 0x30303030
add edx, ebx
// display
mov [edi+0], al
shr eax, 8
mov [edi+1], dl
shr edx, 8
mov [edi+3], al
shr eax, 8
mov [edi+4], dl
shr edx, 8
mov [edi+6], al
shr eax, 8
mov [edi+7], dl
shr edx, 8
mov [edi+9], al
mov [edi+10], dl
here is a slightly optimized version, it's the code for the first four bytes. it's still slightly slower than lookup table.
mov esi, pSrc;
mov edi, pszBuf;
// let edx store original
mov eax, [esi+00]
mov edx, eax
mov ebx, 0x0a0a0a0a
and eax, 0x0f0f0f0f
sub ebx, eax
shr ebx, 4
and ebx, 0x07070707
or ebx, 0x30303030
add eax, ebx
// repeat with higher nibbles
// mov edx, [esi+00]
shr edx, 4
mov ebx, 0x0a0a0a0a
and edx, 0x0f0f0f0f
sub ebx, edx
and ebx, 0x70707070
shr ebx, 4
or ebx, 0x30303030
add edx, ebx
// display
mov [edi+0], al
shr eax, 8
mov [edi+1], dl
shr edx, 8
mov [edi+3], al
shr eax, 8
mov [edi+4], dl
shr edx, 8
mov [edi+6], al
shr eax, 8
mov [edi+7], dl
shr edx, 8
mov [edi+9], al
mov [edi+10], dl
sorry for the double post, non-intentional ...
AvonWyss, yes, there is an error to my code, but the fix is actually simpler
original code:
mov eax, 0x0A090A09 // sample
mov ebx, 0x0a0a0a0a
sub ebx, eax // ebx is now 00010001
shr ebx, 4 // ebx is now 00001000
and ebx, 0x07070707 // ebx is now 00000000
or ebx, 0x30303030 // ebx is now 30303030
add eax, ebx // ebx is now 3A393A39 = ":9:9"
new code:
mov eax, 0x0A090A09 // sample
mov ebx, 0x09090909
sub ebx, eax // ebx is now 00010001
shr ebx, 4 // ebx is now 00001000
and ebx, 0x07070707 // ebx is now 00000000
or ebx, 0x30303030 // ebx is now 30303030
add eax, ebx // ebx is now 3A393A39 = ":9:9"
that is, subtract each nibble from 9 instead of from 10. other than that, i don't think there is anything wrong with the logic, therefore there is no need for the extra bit.
AvonWyss, yes, there is an error to my code, but the fix is actually simpler
original code:
mov eax, 0x0A090A09 // sample
mov ebx, 0x0a0a0a0a
sub ebx, eax // ebx is now 00010001
shr ebx, 4 // ebx is now 00001000
and ebx, 0x07070707 // ebx is now 00000000
or ebx, 0x30303030 // ebx is now 30303030
add eax, ebx // ebx is now 3A393A39 = ":9:9"
new code:
mov eax, 0x0A090A09 // sample
mov ebx, 0x09090909
sub ebx, eax // ebx is now 00010001
shr ebx, 4 // ebx is now 00001000
and ebx, 0x07070707 // ebx is now 00000000
or ebx, 0x30303030 // ebx is now 30303030
add eax, ebx // ebx is now 3A393A39 = ":9:9"
that is, subtract each nibble from 9 instead of from 10. other than that, i don't think there is anything wrong with the logic, therefore there is no need for the extra bit.
ASKER
AvonWyss,
Thanks for the clarification on the ASCII conversion. A very clever bit of binary logic. I salute you.
I also retested your code after removing the MUL opcode form the ASCII part of the code. The change does increase the speed. I misstated the original timing test. Sorry. The two timing results:
20,200 conversions per ms -- version with MUL (sorry 'bout my earlier typo)
21,600 conversions per ms -- version without MUL
So replacing the MUL provided a respectable 6% speed increase. However, as clever as the code it, it does not really even approach the speeds of the pure lookup table algorithm.
-- Dan
Thanks for the clarification on the ASCII conversion. A very clever bit of binary logic. I salute you.
I also retested your code after removing the MUL opcode form the ASCII part of the code. The change does increase the speed. I misstated the original timing test. Sorry. The two timing results:
20,200 conversions per ms -- version with MUL (sorry 'bout my earlier typo)
21,600 conversions per ms -- version without MUL
So replacing the MUL provided a respectable 6% speed increase. However, as clever as the code it, it does not really even approach the speeds of the pure lookup table algorithm.
-- Dan
Dan,
May I ask why you are ignoring my comment?
As I mentioned my code is very similar to AvonWyss's code, as it also operates on 4 bytes at a time. But it uses the special characteristic of 2s complement math so it only needs 8 instructions to calculate the HEX representation of 4 nibbles.
I have tested out the full version of it against Pavlik's code, and my version is about 10% slower. I am not claiming my version will ever be faster than the lookup table, but I really think, as an author of an Assembly book, you would be interested.
May I ask why you are ignoring my comment?
As I mentioned my code is very similar to AvonWyss's code, as it also operates on 4 bytes at a time. But it uses the special characteristic of 2s complement math so it only needs 8 instructions to calculate the HEX representation of 4 nibbles.
I have tested out the full version of it against Pavlik's code, and my version is about 10% slower. I am not claiming my version will ever be faster than the lookup table, but I really think, as an author of an Assembly book, you would be interested.
ASKER
absong,
I appologize for ignoring you. I was testing the other two functions becasue they are complete functions and therefore testable. Then I had to use the bathroom and well, I'd rather not describe that particular incident.
I am very interested in your code. And I just spent about 30 minutes trying to figure out what is going on. I wanted to make some intelligent comments about it. Here is what I've been testing (I replaced your comments with my results):
mov eax, 0x0A090A09 // sample
mov ebx, 0x09090909
sub ebx, eax // ebx is now FEFFFF00
shr ebx, 4 // ebx is now 0FEFFFF0
and ebx, 0x07070707 // ebx is now 07070700
or ebx, 0x30303030 // ebx is now 37373730
add eax, ebx // eax is now 41404139 = ")()9"
The problem is that it doesn't do what you say in your comments and I don't understand how it is supposed to work so I can't fix it. As is, it outputs nonsense. I'm sure that there must be something that I am missing. Could you explain it in detail and/or show the code using single bytes (al, bl, etc?) so I can extrapolate to the 4-at-a-time logic?
-- Dan
I appologize for ignoring you. I was testing the other two functions becasue they are complete functions and therefore testable. Then I had to use the bathroom and well, I'd rather not describe that particular incident.
I am very interested in your code. And I just spent about 30 minutes trying to figure out what is going on. I wanted to make some intelligent comments about it. Here is what I've been testing (I replaced your comments with my results):
mov eax, 0x0A090A09 // sample
mov ebx, 0x09090909
sub ebx, eax // ebx is now FEFFFF00
shr ebx, 4 // ebx is now 0FEFFFF0
and ebx, 0x07070707 // ebx is now 07070700
or ebx, 0x30303030 // ebx is now 37373730
add eax, ebx // eax is now 41404139 = ")()9"
The problem is that it doesn't do what you say in your comments and I don't understand how it is supposed to work so I can't fix it. As is, it outputs nonsense. I'm sure that there must be something that I am missing. Could you explain it in detail and/or show the code using single bytes (al, bl, etc?) so I can extrapolate to the 4-at-a-time logic?
-- Dan
Dan, yes, using single bytes is a good idea.
To convert a nibble N into its HEX representation, we can do the following:
- if the nibble is between 0 and 9, add the nibble by 0x30
- if the nibble is between A and F, add the nibble by 0x37
Notice the difference between 0x30 and 0x37 is minimal. If we can produce a mask, so that it is all 0's when the nibble N is less than or equal to 9, and all 1's when the nibble is greater than 9, we can mask out the 7.
I noticed that subtraction can do exactly that. If N <= 9, R = 9 - N will produce a positive number, therefore the higher nibble of the result R will be all 0's. If N > 9, R = 9 - N will produce a negative number, therefore the higher nibble of the result R will be all 1's.
scenario 1 - assume lower nibble of al is 0x0B, and we want to convert it into hex:
and al, 0x0f ; first clear the high nibble
mov bl, 0x09
sub bl, al ; bl = 9 - al = -2 (binary: 11111110)
; notice at this point, the higher nibble of bl is all 1's. we can use it as a mask.
shr bl, 4
and bl, 0x07 ; bl = 0x0f & 0x07 = 0x07
or bl, 0x30 ; bl = 0x07 | 0x30 = 0x37
add al, bl ; al = 0x0B + 0x37 = 0x42 = 'B'
scenario 2 - assume lower nibble of al is 0x04
and al, 0x0f
mov bl, 0x09
sub bl, al ; bl = 9 - al = 5 (binary: 00000101)
; notice that the higher nibble is all 0's, instead of all 1's
shr bl, 4
and bl, 0x07 ; bl = 0x00 & 0x07 = 0x00
or bl, 0x30 ; bl = 0x00 | 0x30 = 0x30
add al, bl ; al = 0x04 + 0x30 = 0x34 = '4'
To convert a nibble N into its HEX representation, we can do the following:
- if the nibble is between 0 and 9, add the nibble by 0x30
- if the nibble is between A and F, add the nibble by 0x37
Notice the difference between 0x30 and 0x37 is minimal. If we can produce a mask, so that it is all 0's when the nibble N is less than or equal to 9, and all 1's when the nibble is greater than 9, we can mask out the 7.
I noticed that subtraction can do exactly that. If N <= 9, R = 9 - N will produce a positive number, therefore the higher nibble of the result R will be all 0's. If N > 9, R = 9 - N will produce a negative number, therefore the higher nibble of the result R will be all 1's.
scenario 1 - assume lower nibble of al is 0x0B, and we want to convert it into hex:
and al, 0x0f ; first clear the high nibble
mov bl, 0x09
sub bl, al ; bl = 9 - al = -2 (binary: 11111110)
; notice at this point, the higher nibble of bl is all 1's. we can use it as a mask.
shr bl, 4
and bl, 0x07 ; bl = 0x0f & 0x07 = 0x07
or bl, 0x30 ; bl = 0x07 | 0x30 = 0x37
add al, bl ; al = 0x0B + 0x37 = 0x42 = 'B'
scenario 2 - assume lower nibble of al is 0x04
and al, 0x0f
mov bl, 0x09
sub bl, al ; bl = 9 - al = 5 (binary: 00000101)
; notice that the higher nibble is all 0's, instead of all 1's
shr bl, 4
and bl, 0x07 ; bl = 0x00 & 0x07 = 0x00
or bl, 0x30 ; bl = 0x00 | 0x30 = 0x30
add al, bl ; al = 0x04 + 0x30 = 0x34 = '4'
ASKER
Now I see where the problem lies:
It works great to 1-byte values, but when you try to do four at once, a borrow is subtracted from the higher bytes. For instance:
mov eax, 0x0A090A09 // sample
mov ebx, 0x09090909
sub ebx, eax // ebx is now FEFFFF00
shr ebx, 4 // ebx is now 0FEFFFF0
and ebx, 0x07070707 // ebx is now 07070700
or ebx, 0x30303030 // ebx is now 37373730
add eax, ebx // eax is now 41404139 = "A@A9"
after
sub ebx, eax
ebx is
FEFFFF00 (actual)
FF00FF00 (desired)
That is, it works for the rightmost byte, but when a nibble is > 9, it causes a borrow from the byte to the left, screwing up the rest of the mask.
==-=-=-=-=-=-=-
You already mentioned that your output logic is not too efficient (those single-byte writes). But I think we might be able to improve upon AvonWyss's output as well.
Coming into the output part, we have four ASCII characters in two 32-bit registers (edx and ecx). We want to output:
_dl _cl space
_dh _ch space
edl ecl space (the lo byte of the hi word)
edh ech space (the hi byte of the hi word)
That is twelve bytes and we aim to do it by writing three four-byte values.
output #1: _dl _cl _20 _dh
output #2: _ch _20 edl ecl
output #3: _20 edh ech _20
the data needs to be reversed for output so:
regout #1 must be: _dh _20 _cl _dl
regout #2 must be: ecl edl _20 _ch
regout #3 must be: _20 ech edh _20
I'm too tired right now to finish this thought... but it seems that there might be a way to set this up a little more efficiently than AvonWyss's 21-instruction sequence ... perhaps using some combination of XCHG and BSWAP. Or perhaps shifting the data into a FPU register.
-- Dan
It works great to 1-byte values, but when you try to do four at once, a borrow is subtracted from the higher bytes. For instance:
mov eax, 0x0A090A09 // sample
mov ebx, 0x09090909
sub ebx, eax // ebx is now FEFFFF00
shr ebx, 4 // ebx is now 0FEFFFF0
and ebx, 0x07070707 // ebx is now 07070700
or ebx, 0x30303030 // ebx is now 37373730
add eax, ebx // eax is now 41404139 = "A@A9"
after
sub ebx, eax
ebx is
FEFFFF00 (actual)
FF00FF00 (desired)
That is, it works for the rightmost byte, but when a nibble is > 9, it causes a borrow from the byte to the left, screwing up the rest of the mask.
==-=-=-=-=-=-=-
You already mentioned that your output logic is not too efficient (those single-byte writes). But I think we might be able to improve upon AvonWyss's output as well.
Coming into the output part, we have four ASCII characters in two 32-bit registers (edx and ecx). We want to output:
_dl _cl space
_dh _ch space
edl ecl space (the lo byte of the hi word)
edh ech space (the hi byte of the hi word)
That is twelve bytes and we aim to do it by writing three four-byte values.
output #1: _dl _cl _20 _dh
output #2: _ch _20 edl ecl
output #3: _20 edh ech _20
the data needs to be reversed for output so:
regout #1 must be: _dh _20 _cl _dl
regout #2 must be: ecl edl _20 _ch
regout #3 must be: _20 ech edh _20
I'm too tired right now to finish this thought... but it seems that there might be a way to set this up a little more efficiently than AvonWyss's 21-instruction sequence ... perhaps using some combination of XCHG and BSWAP. Or perhaps shifting the data into a FPU register.
-- Dan
Aha, now I understand where the problem is. Can't believe I was so blind to it. AvonWyss, sorry, I believe you also mentioned the same thing, but at that time I thought that problem only arise because I was subtracting each nibble from 10, instead of 9.
Yes, we'll spend our energy optimizing AvonWyss's code more then.
Yes, we'll spend our energy optimizing AvonWyss's code more then.
Since this Q thread has become pretty large, I suggest that we continue in the 0 point followup I have set up for this purpose:
https://www.experts-exchange.com/assembly/Q.20280946.html
https://www.experts-exchange.com/assembly/Q.20280946.html
ASKER
ASKER
This thread seems to have run its course.
I am accepting this answer, mainly for able's excellent work in the two continuation threads:
http:Q.20280946.html
http:Q.20283475.html
I am also posting some points to AvonWyss in
http:Q.20283475.html
I hope we all had fun! I made a new challenge here:
http:Q.20288519.html
-- Dan
I am accepting this answer, mainly for able's excellent work in the two continuation threads:
http:Q.20280946.html
http:Q.20283475.html
I am also posting some points to AvonWyss in
http:Q.20283475.html
I hope we all had fun! I made a new challenge here:
http:Q.20288519.html
-- Dan
Yes, it was definitely fun. Thanks!
I was not yet finished, but I took a slower pace. I started reading Inner Loops from Rick Booth, which showed me that I have been thinking often in the wrong direction.
There is so much to do with pairing on the Pentium and even more about reading cache lines and processing micro ops (which is done in quite an undeterminable manner) in the Pentium Pro that I think that there is still much more to gain if it comes to speed. And I am talking about all different flavors of the algorithm and all different kinds of processors.
But now on with http:Q.20288519.html!
Cheers!
Abel
I was not yet finished, but I took a slower pace. I started reading Inner Loops from Rick Booth, which showed me that I have been thinking often in the wrong direction.
There is so much to do with pairing on the Pentium and even more about reading cache lines and processing micro ops (which is done in quite an undeterminable manner) in the Pentium Pro that I think that there is still much more to gain if it comes to speed. And I am talking about all different flavors of the algorithm and all different kinds of processors.
But now on with http:Q.20288519.html!
Cheers!
Abel
I'm intrigued.