Start Free Trial

asked on

How fast is your ASM hex converter?

Note [3/24/2002]: Continuing at https://www.experts-exchange.com/assembly/Q.20280946.html
Note [3/31/2002]: And then continued here: http:Q.20283475.html

Just throwing out the gauntlet:
Write an 80x86 routine that converts 16 bytes of data to hex and ASCII format. The ASCII 16 bytes should show . (period) for characters <32 or >127. Output format is 65 bytes:

XX XX XX XX XX XX XX XX-XX XX XX XX XX XX XX XX aaaaaaaaaaaaaaaa\0
//----------------------------------------
void HexLineOut( BYTE* pSrc, char* pszBuf )
{
_asm{
// your code here!
}
}

Prove that your code is the fastest on earth. Show that it is faster than similar code written in straight (optimized) C. We could run tests by converting a 10 MB buffer and timing the results.

Is anyone game?

Are you saying it should look like a DEBUG or (HEXED) output window?

I'm intrigued.

ASKER

Sure! It could be used in a debugger. I chose it because I'm petty sure I can dig up some binary-to-hex tricks that the C optimizer won't know about :-)

Here is a not-t0o-efficient C version of the target routine:

void HexLineOut( BYTE* pSrc, char* pszBuf )
{
int j;
BYTE c, d;
for( j=0; j<16; j++ ) {
c= *pSrc++;
d= c >> 4; // get the hi nybble
if ( d < 10 ) *pszBuf++ = d+'0';
else *pszBuf++ = d+'0' + ('A' - ('9'+1)); // tip: this is d+0x37

d= c & 0x0f; // get the lo nybble
if ( d < 10 ) *pszBuf++ = d+'0';
else *pszBuf++ = d+'0' + ('A' - ('9'+1));

if ( j == 7 ) *pszBuf++ = '-'; // space betwee bytes
else *pszBuf++ = ' '; // dash betwee bytes
}
*pszBuf++ = ' '; // space before ascii

pSrc -= 16; // back to the start

for( j=0; j<16; j++ ) {
c= *pSrc++;
if ( c<32 || c> 127 ) {
c= '.';
}
*pszBuf++ = c;
}
*pszBuf= '\0';
}

Give it a go!

-- Dan

>>Prove that your code is the fastest on earth.

I'm not a bad assambly programmer and might try to write the fastest.
But how in gods name can you prove any code is the fastest other than not being able to find faster code. Wich could also mean you're bad at searching or you have bad competition.

Would it not be more easy to understand to write:

void HexLineOut(BYTE pSrc[16], char pszBuf[65])

ASKER

OK, I'll up the ante...

You must prove that your code is the fastest in the known Universe, and all Not-Yet-Found Parallel Universes, and Whatever Higher Dimensions Nobody Has Ever Imagined, and in any Arbitrary Namespaces That Begin with Z or Feature the Word Omega.

Anything less and you don't win! Supporting evidence such as: "Well, my Mommy said it was really fast when she wasn't playing Quake" will be given more weight in the final judging than "I heard it say 'Zoom' a coupla times".

So get to work or you won't have a chance at the golden apple! The Judges' Decision is Familial!

-- Dan

ASKER CERTIFIED SOLUTION

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

ASKER

Smarter than what? who?

If you want to write an Eiffel version, then by all means, do it! It's hard to imagine that it could be faster than *well-written* hand-optimized ASM code, but perhaps it could make use of multiple processors or something.

-- Dan

> Smarter than what? who?
Big blue against the people: computer-optimized c and opt code against the postings here of course :)

Let's keep it fair: one processor. But I guess you're right, especially with an (apparently) trivial task as this, how can a computer outperform a human being?

But I'm getting curious. Let's get to work.

Cheers

ASKER

Here is some helper code that I am placing in a dialog-based app for testing purposes. So far, my best benchmark is around 20,000 characters per ms (a 450MHz P3), but I really haven't started with the hand-optimization yet.

BOOL FileToBuf( LPCSTR sFilename, BYTE* pBuf, int nMax )
{
int nLen= 0;
HFILE hFile= _lopen(sFilename, OF_READ );
if (hFile == HFILE_ERROR ) {
return( FALSE );
}
int nFileLen= _llseek( hFile, 0L, SEEK_END );
_llseek( hFile, 0L, SEEK_SET );

for (int j=0; j<nFileLen; j += 1024) {
if (nLen + 1024 > nMax ) {
break;
}
int nActual= _lread( hFile, pBuf, 1024 );
pBuf += nActual;
nLen += nActual;
}
_lclose( hFile );
return( nLen );
}

void CHexConvertDlg::DumpBuf( BYTE* p, int nLen, BOOL fDisplay /*=TRUE*/ )
{
char szBuf[68];
for (int j= 0; j< nLen; j += 16 ) {
HexLineOut( &p[j], szBuf );
if ( fDisplay ) {
szBuf[65]='\r'; szBuf[66]='\n'; szBuf[67]='\0';
m_ctlEdit.SetSel(-1,-1);
m_ctlEdit.ReplaceSel(szBuf);
m_ctlEdit.SendMessage(EM_SCROLLCARET,0,0);
}
}
}

void CHexConvertDlg::OnButton1()
{
UpdateData( TRUE ); // get the value of m_fShowOutput amd m_nIterCnt

BYTE* pBuf= new BYTE[800000];

int nFileLen= FileToBuf("c:\\temp\\test.txt", pBuf, 800000 );

nFileLen &= 0xfffffff0; // round to closest 16

int nTotalBytes= 0;

timeBeginPeriod(1);
DWORD nStartTick= timeGetTime();

for (int j=0; j< m_nIterCnt; j++ ) {
m_ctlEdit.SetWindowText("");
DumpBuf( pBuf, nFileLen, m_fShowOutput );
nTotalBytes += nFileLen;
}

DWORD nTotalTicks= timeGetTime()-nStartTick;
timeEndPeriod(1);

DWORD nCntPerTick= -1;
if ( nTotalTicks != 0 ) nCntPerTick= nTotalBytes/nTotalTicks;

CString sMsg;
sMsg.Format( "%d bytes converted in %d ms\n\n %d conversions per ms", nTotalBytes, nTotalTicks, nCntPerTick );
MessageBox(sMsg);

delete pBuf;
}

Something like this should already be pretty good (no jump/loops, does some pairing, tries to avoid excessive memory access, uses aligned addresses where easily possible):

// Assuming:
// esi = binary data buffer
// edi = destination char buffer, size 66 bytes

      sub ebx,ebx
      mov edx,[esi+00]
      mov bl,dl
      mov ax,word ptr [ebx*2+@hex]
      shr edx,8
      mov [edi+00],ax
      mov bl,dl
      mov ax,word ptr [ebx*2+@hex]
      mov byte ptr [edi+02],32
      shr edx,8
      mov [edi+03],ax
      mov bl,dl
      mov ax,word ptr [ebx*2+@hex]
      mov byte ptr [edi+05],32
      shr edx,8
      mov [edi+06],ax
      mov ax,word ptr [edx*2+@hex]
      mov byte ptr [edi+08],32
      mov [edi+09],ax
      mov byte ptr [edi+11],32
      mov edx,[esi+04]
      mov bl,dl
      mov ax,word ptr [ebx*2+@hex]
      shr edx,8
      mov [edi+12],ax
      mov bl,dl
      mov ax,word ptr [ebx*2+@hex]
      mov byte ptr [edi+14],32
      shr edx,8
      mov [edi+15],ax
      mov bl,dl
      mov ax,word ptr [ebx*2+@hex]
      mov byte ptr [edi+17],32
      shr edx,8
      mov [edi+18],ax
      mov ax,word ptr [edx*2+@hex]
      mov byte ptr [edi+20],32
      mov [edi+21],ax
      mov byte ptr [edi+23],45
      mov edx,[esi+08]
      mov bl,dl
      mov ax,word ptr [ebx*2+@hex]
      shr edx,8
      mov [edi+24],ax
      mov bl,dl
      mov ax,word ptr [ebx*2+@hex]
      mov byte ptr [edi+26],32
      shr edx,8
      mov [edi+27],ax
      mov bl,dl
      mov ax,word ptr [ebx*2+@hex]
      mov byte ptr [edi+29],32
      shr edx,8
      mov [edi+30],ax
      mov ax,word ptr [edx*2+@hex]
      mov byte ptr [edi+32],32
      mov [edi+33],ax
      mov byte ptr [edi+35],32
      mov edx,[esi+12]
      mov bl,dl
      mov ax,word ptr [ebx*2+@hex]
      shr edx,8
      mov [edi+36],ax
      mov bl,dl
      mov ax,word ptr [ebx*2+@hex]
      mov byte ptr [edi+38],32
      shr edx,8
      mov [edi+39],ax
      mov bl,dl
      mov ax,word ptr [ebx*2+@hex]
      mov byte ptr [edi+41],32
      shr edx,8
      mov [edi+42],ax
      mov ax,word ptr [edx*2+@hex]
      mov byte ptr [edi+44],32
      mov [edi+45],ax
      mov byte ptr [edi+47],32
      mov edx,[esi+00]
      rol edx,8
      mov bl,dl
      mov al,byte ptr [ebx+@ascii]
      rol edx,8
      shl eax,8
      mov bl,dl
      mov al,byte ptr [ebx+@ascii]
      rol edx,8
      shl eax,8
      mov bl,dl
      mov al,byte ptr [ebx+@ascii]
      rol edx,8
      and edx,$000000FF
      shl eax,8
      mov al,byte ptr [edx+@ascii]
      mov [edi+48],eax
      mov edx,[esi+04]
      rol edx,8
      mov bl,dl
      mov al,byte ptr [ebx+@ascii]
      rol edx,8
      shl eax,8
      mov bl,dl
      mov al,byte ptr [ebx+@ascii]
      rol edx,8
      shl eax,8
      mov bl,dl
      mov al,byte ptr [ebx+@ascii]
      rol edx,8
      and edx,$000000FF
      shl eax,8
      mov al,byte ptr [edx+@ascii]
      mov [edi+52],eax
      mov edx,[esi+08]
      rol edx,8
      mov bl,dl
      mov al,byte ptr [ebx+@ascii]
      rol edx,8
      shl eax,8
      mov bl,dl
      mov al,byte ptr [ebx+@ascii]
      rol edx,8
      shl eax,8
      mov bl,dl
      mov al,byte ptr [ebx+@ascii]
      rol edx,8
      and edx,$000000FF
      shl eax,8
      mov al,byte ptr [edx+@ascii]
      mov [edi+56],eax
      mov edx,[esi+12]
      rol edx,8
      mov bl,dl
      mov al,byte ptr [ebx+@ascii]
      rol edx,8
      shl eax,8
      mov bl,dl
      mov al,byte ptr [ebx+@ascii]
      rol edx,8
      shl eax,8
      mov bl,dl
      mov al,byte ptr [ebx+@ascii]
      rol edx,8
      and edx,$000000FF
      shl eax,8
      mov al,byte ptr [edx+@ascii]
      mov byte ptr [edi+65],0
      ret

@hex:
      db '000102030405060708090A0B0C0D0E0F'
      db '101112131415161718191A1B1C1D1E1F'
      db '202122232425262728292A2B2C2D2E2F'
      db '303132333435363738393A3B3C3D3E3F'
      db '404142434445464748494A4B4C4D4E4F'
      db '505152535455565758595A5B5C5D5E5F'
      db '606162636465666768696A6B6C6D6E6F'
      db '707172737475767778797A7B7C7D7E7F'
      db '808182838485868788898A8B8C8D8E8F'
      db '909192939495969798999A9B9C9D9E9F'
      db 'A0A1A2A3A4A5A6A7A8A9AAABACADAEAF'
      db 'B0B1B2B3B4B5B6B7B8B9BABBBCBDBEBF'
      db 'C0C1C2C3C4C5C6C7C8C9CACBCCCDCECF'
      db 'D0D1D2D3D4D5D6D7D8D9DADBDCDDDEDF'
      db 'E0E1E2E3E4E5E6E7E8E9EAEBECEDEEEF'
      db 'F0F1F2F3F4F5F6F7F8F9FAFBFCFDFEFF'
@ascii:
      db 046,046,046,046,046,046,046,046,046,046,046,046,046,046,046,046
      db 046,046,046,046,046,046,046,046,046,046,046,046,046,046,046,046
      db 032,033,034,035,036,037,038,039,040,041,042,043,044,045,046,047
      db 048,049,050,051,052,053,054,055,056,057,058,059,060,061,062,063
      db 064,065,066,067,068,069,070,071,072,073,074,075,076,077,078,079
      db 080,081,082,083,084,085,086,087,088,089,090,091,092,093,094,095
      db 096,097,098,099,100,101,102,103,104,105,106,107,108,109,110,111
      db 112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127
      db 046,046,046,046,046,046,046,046,046,046,046,046,046,046,046,046
      db 046,046,046,046,046,046,046,046,046,046,046,046,046,046,046,046
      db 046,046,046,046,046,046,046,046,046,046,046,046,046,046,046,046
      db 046,046,046,046,046,046,046,046,046,046,046,046,046,046,046,046
      db 046,046,046,046,046,046,046,046,046,046,046,046,046,046,046,046
      db 046,046,046,046,046,046,046,046,046,046,046,046,046,046,046,046
      db 046,046,046,046,046,046,046,046,046,046,046,046,046,046,046,046
      db 046,046,046,046,046,046,046,046,046,046,046,046,046,046,046,046

ASKER

AvonWyss,
Excellent! Cool! I'm glad there is some action here!

Just a little sloppy... I needed to fix two bugs (see code below)

The code tests out at *very nearly as fast* as a 'reference' C routine I wrote that uses a lookup table for the 'Hex' data (I did not implement the lookup table for the ASCII output -- though that is clearly going to improve the speed when implemented.).

That you are unrolling the loops and collecting the four bytes before outputing is a definite win, but I'm sure there is pleny of room for improvement.

I think that all of the lines that use indirect addressing are costing you: For instance,

      mov ax,word ptr [ebx*2+hex]
or
      mov al,byte ptr [ebx+ascii]

are probably expensive in comparison to a straight lookup through the index register and an increment of that register each time (I could be wrong though...).

Here is the VC++ -ized version of your code, with my changes inidcated:

char hex[]=
      "000102030405060708090A0B0C0D0E0F"
      "101112131415161718191A1B1C1D1E1F"
      "202122232425262728292A2B2C2D2E2F"
      "303132333435363738393A3B3C3D3E3F"
      "404142434445464748494A4B4C4D4E4F"
      "505152535455565758595A5B5C5D5E5F"
      "606162636465666768696A6B6C6D6E6F"
      "707172737475767778797A7B7C7D7E7F"
      "808182838485868788898A8B8C8D8E8F"
      "909192939495969798999A9B9C9D9E9F"
      "A0A1A2A3A4A5A6A7A8A9AAABACADAEAF"
      "B0B1B2B3B4B5B6B7B8B9BABBBCBDBEBF"
      "C0C1C2C3C4C5C6C7C8C9CACBCCCDCECF"
      "D0D1D2D3D4D5D6D7D8D9DADBDCDDDEDF"
      "E0E1E2E3E4E5E6E7E8E9EAEBECEDEEEF"
      "F0F1F2F3F4F5F6F7F8F9FAFBFCFDFEFF";

char ascii[]= {
46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46,
46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46,
32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47,
48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63,
64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79,
80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95,
96, 97, 98, 99, 100,101,102,103,104,105,106,107,108,109,110,111,
112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,
46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46,
46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46,
46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46,
46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46,
46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46,
46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46,
46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46,
46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46
};

void AvonWysHexLineOut( BYTE* pSrc, char* pszBuf )
{
_asm {
      mov esi, pSrc; ; <<--- Added
      mov edi, pszBuf; ; <<--- Added
      sub ebx,ebx
      mov edx,[esi+00]
      mov bl,dl
      mov ax,word ptr [ebx*2+hex]
      shr edx,8
      mov [edi+00],ax
      mov bl,dl
      mov ax,word ptr [ebx*2+hex]
      mov byte ptr [edi+2],32
      shr edx,8
      mov [edi+03],ax
      mov bl,dl
      mov ax,word ptr [ebx*2+hex]
      mov byte ptr [edi+5],32
      shr edx,8
      mov [edi+6],ax
      mov ax,word ptr [edx*2+hex]
      mov byte ptr [edi+8],32
      mov [edi+9],ax
      mov byte ptr [edi+11],32
      mov edx,[esi+4]
      mov bl,dl
      mov ax,word ptr [ebx*2+hex]
      shr edx,8
      mov [edi+12],ax
      mov bl,dl
      mov ax,word ptr [ebx*2+hex]
      mov byte ptr [edi+14],32
      shr edx,8
      mov [edi+15],ax
      mov bl,dl
      mov ax,word ptr [ebx*2+hex]
      mov byte ptr [edi+17],32
      shr edx,8
      mov [edi+18],ax
      mov ax,word ptr [edx*2+hex]
      mov byte ptr [edi+20],32
      mov [edi+21],ax
      mov byte ptr [edi+23],45
      mov edx,[esi+8]
      mov bl,dl
      mov ax,word ptr [ebx*2+hex]
      shr edx,8
      mov [edi+24],ax
      mov bl,dl
      mov ax,word ptr [ebx*2+hex]
      mov byte ptr [edi+26],32
      shr edx,8
      mov [edi+27],ax
      mov bl,dl
      mov ax,word ptr [ebx*2+hex]
      mov byte ptr [edi+29],32
      shr edx,8
      mov [edi+30],ax
      mov ax,word ptr [edx*2+hex]
      mov byte ptr [edi+32],32
      mov [edi+33],ax
      mov byte ptr [edi+35],32
      mov edx,[esi+12]
      mov bl,dl
      mov ax,word ptr [ebx*2+hex]
      shr edx,8
      mov [edi+36],ax
      mov bl,dl
      mov ax,word ptr [ebx*2+hex]
      mov byte ptr [edi+38],32
      shr edx,8
      mov [edi+39],ax
      mov bl,dl
      mov ax,word ptr [ebx*2+hex]
      mov byte ptr [edi+41],32
      shr edx,8
      mov [edi+42],ax
      mov ax,word ptr [edx*2+hex]
      mov byte ptr [edi+44],32
      mov [edi+45],ax
      mov byte ptr [edi+47],32
      mov edx,[esi+00]
      rol edx,8
      mov bl,dl
      mov al,byte ptr [ebx+ascii]
      rol edx,8
      shl eax,8
      mov bl,dl
      mov al,byte ptr [ebx+ascii]
      rol edx,8
      shl eax,8
      mov bl,dl
      mov al,byte ptr [ebx+ascii]
      rol edx,8
      and edx,0xFF
      shl eax,8
      mov al,byte ptr [edx+ascii]
      mov [edi+48],eax
      mov edx,[esi+04]
      rol edx,8
      mov bl,dl
      mov al,byte ptr [ebx+ascii]
      rol edx,8
      shl eax,8
      mov bl,dl
      mov al,byte ptr [ebx+ascii]
      rol edx,8
      shl eax,8
      mov bl,dl
      mov al,byte ptr [ebx+ascii]
      rol edx,8
      and edx,0xFF
      shl eax,8
      mov al,byte ptr [edx+ascii]
      mov [edi+52],eax
      mov edx,[esi+8]
      rol edx,8
      mov bl,dl
      mov al,byte ptr [ebx+ascii]
      rol edx,8
      shl eax,8
      mov bl,dl
      mov al,byte ptr [ebx+ascii]
      rol edx,8
      shl eax,8
      mov bl,dl
      mov al,byte ptr [ebx+ascii]
      rol edx,8
      and edx,0xFF
      shl eax,8
      mov al,byte ptr [edx+ascii]
      mov [edi+56],eax
      mov edx,[esi+12]
      rol edx,8
      mov bl,dl
      mov al,byte ptr [ebx+ascii]
      rol edx,8
      shl eax,8
      mov bl,dl
      mov al,byte ptr [ebx+ascii]
      rol edx,8
      shl eax,8
      mov bl,dl
      mov al,byte ptr [ebx+ascii]
      rol edx,8
      and edx,0xFF
      shl eax,8
      mov al,byte ptr [edx+ascii]
      mov [edi+60],eax ;<<<--- added
      mov byte ptr [edi+64],0 ;<<<--- fixed
      }
}

-- Dan

P.S. I'm still looking for a winner...

Dan, thanks for trying my code. Note that the bugs are due to the helper code I used when writing it, which I then stopped again and obviously broke my code.

Also, a test like this makes it obvious that nowadays processors and compilers already do a pretty good job on generating fast code. For optimal use of the processor pipelines, lots of simple commands are about the best code you can generate, and the compilers do exactly this.

Anyways, there are pretty huge differences with the processors available by means of pairing and other things. However, indirect addressing should no longer be costy on any processor since the Pentium (1 cycle allways, but may impact pairing due to the additional registers used). What could be a real neckbreaker are my 16-bit reads and writes on uneven addresses. I believe these are not very good, but I didn't really have a better idea. Maybe I could also collect 32 bits of data to write before doing so (e.g. "XX X","X XX","XX X" etc. pattern). I I have some time again, I'll look into it.

Is there a method to align ASM data or code (the compiler also does this) in memory in the inline assembler of VC++?

ASKER

>>Is there a method to align ASM data or code
If you posted this Q in the Assembly TA, I could get my first Assembly TA points!

The VC++ _asm assembler supports the EVEN and ALIGN directives. But only for code (it inserts NOPs as needed).

It does not support DB, DW, etc. for defining data (as you may have inferred from my mods to your code). But since you can define it outside of the _asm block, I think it will automatically be aligned in the same manner as other C data structures (it is a project setting that defaults to 8 bytes).

>>indirect addressing should no longer be costy on any
>> processor since the Pentium
I can't make heads-nor-tails of opcode timings in the later 80x86 documentation. It used to be there were nice simple charts showing the clock-count for each addressing mode and opcode.

>>Neckbreaker...16-bit reads and writes on uneven addresses
I agree, this must be costing something.

>>Maybe I could also collect 32 bits of data to write
>>before doing so (e.g. "XX X","X XX","XX X" etc. pattern).

This shouldn't be too hard since the loops are unrolled anyway... as soon as you have four bytes accumulated, output'em.

-- Dan

You should also not count the time it takes to display on the screen. At this point, you are at the mercy of the video display to see how fast it can display a line of text in the edit control.

Instead, turn off the edit control's redraw before you start. Then turn it back on when you are done:

timeBeginPeriod(1);
DWORD nStartTick= timeGetTime();
m_ctlEdit.SetRedraw(FALSE);

// make calls to dump hex

DWORD nTotalTicks= timeGetTime()-nStartTick;
timeEndPeriod(1);
m_ctlEdit.SetRedraw(TRUE);

Of course, by doing this, you will no longer see the updates on the screen until it is finished. If that is a part of your design, then you cannot do this. But realize that you are timing the output as well as the conversion.

ASKER

>>You should also not count the time it takes to display on the screen.

You are right thui!

Notice that I have a flag named m_fShowOutput that get passed into the DumpBuf routine. When I benchmark other people's code, I turn this ON. No matter how well-written the code, everybody always scores out at about 9 conversions per millisconed. But when I test my own code, I 'accidentlally' turn it off. Longshot, but perhaps that is why my code is currently yielding about 21,000 conversions per millisecond.

-- Dan

P.S.
I am about to publish my own Hex-to-ASCII routine and when I do that all challengers will cry alligator tears of frustration, knowing that the Universe will never see faster code. So now's your chance.

DanRollins, in this case, please do test the code again without output at all, so that the conditions are the same for both applications... and post the resultsd again :-)

ASKER

Oh PUUUULLLEEEEEASE.

How stupid is this Mr. DanRollins. Hm, he was able to convert my raw ASM into runnnable VC++ code and he fixed two of my coding errors. He suggested some rather subtle improvements to my code and helped me learn about code alignment in an _asm block. Even so, he is probably so dense that he could not recognoize the difference between 9 conversions per millisecond and 20,000 conversions per milliseconfd. I'd better remind him to not display the lines as they are converted.

sheesh.

-- Dan

ASKER

Here is jonin's submission from the C++ TA. It boils down to:

inline void f()
{
printf("%x %x %x %x %x %x %x %x-%x %x %x %x %x %x %x %x",
data[0],data[1],data[2],data[3],data[4],data[5],data[6],
data[7],data[8],data[9],data[10],data[11],data[12],data[13],
data[14],data[15]);
printf(" %c%c%c%c%c%c%c%c%c%c%c%c%c%c%c%c\n",
ascii[data[0]],
ascii[data[1]],
ascii[data[2]],
ascii[data[3]],
ascii[data[4]],
ascii[data[5]],
ascii[data[6]],
ascii[data[7]],
ascii[data[8]],
ascii[data[9]],
ascii[data[10]],
ascii[data[11]],
ascii[data[12]],
ascii[data[13]],
ascii[data[14]],
ascii[data[15]]);
}
=--=-=-=-=-=-=-=-=-==-
I will convert from printf to the sprintf and run a timing test on it tonight, but I'm afraid that it will be considerably slower than some other tries. Also, I fear that it won't meet the specs in that if any of the bytes < 0x10, they will display as a single hex digit rather than the two digits in the spec.

jonin says that even bad ASM will be 10 times faster than fast C, but I disagree. My reference C fn is just slightly faster than AvonWyss's ASM code. But it avoids the (relatively) massive overhead involved in pushing 32 data values onto the stack and calling printf.

-- Dan

I'm sorry to say but I don't see 1 single assembly command.

And I clearly remember reading:

_asm{

// your code here!

}

So in my humble opinion you do not qualify to get the points.

ASKER

Hi NicoLan...
I invited some C/C++ experts to try their hand at this challenge. I think that it will be interesting to see how well a C-language optimizer -- with a cleverly implemented algorithm -- can do against raw ASM.

-- Dan

ASKER

Here is Pavlik's C/C++ submission (from http:Q.20280145.html )

I will test it tonight and post the timings! Note that I am not certain that it meets all criteria in the specification, but I won't hold that against him if it can be modified to meet the spec without slowing down.

static const long digits[256] =
{
0x3030, 0x3130,0x3230, 0x3330, 0x3430, 0x3530, 0x3630, 0x3730, 0x3830, 0x3930, 0x6130, 0x6230, 0x6330,
0x6430, 0x6530, 0x6630,
0x3031, 0x3131,0x3231, 0x3331, 0x3431, 0x3531, 0x3631, 0x3731, 0x3831, 0x3931, 0x6131, 0x6231, 0x6331,
0x6431, 0x6531, 0x6631,
0x3032, 0x3132,0x3232, 0x3332, 0x3432, 0x3532, 0x3632, 0x3732, 0x3832, 0x3932, 0x6132, 0x6232, 0x6332,
0x6432, 0x6532, 0x6632,
0x3033, 0x3133,0x3233, 0x3333, 0x3433, 0x3533, 0x3633, 0x3733, 0x3833, 0x3933, 0x6133, 0x6233, 0x6333,
0x6433, 0x6533, 0x6633,
0x3034, 0x3134,0x3234, 0x3334, 0x3434, 0x3534, 0x3634, 0x3734, 0x3834, 0x3934, 0x6134, 0x6234, 0x6334,
0x6434, 0x6534, 0x6634,
0x3035, 0x3135,0x3235, 0x3335, 0x3435, 0x3535, 0x3635, 0x3735, 0x3835, 0x3935, 0x6135, 0x6235, 0x6335,
0x6435, 0x6535, 0x6635,
0x3036, 0x3136,0x3236, 0x3336, 0x3436, 0x3536, 0x3636, 0x3736, 0x3836, 0x3936, 0x6136, 0x6236, 0x6336,
0x6436, 0x6536, 0x6636,
0x3037, 0x3137,0x3237, 0x3337, 0x3437, 0x3537, 0x3637, 0x3737, 0x3837, 0x3937, 0x6137, 0x6237, 0x6337,
0x6437, 0x6537, 0x6637,
0x3038, 0x3138,0x3238, 0x3338, 0x3438, 0x3538, 0x3638, 0x3738, 0x3838, 0x3938, 0x6138, 0x6238, 0x6338,
0x6438, 0x6538, 0x6638,
0x3039, 0x3139,0x3239, 0x3339, 0x3439, 0x3539, 0x3639, 0x3739, 0x3839, 0x3939, 0x6139, 0x6239, 0x6339,
0x6439, 0x6539, 0x6639,
0x3061, 0x3161,0x3261, 0x3361, 0x3461, 0x3561, 0x3661, 0x3761, 0x3861, 0x3961, 0x6161, 0x6261, 0x6361,
0x6461, 0x6561, 0x6661,
0x3062, 0x3162,0x3262, 0x3362, 0x3462, 0x3562, 0x3662, 0x3762, 0x3862, 0x3962, 0x6162, 0x6262, 0x6362,
0x6462, 0x6562, 0x6662,
0x3063, 0x3163,0x3263, 0x3363, 0x3463, 0x3563, 0x3663, 0x3763, 0x3863, 0x3963, 0x6163, 0x6263, 0x6363,
0x6463, 0x6563, 0x6663,
0x3064, 0x3164,0x3264, 0x3364, 0x3464, 0x3564, 0x3664, 0x3764, 0x3864, 0x3964, 0x6164, 0x6264, 0x6364,
0x6464, 0x6564, 0x6664,
0x3065, 0x3165,0x3265, 0x3365, 0x3465, 0x3565, 0x3665, 0x3765, 0x3865, 0x3965, 0x6165, 0x6265, 0x6365,
0x6465, 0x6565, 0x6665,
0x3066, 0x3166,0x3266, 0x3366, 0x3466, 0x3566, 0x3666, 0x3766, 0x3866, 0x3966, 0x6166, 0x6266, 0x6366,
0x6466, 0x6566, 0x6666
};
static const char chars[256] =
{
'.', '.', '.', '.', '.', '.', '.', '.', '.', '.', '.', '.', '.', '.', '.', '.',
'.', '.', '.', '.', '.', '.', '.', '.', '.', '.', '.', '.', '.', '.', '.', '.',
'\x20', '\x21','\x22','\x23','\x24','\x25','\x26','\x27','\x28','\x29','\x2a','\x2b','\x2c','\x2d','\x2e','\x2f',
'\x30', '\x31','\x32','\x33','\x34','\x35','\x36','\x37','\x38','\x39','\x3a','\x3b','\x3c','\x3d','\x3e','\x3f',
'\x40', '\x41','\x42','\x43','\x44','\x45','\x46','\x47','\x48','\x49','\x4a','\x4b','\x4c','\x4d','\x4e','\x4f',
'\x50', '\x51','\x52','\x53','\x54','\x55','\x56','\x57','\x58','\x59','\x5a','\x5b','\x5c','\x5d','\x5e','\x5f',
'\x60', '\x61','\x62','\x63','\x64','\x65','\x66','\x67','\x68','\x69','\x6a','\x6b','\x6c','\x6d','\x6e','\x6f',
'\x70', '\x71','\x72','\x73','\x74','\x75','\x76','\x77','\x78','\x79','\x7a','\x7b','\x7c','\x7d','\x7e','\x7f',
'\x80', '\x81','\x82','\x83','\x84','\x85','\x86','\x87','\x88','\x89','\x8a','\x8b','\x8c','\x8d','\x8e','\x8f',
'\x90', '\x91','\x92','\x93','\x94','\x95','\x96','\x97','\x98','\x99','\x9a','\x9b','\x9c','\x9d','\x9e','\x9f',
'\xa0', '\xa1','\xa2','\xa3','\xa4','\xa5','\xa6','\xa7','\xa8','\xa9','\xaa','\xab','\xac','\xad','\xae','\xaf',
'\xb0', '\xb1','\xb2','\xb3','\xb4','\xb5','\xb6','\xb7','\xb8','\xb9','\xba','\xbb','\xbc','\xbd','\xbe','\xbf',
'\xc0', '\xc1','\xc2','\xc3','\xc4','\xc5','\xc6','\xc7','\xc8','\xc9','\xca','\xcb','\xcc','\xcd','\xce','\xcf',
'\xd0', '\xd1','\xd2','\xd3','\xd4','\xd5','\xd6','\xd7','\xd8','\xd9','\xda','\xdb','\xdc','\xdd','\xde','\xdf',
'\xe0', '\xe1','\xe2','\xe3','\xe4','\xe5','\xe6','\xe7','\xe8','\xe9','\xea','\xeb','\xec','\xed','\xee','\xef',
'\xf0', '\xf1','\xf2','\xf3','\xf4','\xf5','\xf6','\xf7','\xf8','\xf9','\xfa','\xfb','\xfc','\xfd','\xfe','\xff'
};

long* pBuf = (long*) buf;

for (int i = 4; i > 0; --i)
{
*pBuf = (digits[(*dump)]) | 0x200000L | ((digits[*(dump+1)] & 0xffL) << 24);
++pBuf;
*pBuf = (digits[*(dump+1)] >> 8 ) | (0x2000L) | (digits[*(dump+2)] << 16);
++pBuf;
*pBuf = (digits[*(dump+3)] << 8) | 0x20000020L;
++pBuf;
dump += 4;
}

dump -= 16;

*pBuf =
chars[*dump] |
(chars[*(dump+1)] << 8) |
(chars[*(dump+2)] << 16) |
(chars[*(dump+3)] << 24);
++pBuf;
dump += 4;
*pBuf =
chars[*dump] |
(chars[*(dump+1)] << 8) |
(chars[*(dump+2)] << 16) |
(chars[*(dump+3)] << 24);
++pBuf;
dump += 4;
*pBuf =
chars[*dump] |
(chars[*(dump+1)] << 8) |
(chars[*(dump+2)] << 16) |
(chars[*(dump+3)] << 24);
++pBuf;
dump += 4;
*pBuf =
chars[*dump] |
(chars[*(dump+1)] << 8) |
(chars[*(dump+2)] << 16) |
(chars[*(dump+3)] << 24);
++pBuf;

*pBuf = 0;
}

I do the following:

1. read data 4 bytes at a time
2. operate on lower nibbles of each byte
3. make a mask that corresponds to each nibble. if the nibble is less than 10, mask is 0xff, else mask is 0x00
4. add (mask & 0x30303030) to original nibbles
5. add (mask & 0x37373737) to original nibbles
6. at this point, the register should contain 4 bytes, each a ASCII representation of the corresponding lower nibble.
7. store these values, and repeat with higher nibbles.

code:
mov edx, [esi+00]
and edx, 0x0f0f0f0f
mov ebx, edx
sub ebx, 0x0a0a0a0a
and ebx, 0xf0f0f0f0
mov ecx, ebx
shr ecx, 4
or ebx, ecx

// at this point, ebx should contain the mask for each nibble

mov ecx, ebx
and ecx, 0x30303030
add edx, ecx
mov ecx, ebx
not ecx
and ecx, 0x37373737
add edx, ecx

// at this point, edx contains ASCII representation of each nibble.

// my ASM is too rusty, so don't know how to populate the result buffer.

ASKER

First, analysis of jonnin's code. I had to modify his format string to enforce two-hex digits per byte and I use sprintf to output to the buffer as specified in the challenge.

This is processing a measly 450 conversions per millisecond -- fully optimized and wNOT displaying the output. He mentions in his post that ASM is likely to be 100 times as fast, and he is right.

Here is the working version (same ascii[] array as in AvonWyss entry) :

void JonninHexLineOut( BYTE* pSrc, char* pszBuf )
{
BYTE* data= pSrc;
sprintf(pszBuf,
"%.2x %.2x %.2x %.2x %.2x %.2x %.2x %.2x-"
"%.2x %.2x %.2x %.2x %.2x %.2x %.2x %.2x",
data[0],data[1],data[2],data[3],data[4],data[5],data[6],
data[7],data[8],data[9],data[10],data[11],data[12],data[13],
data[14],data[15]);
sprintf(&pszBuf[47], " %c%c%c%c%c%c%c%c%c%c%c%c%c%c%c%c",
ascii[data[0]], ascii[data[1]], ascii[data[2]], ascii[data[3]],
ascii[data[4]], ascii[data[5]], ascii[data[6]], ascii[data[7]],
ascii[data[8]], ascii[data[9]], ascii[data[10]], ascii[data[11]],
ascii[data[12]], ascii[data[13]], ascii[data[14]], ascii[data[15]]);
}

ASKER

Now Pavlik submitted a C-language entry that is pretty awesome. This fn uses two lookup tables and it process the input in 32-bit units and outputs in 32-bit units (using a clever masking scheme for the hex part, I might add).

The original code had these (relatively minor) problems
1) it did not put the dash after the 8th set of hex digits
2) it outputs characters > 127 (rather than '.' as specified).
3) A subtle err occurs because of signed vs unsigned values. The error goes away when I modify his chars[256] array to type 'unsigned char' (it was 'char').

(anyone care to predict how the bug manifest itself -- what the output looked like before my fix?)

Anyway, when compiled with full optimization, this is processing a whopping

40,300 conversions per millisecond!

nearly TWICE as fast an any routine so far! And it is C, NOT ASM!

Pavlik, if you unroll that first loop, you can get a minor increase. I'm surprized you didn't do it.

Here is the code he submitted (with my minor changes to meet spec):

static const long digits[256] =
{
0x3030, 0x3130,0x3230, 0x3330, 0x3430, 0x3530, 0x3630, 0x3730, 0x3830, 0x3930, 0x6130, 0x6230, 0x6330, 0x6430, 0x6530, 0x6630,
0x3031, 0x3131,0x3231, 0x3331, 0x3431, 0x3531, 0x3631, 0x3731, 0x3831, 0x3931, 0x6131, 0x6231, 0x6331, 0x6431, 0x6531, 0x6631,
0x3032, 0x3132,0x3232, 0x3332, 0x3432, 0x3532, 0x3632, 0x3732, 0x3832, 0x3932, 0x6132, 0x6232, 0x6332, 0x6432, 0x6532, 0x6632,
0x3033, 0x3133,0x3233, 0x3333, 0x3433, 0x3533, 0x3633, 0x3733, 0x3833, 0x3933, 0x6133, 0x6233, 0x6333, 0x6433, 0x6533, 0x6633,
0x3034, 0x3134,0x3234, 0x3334, 0x3434, 0x3534, 0x3634, 0x3734, 0x3834, 0x3934, 0x6134, 0x6234, 0x6334, 0x6434, 0x6534, 0x6634,
0x3035, 0x3135,0x3235, 0x3335, 0x3435, 0x3535, 0x3635, 0x3735, 0x3835, 0x3935, 0x6135, 0x6235, 0x6335, 0x6435, 0x6535, 0x6635,
0x3036, 0x3136,0x3236, 0x3336, 0x3436, 0x3536, 0x3636, 0x3736, 0x3836, 0x3936, 0x6136, 0x6236, 0x6336, 0x6436, 0x6536, 0x6636,
0x3037, 0x3137,0x3237, 0x3337, 0x3437, 0x3537, 0x3637, 0x3737, 0x3837, 0x3937, 0x6137, 0x6237, 0x6337, 0x6437, 0x6537, 0x6637,
0x3038, 0x3138,0x3238, 0x3338, 0x3438, 0x3538, 0x3638, 0x3738, 0x3838, 0x3938, 0x6138, 0x6238, 0x6338, 0x6438, 0x6538, 0x6638,
0x3039, 0x3139,0x3239, 0x3339, 0x3439, 0x3539, 0x3639, 0x3739, 0x3839, 0x3939, 0x6139, 0x6239, 0x6339, 0x6439, 0x6539, 0x6639,
0x3061, 0x3161,0x3261, 0x3361, 0x3461, 0x3561, 0x3661, 0x3761, 0x3861, 0x3961, 0x6161, 0x6261, 0x6361, 0x6461, 0x6561, 0x6661,
0x3062, 0x3162,0x3262, 0x3362, 0x3462, 0x3562, 0x3662, 0x3762, 0x3862, 0x3962, 0x6162, 0x6262, 0x6362, 0x6462, 0x6562, 0x6662,
0x3063, 0x3163,0x3263, 0x3363, 0x3463, 0x3563, 0x3663, 0x3763, 0x3863, 0x3963, 0x6163, 0x6263, 0x6363, 0x6463, 0x6563, 0x6663,
0x3064, 0x3164,0x3264, 0x3364, 0x3464, 0x3564, 0x3664, 0x3764, 0x3864, 0x3964, 0x6164, 0x6264, 0x6364, 0x6464, 0x6564, 0x6664,
0x3065, 0x3165,0x3265, 0x3365, 0x3465, 0x3565, 0x3665, 0x3765, 0x3865, 0x3965, 0x6165, 0x6265, 0x6365, 0x6465, 0x6565, 0x6665,
0x3066, 0x3166,0x3266, 0x3366, 0x3466, 0x3566, 0x3666, 0x3766, 0x3866, 0x3966, 0x6166, 0x6266, 0x6366, 0x6466, 0x6566, 0x6666
};
static const unsigned char chars[256] = //<<----- [DR] fixed (was signed: char)
{
'.', '.', '.', '.', '.', '.', '.', '.', '.', '.', '.', '.', '.', '.', '.', '.',
'.', '.', '.', '.', '.', '.', '.', '.', '.', '.', '.', '.', '.', '.', '.', '.',
'\x20', '\x21','\x22','\x23','\x24','\x25','\x26','\x27','\x28','\x29','\x2a','\x2b','\x2c','\x2d','\x2e','\x2f',
'\x30', '\x31','\x32','\x33','\x34','\x35','\x36','\x37','\x38','\x39','\x3a','\x3b','\x3c','\x3d','\x3e','\x3f',
'\x40', '\x41','\x42','\x43','\x44','\x45','\x46','\x47','\x48','\x49','\x4a','\x4b','\x4c','\x4d','\x4e','\x4f',
'\x50', '\x51','\x52','\x53','\x54','\x55','\x56','\x57','\x58','\x59','\x5a','\x5b','\x5c','\x5d','\x5e','\x5f',
'\x60', '\x61','\x62','\x63','\x64','\x65','\x66','\x67','\x68','\x69','\x6a','\x6b','\x6c','\x6d','\x6e','\x6f',
'\x70', '\x71','\x72','\x73','\x74','\x75','\x76','\x77','\x78','\x79','\x7a','\x7b','\x7c','\x7d','\x7e','\x7f',
'\x80', '\x81','\x82','\x83','\x84','\x85','\x86','\x87','\x88','\x89','\x8a','\x8b','\x8c','\x8d','\x8e','\x8f',
'\x90', '\x91','\x92','\x93','\x94','\x95','\x96','\x97','\x98','\x99','\x9a','\x9b','\x9c','\x9d','\x9e','\x9f',
'\xa0', '\xa1','\xa2','\xa3','\xa4','\xa5','\xa6','\xa7','\xa8','\xa9','\xaa','\xab','\xac','\xad','\xae','\xaf',
'\xb0', '\xb1','\xb2','\xb3','\xb4','\xb5','\xb6','\xb7','\xb8','\xb9','\xba','\xbb','\xbc','\xbd','\xbe','\xbf',
'\xc0', '\xc1','\xc2','\xc3','\xc4','\xc5','\xc6','\xc7','\xc8','\xc9','\xca','\xcb','\xcc','\xcd','\xce','\xcf',
'\xd0', '\xd1','\xd2','\xd3','\xd4','\xd5','\xd6','\xd7','\xd8','\xd9','\xda','\xdb','\xdc','\xdd','\xde','\xdf',
'\xe0', '\xe1','\xe2','\xe3','\xe4','\xe5','\xe6','\xe7','\xe8','\xe9','\xea','\xeb','\xec','\xed','\xee','\xef',
'\xf0', '\xf1','\xf2','\xf3','\xf4','\xf5','\xf6','\xf7','\xf8','\xf9','\xfa','\xfb','\xfc','\xfd','\xfe','\xff'
};

void HexLineOut( BYTE* pSrc, char* pszBuf )
{
      BYTE* dump = pSrc;
      long* pBuf = (long*) pszBuf;

      for (int i = 4; i > 0; --i) {
            *pBuf = (digits[(*dump)]) | 0x200000L | ((digits[*(dump+1)] & 0xffL) << 24);
            ++pBuf;
            *pBuf = (digits[*(dump+1)] >> 8 ) | (0x2000L) | (digits[*(dump+2)] << 16);
            ++pBuf;
            *pBuf = (digits[*(dump+3)] << 8) | 0x20000020L;
            ++pBuf;
            dump += 4;
      }
      dump -= 16;

      *pBuf =
      chars[*dump] | (chars[*(dump+1)] << 8) | (chars[*(dump+2)] << 16) | (chars[*(dump+3)] << 24);
      ++pBuf;
      dump += 4;
      *pBuf =
      chars[*dump] | (chars[*(dump+1)] << 8) | (chars[*(dump+2)] << 16) | (chars[*(dump+3)] << 24);
      ++pBuf;
      dump += 4;
      *pBuf =
      chars[*dump] | (chars[*(dump+1)] << 8) | (chars[*(dump+2)] << 16) | (chars[*(dump+3)] << 24);
      ++pBuf;
      dump += 4;
      *pBuf =
      chars[*dump] | (chars[*(dump+1)] << 8) | (chars[*(dump+2)] << 16) | (chars[*(dump+3)] << 24);
      ++pBuf;

      *pBuf = 0;

pszBuf[23]= '-'; // <<----- [DR] Added
}

=-=-=-=-=-=-=-=-
I'm admitting here and now that I may not be able to beat that by much.

-- Dan

Screw EE! I cannot post my new source, I only get that stupid "no text" thing.

Here's my new version. No tables, no jumps, all memory access is 32-bit on 32-bit boundaries.

mov edi,pszBuf
mov esi,pSrc
mov byte ptr [edi+64],0
mov eax,[esi]
mov edx,eax
and eax,252645135
shr edx,4
mov ecx,eax
and edx,252645135
add eax,101058054
mov ebx,edx
and eax,269488144
add ebx,101058054
shr eax,2
and ebx,269488144
add ecx,eax
shr ebx,2
shr eax,1
add edx,ebx
add ecx,eax
shr ebx,1
shr eax,1
add edx,ebx
add ecx,eax
shr ebx,1
add ecx,808464432
add edx,ebx
add edx,808464432
mov al,32
mov ah,dh
shl eax,16
mov al,dl
mov ah,cl
shr edx,16
mov [edi],eax
shr ecx,8
mov al,dl
mov ah,ch
shl eax,16
mov al,cl
mov ah,32
shr ecx,16
mov [edi+4],eax
mov al,cl
mov ah,32
shl eax,16
mov al,32
mov ah,dh
mov [edi+8],eax
mov eax,[esi+4]
mov edx,eax
and eax,252645135
shr edx,4
mov ecx,eax
and edx,252645135
add eax,101058054
mov ebx,edx
and eax,269488144
add ebx,101058054
shr eax,2
and ebx,269488144
add ecx,eax
shr ebx,2
shr eax,1
add edx,ebx
add ecx,eax
shr ebx,1
shr eax,1
add edx,ebx
add ecx,eax
shr ebx,1
add ecx,808464432
add edx,ebx
add edx,808464432
mov al,32
mov ah,dh
shl eax,16
mov al,dl
mov ah,cl
shr edx,16
mov [edi+12],eax
shr ecx,8
mov al,dl
mov ah,ch
shl eax,16
mov al,cl
mov ah,32
shr ecx,16
mov [edi+16],eax
mov al,cl
mov ah,45
shl eax,16
mov al,32
mov ah,dh
mov [edi+20],eax
mov eax,[esi+8]
mov edx,eax
and eax,252645135
shr edx,4
mov ecx,eax
and edx,252645135
add eax,101058054
mov ebx,edx
and eax,269488144
add ebx,101058054
shr eax,2
and ebx,269488144
add ecx,eax
shr ebx,2
shr eax,1
add edx,ebx
add ecx,eax
shr ebx,1
shr eax,1
add edx,ebx
add ecx,eax
shr ebx,1
add ecx,808464432
add edx,ebx
add edx,808464432
mov al,32
mov ah,dh
shl eax,16
mov al,dl
mov ah,cl
shr edx,16
mov [edi+24],eax
shr ecx,8
mov al,dl
mov ah,ch
shl eax,16
mov al,cl
mov ah,32
shr ecx,16
mov [edi+28],eax
mov al,cl
mov ah,32
shl eax,16
mov al,32
mov ah,dh
mov [edi+32],eax
mov eax,[esi+12]
mov edx,eax
and eax,252645135
shr edx,4
mov ecx,eax
and edx,252645135
add eax,101058054
mov ebx,edx
and eax,269488144
add ebx,101058054
shr eax,2
and ebx,269488144
add ecx,eax
shr ebx,2
shr eax,1
add edx,ebx
add ecx,eax
shr ebx,1
shr eax,1
add edx,ebx
add ecx,eax
shr ebx,1
add ecx,808464432
add edx,ebx
add edx,808464432
mov al,32
mov ah,dh
shl eax,16
mov al,dl
mov ah,cl
shr edx,16
mov [edi+36],eax
shr ecx,8
mov al,dl
mov ah,ch
shl eax,16
mov al,cl
mov ah,32
shr ecx,16
mov [edi+40],eax
mov al,cl
mov ah,32
shl eax,16
mov al,32
mov ah,dh
mov [edi+44],eax
mov eax,[esi]
mov ebx,eax
mov ecx,eax
and eax,1616928864
not ecx
add eax,1616928864
and eax,ecx
and eax,2155905152
shr eax,7
mov edx,$FF
mul edx
mov ecx,[esi+4]
and ebx,eax
not eax
and eax,774778414
or eax,ebx
mov [edi+48],eax
mov eax,ecx
mov ebx,ecx
and eax,1616928864
not ebx
add eax,1616928864
and eax,ebx
and eax,2155905152
shr eax,7
mov edx,$FF
mul edx
mov ebx,[esi+8]
and ecx,eax
not eax
and eax,774778414
or eax,ecx
mov [edi+52],eax
mov eax,ebx
mov ecx,ebx
and eax,1616928864
not ecx
add eax,1616928864
and eax,ecx
and eax,2155905152
shr eax,7
mov edx,$FF
mul edx
mov ecx,[esi+12]
and ebx,eax
not eax
and eax,774778414
or eax,ebx
mov [edi+56],eax
mov eax,ecx
mov ebx,ecx
and eax,1616928864
not ebx
add eax,1616928864
and eax,ebx
and eax,2155905152
shr eax,7
mov edx,$FF
mul edx
and ecx,eax
not eax
and eax,774778414
or eax,ecx
mov [edi+60],eax

Hope you like it...

ASKER

AvonWyss,
That second submission is unusual to say the least. Like Pavlik, you are always reading 32-bit aligned values and writing 32-bit aligned values. But rather than looking up the hex-digit values, you are calculating them.

I suppose that I am expected to be mystified here. The calculation technique appears to be rather obscure, but only because you have intentionally made it so by using decimal values and interleaved the opcodes (possibly for pipeline optimization). Lets see what it is actually doing:
=-=--==-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
CODE ANALYSIS
In the main sequence, you are working with four nibbles simultaneously. To simpliy the discusiion, lets look at one nibble at a time:

assume the first input is either 'W' (0x57) or 'Z' (0x5A)

n1= one nibble // examples: 7 or 10 (examples in decimal)
n2= n1 +6 // examples: 13 or 16
n3= 1 if n2>16, else n3=0
n4 = n1 + (n3*4)+(n3*2)+ n3
i.e. n4= n1+7 // examples: 7 (n1+0) or 17(n1+7)
n4 = n4 + 0x30 // examples: 55 (0x37='7') or 65 (0x41='A')

So your code adds 6 to obtain a temp result of 1 or a 0 (depending upon whether the sum > 10) Then it multiplies that temp result by 7 resulting in 0 or 7, then it adds that to '0' returning '0'-'9' or 'A'-'F'

The ASCII output, I have not figured out. Maybe you or someone else can explain it (I show an unobfuscated version of it below).

=-=--==-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TIMING ANALYSIS
Alas, as clever as this code is -- avoiding all JMPs and interleaving opcodes to maximize pipelining -- it cannot complete with simple lookup logic. I think there are two downfalls:

1) the code itself is nearly twice as many opcode bytes as Pavlik's (708 vs. 393) so there is just more processing going on.

2) The use of the MUL instruction is a killer (at least it used to be when I kept track of these things).

Your code was getting about

30,200 convertsions per millisecond.

Nice try, interesting code, but Sorry! not the fastest.

=-=-=-=-=-=-=-=-=-=-=-=-=-
CODE Footnotes

mov eax,[esi]
mov edx,eax
and eax,0x0F0F0F0F // eax is lo nibbles (four of them; use ah for example)
mov ecx,eax // save a copy
add eax,0x06060606 // ah is lo nibble + 6
and eax,0x10101010 // ah is 1 if a low nibble was > 9
shr eax,2 // ah is 4 if a low nibble was > 9
add ecx,eax // ah is low nibble + 4 if a low nibble was > 9
shr eax,1 // += 2 if a low nibble was > 9
add ecx,eax // += 1 if a low nibble was > 9 (total +7)
add edx,0x30303030 // += '0' (convert to 0-9 and A-F
// (omitted: repeat for hi nibbles)
// (omitted) rearrange the bytes into eax for 'big endian' architecture
// repeat four times for 16 bytes

///////// ASCII output portion -- who can explain this!

mov eax,[esi]
mov ebx,eax
mov ecx,eax
not ecx
and eax,0x60606060
add eax,0x60606060
and eax,ecx
and eax,0x80808080
shr eax,7 // 0 or 1 for each byte
mov edx,0xff // 00 or FF for each byte
mul edx // killer time penalty here
and ebx,eax
not eax
and eax,0x2E2E2E2E // '....'
or eax,ebx
mov [edi+48],eax

-- Dan

Note that I "obfuscated" the hex values because my ASM uses other hex constants than the one in VC++ (you certainly have noticed the $FF one which I forgot to convert). The rest of the obfuscation is for CPU pairing, trying to avoid using one register twice in a row. But this only as sidenote.

I'll explain the ASCII output portion:

Let's split a char into its 8 bits to explain. We have
76543210 Bits
ABBCCCCC Name

Now, your condition >=32 or <128, means that a char is valid exactly when any bit B is set (>31) but not bit A (>127). Therefore, I first mask the char to only reflect the B bits and add 01100000. The highest bit will become 1 in every case where any of the bits B were set and remain 0 if none was set. So I now have a mask telling me if the B bits were OK or not.

Next step is an AND NOT with the original value to kill the "OK" bit if the bit A is set. Finally I only keep the A bit and shift it right, so that I now have a valid mask with a 1 for every valid char. Unfortunately, I need a 0xFF mask in order to filter the chars using AND, and that's why I multiply my mask.

The rest is easy, I do a AND with the original values, which gives me only the valid chars, inverse my mask, convert the mask to "." chars, and add it to the valid chars. That's it.

The MUL can be replaced by adding an additional statement:

mov eax,[esi] // EAX: 4 source chars
mov ebx,eax // EBX: copy of them
mov ecx,eax
not ecx // ECX: negated copy (only highest bit of interest)
and eax,$60606060
add eax,$60606060 // highest bit is 1 if bits 5/6 were OK
and eax,ecx // remove the highest bit if set in source
shr eax,7
and eax,$01010101 // mask to 1 for valid, 0 for invalid
mov edx,$80808080 // set up subtraction buffer
sub edx,eax // subtract. mask 0 -> 80, 1 -> 7F
and edx,$80808080 // mask 0 -> 00, 1 -> 7F
and ebx,edx // filter valid chars. invalid chars may be $80
not edx // invert mask
and edx,$2E2E2E2E // make all invalid chars 2E ('.')
or edx,ebx // combine valid and invalid chars
mov [edi+48],edx // write result

Ok, here's a MUL-free version of the part that does the ASCII:

mov eax,[esi]
mov ebx,eax
mov ecx,eax
and eax,1616928864
not ecx
add eax,1616928864
and eax,ecx
and eax,2155905152
shr eax,7
mov edx,$80808080
sub edx,eax
and edx,$7F7F7F7F
mov ecx,[esi+4]
and ebx,edx
not edx
and edx,774778414
or edx,ebx
mov eax,ecx
mov [edi+48],edx
mov ebx,ecx
and eax,1616928864
not ebx
add eax,1616928864
and eax,ebx
and eax,2155905152
shr eax,7
mov edx,$80808080
sub edx,eax
and edx,$7F7F7F7F
mov ebx,[esi+8]
and ecx,edx
not edx
and edx,774778414
or edx,ecx
mov eax,ebx
mov [edi+52],edx
mov ecx,ebx
and eax,1616928864
not ecx
add eax,1616928864
and eax,ecx
and eax,2155905152
shr eax,7
mov edx,$80808080
sub edx,eax
and edx,$7F7F7F7F
mov ecx,[esi+12]
and ebx,edx
not edx
and edx,774778414
or edx,ebx
mov eax,ecx
mov [edi+56],edx
mov ebx,ecx
and eax,1616928864
not ebx
add eax,1616928864
and eax,ebx
and eax,2155905152
shr eax,7
mov edx,$80808080
sub edx,eax
and edx,$7F7F7F7F
and ecx,edx
not edx
and edx,774778414
or edx,ecx
mov [edi+60],edx

i don't know if anybody read the code snippet i posted earlier. but it was exactly written for the purpose of avoiding MUL in the algorithm similar to AvonWyss. The method is, when 9 is subtracted by a nibble, its higher nibble will become 1111b if the nibble is greater than 9, otherwise it will become 0000b. without multiplying this already gives us the mask needed to distinguish 0-9 from A-F.

here is a slightly optimized version, it's the code for the first four bytes. it's still slightly slower than lookup table.

mov esi, pSrc;
mov edi, pszBuf;

// let edx store original
mov eax, [esi+00]
mov edx, eax
mov ebx, 0x0a0a0a0a
and eax, 0x0f0f0f0f
sub ebx, eax
shr ebx, 4
and ebx, 0x07070707
or ebx, 0x30303030
add eax, ebx

// repeat with higher nibbles
// mov edx, [esi+00]
shr edx, 4
mov ebx, 0x0a0a0a0a
and edx, 0x0f0f0f0f
sub ebx, edx
and ebx, 0x70707070
shr ebx, 4
or ebx, 0x30303030
add edx, ebx

// display
mov [edi+0], al
shr eax, 8
mov [edi+1], dl
shr edx, 8
mov [edi+3], al
shr eax, 8
mov [edi+4], dl
shr edx, 8
mov [edi+6], al
shr eax, 8
mov [edi+7], dl
shr edx, 8
mov [edi+9], al
mov [edi+10], dl

abson, unfortnuately, I think that your code does not work correctly in all situations:

mov eax, 0x0A090A09 // sample

mov ebx, 0x0a0a0a0a
sub ebx, eax // ebx is now 00010001
shr ebx, 4 // ebx is now 00001000
and ebx, 0x07070707 // ebx is now 00000000
or ebx, 0x30303030 // ebx is now 30303030
add eax, ebx // ebx is now 3A393A39 = ":9:9"

Your previous version was better, but also failed:

mov edx, 0x0A090A09
mov ebx, edx
sub ebx, 0x0a0a0a0a // ebx=0xFFFEFFFF
and ebx, 0xf0f0f0f0 // ebx=0xF0F0F0F0
mov ecx, ebx
shr ecx, 4 // ecx=0x0F0F0F0F
or ebx, ecx // ebx=0xFFFFFFFF
// at this point, ebx should contain the mask for each nibble

It's obvious that this mask cannot be right. The correct mask would be 0x00FF00FF.

Forgot to mention: The MUL was only in used in the ASCII part of the string, not the HEX conversion. See my code for details.

Sorry for all the posts. Sometimes I just don't wait long enough before posting... ;-)
abson, what's going wron in you code is the following:
0x00000000 - 0x00000001 == 0xFFFFFFFF

however, you are assuming that only the nibble changes:
0x00000000 - 0x00000001 != 0x000000FF

Your (original) code can be fixed by adding an additional bit which makes sure that the subtraction ends after the nibble:

mov edx, 0x0A090A09
mov ebx, edx
or ebx, 0x01010100
sub ebx, 0x0a0a0a0a // ebx=0x00FF00FF
and ebx, 0xf0f0f0f0 // ebx=0x00F000F0
mov ecx, ebx
shr ecx, 4 // ecx=0x000F000F
or ebx, ecx // ebx=0x00FF00FF

Which is the correct mask for the following operations.

i don't know if anybody read the code snippet i posted earlier. but it was exactly written for the purpose of avoiding MUL in the algorithm similar to AvonWyss. The method is, when 9 is subtracted by a nibble, its higher nibble will become 1111b if the nibble is greater than 9, otherwise it will become 0000b. without multiplying this already gives us the mask needed to distinguish 0-9 from A-F.

here is a slightly optimized version, it's the code for the first four bytes. it's still slightly slower than lookup table.

mov esi, pSrc;
mov edi, pszBuf;

// let edx store original
mov eax, [esi+00]
mov edx, eax
mov ebx, 0x0a0a0a0a
and eax, 0x0f0f0f0f
sub ebx, eax
shr ebx, 4
and ebx, 0x07070707
or ebx, 0x30303030
add eax, ebx

// repeat with higher nibbles
// mov edx, [esi+00]
shr edx, 4
mov ebx, 0x0a0a0a0a
and edx, 0x0f0f0f0f
sub ebx, edx
and ebx, 0x70707070
shr ebx, 4
or ebx, 0x30303030
add edx, ebx

// display
mov [edi+0], al
shr eax, 8
mov [edi+1], dl
shr edx, 8
mov [edi+3], al
shr eax, 8
mov [edi+4], dl
shr edx, 8
mov [edi+6], al
shr eax, 8
mov [edi+7], dl
shr edx, 8
mov [edi+9], al
mov [edi+10], dl

sorry for the double post, non-intentional ...
AvonWyss, yes, there is an error to my code, but the fix is actually simpler

original code:
mov eax, 0x0A090A09 // sample

mov ebx, 0x0a0a0a0a
sub ebx, eax // ebx is now 00010001
shr ebx, 4 // ebx is now 00001000
and ebx, 0x07070707 // ebx is now 00000000
or ebx, 0x30303030 // ebx is now 30303030
add eax, ebx // ebx is now 3A393A39 = ":9:9"

new code:
mov eax, 0x0A090A09 // sample

mov ebx, 0x09090909
sub ebx, eax // ebx is now 00010001
shr ebx, 4 // ebx is now 00001000
and ebx, 0x07070707 // ebx is now 00000000
or ebx, 0x30303030 // ebx is now 30303030
add eax, ebx // ebx is now 3A393A39 = ":9:9"

that is, subtract each nibble from 9 instead of from 10. other than that, i don't think there is anything wrong with the logic, therefore there is no need for the extra bit.

ASKER

AvonWyss,
Thanks for the clarification on the ASCII conversion. A very clever bit of binary logic. I salute you.

I also retested your code after removing the MUL opcode form the ASCII part of the code. The change does increase the speed. I misstated the original timing test. Sorry. The two timing results:

20,200 conversions per ms -- version with MUL (sorry 'bout my earlier typo)
21,600 conversions per ms -- version without MUL

So replacing the MUL provided a respectable 6% speed increase. However, as clever as the code it, it does not really even approach the speeds of the pure lookup table algorithm.

-- Dan

Dan,
May I ask why you are ignoring my comment?
As I mentioned my code is very similar to AvonWyss's code, as it also operates on 4 bytes at a time. But it uses the special characteristic of 2s complement math so it only needs 8 instructions to calculate the HEX representation of 4 nibbles.
I have tested out the full version of it against Pavlik's code, and my version is about 10% slower. I am not claiming my version will ever be faster than the lookup table, but I really think, as an author of an Assembly book, you would be interested.

ASKER

absong,
I appologize for ignoring you. I was testing the other two functions becasue they are complete functions and therefore testable. Then I had to use the bathroom and well, I'd rather not describe that particular incident.

I am very interested in your code. And I just spent about 30 minutes trying to figure out what is going on. I wanted to make some intelligent comments about it. Here is what I've been testing (I replaced your comments with my results):

mov eax, 0x0A090A09 // sample

mov ebx, 0x09090909
sub ebx, eax // ebx is now FEFFFF00
shr ebx, 4 // ebx is now 0FEFFFF0
and ebx, 0x07070707 // ebx is now 07070700
or ebx, 0x30303030 // ebx is now 37373730
add eax, ebx // eax is now 41404139 = ")()9"

The problem is that it doesn't do what you say in your comments and I don't understand how it is supposed to work so I can't fix it. As is, it outputs nonsense. I'm sure that there must be something that I am missing. Could you explain it in detail and/or show the code using single bytes (al, bl, etc?) so I can extrapolate to the 4-at-a-time logic?

-- Dan

Dan, yes, using single bytes is a good idea.
To convert a nibble N into its HEX representation, we can do the following:
- if the nibble is between 0 and 9, add the nibble by 0x30
- if the nibble is between A and F, add the nibble by 0x37
Notice the difference between 0x30 and 0x37 is minimal. If we can produce a mask, so that it is all 0's when the nibble N is less than or equal to 9, and all 1's when the nibble is greater than 9, we can mask out the 7.
I noticed that subtraction can do exactly that. If N <= 9, R = 9 - N will produce a positive number, therefore the higher nibble of the result R will be all 0's. If N > 9, R = 9 - N will produce a negative number, therefore the higher nibble of the result R will be all 1's.

scenario 1 - assume lower nibble of al is 0x0B, and we want to convert it into hex:

and al, 0x0f ; first clear the high nibble
mov bl, 0x09
sub bl, al ; bl = 9 - al = -2 (binary: 11111110)
; notice at this point, the higher nibble of bl is all 1's. we can use it as a mask.
shr bl, 4
and bl, 0x07 ; bl = 0x0f & 0x07 = 0x07
or bl, 0x30 ; bl = 0x07 | 0x30 = 0x37
add al, bl ; al = 0x0B + 0x37 = 0x42 = 'B'

scenario 2 - assume lower nibble of al is 0x04

and al, 0x0f
mov bl, 0x09
sub bl, al ; bl = 9 - al = 5 (binary: 00000101)
; notice that the higher nibble is all 0's, instead of all 1's
shr bl, 4
and bl, 0x07 ; bl = 0x00 & 0x07 = 0x00
or bl, 0x30 ; bl = 0x00 | 0x30 = 0x30
add al, bl ; al = 0x04 + 0x30 = 0x34 = '4'

ASKER

Now I see where the problem lies:

It works great to 1-byte values, but when you try to do four at once, a borrow is subtracted from the higher bytes. For instance:

mov eax, 0x0A090A09 // sample

mov ebx, 0x09090909
sub ebx, eax // ebx is now FEFFFF00
shr ebx, 4 // ebx is now 0FEFFFF0
and ebx, 0x07070707 // ebx is now 07070700
or ebx, 0x30303030 // ebx is now 37373730
add eax, ebx // eax is now 41404139 = "A@A9"

after
sub ebx, eax
ebx is
FEFFFF00 (actual)
FF00FF00 (desired)

That is, it works for the rightmost byte, but when a nibble is > 9, it causes a borrow from the byte to the left, screwing up the rest of the mask.

==-=-=-=-=-=-=-
You already mentioned that your output logic is not too efficient (those single-byte writes). But I think we might be able to improve upon AvonWyss's output as well.

Coming into the output part, we have four ASCII characters in two 32-bit registers (edx and ecx). We want to output:

_dl _cl space
_dh _ch space
edl ecl space (the lo byte of the hi word)
edh ech space (the hi byte of the hi word)

That is twelve bytes and we aim to do it by writing three four-byte values.

output #1: _dl _cl _20 _dh
output #2: _ch _20 edl ecl
output #3: _20 edh ech _20

the data needs to be reversed for output so:

regout #1 must be: _dh _20 _cl _dl
regout #2 must be: ecl edl _20 _ch
regout #3 must be: _20 ech edh _20

I'm too tired right now to finish this thought... but it seems that there might be a way to set this up a little more efficiently than AvonWyss's 21-instruction sequence ... perhaps using some combination of XCHG and BSWAP. Or perhaps shifting the data into a FPU register.

-- Dan

Aha, now I understand where the problem is. Can't believe I was so blind to it. AvonWyss, sorry, I believe you also mentioned the same thing, but at that time I thought that problem only arise because I was subtracting each nibble from 10, instead of 9.
Yes, we'll spend our energy optimizing AvonWyss's code more then.

Since this Q thread has become pretty large, I suggest that we continue in the 0 point followup I have set up for this purpose:
https://www.experts-exchange.com/assembly/Q.20280946.html

ASKER

Continuation (3) continued here: http:Q.20283475.html

-- Dan

ASKER

This thread seems to have run its course.

I am accepting this answer, mainly for able's excellent work in the two continuation threads:

http:Q.20280946.html
http:Q.20283475.html

I am also posting some points to AvonWyss in
http:Q.20283475.html

I hope we all had fun! I made a new challenge here:

http:Q.20288519.html

-- Dan

Yes, it was definitely fun. Thanks!
I was not yet finished, but I took a slower pace. I started reading Inner Loops from Rick Booth, which showed me that I have been thinking often in the wrong direction.

There is so much to do with pairing on the Pentium and even more about reading cache lines and processing micro ops (which is done in quite an undeterminable manner) in the Pentium Pro that I think that there is still much more to gain if it comes to speed. And I am talking about all different flavors of the algorithm and all different kinds of processors.

But now on with http:Q.20288519.html!

Cheers!

Abel