Link to home
Start Free TrialLog in
Avatar of DanRollins
DanRollinsFlag for United States of America

asked on

How fast is your ASM hex converter?

Note [3/24/2002]: Continuing at https://www.experts-exchange.com/assembly/Q.20280946.html
Note [3/31/2002]: And then continued here:  http:Q.20283475.html

Just throwing out the gauntlet:
Write an 80x86 routine that converts 16 bytes of data to hex and ASCII format.  The ASCII 16 bytes should show . (period) for characters <32 or >127.  Output format is 65 bytes:

XX XX XX XX XX XX XX XX-XX XX XX XX XX XX XX XX aaaaaaaaaaaaaaaa\0
//----------------------------------------
void HexLineOut( BYTE* pSrc, char* pszBuf )
{
       _asm{
        // your code here!
        }      
}

Prove that your code is the fastest on earth.  Show that it is faster than similar code written in straight (optimized) C.  We could run tests by converting a 10 MB buffer and timing the results.  

Is anyone game?
Avatar of Triskelion
Triskelion
Flag of United States of America image

Are you saying it should look like a DEBUG or (HEXED) output window?

I'm intrigued.
Avatar of DanRollins

ASKER

Sure!  It could be used in a debugger.  I chose it because I'm petty sure I can dig up some binary-to-hex tricks that the C optimizer won't know about :-)

Here is a not-t0o-efficient C version of the target routine:

void HexLineOut( BYTE* pSrc, char* pszBuf )
{
     int  j;
     BYTE c, d;
     for( j=0; j<16; j++ ) {
          c= *pSrc++;
          d= c >> 4;   // get the hi nybble
          if ( d < 10 ) *pszBuf++ = d+'0';
          else          *pszBuf++ = d+'0' + ('A' - ('9'+1)); // tip: this is  d+0x37

          d= c & 0x0f; // get the lo nybble
          if ( d < 10 ) *pszBuf++ = d+'0';
          else          *pszBuf++ = d+'0' + ('A' - ('9'+1));

          if ( j == 7 ) *pszBuf++ = '-'; // space betwee bytes
          else          *pszBuf++ = ' '; // dash betwee bytes
     }
     *pszBuf++ = ' '; // space before ascii

     pSrc -= 16;     // back to the start

     for( j=0; j<16; j++ ) {
          c= *pSrc++;
          if ( c<32 || c> 127 ) {
               c= '.';
          }
          *pszBuf++ = c;
     }
     *pszBuf= '\0';
}

Give it a go!

-- Dan
Avatar of NicoLaan
NicoLaan

>>Prove that your code is the fastest on earth.

I'm not a bad assambly programmer and might try to write the fastest.
But how in gods name can you prove any code is the fastest other than not being able to find faster code. Wich could also mean you're bad at searching or you have bad competition.

Would it not be more easy to understand to write:

void HexLineOut(BYTE pSrc[16], char pszBuf[65])
OK, I'll up the ante...

You must prove that your code is the fastest in the known Universe, and all Not-Yet-Found Parallel Universes, and Whatever Higher Dimensions Nobody Has Ever Imagined, and in any Arbitrary Namespaces That Begin with Z or Feature the Word Omega.  

Anything less and you don't win!  Supporting evidence such as: "Well, my Mommy said it was really fast when she wasn't playing Quake" will be given more weight in the final judging than "I heard it say 'Zoom' a coupla times".  

So get to work or you won't have a chance at the golden apple!  The Judges' Decision is Familial!  

-- Dan
ASKER CERTIFIED SOLUTION
Avatar of abel
abel
Flag of Netherlands image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Smarter than what? who?

If you want to write an Eiffel version, then by all means, do it!  It's hard to imagine that it could be faster than *well-written* hand-optimized ASM code, but perhaps it could make use of multiple processors or something.

-- Dan
> Smarter than what? who?
Big blue against the people: computer-optimized c and opt code against the postings here of course :)

Let's keep it fair: one processor. But I guess you're right, especially with an (apparently) trivial task as this, how can a computer outperform a human being?

But I'm getting curious. Let's get to work.

Cheers
Here is some helper code that I am placing in a dialog-based app for testing purposes.  So far, my best benchmark is around 20,000 characters per ms (a 450MHz P3), but I really haven't started with the hand-optimization yet.

BOOL FileToBuf( LPCSTR sFilename, BYTE* pBuf, int nMax )
{
     int nLen= 0;
     HFILE hFile= _lopen(sFilename, OF_READ );
     if (hFile == HFILE_ERROR ) {
          return( FALSE );
     }
     int nFileLen= _llseek( hFile, 0L, SEEK_END );
     _llseek( hFile, 0L, SEEK_SET );

     for (int j=0; j<nFileLen; j += 1024) {
          if (nLen + 1024 > nMax ) {
               break;
          }
          int nActual= _lread( hFile, pBuf, 1024 );
          pBuf += nActual;
          nLen += nActual;
     }
     _lclose( hFile );
     return( nLen );
}

void CHexConvertDlg::DumpBuf( BYTE* p, int nLen, BOOL fDisplay /*=TRUE*/ )
{
     char szBuf[68];
     for (int j= 0; j< nLen; j += 16 ) {
          HexLineOut( &p[j], szBuf );
          if ( fDisplay ) {
               szBuf[65]='\r';     szBuf[66]='\n'; szBuf[67]='\0';
               m_ctlEdit.SetSel(-1,-1);
               m_ctlEdit.ReplaceSel(szBuf);
               m_ctlEdit.SendMessage(EM_SCROLLCARET,0,0);
          }
     }
}

void CHexConvertDlg::OnButton1()
{
     UpdateData( TRUE );  // get the value of m_fShowOutput amd m_nIterCnt

     BYTE* pBuf= new BYTE[800000];
     
     int nFileLen= FileToBuf("c:\\temp\\test.txt", pBuf, 800000 );

     nFileLen &= 0xfffffff0; // round to closest 16

     int nTotalBytes= 0;

     timeBeginPeriod(1);
     DWORD nStartTick= timeGetTime();

     for (int j=0; j< m_nIterCnt; j++ ) {
          m_ctlEdit.SetWindowText("");
          DumpBuf( pBuf, nFileLen, m_fShowOutput );
          nTotalBytes += nFileLen;
     }

     DWORD nTotalTicks= timeGetTime()-nStartTick;
     timeEndPeriod(1);

     DWORD nCntPerTick= -1;
     if ( nTotalTicks != 0 ) nCntPerTick= nTotalBytes/nTotalTicks;

     CString sMsg;
     sMsg.Format( "%d bytes converted in %d ms\n\n %d conversions per ms", nTotalBytes, nTotalTicks, nCntPerTick );
     MessageBox(sMsg);

     delete pBuf;
}
Something like this should already be pretty good (no jump/loops, does some pairing, tries to avoid excessive memory access, uses aligned addresses where easily possible):

// Assuming:
// esi = binary data buffer
// edi = destination char buffer, size 66 bytes

      sub ebx,ebx
      mov edx,[esi+00]
      mov bl,dl
      mov ax,word ptr [ebx*2+@hex]
      shr edx,8
      mov [edi+00],ax
      mov bl,dl
      mov ax,word ptr [ebx*2+@hex]
      mov byte ptr [edi+02],32
      shr edx,8
      mov [edi+03],ax
      mov bl,dl
      mov ax,word ptr [ebx*2+@hex]
      mov byte ptr [edi+05],32
      shr edx,8
      mov [edi+06],ax
      mov ax,word ptr [edx*2+@hex]
      mov byte ptr [edi+08],32
      mov [edi+09],ax
      mov byte ptr [edi+11],32
      mov edx,[esi+04]
      mov bl,dl
      mov ax,word ptr [ebx*2+@hex]
      shr edx,8
      mov [edi+12],ax
      mov bl,dl
      mov ax,word ptr [ebx*2+@hex]
      mov byte ptr [edi+14],32
      shr edx,8
      mov [edi+15],ax
      mov bl,dl
      mov ax,word ptr [ebx*2+@hex]
      mov byte ptr [edi+17],32
      shr edx,8
      mov [edi+18],ax
      mov ax,word ptr [edx*2+@hex]
      mov byte ptr [edi+20],32
      mov [edi+21],ax
      mov byte ptr [edi+23],45
      mov edx,[esi+08]
      mov bl,dl
      mov ax,word ptr [ebx*2+@hex]
      shr edx,8
      mov [edi+24],ax
      mov bl,dl
      mov ax,word ptr [ebx*2+@hex]
      mov byte ptr [edi+26],32
      shr edx,8
      mov [edi+27],ax
      mov bl,dl
      mov ax,word ptr [ebx*2+@hex]
      mov byte ptr [edi+29],32
      shr edx,8
      mov [edi+30],ax
      mov ax,word ptr [edx*2+@hex]
      mov byte ptr [edi+32],32
      mov [edi+33],ax
      mov byte ptr [edi+35],32
      mov edx,[esi+12]
      mov bl,dl
      mov ax,word ptr [ebx*2+@hex]
      shr edx,8
      mov [edi+36],ax
      mov bl,dl
      mov ax,word ptr [ebx*2+@hex]
      mov byte ptr [edi+38],32
      shr edx,8
      mov [edi+39],ax
      mov bl,dl
      mov ax,word ptr [ebx*2+@hex]
      mov byte ptr [edi+41],32
      shr edx,8
      mov [edi+42],ax
      mov ax,word ptr [edx*2+@hex]
      mov byte ptr [edi+44],32
      mov [edi+45],ax
      mov byte ptr [edi+47],32
      mov edx,[esi+00]
      rol edx,8
      mov bl,dl
      mov al,byte ptr [ebx+@ascii]
      rol edx,8
      shl eax,8
      mov bl,dl
      mov al,byte ptr [ebx+@ascii]
      rol edx,8
      shl eax,8
      mov bl,dl
      mov al,byte ptr [ebx+@ascii]
      rol edx,8
      and edx,$000000FF
      shl eax,8
      mov al,byte ptr [edx+@ascii]
      mov [edi+48],eax
      mov edx,[esi+04]
      rol edx,8
      mov bl,dl
      mov al,byte ptr [ebx+@ascii]
      rol edx,8
      shl eax,8
      mov bl,dl
      mov al,byte ptr [ebx+@ascii]
      rol edx,8
      shl eax,8
      mov bl,dl
      mov al,byte ptr [ebx+@ascii]
      rol edx,8
      and edx,$000000FF
      shl eax,8
      mov al,byte ptr [edx+@ascii]
      mov [edi+52],eax
      mov edx,[esi+08]
      rol edx,8
      mov bl,dl
      mov al,byte ptr [ebx+@ascii]
      rol edx,8
      shl eax,8
      mov bl,dl
      mov al,byte ptr [ebx+@ascii]
      rol edx,8
      shl eax,8
      mov bl,dl
      mov al,byte ptr [ebx+@ascii]
      rol edx,8
      and edx,$000000FF
      shl eax,8
      mov al,byte ptr [edx+@ascii]
      mov [edi+56],eax
      mov edx,[esi+12]
      rol edx,8
      mov bl,dl
      mov al,byte ptr [ebx+@ascii]
      rol edx,8
      shl eax,8
      mov bl,dl
      mov al,byte ptr [ebx+@ascii]
      rol edx,8
      shl eax,8
      mov bl,dl
      mov al,byte ptr [ebx+@ascii]
      rol edx,8
      and edx,$000000FF
      shl eax,8
      mov al,byte ptr [edx+@ascii]
      mov byte ptr [edi+65],0
      ret

@hex:
      db '000102030405060708090A0B0C0D0E0F'
      db '101112131415161718191A1B1C1D1E1F'
      db '202122232425262728292A2B2C2D2E2F'
      db '303132333435363738393A3B3C3D3E3F'
      db '404142434445464748494A4B4C4D4E4F'
      db '505152535455565758595A5B5C5D5E5F'
      db '606162636465666768696A6B6C6D6E6F'
      db '707172737475767778797A7B7C7D7E7F'
      db '808182838485868788898A8B8C8D8E8F'
      db '909192939495969798999A9B9C9D9E9F'
      db 'A0A1A2A3A4A5A6A7A8A9AAABACADAEAF'
      db 'B0B1B2B3B4B5B6B7B8B9BABBBCBDBEBF'
      db 'C0C1C2C3C4C5C6C7C8C9CACBCCCDCECF'
      db 'D0D1D2D3D4D5D6D7D8D9DADBDCDDDEDF'
      db 'E0E1E2E3E4E5E6E7E8E9EAEBECEDEEEF'
      db 'F0F1F2F3F4F5F6F7F8F9FAFBFCFDFEFF'
@ascii:
      db 046,046,046,046,046,046,046,046,046,046,046,046,046,046,046,046
      db 046,046,046,046,046,046,046,046,046,046,046,046,046,046,046,046
      db 032,033,034,035,036,037,038,039,040,041,042,043,044,045,046,047
      db 048,049,050,051,052,053,054,055,056,057,058,059,060,061,062,063
      db 064,065,066,067,068,069,070,071,072,073,074,075,076,077,078,079
      db 080,081,082,083,084,085,086,087,088,089,090,091,092,093,094,095
      db 096,097,098,099,100,101,102,103,104,105,106,107,108,109,110,111
      db 112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127
      db 046,046,046,046,046,046,046,046,046,046,046,046,046,046,046,046
      db 046,046,046,046,046,046,046,046,046,046,046,046,046,046,046,046
      db 046,046,046,046,046,046,046,046,046,046,046,046,046,046,046,046
      db 046,046,046,046,046,046,046,046,046,046,046,046,046,046,046,046
      db 046,046,046,046,046,046,046,046,046,046,046,046,046,046,046,046
      db 046,046,046,046,046,046,046,046,046,046,046,046,046,046,046,046
      db 046,046,046,046,046,046,046,046,046,046,046,046,046,046,046,046
      db 046,046,046,046,046,046,046,046,046,046,046,046,046,046,046,046
AvonWyss,
Excellent! Cool!  I'm glad there is some action here!

Just a little sloppy...  I needed to fix two bugs (see code below)

The code tests out at *very nearly as fast* as a 'reference' C routine I wrote that uses a lookup table for the 'Hex' data (I did not implement the lookup table for the ASCII output -- though that is clearly going to improve the speed when implemented.).

That you are unrolling the loops and collecting the four bytes before outputing is a definite win, but I'm sure there is pleny of room for improvement.

I think that all of the lines that use indirect addressing are costing you:  For instance,

      mov ax,word ptr [ebx*2+hex]
or
      mov al,byte ptr [ebx+ascii]

are probably expensive in comparison to a straight lookup through the index register and an increment of that register each time (I could be wrong though...).

Here is the VC++ -ized version of your code, with my changes inidcated:

char hex[]=
      "000102030405060708090A0B0C0D0E0F"
      "101112131415161718191A1B1C1D1E1F"
      "202122232425262728292A2B2C2D2E2F"
      "303132333435363738393A3B3C3D3E3F"
      "404142434445464748494A4B4C4D4E4F"
      "505152535455565758595A5B5C5D5E5F"
      "606162636465666768696A6B6C6D6E6F"
      "707172737475767778797A7B7C7D7E7F"
      "808182838485868788898A8B8C8D8E8F"
      "909192939495969798999A9B9C9D9E9F"
      "A0A1A2A3A4A5A6A7A8A9AAABACADAEAF"
      "B0B1B2B3B4B5B6B7B8B9BABBBCBDBEBF"
      "C0C1C2C3C4C5C6C7C8C9CACBCCCDCECF"
      "D0D1D2D3D4D5D6D7D8D9DADBDCDDDEDF"
      "E0E1E2E3E4E5E6E7E8E9EAEBECEDEEEF"
      "F0F1F2F3F4F5F6F7F8F9FAFBFCFDFEFF";

char ascii[]= {
     46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46,
     46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46,
     32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47,
     48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63,
     64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79,
     80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95,
     96, 97, 98, 99, 100,101,102,103,104,105,106,107,108,109,110,111,
     112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,
     46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46,
     46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46,
     46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46,
     46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46,
     46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46,
     46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46,
     46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46,
     46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46
};

void AvonWysHexLineOut( BYTE* pSrc, char* pszBuf )
{
_asm {      
      mov esi, pSrc;     ; <<--- Added
      mov edi, pszBuf;   ; <<--- Added
      sub ebx,ebx
      mov edx,[esi+00]
      mov bl,dl
      mov ax,word ptr [ebx*2+hex]
      shr edx,8
      mov [edi+00],ax
      mov bl,dl
      mov ax,word ptr [ebx*2+hex]
      mov byte ptr [edi+2],32
      shr edx,8
      mov [edi+03],ax
      mov bl,dl
      mov ax,word ptr [ebx*2+hex]
      mov byte ptr [edi+5],32
      shr edx,8
      mov [edi+6],ax
      mov ax,word ptr [edx*2+hex]
      mov byte ptr [edi+8],32
      mov [edi+9],ax
      mov byte ptr [edi+11],32
      mov edx,[esi+4]
      mov bl,dl
      mov ax,word ptr [ebx*2+hex]
      shr edx,8
      mov [edi+12],ax
      mov bl,dl
      mov ax,word ptr [ebx*2+hex]
      mov byte ptr [edi+14],32
      shr edx,8
      mov [edi+15],ax
      mov bl,dl
      mov ax,word ptr [ebx*2+hex]
      mov byte ptr [edi+17],32
      shr edx,8
      mov [edi+18],ax
      mov ax,word ptr [edx*2+hex]
      mov byte ptr [edi+20],32
      mov [edi+21],ax
      mov byte ptr [edi+23],45
      mov edx,[esi+8]
      mov bl,dl
      mov ax,word ptr [ebx*2+hex]
      shr edx,8
      mov [edi+24],ax
      mov bl,dl
      mov ax,word ptr [ebx*2+hex]
      mov byte ptr [edi+26],32
      shr edx,8
      mov [edi+27],ax
      mov bl,dl
      mov ax,word ptr [ebx*2+hex]
      mov byte ptr [edi+29],32
      shr edx,8
      mov [edi+30],ax
      mov ax,word ptr [edx*2+hex]
      mov byte ptr [edi+32],32
      mov [edi+33],ax
      mov byte ptr [edi+35],32
      mov edx,[esi+12]
      mov bl,dl
      mov ax,word ptr [ebx*2+hex]
      shr edx,8
      mov [edi+36],ax
      mov bl,dl
      mov ax,word ptr [ebx*2+hex]
      mov byte ptr [edi+38],32
      shr edx,8
      mov [edi+39],ax
      mov bl,dl
      mov ax,word ptr [ebx*2+hex]
      mov byte ptr [edi+41],32
      shr edx,8
      mov [edi+42],ax
      mov ax,word ptr [edx*2+hex]
      mov byte ptr [edi+44],32
      mov [edi+45],ax
      mov byte ptr [edi+47],32
      mov edx,[esi+00]
      rol edx,8
      mov bl,dl
      mov al,byte ptr [ebx+ascii]
      rol edx,8
      shl eax,8
      mov bl,dl
      mov al,byte ptr [ebx+ascii]
      rol edx,8
      shl eax,8
      mov bl,dl
      mov al,byte ptr [ebx+ascii]
      rol edx,8
      and edx,0xFF
      shl eax,8
      mov al,byte ptr [edx+ascii]
      mov [edi+48],eax
      mov edx,[esi+04]
      rol edx,8
      mov bl,dl
      mov al,byte ptr [ebx+ascii]
      rol edx,8
      shl eax,8
      mov bl,dl
      mov al,byte ptr [ebx+ascii]
      rol edx,8
      shl eax,8
      mov bl,dl
      mov al,byte ptr [ebx+ascii]
      rol edx,8
      and edx,0xFF
      shl eax,8
      mov al,byte ptr [edx+ascii]
      mov [edi+52],eax
      mov edx,[esi+8]
      rol edx,8
      mov bl,dl
      mov al,byte ptr [ebx+ascii]
      rol edx,8
      shl eax,8
      mov bl,dl
      mov al,byte ptr [ebx+ascii]
      rol edx,8
      shl eax,8
      mov bl,dl
      mov al,byte ptr [ebx+ascii]
      rol edx,8
      and edx,0xFF
      shl eax,8
      mov al,byte ptr [edx+ascii]
      mov [edi+56],eax
      mov edx,[esi+12]
      rol edx,8
      mov bl,dl
      mov al,byte ptr [ebx+ascii]
      rol edx,8
      shl eax,8
      mov bl,dl
      mov al,byte ptr [ebx+ascii]
      rol edx,8
      shl eax,8
      mov bl,dl
      mov al,byte ptr [ebx+ascii]
      rol edx,8
      and edx,0xFF
      shl eax,8
      mov al,byte ptr [edx+ascii]
      mov [edi+60],eax                ;<<<--- added
      mov byte ptr [edi+64],0         ;<<<--- fixed
      }
}

-- Dan

P.S.  I'm still looking for a winner...
Dan, thanks for trying my code. Note that the bugs are due to the helper code I used when writing it, which I then stopped again and obviously broke my code.

Also, a test like this makes it obvious that nowadays processors and compilers already do a pretty good job on generating fast code. For optimal use of the processor pipelines, lots of simple commands are about the best code you can generate, and the compilers do exactly this.

Anyways, there are pretty huge differences with the processors available by means of pairing and other things. However, indirect addressing should no longer be costy on any processor since the Pentium (1 cycle allways, but may impact pairing due to the additional registers used). What could be a real neckbreaker are my 16-bit reads and writes on uneven addresses. I believe these are not very good, but I didn't really have a better idea. Maybe I could also collect 32 bits of data to write before doing so (e.g. "XX X","X XX","XX X" etc. pattern). I I have some time again, I'll look into it.

Is there a method to align ASM data or code (the compiler also does this) in memory in the inline assembler of VC++?
>>Is there a method to align ASM data or code
If you posted this Q in the Assembly TA, I could get my first Assembly TA points!

The VC++ _asm assembler supports the EVEN and ALIGN directives.  But only for code (it inserts NOPs as needed).

It does not support DB, DW, etc. for defining data (as you may have inferred from my mods to your code).  But since you can define it outside of the _asm block, I think it will automatically be aligned in the same manner as other C data structures (it is a project setting that defaults to 8 bytes).  

>>indirect addressing should no longer be costy on any
>> processor since the Pentium
I can't make heads-nor-tails of opcode timings in the later 80x86 documentation.  It used to be there were nice simple charts showing the clock-count for each addressing mode and opcode.  

>>Neckbreaker...16-bit reads and writes on uneven addresses
I agree, this must be costing something.

>>Maybe I could also collect 32 bits of data to write
>>before doing so (e.g. "XX X","X XX","XX X" etc. pattern).

This shouldn't be too hard since the loops are unrolled anyway... as soon as you have four bytes accumulated, output'em.

-- Dan
You should also not count the time it takes to display on the screen. At this point, you are at the mercy of the video display to see how fast it can display a line of text in the edit control.

Instead, turn off the edit control's redraw before you start. Then turn it back on when you are done:
 
  timeBeginPeriod(1);
  DWORD nStartTick= timeGetTime();
  m_ctlEdit.SetRedraw(FALSE);

  // make calls to dump hex

  DWORD nTotalTicks= timeGetTime()-nStartTick;
  timeEndPeriod(1);
  m_ctlEdit.SetRedraw(TRUE);

Of course, by doing this, you will no longer see the updates on the screen until it is finished. If that is a part of your design, then you cannot do this. But realize that you are timing the output as well as the conversion.
>>You should also not count the time it takes to display on the screen.

You are right thui!  

Notice that I have a flag named m_fShowOutput that get passed into the DumpBuf routine.   When I benchmark other people's code, I turn this ON.  No matter how well-written the code, everybody always scores out at about 9 conversions per millisconed.  But when I test my own code, I 'accidentlally' turn it off.  Longshot, but perhaps that is why my code is currently yielding about 21,000 conversions per millisecond.

-- Dan

P.S.
I am about to publish my own Hex-to-ASCII routine and when I do that all challengers will cry alligator tears of frustration, knowing that the Universe will never see faster code.  So now's your chance.
DanRollins, in this case, please do test the code again without output at all, so that the conditions are the same for both applications... and post the resultsd again :-)
Oh PUUUULLLEEEEEASE.

How stupid is this Mr. DanRollins.  Hm, he was able to convert my raw ASM into runnnable VC++ code and he fixed two of my coding errors.  He suggested some rather subtle improvements to my code and helped me learn about code alignment in an _asm block.  Even so, he is probably so dense that he could not recognoize the difference between 9 conversions per millisecond and 20,000 conversions per milliseconfd.  I'd better remind him to not display the lines as they are converted.

sheesh.

-- Dan
Here is jonin's submission from the C++ TA.  It boils down to:

inline void f()
{
   printf("%x %x %x %x %x %x %x %x-%x %x %x %x %x %x %x %x",
       data[0],data[1],data[2],data[3],data[4],data[5],data[6],
       data[7],data[8],data[9],data[10],data[11],data[12],data[13],
       data[14],data[15]);
   printf("  %c%c%c%c%c%c%c%c%c%c%c%c%c%c%c%c\n",
       ascii[data[0]],
       ascii[data[1]],
       ascii[data[2]],
       ascii[data[3]],
       ascii[data[4]],
       ascii[data[5]],
       ascii[data[6]],
       ascii[data[7]],
       ascii[data[8]],
       ascii[data[9]],
       ascii[data[10]],
       ascii[data[11]],
       ascii[data[12]],
       ascii[data[13]],
       ascii[data[14]],
       ascii[data[15]]);
}
=--=-=-=-=-=-=-=-=-==-
I will convert from printf to the sprintf and run a timing test on it tonight, but I'm afraid that it will be considerably slower than some other tries.  Also, I fear that it won't meet the specs in that if any of the bytes < 0x10, they will display as a single hex digit rather than the two digits in the spec.

jonin says that even bad ASM will be 10 times faster than fast C, but I disagree.  My reference C fn is just slightly faster than AvonWyss's ASM code.  But it avoids the (relatively) massive overhead involved in pushing 32 data values onto the stack and calling printf.

-- Dan
I'm sorry to say but I don't see 1 single assembly command.

And I clearly remember reading:

      _asm{

       // your code here!

       }      

So in my humble opinion you do not qualify to get the points.
Hi NicoLan...
I invited some C/C++ experts to try their hand at this challenge.   I think that it will be interesting to see how well a C-language optimizer -- with a cleverly implemented algorithm -- can do against raw ASM.

-- Dan
Here is Pavlik's C/C++ submission (from http:Q.20280145.html )

I will test it tonight and post the timings!  Note that I am not certain that it meets all criteria in the specification, but I won't hold that against him if it can be modified to meet the spec without slowing down.

static const long digits[256] =
 {
   0x3030, 0x3130,0x3230, 0x3330, 0x3430, 0x3530, 0x3630, 0x3730, 0x3830, 0x3930, 0x6130, 0x6230, 0x6330,
0x6430, 0x6530, 0x6630,
   0x3031, 0x3131,0x3231, 0x3331, 0x3431, 0x3531, 0x3631, 0x3731, 0x3831, 0x3931, 0x6131, 0x6231, 0x6331,
0x6431, 0x6531, 0x6631,
   0x3032, 0x3132,0x3232, 0x3332, 0x3432, 0x3532, 0x3632, 0x3732, 0x3832, 0x3932, 0x6132, 0x6232, 0x6332,
0x6432, 0x6532, 0x6632,
   0x3033, 0x3133,0x3233, 0x3333, 0x3433, 0x3533, 0x3633, 0x3733, 0x3833, 0x3933, 0x6133, 0x6233, 0x6333,
0x6433, 0x6533, 0x6633,
   0x3034, 0x3134,0x3234, 0x3334, 0x3434, 0x3534, 0x3634, 0x3734, 0x3834, 0x3934, 0x6134, 0x6234, 0x6334,
0x6434, 0x6534, 0x6634,
   0x3035, 0x3135,0x3235, 0x3335, 0x3435, 0x3535, 0x3635, 0x3735, 0x3835, 0x3935, 0x6135, 0x6235, 0x6335,
0x6435, 0x6535, 0x6635,
   0x3036, 0x3136,0x3236, 0x3336, 0x3436, 0x3536, 0x3636, 0x3736, 0x3836, 0x3936, 0x6136, 0x6236, 0x6336,
0x6436, 0x6536, 0x6636,
   0x3037, 0x3137,0x3237, 0x3337, 0x3437, 0x3537, 0x3637, 0x3737, 0x3837, 0x3937, 0x6137, 0x6237, 0x6337,
0x6437, 0x6537, 0x6637,
   0x3038, 0x3138,0x3238, 0x3338, 0x3438, 0x3538, 0x3638, 0x3738, 0x3838, 0x3938, 0x6138, 0x6238, 0x6338,
0x6438, 0x6538, 0x6638,
   0x3039, 0x3139,0x3239, 0x3339, 0x3439, 0x3539, 0x3639, 0x3739, 0x3839, 0x3939, 0x6139, 0x6239, 0x6339,
0x6439, 0x6539, 0x6639,
   0x3061, 0x3161,0x3261, 0x3361, 0x3461, 0x3561, 0x3661, 0x3761, 0x3861, 0x3961, 0x6161, 0x6261, 0x6361,
0x6461, 0x6561, 0x6661,
   0x3062, 0x3162,0x3262, 0x3362, 0x3462, 0x3562, 0x3662, 0x3762, 0x3862, 0x3962, 0x6162, 0x6262, 0x6362,
0x6462, 0x6562, 0x6662,
   0x3063, 0x3163,0x3263, 0x3363, 0x3463, 0x3563, 0x3663, 0x3763, 0x3863, 0x3963, 0x6163, 0x6263, 0x6363,
0x6463, 0x6563, 0x6663,
   0x3064, 0x3164,0x3264, 0x3364, 0x3464, 0x3564, 0x3664, 0x3764, 0x3864, 0x3964, 0x6164, 0x6264, 0x6364,
0x6464, 0x6564, 0x6664,
   0x3065, 0x3165,0x3265, 0x3365, 0x3465, 0x3565, 0x3665, 0x3765, 0x3865, 0x3965, 0x6165, 0x6265, 0x6365,
0x6465, 0x6565, 0x6665,
   0x3066, 0x3166,0x3266, 0x3366, 0x3466, 0x3566, 0x3666, 0x3766, 0x3866, 0x3966, 0x6166, 0x6266, 0x6366,
0x6466, 0x6566, 0x6666
 };
 static const char chars[256] =
 {
   '.', '.', '.', '.', '.', '.', '.', '.', '.', '.', '.', '.', '.', '.', '.', '.',
   '.', '.', '.', '.', '.', '.', '.', '.', '.', '.', '.', '.', '.', '.', '.', '.',
   '\x20', '\x21','\x22','\x23','\x24','\x25','\x26','\x27','\x28','\x29','\x2a','\x2b','\x2c','\x2d','\x2e','\x2f',
   '\x30', '\x31','\x32','\x33','\x34','\x35','\x36','\x37','\x38','\x39','\x3a','\x3b','\x3c','\x3d','\x3e','\x3f',
   '\x40', '\x41','\x42','\x43','\x44','\x45','\x46','\x47','\x48','\x49','\x4a','\x4b','\x4c','\x4d','\x4e','\x4f',
   '\x50', '\x51','\x52','\x53','\x54','\x55','\x56','\x57','\x58','\x59','\x5a','\x5b','\x5c','\x5d','\x5e','\x5f',
   '\x60', '\x61','\x62','\x63','\x64','\x65','\x66','\x67','\x68','\x69','\x6a','\x6b','\x6c','\x6d','\x6e','\x6f',
   '\x70', '\x71','\x72','\x73','\x74','\x75','\x76','\x77','\x78','\x79','\x7a','\x7b','\x7c','\x7d','\x7e','\x7f',
   '\x80', '\x81','\x82','\x83','\x84','\x85','\x86','\x87','\x88','\x89','\x8a','\x8b','\x8c','\x8d','\x8e','\x8f',
   '\x90', '\x91','\x92','\x93','\x94','\x95','\x96','\x97','\x98','\x99','\x9a','\x9b','\x9c','\x9d','\x9e','\x9f',
   '\xa0', '\xa1','\xa2','\xa3','\xa4','\xa5','\xa6','\xa7','\xa8','\xa9','\xaa','\xab','\xac','\xad','\xae','\xaf',
   '\xb0', '\xb1','\xb2','\xb3','\xb4','\xb5','\xb6','\xb7','\xb8','\xb9','\xba','\xbb','\xbc','\xbd','\xbe','\xbf',
   '\xc0', '\xc1','\xc2','\xc3','\xc4','\xc5','\xc6','\xc7','\xc8','\xc9','\xca','\xcb','\xcc','\xcd','\xce','\xcf',
   '\xd0', '\xd1','\xd2','\xd3','\xd4','\xd5','\xd6','\xd7','\xd8','\xd9','\xda','\xdb','\xdc','\xdd','\xde','\xdf',
   '\xe0', '\xe1','\xe2','\xe3','\xe4','\xe5','\xe6','\xe7','\xe8','\xe9','\xea','\xeb','\xec','\xed','\xee','\xef',
   '\xf0', '\xf1','\xf2','\xf3','\xf4','\xf5','\xf6','\xf7','\xf8','\xf9','\xfa','\xfb','\xfc','\xfd','\xfe','\xff'
 };

 long* pBuf = (long*) buf;

 for (int i = 4; i > 0; --i)
 {
   *pBuf = (digits[(*dump)]) | 0x200000L | ((digits[*(dump+1)] & 0xffL) << 24);
   ++pBuf;
   *pBuf = (digits[*(dump+1)] >> 8 ) | (0x2000L) | (digits[*(dump+2)] << 16);
   ++pBuf;
   *pBuf = (digits[*(dump+3)] << 8) | 0x20000020L;
   ++pBuf;
   dump += 4;
 }

 dump -= 16;

 *pBuf =
   chars[*dump] |
   (chars[*(dump+1)] << 8) |
   (chars[*(dump+2)] << 16) |
   (chars[*(dump+3)] << 24);
 ++pBuf;
 dump += 4;
 *pBuf =
   chars[*dump] |
   (chars[*(dump+1)] << 8) |
   (chars[*(dump+2)] << 16) |
   (chars[*(dump+3)] << 24);
 ++pBuf;
 dump += 4;
 *pBuf =
   chars[*dump] |
   (chars[*(dump+1)] << 8) |
   (chars[*(dump+2)] << 16) |
   (chars[*(dump+3)] << 24);
 ++pBuf;
 dump += 4;
 *pBuf =
   chars[*dump] |
   (chars[*(dump+1)] << 8) |
   (chars[*(dump+2)] << 16) |
   (chars[*(dump+3)] << 24);
 ++pBuf;

 *pBuf = 0;
}
I do the following:

1. read data 4 bytes at a time
2. operate on lower nibbles of each byte
3. make a mask that corresponds to each nibble. if the nibble is less than 10, mask is 0xff, else mask is 0x00
4. add (mask & 0x30303030) to original nibbles
5. add (mask & 0x37373737) to original nibbles
6. at this point, the register should contain 4 bytes, each a ASCII representation of the corresponding lower nibble.
7. store these values, and repeat with higher nibbles.

code:
   mov edx, [esi+00]
   and edx, 0x0f0f0f0f
   mov ebx, edx
   sub ebx, 0x0a0a0a0a
   and ebx, 0xf0f0f0f0
   mov ecx, ebx
   shr ecx, 4
   or ebx, ecx

   // at this point, ebx should contain the mask for each nibble

   mov ecx, ebx
   and ecx, 0x30303030
   add edx, ecx
   mov ecx, ebx
   not ecx
   and ecx, 0x37373737
   add edx, ecx

   // at this point, edx contains ASCII representation of each nibble.

   // my ASM is too rusty, so don't know how to populate the result buffer.
First, analysis of jonnin's code.  I had to modify his format string to enforce two-hex digits per byte and I use sprintf to output to the buffer as specified in the challenge.  

This is processing a measly 450 conversions per millisecond -- fully optimized and wNOT displaying the output.  He mentions in his post that ASM is likely to be 100 times as fast, and he is right.
 
Here is the working version (same ascii[] array as in AvonWyss entry) :

void JonninHexLineOut( BYTE* pSrc, char* pszBuf )
{
     BYTE* data= pSrc;
     sprintf(pszBuf,
          "%.2x %.2x %.2x %.2x %.2x %.2x %.2x %.2x-"
          "%.2x %.2x %.2x %.2x %.2x %.2x %.2x %.2x",
          data[0],data[1],data[2],data[3],data[4],data[5],data[6],
          data[7],data[8],data[9],data[10],data[11],data[12],data[13],
          data[14],data[15]);
     sprintf(&pszBuf[47], " %c%c%c%c%c%c%c%c%c%c%c%c%c%c%c%c",
          ascii[data[0]], ascii[data[1]],     ascii[data[2]], ascii[data[3]],
          ascii[data[4]], ascii[data[5]],     ascii[data[6]], ascii[data[7]],
          ascii[data[8]], ascii[data[9]],     ascii[data[10]], ascii[data[11]],
          ascii[data[12]], ascii[data[13]], ascii[data[14]], ascii[data[15]]);
}

Now Pavlik submitted a C-language entry that is pretty awesome.  This fn uses two lookup tables and it process the input in 32-bit units and outputs in 32-bit units (using a clever masking scheme for the hex part, I might add).

The original code had these (relatively minor) problems
1) it did not put the dash after the 8th set of hex digits
2) it outputs characters > 127 (rather than '.' as specified).
3) A subtle err occurs because of signed vs unsigned values.  The error goes away when I modify his chars[256] array to type 'unsigned char' (it was 'char').

(anyone care to predict how the bug manifest itself -- what the output looked like before my fix?)

Anyway, when compiled with full optimization, this is processing a whopping

     40,300 conversions per millisecond!

nearly TWICE as fast an any routine so far!  And it is C, NOT ASM!

Pavlik, if you unroll that first loop, you can get a minor increase.  I'm surprized you didn't do it.

Here is the code he submitted (with my minor changes to meet spec):

static const long digits[256] =
{
  0x3030, 0x3130,0x3230, 0x3330, 0x3430, 0x3530, 0x3630, 0x3730, 0x3830, 0x3930, 0x6130, 0x6230, 0x6330, 0x6430, 0x6530, 0x6630,
  0x3031, 0x3131,0x3231, 0x3331, 0x3431, 0x3531, 0x3631, 0x3731, 0x3831, 0x3931, 0x6131, 0x6231, 0x6331, 0x6431, 0x6531, 0x6631,
  0x3032, 0x3132,0x3232, 0x3332, 0x3432, 0x3532, 0x3632, 0x3732, 0x3832, 0x3932, 0x6132, 0x6232, 0x6332, 0x6432, 0x6532, 0x6632,
  0x3033, 0x3133,0x3233, 0x3333, 0x3433, 0x3533, 0x3633, 0x3733, 0x3833, 0x3933, 0x6133, 0x6233, 0x6333, 0x6433, 0x6533, 0x6633,
  0x3034, 0x3134,0x3234, 0x3334, 0x3434, 0x3534, 0x3634, 0x3734, 0x3834, 0x3934, 0x6134, 0x6234, 0x6334, 0x6434, 0x6534, 0x6634,
  0x3035, 0x3135,0x3235, 0x3335, 0x3435, 0x3535, 0x3635, 0x3735, 0x3835, 0x3935, 0x6135, 0x6235, 0x6335, 0x6435, 0x6535, 0x6635,
  0x3036, 0x3136,0x3236, 0x3336, 0x3436, 0x3536, 0x3636, 0x3736, 0x3836, 0x3936, 0x6136, 0x6236, 0x6336, 0x6436, 0x6536, 0x6636,
  0x3037, 0x3137,0x3237, 0x3337, 0x3437, 0x3537, 0x3637, 0x3737, 0x3837, 0x3937, 0x6137, 0x6237, 0x6337, 0x6437, 0x6537, 0x6637,
  0x3038, 0x3138,0x3238, 0x3338, 0x3438, 0x3538, 0x3638, 0x3738, 0x3838, 0x3938, 0x6138, 0x6238, 0x6338, 0x6438, 0x6538, 0x6638,
  0x3039, 0x3139,0x3239, 0x3339, 0x3439, 0x3539, 0x3639, 0x3739, 0x3839, 0x3939, 0x6139, 0x6239, 0x6339, 0x6439, 0x6539, 0x6639,
  0x3061, 0x3161,0x3261, 0x3361, 0x3461, 0x3561, 0x3661, 0x3761, 0x3861, 0x3961, 0x6161, 0x6261, 0x6361, 0x6461, 0x6561, 0x6661,
  0x3062, 0x3162,0x3262, 0x3362, 0x3462, 0x3562, 0x3662, 0x3762, 0x3862, 0x3962, 0x6162, 0x6262, 0x6362, 0x6462, 0x6562, 0x6662,
  0x3063, 0x3163,0x3263, 0x3363, 0x3463, 0x3563, 0x3663, 0x3763, 0x3863, 0x3963, 0x6163, 0x6263, 0x6363, 0x6463, 0x6563, 0x6663,
  0x3064, 0x3164,0x3264, 0x3364, 0x3464, 0x3564, 0x3664, 0x3764, 0x3864, 0x3964, 0x6164, 0x6264, 0x6364, 0x6464, 0x6564, 0x6664,
  0x3065, 0x3165,0x3265, 0x3365, 0x3465, 0x3565, 0x3665, 0x3765, 0x3865, 0x3965, 0x6165, 0x6265, 0x6365, 0x6465, 0x6565, 0x6665,
  0x3066, 0x3166,0x3266, 0x3366, 0x3466, 0x3566, 0x3666, 0x3766, 0x3866, 0x3966, 0x6166, 0x6266, 0x6366, 0x6466, 0x6566, 0x6666
};
static const unsigned char chars[256] =  //<<-----  [DR] fixed (was signed: char)
{
  '.', '.', '.', '.', '.', '.', '.', '.', '.', '.', '.', '.', '.', '.', '.', '.',
  '.', '.', '.', '.', '.', '.', '.', '.', '.', '.', '.', '.', '.', '.', '.', '.',
  '\x20', '\x21','\x22','\x23','\x24','\x25','\x26','\x27','\x28','\x29','\x2a','\x2b','\x2c','\x2d','\x2e','\x2f',
  '\x30', '\x31','\x32','\x33','\x34','\x35','\x36','\x37','\x38','\x39','\x3a','\x3b','\x3c','\x3d','\x3e','\x3f',
  '\x40', '\x41','\x42','\x43','\x44','\x45','\x46','\x47','\x48','\x49','\x4a','\x4b','\x4c','\x4d','\x4e','\x4f',
  '\x50', '\x51','\x52','\x53','\x54','\x55','\x56','\x57','\x58','\x59','\x5a','\x5b','\x5c','\x5d','\x5e','\x5f',
  '\x60', '\x61','\x62','\x63','\x64','\x65','\x66','\x67','\x68','\x69','\x6a','\x6b','\x6c','\x6d','\x6e','\x6f',
  '\x70', '\x71','\x72','\x73','\x74','\x75','\x76','\x77','\x78','\x79','\x7a','\x7b','\x7c','\x7d','\x7e','\x7f',
  '\x80', '\x81','\x82','\x83','\x84','\x85','\x86','\x87','\x88','\x89','\x8a','\x8b','\x8c','\x8d','\x8e','\x8f',
  '\x90', '\x91','\x92','\x93','\x94','\x95','\x96','\x97','\x98','\x99','\x9a','\x9b','\x9c','\x9d','\x9e','\x9f',
  '\xa0', '\xa1','\xa2','\xa3','\xa4','\xa5','\xa6','\xa7','\xa8','\xa9','\xaa','\xab','\xac','\xad','\xae','\xaf',
  '\xb0', '\xb1','\xb2','\xb3','\xb4','\xb5','\xb6','\xb7','\xb8','\xb9','\xba','\xbb','\xbc','\xbd','\xbe','\xbf',
  '\xc0', '\xc1','\xc2','\xc3','\xc4','\xc5','\xc6','\xc7','\xc8','\xc9','\xca','\xcb','\xcc','\xcd','\xce','\xcf',
  '\xd0', '\xd1','\xd2','\xd3','\xd4','\xd5','\xd6','\xd7','\xd8','\xd9','\xda','\xdb','\xdc','\xdd','\xde','\xdf',
  '\xe0', '\xe1','\xe2','\xe3','\xe4','\xe5','\xe6','\xe7','\xe8','\xe9','\xea','\xeb','\xec','\xed','\xee','\xef',
  '\xf0', '\xf1','\xf2','\xf3','\xf4','\xf5','\xf6','\xf7','\xf8','\xf9','\xfa','\xfb','\xfc','\xfd','\xfe','\xff'
};

void HexLineOut( BYTE* pSrc, char* pszBuf )
{
      BYTE* dump = pSrc;
      long* pBuf = (long*) pszBuf;

      for (int i = 4; i > 0; --i) {
            *pBuf = (digits[(*dump)]) | 0x200000L | ((digits[*(dump+1)] & 0xffL) << 24);
            ++pBuf;
            *pBuf = (digits[*(dump+1)] >> 8 ) | (0x2000L) | (digits[*(dump+2)] << 16);
            ++pBuf;
            *pBuf = (digits[*(dump+3)] << 8) | 0x20000020L;
            ++pBuf;
            dump += 4;
      }
      dump -= 16;

      *pBuf =
      chars[*dump] | (chars[*(dump+1)] << 8) | (chars[*(dump+2)] << 16) | (chars[*(dump+3)] << 24);
      ++pBuf;
      dump += 4;
      *pBuf =
      chars[*dump] | (chars[*(dump+1)] << 8) | (chars[*(dump+2)] << 16) | (chars[*(dump+3)] << 24);
      ++pBuf;
      dump += 4;
      *pBuf =
      chars[*dump] | (chars[*(dump+1)] << 8) | (chars[*(dump+2)] << 16) | (chars[*(dump+3)] << 24);
      ++pBuf;
      dump += 4;
      *pBuf =
      chars[*dump] | (chars[*(dump+1)] << 8) | (chars[*(dump+2)] << 16) | (chars[*(dump+3)] << 24);
      ++pBuf;

      *pBuf = 0;

    pszBuf[23]= '-'; // <<----- [DR] Added
}

=-=-=-=-=-=-=-=-
I'm admitting here and now that I may not be able to beat that by much.

-- Dan
Screw EE! I cannot post my new source, I only get that stupid "no text" thing.
Here's my new version. No tables, no jumps, all memory access is 32-bit on 32-bit boundaries.

    mov edi,pszBuf
    mov esi,pSrc
    mov byte ptr [edi+64],0
    mov eax,[esi]
    mov edx,eax
    and eax,252645135
    shr edx,4
    mov ecx,eax
    and edx,252645135
    add eax,101058054
    mov ebx,edx
    and eax,269488144
    add ebx,101058054
    shr eax,2
    and ebx,269488144
    add ecx,eax
    shr ebx,2
    shr eax,1
    add edx,ebx
    add ecx,eax
    shr ebx,1
    shr eax,1
    add edx,ebx
    add ecx,eax
    shr ebx,1
    add ecx,808464432
    add edx,ebx
    add edx,808464432
    mov al,32
    mov ah,dh
    shl eax,16
    mov al,dl
    mov ah,cl
    shr edx,16
    mov [edi],eax
    shr ecx,8
    mov al,dl
    mov ah,ch
    shl eax,16
    mov al,cl
    mov ah,32
    shr ecx,16
    mov [edi+4],eax
    mov al,cl
    mov ah,32
    shl eax,16
    mov al,32
    mov ah,dh
    mov [edi+8],eax
    mov eax,[esi+4]
    mov edx,eax
    and eax,252645135
    shr edx,4
    mov ecx,eax
    and edx,252645135
    add eax,101058054
    mov ebx,edx
    and eax,269488144
    add ebx,101058054
    shr eax,2
    and ebx,269488144
    add ecx,eax
    shr ebx,2
    shr eax,1
    add edx,ebx
    add ecx,eax
    shr ebx,1
    shr eax,1
    add edx,ebx
    add ecx,eax
    shr ebx,1
    add ecx,808464432
    add edx,ebx
    add edx,808464432
    mov al,32
    mov ah,dh
    shl eax,16
    mov al,dl
    mov ah,cl
    shr edx,16
    mov [edi+12],eax
    shr ecx,8
    mov al,dl
    mov ah,ch
    shl eax,16
    mov al,cl
    mov ah,32
    shr ecx,16
    mov [edi+16],eax
    mov al,cl
    mov ah,45
    shl eax,16
    mov al,32
    mov ah,dh
    mov [edi+20],eax
    mov eax,[esi+8]
    mov edx,eax
    and eax,252645135
    shr edx,4
    mov ecx,eax
    and edx,252645135
    add eax,101058054
    mov ebx,edx
    and eax,269488144
    add ebx,101058054
    shr eax,2
    and ebx,269488144
    add ecx,eax
    shr ebx,2
    shr eax,1
    add edx,ebx
    add ecx,eax
    shr ebx,1
    shr eax,1
    add edx,ebx
    add ecx,eax
    shr ebx,1
    add ecx,808464432
    add edx,ebx
    add edx,808464432
    mov al,32
    mov ah,dh
    shl eax,16
    mov al,dl
    mov ah,cl
    shr edx,16
    mov [edi+24],eax
    shr ecx,8
    mov al,dl
    mov ah,ch
    shl eax,16
    mov al,cl
    mov ah,32
    shr ecx,16
    mov [edi+28],eax
    mov al,cl
    mov ah,32
    shl eax,16
    mov al,32
    mov ah,dh
    mov [edi+32],eax
    mov eax,[esi+12]
    mov edx,eax
    and eax,252645135
    shr edx,4
    mov ecx,eax
    and edx,252645135
    add eax,101058054
    mov ebx,edx
    and eax,269488144
    add ebx,101058054
    shr eax,2
    and ebx,269488144
    add ecx,eax
    shr ebx,2
    shr eax,1
    add edx,ebx
    add ecx,eax
    shr ebx,1
    shr eax,1
    add edx,ebx
    add ecx,eax
    shr ebx,1
    add ecx,808464432
    add edx,ebx
    add edx,808464432
    mov al,32
    mov ah,dh
    shl eax,16
    mov al,dl
    mov ah,cl
    shr edx,16
    mov [edi+36],eax
    shr ecx,8
    mov al,dl
    mov ah,ch
    shl eax,16
    mov al,cl
    mov ah,32
    shr ecx,16
    mov [edi+40],eax
    mov al,cl
    mov ah,32
    shl eax,16
    mov al,32
    mov ah,dh
    mov [edi+44],eax
    mov eax,[esi]
    mov ebx,eax
    mov ecx,eax
    and eax,1616928864
    not ecx
    add eax,1616928864
    and eax,ecx
    and eax,2155905152
    shr eax,7
    mov edx,$FF
    mul edx
    mov ecx,[esi+4]
    and ebx,eax
    not eax
    and eax,774778414
    or  eax,ebx
    mov [edi+48],eax
    mov eax,ecx
    mov ebx,ecx
    and eax,1616928864
    not ebx
    add eax,1616928864
    and eax,ebx
    and eax,2155905152
    shr eax,7
    mov edx,$FF
    mul edx
    mov ebx,[esi+8]
    and ecx,eax
    not eax
    and eax,774778414
    or  eax,ecx
    mov [edi+52],eax
    mov eax,ebx
    mov ecx,ebx
    and eax,1616928864
    not ecx
    add eax,1616928864
    and eax,ecx
    and eax,2155905152
    shr eax,7
    mov edx,$FF
    mul edx
    mov ecx,[esi+12]
    and ebx,eax
    not eax
    and eax,774778414
    or  eax,ebx
    mov [edi+56],eax
    mov eax,ecx
    mov ebx,ecx
    and eax,1616928864
    not ebx
    add eax,1616928864
    and eax,ebx
    and eax,2155905152
    shr eax,7
    mov edx,$FF
    mul edx
    and ecx,eax
    not eax
    and eax,774778414
    or  eax,ecx
    mov [edi+60],eax

Hope you like it...
AvonWyss,
That second submission is unusual to say the least.  Like Pavlik, you are always reading 32-bit aligned values and writing 32-bit aligned values.  But rather than looking up the hex-digit values, you are calculating them.  

I suppose that I am expected to be mystified here.  The calculation technique appears to be rather obscure, but only because you have intentionally made it so by using decimal values and interleaved the opcodes (possibly for pipeline optimization).   Lets see what it is actually doing:
=-=--==-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
CODE ANALYSIS
In the main sequence, you are working with four nibbles simultaneously.  To simpliy the discusiion, lets look at one nibble at a time:

assume the first input is either 'W' (0x57) or 'Z' (0x5A)

n1= one nibble // examples: 7 or 10 (examples in decimal)
n2= n1 +6      // examples: 13 or 16
n3= 1 if n2>16, else n3=0
n4 = n1 + (n3*4)+(n3*2)+ n3
     i.e. n4= n1+7  // examples: 7 (n1+0) or 17(n1+7)
n4 = n4 + 0x30      // examples: 55 (0x37='7') or 65 (0x41='A')

So your code adds 6 to obtain a temp result of 1 or a 0 (depending upon whether the sum > 10)  Then it multiplies that temp result by 7 resulting in 0 or 7, then it adds that to '0' returning '0'-'9' or 'A'-'F'

The ASCII output, I have not figured out.  Maybe you or someone else can explain it (I show an unobfuscated version of it below).

=-=--==-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TIMING ANALYSIS
Alas, as clever as this code is -- avoiding all JMPs and interleaving opcodes to maximize pipelining -- it  cannot complete with simple lookup logic.  I think there are two downfalls:

1) the code itself is nearly twice as many opcode bytes as Pavlik's (708 vs. 393) so there is just more processing going on.

2) The use of the MUL instruction is a killer (at least it used to be when I kept track of these things).

Your code was getting about

     30,200 convertsions per millisecond.

Nice try, interesting code, but  Sorry! not the fastest.

=-=-=-=-=-=-=-=-=-=-=-=-=-
CODE Footnotes

     mov eax,[esi]
     mov edx,eax
     and eax,0x0F0F0F0F // eax is lo nibbles (four of them; use ah for example)
     mov ecx,eax        // save a copy
     add eax,0x06060606 // ah is lo nibble + 6
     and eax,0x10101010 // ah is 1 if a low nibble was > 9
     shr eax,2          // ah is 4 if a low nibble was > 9
     add ecx,eax        // ah is low nibble + 4 if a low nibble was > 9
     shr eax,1          //   += 2 if a low nibble was > 9
     add ecx,eax        //   += 1 if a low nibble was > 9 (total +7)
     add edx,0x30303030 //   += '0' (convert to 0-9 and A-F
     //  (omitted: repeat for hi nibbles)
     //  (omitted) rearrange the bytes into eax for 'big endian' architecture
     // repeat four times for 16 bytes

///////// ASCII output portion -- who can explain this!

     mov eax,[esi]
     mov ebx,eax
     mov ecx,eax
     not ecx
     and eax,0x60606060
     add eax,0x60606060
     and eax,ecx
     and eax,0x80808080
     shr eax,7         // 0 or 1 for each byte
     mov edx,0xff      // 00 or FF for each byte
     mul edx           // killer time penalty here
     and ebx,eax
     not eax
     and eax,0x2E2E2E2E // '....'
     or  eax,ebx
     mov [edi+48],eax

-- Dan
Note that I "obfuscated" the hex values because my ASM uses other hex constants than the one in VC++ (you certainly have noticed the $FF one which I forgot to convert). The rest of the obfuscation is for CPU pairing, trying to avoid using one register twice in a row. But this only as sidenote.

I'll explain the ASCII output portion:

Let's split a char into its 8 bits to explain. We have
   76543210  Bits
   ABBCCCCC  Name

Now, your condition >=32 or <128, means that a char is valid exactly when any bit B is set (>31) but not bit A (>127). Therefore, I first mask the char to only reflect the B bits and add 01100000. The highest bit will become 1 in every case where any of the bits B were set and remain 0 if none was set. So I now have a mask telling me if the B bits were OK or not.

Next step is an AND NOT with the original value to kill the "OK" bit if the bit A is set. Finally I only keep the A bit and shift it right, so that I now have a valid mask with a 1 for every valid char. Unfortunately, I need a 0xFF mask in order to filter the chars using AND, and that's why I multiply my mask.

The rest is easy, I do a AND with the original values, which gives me only the valid chars, inverse my mask, convert the mask to "." chars, and add it to the valid chars. That's it.
The MUL can be replaced by adding an additional statement:

     mov eax,[esi]     // EAX: 4 source chars
     mov ebx,eax       // EBX: copy of them
     mov ecx,eax
     not ecx           // ECX: negated copy (only highest bit of interest)
     and eax,$60606060
     add eax,$60606060 // highest bit is 1 if bits 5/6 were OK
     and eax,ecx       // remove the highest bit if set in source
     shr eax,7
     and eax,$01010101 // mask to 1 for valid, 0 for invalid
     mov edx,$80808080 // set up subtraction buffer
     sub edx,eax       // subtract. mask 0 -> 80, 1 -> 7F
        and edx,$80808080 // mask 0 -> 00, 1 -> 7F
     and ebx,edx       // filter valid chars. invalid chars may be $80
     not edx           // invert mask
     and edx,$2E2E2E2E // make all invalid chars 2E ('.')
     or  edx,ebx       // combine valid and invalid chars
     mov [edi+48],edx  // write result
Ok, here's a MUL-free version of the part that does the ASCII:

      mov eax,[esi]
      mov ebx,eax
      mov ecx,eax
      and eax,1616928864
      not ecx
      add eax,1616928864
      and eax,ecx
      and eax,2155905152
      shr eax,7
      mov edx,$80808080
      sub edx,eax
      and edx,$7F7F7F7F
      mov ecx,[esi+4]
      and ebx,edx
      not edx
      and edx,774778414
      or  edx,ebx
      mov eax,ecx
      mov [edi+48],edx
      mov ebx,ecx
      and eax,1616928864
      not ebx
      add eax,1616928864
      and eax,ebx
      and eax,2155905152
      shr eax,7
      mov edx,$80808080
      sub edx,eax
      and edx,$7F7F7F7F
      mov ebx,[esi+8]
      and ecx,edx
      not edx
      and edx,774778414
      or  edx,ecx
      mov eax,ebx
      mov [edi+52],edx
      mov ecx,ebx
      and eax,1616928864
      not ecx
      add eax,1616928864
      and eax,ecx
      and eax,2155905152
      shr eax,7
      mov edx,$80808080
      sub edx,eax
      and edx,$7F7F7F7F
      mov ecx,[esi+12]
      and ebx,edx
      not edx
      and edx,774778414
      or  edx,ebx
      mov eax,ecx
      mov [edi+56],edx
      mov ebx,ecx
      and eax,1616928864
      not ebx
      add eax,1616928864
      and eax,ebx
      and eax,2155905152
      shr eax,7
      mov edx,$80808080
      sub edx,eax
      and edx,$7F7F7F7F
      and ecx,edx
      not edx
      and edx,774778414
      or  edx,ecx
      mov [edi+60],edx
i don't know if anybody read the code snippet i posted earlier. but it was exactly written for the purpose of avoiding MUL in the algorithm similar to AvonWyss. The method is, when 9 is subtracted by a nibble, its higher nibble will become 1111b if the nibble is greater than 9, otherwise it will become 0000b. without multiplying this already gives us the mask needed to distinguish 0-9 from A-F.

here is a slightly optimized version, it's the code for the first four bytes. it's still slightly slower than lookup table.

          mov esi, pSrc;
          mov edi, pszBuf;

          // let edx store original
          mov eax, [esi+00]
          mov edx, eax
          mov ebx, 0x0a0a0a0a
          and eax, 0x0f0f0f0f
          sub ebx, eax
          shr ebx, 4
          and ebx, 0x07070707
          or ebx, 0x30303030
          add eax, ebx

          // repeat with higher nibbles
          // mov edx, [esi+00]
          shr edx, 4
          mov ebx, 0x0a0a0a0a
          and edx, 0x0f0f0f0f
          sub ebx, edx
          and ebx, 0x70707070
          shr ebx, 4
          or ebx, 0x30303030
          add edx, ebx


          // display
          mov [edi+0], al
          shr eax, 8
          mov [edi+1], dl
          shr edx, 8
          mov [edi+3], al
          shr eax, 8
          mov [edi+4], dl
          shr edx, 8
          mov [edi+6], al
          shr eax, 8
          mov [edi+7], dl
          shr edx, 8
          mov [edi+9], al
          mov [edi+10], dl
abson, unfortnuately, I think that your code does not work correctly in all situations:


         mov eax, 0x0A090A09 // sample

         mov ebx, 0x0a0a0a0a
         sub ebx, eax        // ebx is now 00010001
         shr ebx, 4          // ebx is now 00001000
         and ebx, 0x07070707 // ebx is now 00000000
         or  ebx, 0x30303030 // ebx is now 30303030
         add eax, ebx        // ebx is now 3A393A39 = ":9:9"

Your previous version was better, but also failed:

         mov edx, 0x0A090A09
         mov ebx, edx
         sub ebx, 0x0a0a0a0a // ebx=0xFFFEFFFF
         and ebx, 0xf0f0f0f0 // ebx=0xF0F0F0F0
         mov ecx, ebx
         shr ecx, 4          // ecx=0x0F0F0F0F
         or ebx, ecx         // ebx=0xFFFFFFFF
  // at this point, ebx should contain the mask for each nibble

It's obvious that this mask cannot be right. The correct mask would be 0x00FF00FF.
Forgot to mention: The MUL was only in used in the ASCII part of the string, not the HEX conversion. See my code for details.
Sorry for all the posts. Sometimes I just don't wait long enough before posting... ;-)
abson, what's going wron in you code is the following:
0x00000000 - 0x00000001 == 0xFFFFFFFF

however, you are assuming that only the nibble changes:
0x00000000 - 0x00000001 != 0x000000FF

Your (original) code can be fixed by adding an additional bit which makes sure that the subtraction ends after the nibble:

        mov edx, 0x0A090A09
        mov ebx, edx
        or  ebx, 0x01010100
        sub ebx, 0x0a0a0a0a // ebx=0x00FF00FF
        and ebx, 0xf0f0f0f0 // ebx=0x00F000F0
        mov ecx, ebx
        shr ecx, 4          // ecx=0x000F000F
        or  ebx, ecx        // ebx=0x00FF00FF

Which is the correct mask for the following operations.
i don't know if anybody read the code snippet i posted earlier. but it was exactly written for the purpose of avoiding MUL in the algorithm similar to AvonWyss. The method is, when 9 is subtracted by a nibble, its higher nibble will become 1111b if the nibble is greater than 9, otherwise it will become 0000b. without multiplying this already gives us the mask needed to distinguish 0-9 from A-F.

here is a slightly optimized version, it's the code for the first four bytes. it's still slightly slower than lookup table.

          mov esi, pSrc;
          mov edi, pszBuf;

          // let edx store original
          mov eax, [esi+00]
          mov edx, eax
          mov ebx, 0x0a0a0a0a
          and eax, 0x0f0f0f0f
          sub ebx, eax
          shr ebx, 4
          and ebx, 0x07070707
          or ebx, 0x30303030
          add eax, ebx

          // repeat with higher nibbles
          // mov edx, [esi+00]
          shr edx, 4
          mov ebx, 0x0a0a0a0a
          and edx, 0x0f0f0f0f
          sub ebx, edx
          and ebx, 0x70707070
          shr ebx, 4
          or ebx, 0x30303030
          add edx, ebx


          // display
          mov [edi+0], al
          shr eax, 8
          mov [edi+1], dl
          shr edx, 8
          mov [edi+3], al
          shr eax, 8
          mov [edi+4], dl
          shr edx, 8
          mov [edi+6], al
          shr eax, 8
          mov [edi+7], dl
          shr edx, 8
          mov [edi+9], al
          mov [edi+10], dl
sorry for the double post, non-intentional ...
AvonWyss, yes, there is an error to my code, but the fix is actually simpler

original code:
        mov eax, 0x0A090A09 // sample

        mov ebx, 0x0a0a0a0a
        sub ebx, eax        // ebx is now 00010001
        shr ebx, 4          // ebx is now 00001000
        and ebx, 0x07070707 // ebx is now 00000000
        or  ebx, 0x30303030 // ebx is now 30303030
        add eax, ebx        // ebx is now 3A393A39 = ":9:9"

new code:
        mov eax, 0x0A090A09 // sample

        mov ebx, 0x09090909
        sub ebx, eax        // ebx is now 00010001
        shr ebx, 4          // ebx is now 00001000
        and ebx, 0x07070707 // ebx is now 00000000
        or  ebx, 0x30303030 // ebx is now 30303030
        add eax, ebx        // ebx is now 3A393A39 = ":9:9"

that is, subtract each nibble from 9 instead of from 10. other than that, i don't think there is anything wrong with the logic, therefore there is no need for the extra bit.
AvonWyss,
Thanks for the clarification on the ASCII conversion.  A very clever bit of binary logic.  I salute you.

I also retested your code after removing the MUL opcode form the ASCII part of the code.  The change does increase the speed. I misstated the original timing test.  Sorry.  The two timing results:

20,200 conversions per ms -- version with MUL (sorry 'bout my earlier typo)
21,600 conversions per ms -- version without MUL

So replacing the MUL provided a respectable 6% speed increase.   However, as clever as the code it, it does not really even approach the speeds of the pure lookup table algorithm.

-- Dan
Dan,
May I ask why you are ignoring my comment?
As I mentioned my code is very similar to AvonWyss's code, as it also operates on 4 bytes at a time. But it uses the special characteristic of 2s complement math so it only needs 8 instructions to calculate the HEX representation of 4 nibbles.
I have tested out the full version of it against Pavlik's code, and my version is about 10% slower. I am not claiming my version will ever be faster than the lookup table, but I really think, as an author of an Assembly book, you would be interested.
absong,
I appologize for ignoring you.  I was testing the other two functions becasue they are complete functions and therefore testable.  Then I had to use the bathroom and well,  I'd rather not describe that particular incident.

I am very interested in your code.  And I just spent about 30 minutes trying to figure out what is going on.  I wanted to make some intelligent comments about it.  Here is what I've been testing (I replaced your comments with my results):

mov eax, 0x0A090A09 // sample

mov ebx, 0x09090909
sub ebx, eax        // ebx is now FEFFFF00
shr ebx, 4          // ebx is now 0FEFFFF0
and ebx, 0x07070707 // ebx is now 07070700
or  ebx, 0x30303030 // ebx is now 37373730
add eax, ebx        // eax is now 41404139 = ")()9"

The problem is that it doesn't do what you say in your comments and I don't understand how it is supposed to work so I can't fix it.  As is, it outputs nonsense.  I'm sure that there must be something that I am missing.  Could you explain it in detail and/or show the code using single bytes (al, bl, etc?) so I can extrapolate to the 4-at-a-time logic?

-- Dan
Dan, yes, using single bytes is a good idea.
To convert a nibble N into its HEX representation, we can do the following:
- if the nibble is between 0 and 9, add the nibble by 0x30
- if the nibble is between A and F, add the nibble by 0x37
Notice the difference between 0x30 and 0x37 is minimal. If we can produce a mask, so that it is all 0's when the nibble N is less than or equal to 9, and all 1's when the nibble is greater than 9, we can mask out the 7.
I noticed that subtraction can do exactly that. If N <= 9, R = 9 - N will produce a positive number, therefore the higher nibble of the result R will be all 0's. If N > 9, R = 9 - N will produce a negative number, therefore the higher nibble of the result R will be all 1's.


scenario 1 - assume lower nibble of al is 0x0B, and we want to convert it into hex:

and al, 0x0f   ; first clear the high nibble
mov bl, 0x09
sub bl, al     ; bl = 9 - al = -2 (binary: 11111110)
; notice at this point, the higher nibble of bl is all 1's. we can use it as a mask.
shr bl, 4
and bl, 0x07   ; bl = 0x0f & 0x07 = 0x07
or bl, 0x30    ; bl = 0x07 | 0x30 = 0x37
add al, bl     ; al = 0x0B + 0x37 = 0x42 = 'B'


scenario 2 - assume lower nibble of al is 0x04

and al, 0x0f
mov bl, 0x09
sub bl, al   ; bl = 9 - al = 5 (binary: 00000101)
; notice that the higher nibble is all 0's, instead of all 1's
shr bl, 4
and bl, 0x07   ; bl = 0x00 & 0x07 = 0x00
or bl, 0x30    ; bl = 0x00 | 0x30 = 0x30
add al, bl     ; al = 0x04 + 0x30 = 0x34 = '4'

Now I see where the problem lies:

It works great to 1-byte values, but when you try to do four at once, a borrow is subtracted from the higher bytes.  For instance:

mov eax, 0x0A090A09 // sample

mov ebx, 0x09090909
sub ebx, eax        // ebx is now FEFFFF00
shr ebx, 4          // ebx is now 0FEFFFF0
and ebx, 0x07070707 // ebx is now 07070700
or  ebx, 0x30303030 // ebx is now 37373730
add eax, ebx        // eax is now 41404139 = "A@A9"

after
   sub ebx, eax
ebx is
FEFFFF00 (actual)
FF00FF00 (desired)

That is, it works for the rightmost byte, but when a nibble is > 9, it causes a borrow from the byte to the left, screwing up the rest of the mask.

==-=-=-=-=-=-=-
You already mentioned that your output logic is not too efficient (those single-byte writes).  But I think we might be able to improve upon AvonWyss's output as well.

Coming into the output part, we have four ASCII characters in two 32-bit registers (edx and ecx).  We want to output:

_dl  _cl  space
_dh  _ch  space
edl  ecl  space  (the lo byte of the hi word)
edh  ech  space  (the hi byte of the hi word)

That is twelve bytes and we aim to do it by writing three four-byte values.  

output #1: _dl  _cl  _20  _dh
output #2: _ch  _20  edl  ecl
output #3: _20  edh  ech  _20

the data needs to be reversed for output so:

regout #1 must be: _dh  _20  _cl  _dl
regout #2 must be: ecl  edl  _20  _ch
regout #3 must be: _20  ech  edh  _20

I'm too tired right now to finish this thought... but it seems that there might be a way to set this up a little more efficiently than AvonWyss's 21-instruction sequence ... perhaps using some combination of XCHG and BSWAP.  Or perhaps shifting the data into a FPU register.  

-- Dan
Aha, now I understand where the problem is. Can't believe I was so blind to it. AvonWyss, sorry, I believe you also mentioned the same thing, but at that time I thought that problem only arise because I was subtracting each nibble from 10, instead of 9.
Yes, we'll spend our energy optimizing AvonWyss's code more then.
Since this Q thread has become pretty large, I suggest that we continue in the 0 point followup I have set up for this purpose:
https://www.experts-exchange.com/assembly/Q.20280946.html
Continuation (3) continued here: http:Q.20283475.html

-- Dan
This thread seems to have run its course.

I am accepting this answer, mainly for able's excellent work in the two continuation threads:

http:Q.20280946.html
http:Q.20283475.html

I am also posting some points to AvonWyss in
http:Q.20283475.html

I hope we all had fun!  I made a new challenge here:

   http:Q.20288519.html

-- Dan
Yes, it was definitely fun. Thanks!
I was not yet finished, but I took a slower pace. I started reading Inner Loops from Rick Booth, which showed me that I have been thinking often in the wrong direction.

There is so much to do with pairing on the Pentium and even more about reading cache lines and processing micro ops (which is done in quite an undeterminable manner) in the Pentium Pro that I think that there is still much more to gain if it comes to speed. And I am talking about all different flavors of the algorithm and all different kinds of processors.

But now on with http:Q.20288519.html!

Cheers!

Abel