Solved

# How fast is your ASM hex converter? (continuation)

Posted on 2002-03-24
1,518 Views
0
Question by:AvonWyss
• 18
• 16
• 14
• +2

LVL 14

Author Comment

ID: 6892495
absong, in fact, this was exactly my point of the comment I made where I proposed to add that additional bits to make sure that the subtraction does not influencemore than one nibble at a time.

dan, I already did think about faster methods to assemble the hex bytes and put them in a 32 bit "block". However, since the nibbles are interleaved in two registers, it's pretty hard to find a better way of assembling them than to use byte access. Of course, if anyone has ideas, I'll be glad to hear about them, but I'm afraid that reaching a speed comparable or even better than the simple lookup method will be pretty hard to achieve.
0

LVL 14

Author Comment

ID: 6892613
So far, I was able to optimize the nibbles-to-hex converter some:

The original was:
and eax,0x0F0F0F0F // eax is lo nibbles (four of them; use ah for example)
mov ecx,eax        // save a copy
add eax,0x06060606 // ah is lo nibble + 6
and eax,0x10101010 // ah is 1 if a low nibble was > 9
shr eax,2          // ah is 4 if a low nibble was > 9
add ecx,eax        // ah is low nibble + 4 if a low nibble was > 9
shr eax,1          //   += 2 if a low nibble was > 9
add ecx,eax        //   += 1 if a low nibble was > 9 (total +7)
add ecx,0x30303030 //   += '0' (convert to 0-9 and A-F

The new one is (identical functionality):
lea ecx,eax+\$80808080-\$0A0A0A0A // ecx = (eax | \$80808080) - \$0A0A0A0A
shr ecx,4  // get part of the subtraction which is significant for us
not ecx // negate the offset nibbles
lea ecx,\$30303030+ecx+eax // ecx = \$30303030 + eax + ecx

More to come ;-)
0

LVL 49

Expert Comment

ID: 6892664
>>I'm afraid that reaching a speed comparable or even better than the simple lookup method will be pretty hard to achieve.

I agree.  But there might be an interesting challenge in trying to squeeze a cycle or two out of your non-lookup algorithm code.  Summary (repeated from my last post)

regout #1 must be: _dh  _20  _cl  _dl
regout #2 must be: ecl  edl  _20  _ch
regout #3 must be: _20  ech  edh  _20

For instance, in the first dword output, your code uses:

mov al,32        // xx xx xx 20
mov ah,dh        // xx xx dh 20
shl eax,16       // dh 20 xx xx
mov al,dl        // dh 20 xx dl
mov ah,cl        // dh 20 cl dl  <<-- desired output
mov [edi],eax

...but this should also work...

mov eax, edx  // xx xx dh dl
bswap eax     // dl dh xx xx
mov ah,20     // dl dh 20 xx
mov al,cl     // dl dh 20 cl
rol eax,8     // dh 20 cl dl  <<-- desired output
mov [edi],eax

I can't say if it is any faster, but it is slightly more interesting :)

There are some similar challenges in optimizing the output even after using a lookup table.

I just noticed that in Pavlik's lookup table, the hex-digit pairs are preset in reverse order to avoid the post-lookup shuffling.  I'm impressed.

-- Dan
0

LVL 14

Author Comment

ID: 6892705
I have optimized the third one pretty good (where ecx and edx must have been shifted 8 resp. 16 bits for the 2nd block):
and ecx,\$00FFFF00
or  ecx,\$2D000020
mov ch,dh
mov [edi+20],ecx

Also, I have used your proposition for the 1st one because it looks pretty and it avoids some byte-ops, possibly making it faster on modern CPUs.

I'll try to get rid of some instructions in the ASCII part as well. Will post my results when I get something.

An interesting fact is that this sort of code (no lookups, and operating with full registers) should help to get pretty awesome performance on processors with larger registers (soon 64, or later on maybe even 128 bits or more). For instance, with the 6 instructions to convert nibbles to ASCII values, the double or even quadruple amount would be processed in the same timespan. The only thing I don't yet find satisfying in the HEX part is the merging of the interleaved values.
0

LVL 14

Author Comment

ID: 6892725
My current hex-to-chars implementation converts a block of 8 input bytes (12 chars output) in a total of 33 assembler instructions. This is a ratio of 2.75 instructions per byte output. Not bad, uh?
0

LVL 14

Author Comment

ID: 6892726
Er, 8 input nibbles of course... (or 4 input bytes) *g*
0

LVL 14

Author Comment

ID: 6892772
Here's an enhanced version of the ASCII converter. It has 15 instead of 17 assembler commands per 4 bytes (ratio: 3.75 instructions instead of 4.25 per byte), and it uses 2 32-bit constants less and will be easier to optimize for pipelining than the old one.

mov ecx,[esi]         // ABC, ECX = source chars
mov ebx,ecx
mov edx,\$80808080     // EDX: mask for high bits (high per byte)
not ecx
and ecx,edx           // ECX: negated copy of the high bits
and ebx,\$7F7F7F7F     // EBX: source w/o highest bits, prevents carry across bytes
lea eax,\$60606060+ebx
and eax,ecx
shr eax,7             // EAX: lowest bit is 1 if OK (source bits 5,6 != 0,0 and bit 8 == 0)
sub edx,eax           // EDX: \$80 if OK, else \$7F
and ebx,edx           // EBX: only valid chars remain
not edx
and edx,\$2E2E2E2E     // EDX: make all invalid chars 2E ('.')
or  edx,ebx           // EDX: combine valid and invalid chars
mov [edi+48],edx      // write result
0

LVL 49

Expert Comment

ID: 6892783
I can't get your new hex nibbler to work.  You said:

mov eax,[esi]
lea ecx,eax+\$80808080-\$0A0A0A0A // ecx = (eax | \$80808080) - \$0A0A0A0A
shr ecx,4
not ecx // negate the offset nibbles
lea ecx,\$30303030+ecx+eax // ecx = \$30303030 + eax + ecx

which I converted to VC++ as:

mov eax,[esi]
and eax,0x0F0F0F0F
lea ecx,[eax+0x80808080-0x0A0A0A0A]
shr ecx,4
not ecx
and ecx,0x70707070
lea ecx,[0x30303030+ecx+eax]

But at the end, ECX does not contain anything usable. Assuming a problem with the LEA code, I tried this:

mov eax,[esi]
and eax,0x0F0F0F0F
mov ecx, eax
or ecx,0x80808080;
sub ecx,0x0A0A0A0A;
shr ecx,4
not ecx
and ecx,0x70707070

with the same result.  In both cases, the low four bits of each byte in the resulting ECX is correct, but the hi four bits is wrong.  Can you confirm?

(When working, I think this will have a measurable speed increase).

>> The only thing I don't yet find satisfying in the HEX part is the merging of the interleaved values.

One way to attack that (as I've seen that you already know) is to say, for instance, "if only the value in CL were in the upper byte of edx, then I could just..."  and then rearrange the initial conditions so that when the calculations are done, everything is spot on for efficient output.  Maybe I can find something on that front.

-- Dan
0

LVL 14

Author Comment

ID: 6892796
Yeah, I have a typo on one of the bottom lines.
and ecx,0x70707070
should be
and ecx,0x07070707

Like this, it should work just fine. Send me an email to avw@gmx.ch if you want me to send the complete version, I don't think it makes sense to post hundred of lines of code that is still in progress right now.
0

LVL 14

Author Comment

ID: 6892830
Funny. The current code has 195 instructions to output a total of 65 bytes, which makes exactly 3 instructions per byte output, or 12 instructions per 32-bit value. :-)
0

LVL 49

Expert Comment

ID: 6893557
I've been beating on arranging the hex nibbles for optimim output and can't seem to make any progres.

Are you interested in optimizing pavlik's code (I'm pretty sure that it will end up with the best speed).  Here is the disassembly of the optimized C Code:

//   for (int i = 4; i > 0; --i) {
//   *pBuf++ = (digits[(*dump)]) | 0x200000L | ((digits[*(dump+1)] & 0xffL) << 24);
xor  edx,edx
xor  ebx,ebx
mov  dl,byte ptr [eax+1]
mov  bl,byte ptr [eax]
//  *pBuf++ = (digits[*(dump+1)] >> 8 ) | (0x2000L) | (digits[*(dump+2)] << 16);
//  *pBuf++ = (digits[*(dump+3)] << 8) | 0x20000020L;
//  dump += 4;
mov  edx,dword ptr [edx*4+4033C0h]
mov  ebp,dword ptr [ebx*4+4033C0h]
shl  edx,18h
or   edx,ebp
xor  ebx,ebx
or   edx,200000h
mov  dword ptr [esi-4],edx
mov  bl,byte ptr [eax-2]
xor  edx,edx
mov  dl,byte ptr [eax-3]
mov  ebx,dword ptr [ebx*4+4033C0h]
shl  ebx,10h
mov  edx,dword ptr [edx*4+4033C0h]
or   edx,200000h
sar  edx,8
or   edx,ebx
mov  dword ptr [esi-8],edx
xor  edx,edx
mov  dl,byte ptr [eax-1]
mov  edx,dword ptr [edx*4+4033C0h]
shl  edx,8
or   edx,20000020h
dec  ecx
mov  dword ptr [esi-4],edx
jne  HexLineOut+13h (00401b43)
}

Aside from the obvious loop unrolling, I can see several ways to speed this up.

-- Dan
0

LVL 1

Expert Comment

ID: 6895001
Hi.

Thanks Dan for fixing my code.

Actually i've profiled version with loop and without. I also thought that it will increase perfomance but surprizingly it didn't.
As for current version of code there is one little improvement. Instead of writing *pBuf++=... in every line it is better to write *pBuf=..., *(pBuf+1)=... and *(pBuf+2)=... . And pBuf+=3 at the end of loop. It will replace three "add esi,4" instruction with one "add esi,12" and save whole 2 processor ticks :-).
I also thought about loading 4 bytes of dump to register and manipulate with this register inside the loop. But since i can not guarantee that the parameter will point to the alligned memory, this can lower performance.
0

LVL 49

Expert Comment

ID: 6895288
Pavlik,

>>*pBuf++=...
I assumed that direct addressing through a pointer (no offset) would be faster.  E.g.,  Isn't
mov  dword ptr [esi],edx
...faster than...
mov  dword ptr [esi+4],edx
?  It used to be --- back in the 'golden age' of 8086 code when we had to program in binary with our teeth.  I'd not be surprised if the index+offset addressing modes are just as fast nowadays (or even faster because one can skip incrementing the index, as you point out).

>> But since i can not guarantee that the parameter will
>> point to the alligned memory
Let's examine two cases:
1) The output pointer is aligned on a DWORD boundary:
There will be a measurable performace gain, won't there?

2) The output pointer is at some oddball offset (not DWORD aligned).  Performance is certainly lost.  But isn't the penalty about the same as doing one-byte moves?

Anyway, I've found that the compiler always places standard allocations -- including auto stack variables and new'd pointers, and structure fields -- at DWORD offsets; one needs to go out of one's way to do otherwise.  For instance, if the caller does:

char szBuf[100];
HexLineOut( pSrc, &szBuf[1] );

then the pszBuf (destination) would be unaligned and performance would suffer.  In that case, the caller is at fault and should expect a slowdown. In all other cases, wouldn't the potential performance gain for the 'smart programmer' be worth the risk of slowdown for the idiots amongst us ?:)

In time-critical situations, I have seen where programmers check the alignment and then 'special-case' the first and last (up to) 3 bytes of unaligned data.  (I think I saw this in the C libary source for memmove).

Anyway, in my testing code, I'll ensure that both pointers will be aligned on DWORD boundaries.  So if you want to add that optimization, there is no risk.  In fact, it would be interesting to profile the routine, passing in an unaligned pointer, just to satisfy curiosity...

>>i've profiled version with loop and without...
One would think that there would be a (small) improvement with unrolled loops.  Perhaps it is so small that it is 'lost in the noise' of the byte-oriented output.

-- Dan
0

LVL 1

Expert Comment

ID: 6895461
Hi Dan.

I'm actually not that good in assembler. I mostly write in C++.

About offset in command: It doesn't matter for processor to access memory indexed by register or by register+offset. In the last case offset will be written in the command right after ModR/SIB bytes and will be processed  during instruction decoding. (I've just read this in the "i486 commands description").

I've tried to optimize access to dump but once i try to add one more register variable, compiler starts to generate code, which stores this variable in memory. Looks like there is not enough register for him to store all the intermediate results. So for C version of procedure I don't know how to make him optimize the code this way. I'm almost sure that it is possible using assembler.

Thanks, Pavlik.
0

LVL 39

Expert Comment

ID: 6897106
Hi all!

Here's a non-optimized version that works with MMX registers. When Dan first started this thread I realized that shifting the bits and working on byte registers wouldn't get us far in speed. MMX can do what a normal x86 code cannot: work on individual bytes in one register. And it is 64 bit!

It took me quite some time to figure out how MMX worked and what commands it has, but now I can finally contribute with a different approach. I hope you like it. If anybody want some comments on the code, I'll be happy to give it.

Please note, I am not good at assembler, my last attempt was on an 80286, years ago. This version is 3.3 times faster then the original c++ized assembler posted in the other thread. I hope it still beats the timings! And I know that even with some easy optimizations, this MMX version can be speeded up a lot more.

void HexLineOut(unsigned char * pSrc, char * pszDest)
{
__asm
{
mov            esi, pSrc
mov            edi, pszDest

//Lo nibble
movq      mm0, [esi]
pand      mm0, mm1            ;mask each byte against 0x0F
movq      mm6, mm0            ;store temporarily
pcmpgtb      mm0, mm4            ;compare each byte with 0x39 and if greater, store 0xFF in dest. byte
pand      mm0, mm5            ;mask compare-result with 0x07
movq      mm7, mm0            ;store

//Hi nibble
movq      mm0, [esi]
pand      mm0, mm2            ;mask each byte against 0xF0
psrlq      mm0, 4                  ;move hinibble to lonibble for each byte
movq      mm6, mm0            ;store temporarily
pcmpgtb      mm0, mm4            ;compare each byte with 0x39 and if greater, store 0xFF in dest. byte
pand      mm0, mm5            ;mask compare-result with 0x07
movq      mm6, mm0            ;store

punpcklbw      mm0, mm7      ;unpack  2468+1357 -> 12345678
punpckhbw      mm6, mm7

movq      qword ptr [edi], mm0
movq      qword ptr [edi+8], mm6

/////////////////////////////////////////////////
//Lo nibble
movq      mm0, [esi+8]
pand      mm0, mm1            ;mask each byte against 0x0F
movq      mm6, mm0            ;store temporarily
pcmpgtb      mm0, mm4            ;compare each byte with 0x39 and if greater, store 0xFF in dest. byte
pand      mm0, mm5            ;mask compare-result with 0x07
movq      mm7, mm0            ;intermediate result

//Hi nibble
movq      mm0, [esi+8]
pand      mm0, mm2            ;mask each byte against 0xF0
psrlq      mm0, 4                  ;move hinibble to lonibble for each byte
movq      mm6, mm0            ;store temporarily
pcmpgtb      mm0, mm4            ;compare each byte with 0x39 and if greater, store 0xFF in dest. byte
pand      mm0, mm5            ;mask compare-result with 0x07
movq      mm6, mm0            ;store

punpcklbw      mm0, mm7      ;unpack  2468+1357 -> 12345678
punpckhbw      mm6, mm7

mov            byte ptr [edi+16], 0x2d
movq      qword ptr [edi+17], mm0
movq      qword ptr [edi+25], mm6
mov            byte ptr [edi+33], 0x20

//Create the stringized part

movq      mm0, [esi]
movq      mm1, mm0            ;store original
//under 32
pcmpgtb      mm0, mm4            ;SIGNED compare, everything between 127+ and 31- will be masked
pandn      mm0, mm6
movq      mm3, mm0            ;store dot-result
pand      mm1, mm2            ;original AND original mask

movq      qword ptr [edi+34], mm1

movq      mm0, [esi+8]
movq      mm1, mm0            ;store original
pcmpgtb      mm0, mm4            ;SIGNED compare, everything between 127+ and 31- will be masked
pandn      mm0, mm6
movq      mm3, mm0            ;store dot-result
pand      mm1, mm2            ;original AND original mask

movq      qword ptr [edi+42], mm1
mov            byte ptr [edi+50], 0x00

emms                              ;clear mmx registers
}
}
0

LVL 39

Expert Comment

ID: 6897107
Sorry, forgot to include the masks:

unsigned char mask0F[8] = {0x0F, 0x0F, 0x0F, 0x0F, 0x0F, 0x0F, 0x0F, 0x0F};  //to get hinibble
unsigned char maskF0[8] = {0xF0, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0};  //to get lonibble
unsigned char mask30[8] = {0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30};  //to make digits
unsigned char mask39[8] = {0x39, 0x39, 0x39, 0x39, 0x39, 0x39, 0x39, 0x39};  //to differ betw. digits/A-F
unsigned char mask07[8] = {0x07, 0x07, 0x07, 0x07, 0x07, 0x07, 0x07, 0x07};  //to make A-F
unsigned char mask1f[8] = {0x1f, 0x1f, 0x1f, 0x1f, 0x1f, 0x1f, 0x1f, 0x1f};  //for differ betw. < asc(32)
unsigned char mask46[8] = {0x2e, 0x2e, 0x2e, 0x2e, 0x2e, 0x2e, 0x2e, 0x2e};  //the dot
0

LVL 39

Expert Comment

ID: 6897118
I will update with a version that includes the spaces in the output string. I had "forgotten" that, I was just too eager to share this solution.
0

LVL 1

Expert Comment

ID: 6897271
Hi folks!

Since I am new in this thread!  I wrote a 100% Assembler version of your quest but since I have a fast machine,I am able to achieve 65,535 iteration well under 1 second! Since I doubt that my 1st attempt rivals yours, isit possible to know, in cycles, how fast you are completing 1 iteration without the output string process!  This way I can compare apples to apples and find out where my code really stands! If I am in the ballpark, I will post my code!

Tanks, Serge!
0

LVL 39

Expert Comment

ID: 6897327
I don't know about the cycles, but there are complete listings here, you can easily compare them. I do that by using QueryPerformanceCounter, other machines probably have similar possibilities. You can also find Dan's C++ code here that does the timing in a dialog based application counting the conversions per millisecond.

Reference C++ fn: 21,000 conversions per millisecond
Pavlik's C fn: 40,300 conversions per millisecond
AvonWyss asm: 30,200 convertsions per millisecond
Jonnin's sprintf: 450 conversions per millisecond
My code: not tested by Dan yet, but 2.06x pavlik on my machine.

All these versions are tested on Dan's machine, which is a PIII 450.

Dan, if you are doing timings, you may want to set the thread priority to time critical, as in:

That way you avoid as many as possible unfair interruptions from context switches and such.

Cheers!
Abel
0

LVL 49

Expert Comment

ID: 6897363
Hi sergelabel,
Give it a shot!  Try to work out a single clean function as described earlier.  In that format, I can easily plug it into my testing app.  All of the overhead should be similar for all submissions.  I run it with the output turned ON just once so that I can verify correct formating.

-- Dan
0

LVL 49

Expert Comment

ID: 6897366
abel,
I'm glad you're back!  Bearing gifts!  I also thought that MMX would kick ass on std opcodes, but I would have had to start from scratch learning how to use it.  That did stop YOU though!  Great work.

You are getting 80,000+ conversions per millisecond?  AWESOME!  I can't wait to get home so I can test it (and maybe learn something brand new!)  I'm so excited -- if I were a doggy, I'd be wagging my tail like mad!

Good Idea. I did add a Sleep(0) call right before starting to try to get a full timeslice.

-- Dan
0

LVL 14

Author Comment

ID: 6897543
abel, I'm afraid that your version will suffer a substantial slowdown ehan you add the code for the spacing. In fact, I believe that this is one of the more challenging parts. Also note that the (original) timings for my ASM were not 30'200, but only 21'600. But I will post the current version for benchmarking, I believe that it should already do quite better (even though it will most likely not come even near the table approaches).

mov edi,pszBuf
mov esi,pSrc
mov byte ptr [edi+64],0
mov eax,[esi]
mov ebx,eax
shr eax,4
and ebx,\$0F0F0F0F
lea edx,ebx+\$80808080-\$0A0A0A0A
lea ecx,eax+\$80808080-\$0A0A0A0A // ecx = (eax | \$80808080) - \$0A0A0A0A
shr edx,4
shr ecx,4                       // get part of the subtraction which is significant for us
not edx
not ecx                         // negate the offset nibbles
and edx,\$07070707
lea edx,\$30303030+edx+ebx
lea ecx,\$30303030+ecx+eax       // ecx = \$30303030 + eax + ecx
mov eax,edx   // xx xx dh dl
bswap eax     // dl dh xx xx
mov ah,\$20    // dl dh 20 xx
mov al,cl     // dl dh 20 cl
shr edx,16
rol eax,8     // dh 20 cl dl  <<-- desired output
shr ecx,8
mov [edi],eax
mov bl,ch
mov bh,dl
bswap ebx
mov bl,cl
and ecx,\$00FF0000
mov bh,32
or  ecx,\$20000020
mov [edi+4],ebx
mov eax,[esi+4]
and edx,\$0000FF00
mov ebx,eax
or  ecx,edx
shr eax,4
mov [edi+8],ecx
and eax,\$0F0F0F0F
and ebx,\$0F0F0F0F
lea ecx,eax+\$80808080-\$0A0A0A0A
lea edx,ebx+\$80808080-\$0A0A0A0A
shr ecx,4
shr edx,4
not ecx
not edx
and ecx,\$07070707
and edx,\$07070707
lea ecx,\$30303030+ecx+eax
lea edx,\$30303030+edx+ebx
mov eax,edx
bswap eax
mov ah,\$20
shr edx,16
mov al,cl
shr ecx,8
rol eax,8
mov bl,ch
mov [edi+12],eax
mov bh,dl
bswap ebx
mov bl,cl
and ecx,\$00FFFF00
mov bh,32
or  ecx,\$2D000020
mov [edi+16],ebx
mov eax,[esi+8]
mov ch,dh
mov ebx,eax
mov [edi+20],ecx
shr eax,4
and ebx,\$0F0F0F0F
and eax,\$0F0F0F0F
lea ecx,eax+\$80808080-\$0A0A0A0A
lea edx,ebx+\$80808080-\$0A0A0A0A
shr edx,4
shr ecx,4
not edx
not ecx
and edx,\$07070707
and ecx,\$07070707
lea edx,\$30303030+edx+ebx
lea ecx,\$30303030+ecx+eax
mov eax,edx
bswap eax
shr edx,16
mov ah,\$20
mov al,cl
shr ecx,8
rol eax,8
mov bl,ch
mov [edi+24],eax
mov bh,dl
bswap ebx
mov bl,cl
and ecx,\$00FFFF00
mov bh,32
or  ecx,\$20000020
mov eax,[esi+12]
mov [edi+28],ebx
mov ebx,eax
mov ch,dh
shr eax,4
mov [edi+32],ecx
and ebx,\$0F0F0F0F
and eax,\$0F0F0F0F
lea edx,ebx+\$80808080-\$0A0A0A0A
lea ecx,eax+\$80808080-\$0A0A0A0A
shr edx,4
shr ecx,4
not edx
not ecx
and edx,\$07070707
and ecx,\$07070707
lea edx,\$30303030+edx+ebx
lea ecx,\$30303030+ecx+eax
mov eax,edx
bswap eax
shr edx,16
mov ah,\$20
mov al,cl
shr ecx,8
rol eax,8
mov bl,ch
mov [edi+36],eax
mov bh,dl
bswap ebx
mov bl,cl
and ecx,\$00FFFF00
mov bh,32
or  ecx,\$20000020
mov [edi+40],ebx
mov ch,dh
mov [edi+44],ecx
mov ecx,[esi]         // ABC, ECX = source chars
mov ebx,ecx
mov edx,\$80808080     // EDX: mask for high bits (high per byte)
not ecx
and ecx,edx           // ECX: negated copy of the high bits
and ebx,\$7F7F7F7F     // EBX: source w/o highest bits, prevents carry across bytes
lea eax,\$60606060+ebx
and eax,ecx
mov ecx,[esi+4]
shr eax,7             // EAX: lowest bit is 1 if OK (source bits 5,6 != 0,0 and bit 8 == 0)
sub edx,eax           // EDX: \$80 if OK, else \$7F
and ebx,edx           // EBX: only valid chars remain
not edx
and edx,\$2E2E2E2E     // EDX: make all invalid chars 2E ('.')
or  edx,ebx           // EDX: combine valid and invalid chars
mov ebx,ecx
mov [edi+48],edx      // write result
mov edx,\$80808080
not ecx
and ecx,edx
and ebx,\$7F7F7F7F
lea eax,\$60606060+ebx
and eax,ecx
mov ecx,[esi+8]
shr eax,7
sub edx,eax
and ebx,edx
not edx
and edx,\$2E2E2E2E
or  edx,ebx
mov ebx,ecx
mov [edi+52],edx
mov edx,\$80808080
not ecx
and ecx,edx
and ebx,\$7F7F7F7F
lea eax,\$60606060+ebx
and eax,ecx
mov ecx,[esi+12]
shr eax,7
sub edx,eax
and ebx,edx
not edx
and edx,\$2E2E2E2E
or  edx,ebx
mov ebx,ecx
mov [edi+56],edx
mov edx,\$80808080
not ecx
and ecx,edx
and ebx,\$7F7F7F7F
lea eax,\$60606060+ebx
and eax,ecx
shr eax,7
sub edx,eax
and ebx,edx
not edx
and edx,\$2E2E2E2E
or  edx,ebx
mov [edi+60],edx
0

LVL 39

Expert Comment

ID: 6897554
Hi Dan,

> You are getting 80,000+ conversions per millisecond?
Perhaps on your system... But I'm wondering about your timings. If Pavlik's code executes 40,300 times per millisecond on a PIII 450, then it makes 450,000,000 / 40,300,000 = 11.17 cycles per call. As a comparison, the lowest timing is on an empty function (with one asm statement), I used it for calculating the overhead (out of curiosity, the overhead is, as you said, the same for every function). My machine is a PIII 800 Mhz. That means that an empty function takes already: 800,000,000 / 72,180,000 = 11.08 cycles on my machine! I think you should check the length of a millisecond on your machine, but I think it's cheating on you!

Note on my timings: Ticks is returned by QueryPerformanceCounter and Reference should be read compared to my code as a reference of one. A reference of 5.5 means that the code is 5.5 times slower then mine. A reference below 1.0 means faster code then mine. These are not average timings, it is just a snapshot.

MINE
Iterations: 1000000
Ticks:      319623
Seconds:    0.0893
Loops/msec: 11199
Reference:  1.0000

AVONWYSS
Iterations: 1000000
Ticks:      1061417
Seconds:    0.2965
Loops/msec: 3372
Reference:  3.3208

PAVLIK
Iterations: 1000000
Ticks:      705361
Seconds:    0.1971
Loops/msec: 5075
Reference:  2.2069

EMPTYFUNCTION
Iterations: 1000000
Ticks:      49588
Seconds:    0.0139
Loops/msec: 72186
Reference:  0.1313
0

LVL 39

Expert Comment

ID: 6897558
> I'm afraid that your version will suffer a substantial slowdown
> I'm afraid that your version will suffer a substantial slowdown
> I'm afraid that your version will suffer a substantial slowdown

I'm afraid so too, but I'm working on it, I have some neat ideas :-)
I'm afraid so too, but I'm working on it, I have some neat ideas :-)
I'm afraid so too, but I'm working on it, I have some neat ideas :-)

Cheers (3x)
Abel

PS: Thanks for the new code. Indeed, I did not have it. Sorry for the wrong benchmark posting in the last post. I will re-eveluate it.
0

LVL 14

Author Comment

ID: 6897582
Sorry for the multiple post. I don't know how this could happen, EE showed me that annoying, well-known "Down for maintenance" page. Anyways, if the thread gets too long, we'll just open a new one, right? :-)
0

LVL 39

Expert Comment

ID: 6897662
Yep, room enough in the Assembler ta!

> I'm afraid that your version will suffer a substantial slowdown
You were right, but even with the ugliest change of the code I made so far, it is still the fastest :)

ABEL
Loops/msec: 8036
Reference:  1.0000

AVONWYSS
Loops/msec: 3375
Reference:  2.3813

PAVLIK
Loops/msec: 5095
Reference:  1.5772

Why don't we just easily fill the memory with spaces? That will be my next hack. This one comes down to repeating this a lot of times and adjusting the ESI pointers:

movd     dword ptr [edi], mm0
mov          byte ptr [edi+2], 0x20
psrlq     mm0, 16

New version coming up.
0

LVL 49

Expert Comment

ID: 6897664
abel,
I'm also beginning to suspect my timing logic.  I think that QueryPerformanceCounter would be better.  I have been seeing a lot of variation in the times.  I read a large datafile (100KB I think) and run 1000 passes at it.  It gets done is around three seconds on fast code.

I did not do the math, but 30,000 calls per millisecond does seem wrong.  The PAVLIK code is over 500 bytes of opcodes, just for the Hex lookup and output.  So 11 cycles is absurd.  I may be overflowing my counter or something. I'll check into it.  However, as one datapoint, your Pavlik-to-AvonWyss ratio looks at least roughtly similar to what I've seen

Incidently, I don't think I have ever seen numbers better than about 30,000/ms.  If I said 40,000 somewhere, I appologise.

AvonWyss,
EE has been rather flaky lately.  I've gotten into the habit of at least copying the post to the clipboard before sending.  Then if EE chokes, I can paste it into a text file and save it for resend sometime later.  I always check the page before reposting...

On another note:
I am having trouble getting good performance out of (my first pass at) a hand-optimized version of Pavlik's Lookup code.  It's possbile that the C compiler is outputing better code than ME.  How DARE IT!  But I have not given up.  Maybe I need to interleave the opcodes to avoid pipeline stalls.  Or maybe I am experiencing a brain fart.  Hard to tell.

-- Dan
0

LVL 39

Expert Comment

ID: 6897665
Sorry, EDI pointers :)
0

LVL 49

Expert Comment

ID: 6897688
>>Why don't we just easily fill the memory with spaces?
>> ...
We still would need to do a bunch of unaligned two-byte writes.  I think that is goodbye to performance.  I think it is your use of (e.g.)
movq     mm0, [esi]
that is giving you the upper hand here.  You are reading and writing 64-bit chunks while the rest of us are writing 32 (at best).

-- Dan
0

LVL 14

Author Comment

ID: 6897726
Dan, I found an interesting document which pretty well explains the different parts of command processin in the CPU along with useful information about what to do or avoid to get good performance. It also points out differences of new processors compared to classic Pentium and Pentium MMX, and goes up to PIII. In fact, lots has changed since 286/386/486 assembly programming, and the bottlenecks have changed somewhat.

http://www.agner.org/assem/

I found this really eductional. But let me warn you, even though it's very well explained, I have found it very hard to use it to actually produce better code. It's just too complex for my little assembly knowledge to both find voth working code that also is optimal by means op microops etc.
0

LVL 39

Expert Comment

ID: 6897793
> your Pavlik-to-AvonWyss ratio looks at least roughtly similar to what I've seen
Yes, I noticed that. I just considered your timings relative ones, which still make them pretty usable.

> I read a large datafile (100KB I think) and run 1000 passes at it.
You will also have overhead of the memory calls to get that data. I time with only 16 bytes and loop 100,000 times or more on it. That way it will run mostly on data in the 1st level cache, I hope. If you want my timing-code to compare it to yours, just scream.

>If I said 40,000 somewhere

> You are reading and writing 64-bit chunks while the rest of us are writing 32 (at best).
In the new version, which is still faster (see above), I use an ugly unaligned algorithm (if you can call it an algorithm at all!) to place the spaces where they belong.

I was thinking of REP STOSD, which should be the fastest way of initializing memory according to Intel, but it didn't help me much. It slowed down from 8050 to 6500 loops!

Cheers
Abel
0

LVL 49

Expert Comment

ID: 6897929
abel,
>>...40,300 times per millisecond on a PIII 450, then it
>> makes 450,000,000 / 40,300,000 = 11.17 cycles per call.

Here's the discrepency:  My times are in conversions-per-time unit while you are talking in calls-per-unit.  Thus, if Pavlik's code gets 40,000/ms, that is 11.17 cycles per character processed, which might be more likely.

Adjusting down to 30,000/ms, and with that in mind, then
45,000,000 / 30,000,000 = 15 cycles per BYTE,
15*16 = 248 cycles per call
which still seems too fast, but it is the realm of feasibility.

BTW, I think my use of a large input buffer is a better workout than your repeated use of a 16-byte buffer.  It is a stress test on memory access as well as efficiency of conversion code.   Inefficient memory accessing techniques (such as single-byte writes) should have a significant time penalty, as they do in reality.  I can imagine this routine being called by the VC++'s 'binary file' viewer ( I once loaded up a 12 MB file into it and, while sluggish, did do the job).

-- Dan
0

LVL 49

Expert Comment

ID: 6898753
abel,
Indeed your code is converting 67,000/ms.  If you can solve the problem of putting spaces in between the digit-pairs, you will have a monster on your hands.

That might not be to hard.  Code like this:

mov edx,0x00200000
movd edx,mm0  // get current data
movq mm7, edx // set (sp)XX

por  mm1, mm7 // start setting the output register
psllq mm1,24  // make room for next three bytes

psrlq mm0,16  // prepare to get the next two digits
movd edx,mm0  // get the next two bytes
... etc...

I think there is an issue of the byte order getting reversed in the above code, but I think it is solvable.  Also, there is no question that this will add many cycles to your code, but it will likely remain the fastest.

-- Dan
0

LVL 39

Expert Comment

ID: 6904095
thanx ;)

>  but it will likely remain the fastest.
On my own tests: about thirty percent above Pavlik :) I indeed lost a lot like AvonWyss predicted, but I'm again working my way up to making the speed-gap bigger.

> If you can solve the problem of putting spaces in between the digit-pairs
I have resolved that. See my prev. post on a temporal solution that is still very fast. But in my current application (still working fulltime on it!) I squeeze another 20% out of it. By modifying the ugly writes from above to something a little prettier. It now takes 12 opcodes for half a string. Wait a moment, I'll post the update.

> think there is an issue of the byte order getting reversed
Yes, I had that while working out a solution like that. But I now use a different solution, because I ended up slower then Pavlik.

Remember what I said about putting spaces in the target string up front? I won about 10 percent of extra speed by stopping trying to get the spaces in the MMX registers. I now use maskmovq (SSE I think, but you have PIII, so that's alright) and some shifting to get it done. Maskmovq leaves the words alone that need not be written. I'm still working out a variant of this solution that might be a little faster, because the shuffle and mask instruction cost extra cycles if you use them too quickly after one another.

And now for the code in the next post.

Cheers
Abel
0

LVL 39

Expert Comment

ID: 6904107
Place this in the first write-cycle. Replace the two movq instructions with this.

pshufw          mm1, mm0, 10000000B
maskmovq     mm1, mm7                    ;pos 1, 2, 5 and 6
pshufw          mm1, mm0, 11000001B          ;[12345678] --> [34121278]
pshufw          mm1, mm6, 10000000B
maskmovq     mm1, mm7                    ;pos 1, 2, 5 and 6
pshufw          mm1, mm6, 11000001B          ;[12345678] --> [34121278]

Place the following codesnippet in the second write-cycle. Replace the mov-movq-movq-mov instructions with this:

pshufw          mm1, mm0, 10000000B
maskmovq     mm1, mm7                    ;pos 1, 2, 5 and 6
pshufw          mm1, mm0, 11000001B          ;[12345678] --> [34121278]
pshufw          mm1, mm6, 10000000B
maskmovq     mm1, mm7                    ;pos 1, 2, 5 and 6
pshufw          mm1, mm6, 11000001B          ;[12345678] --> [34121278]
sub               edi, 39

As far as I know, I left the rest untouched. Probably in some days I will update with a faster version, because there is some redundancy in my code.

Important note: I have not included the code for setting the destination string to all spaces, I cheat, I use memset. The idea is that the places of the spaces do not change. For correct functionality I guess it would be best to use a static variable (I know that it's slow, but it's faster then having to write all those spaces!) to check for a changed target pointer, like this:

if (newDest  != oldDest)
{
memset(newDest, 0x20, 64);    //use 64: fastest
oldDest = newDest;
}

Or in asm, if you wish :)

Alas, if your testcode uses an outputbuffer that needs to filled with constantly changing pointers, this may result in a penalty higher then the speed gained by this. But if you only have to print it to the screen (the original requirement!) then there's no need for buffering.
0

LVL 39

Expert Comment

ID: 6904137
On timings, I decided to take my own advice to include the SetThreadPriority just before and after I start and stop the clocks. And it makes a HUGE difference on my machine. Previously, I wasn't able to find out the real difference in performance of the first and latest version of AvonWyss's code. Now I can say that it is the difference between 1.05 Megaticks and 1.09 Megaticks. And the difference between timings is minimal now.

Abel
0

LVL 14

Author Comment

ID: 6904528
Abel, pretty interesting dtuff you're doing. Unfortunately, requiring SSE puts me out of the race since I use an AMD Athlon (1st Generation, 800 MHz). Also, your very correct observation that the space-bytes remain at the same place and thus wouldn't need to be set in a real-world application does in fact make your approach less comparable to the others which always generate a complete string.

Also, it may be interesting to see how the different versions perform on a slow and uncached memory (that is, directly writing to the display memory). However, for optimal direct display output, the buffer would neet to have color bytes interleaved in the buffer, so that we would end up with a buffer of 128 chars, each text char using 16 bits. This would change the task somewhat.
0

LVL 49

Expert Comment

ID: 6904610
able,
I'm having trouble with that last code.  THe output I'm getting looks like:

41 42 14 44 45 46 54 48 49 4A 94 4ABCDEFGHIJKLMNOP

in which the third and seventh and eleventh hex-digit pairs are incorrect and the ASCII part is not positioned correctly.

This may easily be my fault, since I had to guess at the value for maskMOV and my compiler chokes on two of the MMX opcodes (but the disassebly window shows that I have hacked the opcode bytes correctly).  I used:

unsigned char maskMOV[8] ={0x80, 0x80, 0x00, 0x80, 0x80, 0x00, 0x80, 0x80};

#define M_MaskMovQ_mm1mm7  __asm _emit 0x0F __asm _emit 0xF7 __asm _emit 0xCF
#define M_PShufW(mmrmmr,nImm) __asm _emit 0x0f __asm _emit 0x70 __asm _emit mmrmmr __asm _emit nImm

M_PShufW (0xC8,10000000b) //mm1, mm0, 10000000B
M_PShufW (0xC8,11000001b) //mm1, mm0, 11000001B
M_PShufW (0xCE,10000000b) //mm1, mm6, 10000000B
M_PShufW (0xCE,11000001b)  //mm1, mm6, 11000001B

etc.  ANy idea what's going wrong?
=--==-=-=-=-
Preliminary timing shows that this additional code caused a very significant slowdown, but it is still (maybe/probably) somewhat faster than Pavlik's.  I think the unaligned writes are doing the damage because the code itself is very tight.

-- Dan
0

LVL 39

Expert Comment

ID: 6906222
Hi Dan,

Your mask is the wrong part. I'm sorry, i forgot to include it. The maskMOV mask should set ONLY the outer bytes. I have defined it as follows (though other values work as well):

unsigned char maskMOV[8]= {0xff, 0xff, 0x00, 0x00, 0x00, 0x00, 0xff, 0xff};

Your opcodes are correct, I think. This is what disassembly shows in VC++ NET (yes, yes, I upgraded, only to be able to use maskmovq & co without trouble! but it sure made debugging assembly much easier!). Go ahead and compare your opcodes, but I think they are correct.

pshufw          mm1, mm0, 10000000B
00412A4C 0F 70 C8 80      pshufw      mm1,mm0,80h
maskmovq     mm1, mm3                    ;pos 1, 2, 5 and 6
00412A50 0F F7 CB         maskmovq    mm1,mm3
pshufw          mm1, mm0, 11000001B          ;[12345678] --> [34121278]
00412A53 0F 70 C8 C1      pshufw      mm1,mm0,0C1h
00412A57 83 C7 03         add         edi,3
00412A5A 0F F7 CB         maskmovq    mm1,mm3
pshufw          mm1, mm2, 10000000B
00412A5D 0F 70 CA 80      pshufw      mm1,mm2,80h
00412A61 83 C7 09         add         edi,9
maskmovq     mm1, mm3                    ;pos 1, 2, 5 and 6
00412A64 0F F7 CB         maskmovq    mm1,mm3
pshufw          mm1, mm2, 11000001B          ;[12345678] --> [34121278]
00412A67 0F 70 CA C1      pshufw      mm1,mm2,0C1h
00412A6B 83 C7 03         add         edi,3
00412A6E 0F F7 CB         maskmovq    mm1,mm3

0

LVL 39

Expert Comment

ID: 6906230
I've done some more tweaking. I agree with AvonWyss that the code is not completely comparable, if not for the specialized processor, then for the different approaches. As for that, I included two static variables to check if data has changed, which brings it back again in line with the original requirements.

Better tweakings include making the code more dense and getting rid of some redundant memory accesses. On my machine it runs about 1.52 times Pavlik's code. But even without the if-statement in the beginning, it is still faster, but then it goes down to about 1.19 times Pavlik.

Well, nuff said, here's the code:

unsigned char mask0F[8] = {0x0F, 0x0F, 0x0F, 0x0F, 0x0F, 0x0F, 0x0F, 0x0F};  //to get hinibble
unsigned char mask30[8] = {0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30};  //to make digits
unsigned char mask39[8] = {0x39, 0x39, 0x39, 0x39, 0x39, 0x39, 0x39, 0x39};  //to differ betw. digits/A-F
unsigned char mask07[8] = {0x07, 0x07, 0x07, 0x07, 0x07, 0x07, 0x07, 0x07};  //to make A-F
unsigned char mask1f[8] = {0x1f, 0x1f, 0x1f, 0x1f, 0x1f, 0x1f, 0x1f, 0x1f};  //for differ betw. < asc(32)
unsigned char mask2e[8] = {0x2e, 0x2e, 0x2e, 0x2e, 0x2e, 0x2e, 0x2e, 0x2e};  //the dash
unsigned char maskMOV[8]= {0xff, 0xff, 0x00, 0x00, 0x00, 0x00, 0xff, 0xff};

void HexLineOut(unsigned char * pSrc, char * pszDest)
{
static char * pOldDest = NULL;
static bool bFirstTime = true;
__asm
{
//test for equality of old and new pointer
mov     eax,dword ptr [pOldDest]
cmp     eax,dword ptr [pszDest]
je      lblTheEnd

//store new dest in old dest
mov     eax,dword ptr [pszDest]
mov     dword ptr [pOldDest],eax

//Fill only the first time this pointer is used with spaces
mov            edi, pszDest
mov            ecx, 16
mov            eax, 0x20202020
rep            stosd

//Store the mask only once. This can be done smarter, check the mask or something like that?
test      bFirstTime, 0x0                        //should be better to check for a checksum of mm4-mm7
jne            lblTheEnd
mov            bFirstTime, 0x0
lblTheEnd:
}

__asm
{
mov            esi, pSrc
mov            edi, pszDest
prefetcht0      [esi]            ;wins between 0.5-1.5 percent performance

//Lo nibbles

movq      mm0, [esi]
movq      mm1, mm0            ;we need a copy (see hi nibbles part)
pand      mm1, mm4            ;mask each byte against 0x0F
movq      mm3, mm1            ;store temporarily
pcmpgtb      mm1, mm6            ;compare each byte with 0x39 and if greater, store 0xFF in dest. byte
pand      mm1, mm7            ;mask compare-result with 0x07
paddb      mm1, mm3            ;add the bytes that were masked to our data -> mm1 holds eight lo nibbles

//Hi nibbles
movq      mm2, mm0
psrlq      mm0, 4                  ;move hinibble to lonibble for each byte
pand      mm0, mm4            ;mask each byte against 0x0F
movq      mm3, mm0            ;store temporarily
pcmpgtb      mm0, mm6            ;compare each byte with 0x39 and if greater, store 0xFF in dest. byte
pand      mm0, mm7            ;mask compare-result with 0x07
paddb      mm0, mm3            ;add the bytes that were masked to our data -> mm0 holds eight hi nibbles
movq      mm2, mm0            ;store, we need this copy

//unpack bytes 2468+1357 -> 12345678. Now mm0 and mm2 hold the resulting converted 8 bytes
punpcklbw      mm0, mm1      ;[xxxxyyyy] and [xxxxzzzz] --> [zyzyzyzy]
punpckhbw      mm2, mm1      ;[yyyyxxxx] and [zzzzxxxx] --> [zyzyzyzy]

//movq2dq, pslldq, pand(xmm), movdqa, movdqu and other integer operations on xmm are SSE2 instructions!

//store
pshufw            mm1, mm0, 10000000B
maskmovq      mm1, mm3                        ;pos 1, 2, 5 and 6
pshufw            mm1, mm0, 11000001B            ;[12345678] --> [34121278]
pshufw            mm1, mm2, 10000000B
maskmovq      mm1, mm3                        ;pos 1, 2, 5 and 6
pshufw            mm1, mm2, 11000001B            ;[12345678] --> [34121278]

/////////////////////////////////////////////////
//Lo nibbles
// NO SECOND PREFETCH! That slows down a lot!
movq      mm0, [esi+8]
movq      mm1, mm0            ;we need a copy (see hi nibbles part)
pand      mm1, mm4            ;mask each byte against 0x0F
movq      mm3, mm1            ;store temporarily
pcmpgtb      mm1, mm6            ;compare each byte with 0x39 and if greater, store 0xFF in dest. byte
pand      mm1, mm7            ;mask compare-result with 0x07
paddb      mm1, mm3            ;add the bytes that were masked to our data -> mm1 holds eight lo nibbles

//Hi nibbles
movq      mm2, mm0
psrlq      mm0, 4                  ;move hinibble to lonibble for each byte
pand      mm0, mm4            ;mask each byte against 0x0F
movq      mm3, mm0            ;store temporarily
pcmpgtb      mm0, mm6            ;compare each byte with 0x39 and if greater, store 0xFF in dest. byte
pand      mm0, mm7            ;mask compare-result with 0x07
paddb      mm0, mm3            ;add the bytes that were masked to our data -> mm0 holds eight hi nibbles
movq      mm2, mm0            ;store, we need this copy

//unpack bytes 2468+1357 -> 12345678. Now mm0 and mm2 hold the resulting converted 8 bytes
punpcklbw      mm0, mm1      ;[xxxxyyyy] and [xxxxzzzz] --> [zyzyzyzy]
punpckhbw      mm2, mm1      ;[yyyyxxxx] and [zzzzxxxx] --> [zyzyzyzy]

//movq2dq, pslldq, pand(xmm), movdqa, movdqu and other 128 bit integer operations on are SSE2 instructions!

//store
pshufw            mm1, mm0, 10000000B
maskmovq      mm1, mm3                        ;pos 1, 2, 5 and 6
pshufw            mm1, mm0, 11000001B            ;[12345678] --> [34121278]
pshufw            mm1, mm2, 10000000B
maskmovq      mm1, mm3                        ;pos 1, 2, 5 and 6
pshufw            mm1, mm2, 11000001B            ;[12345678] --> [34121278]

sub                  edi, 39                              ;restore

//Create the stringized part

//DO NOT: prefetcht0      [edi+32]
movq      mm0, [esi]
movq      mm1, mm0            ;make a copy
pcmpgtb      mm0, mm3            ;SIGNED compare, everything between 127+ and 31- will be masked in mm0
pandn      mm0, mm4
pand      mm1, mm2            ;original AND original mask

movq      qword ptr [edi+48], mm1

movq      mm0, [esi+8]
movq      mm1, mm0            ;store original
pcmpgtb      mm0, mm3            ;SIGNED compare, everything between 127+ and 31- will be masked
pandn      mm0, mm4
pand      mm1, mm2            ;original AND original mask

movq      qword ptr [edi+56], mm1
mov            byte ptr [edi+65], 0x00

//I thought removing emms (I can do it later) would speed up, but it slows down!
//and it is not in the generated opcode: that is just how it should be, but what does the processor do?
emms                              ;clear mmx registers
}
}
0

LVL 39

Expert Comment

ID: 6906238
AvonWyss, a word on the AMD. Does it support the PEXTRW instruction? Because I have a version that uses that, it's something slower, but still pretty fast. And does the AMD support Prefetch?

Just replace the storing-parts with this and remove the sub edi instruction:

pextrw          eax, mm0, 0x0
mov               [edi+24], ax
pextrw          eax, mm0, 0x1
mov               [edi+27], ax
pextrw          eax, mm0, 0x2
mov               [edi+30], ax
pextrw          eax, mm0, 0x3
mov               [edi+33], ax
pextrw          eax, mm2, 0x0
mov               [edi+36], ax
pextrw          eax, mm2, 0x1
mov               [edi+39], ax
pextrw          eax, mm2, 0x2
mov               [edi+42], ax
pextrw          eax, mm2, 0x3
mov               [edi+45], ax

0

LVL 39

Expert Comment

ID: 6906244
I will be away with Easter. If you are going to test the code on your machines, note the use of PREFETCH. It may slow down when using it, or it may speed up. The current use of prefetch speeds up a little on my machine, but I wonder about yours.

Cheers!
Abel
0

LVL 49

Accepted Solution

DanRollins earned 300 total points
ID: 6906778
abel,
I wrote this code to eliminate unaligned writes.  I'll be testing it against that new code later.

#define mm2_ax 0xD0 // for PINSRW
#define mm2_bx 0xD3
#define mm2_cx 0xD1
#define mm2_dx 0xD2

#define eax_mm0 0xC0 // for PEXTRW
#define ebx_mm0 0xD8
#define ecx_mm0 0xC8
#define edx_mm0 0xD0

#define eax_mm1 0xC1
#define ebx_mm1 0xD9
#define ecx_mm1 0xC9
#define edx_mm1 0xD1

#define M_PINSRW(mm_r16,nImm) __asm _emit 0x0f __asm _emit 0xC4 __asm _emit mm_r16 __asm _emit nImm
#define M_PEXTRW(r32_mm,nImm) __asm _emit 0x0f __asm _emit 0xC5 __asm _emit r32_mm __asm _emit nImm

//----------------------------------------
// expects:
//   mm0 and mm1 have 8 bytes of hex digits eg mm0=0x31323334
//   edi is output buf (will recieve 24 bytes)
//   cx is 0x20000020

#define M_Out24( cDashOrSpace ) __asm {\
__asm M_PEXTRW(eax_mm0,0x00) \
__asm M_PEXTRW(ebx_mm0,0x01) \
__asm shl ebx,8 \
__asm or  ebx, ecx \
__asm M_PINSRW(mm2_ax,0x00) \
__asm M_PINSRW(mm2_bx,0x01) \
__asm shr ebx,16 \
__asm M_PINSRW(mm2_bx,0x02) \
__asm M_PEXTRW(edx_mm0,0x02) \
__asm M_PINSRW(mm2_dx, 0x03) \
__asm movq [edi], mm2 /* save 8 bytes*/ \
__asm M_PEXTRW(ebx_mm0,0x03) \
__asm shl ebx,8 \
__asm or  ebx, ecx \
__asm M_PINSRW(mm2_bx,0x00) \
__asm shr ebx,16 \
__asm M_PINSRW(mm2_bx,0x01) \
__asm M_PEXTRW(eax_mm1,0x00) \
__asm M_PEXTRW(ebx_mm1,0x01) \
__asm M_PEXTRW(edx_mm1,0x02) \
__asm M_PINSRW(mm2_ax, 0x02) \
__asm shl ebx,8 \
__asm or  ebx, ecx \
__asm M_PINSRW(mm2_bx, 0x03) \
__asm movq [edi+8], mm2  /* save 8 bytes*/ \
__asm shr ebx,16 \
__asm pxor mm2,mm2 \
__asm M_PINSRW(mm2_bx, 0x00) \
__asm M_PINSRW(mm2_dx, 0x01) \
__asm M_PEXTRW(ebx_mm1,0x03) \
__asm shl ebx,8 \
__asm or  ebx, ecx \
__asm M_PINSRW(mm2_bx, 0x02) \
__asm shr ebx,16 \
__asm mov bh, cDashOrSpace \
__asm M_PINSRW(mm2_bx, 0x03) \
__asm movq [edi+16], mm2 /* save 8 bytes*/ \
}

.... then to output the hexdigit strings...

mov esi, psz
mov edi, pszOut
movq mm0,[esi]    // test: get 8 bytes of source data
movq mm1,[esi+8]
mov  ecx,0x20000020

M_Out24( '-' )

movq mm0,[esi+16]    // get 8 bytes of source data
movq mm1,[esi+24]

M_Out24( ' ' )

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
This is vastly more code that yours, so the timing test will be intersting:  Just how important is writing to QWORD-aligned boundaries?

---==-=-=-=-=-=-=-=-=-
Also, I don't think that you dare assume that the masks are in place at the start of the fn.  After all, it is quite important to use EMMS at the end of the fn (or so I've read).  I think that needing to reload the mask is just a part of doing business with MMX.

-- Dan

0

LVL 39

Expert Comment

ID: 6906846
> Also, I don't think that you dare assume that the masks are in place at the start of the fn.  After
all, it is quite important to use EMMS at the end of the fn (or so I've read).  I think that needing

Yes, I agree with that. Albeit that EMMS does not clear the MMX registers, it only marks the x86 FPU registers as being available for floating point operations. And I was thinking that, since the function will mostly be called repeatedly in a loop, the MMX registers will remain untouched. Unless floating point operations are requested  (which is not uncommon of course!), this will be an issue. To resolve that, I think that comparing one or two registers to a checksum should help. This would mean that when using the function with a floating point counter in a loop will drastically decrease performance, but using it with only integers will make it very fast.

Another option is changing the requirements a bit. I was thinking about a function call with an arbitrary length of the buffers, like this:

void HexLineOut(unsigned char * pSrc, char * pszDest, int cSize)
//cSize is length in bytes of source buffer.
//pszDest must be allocated to receive cSize * 4 + 1 bytes at least.
//This includes room for the terminating null character

That way we can optimize our memory accesses to a much higher level, I think.

Abel
0

LVL 49

Expert Comment

ID: 6907877
I used this macro to replace the M_Out24 above:

#define M_Out24_able __asm {\
__asm M_PEXTRW(eax_mm0,0x00) __asm mov [edi], ax \
__asm M_PEXTRW(eax_mm0,0x01) __asm mov [edi+3], ax \
__asm M_PEXTRW(eax_mm0,0x02) __asm mov [edi+6], ax \
__asm M_PEXTRW(eax_mm0,0x03) __asm mov [edi+9], ax \
__asm M_PEXTRW(eax_mm1,0x00) __asm mov [edi+12], ax \
__asm M_PEXTRW(eax_mm1,0x01) __asm mov [edi+15], ax \
__asm M_PEXTRW(eax_mm1,0x02) __asm mov [edi+18], ax \
__asm M_PEXTRW(eax_mm1,0x03) __asm mov [edi+21], ax \
}

Then I timed some variations:

43,600  able version of hex-digit output
42,900  DanRollins hex-digit output (no space prefill needed)

86,200  no hexdigit output (and no prefill)
63,600  none (using prefill 64 -- see below)
59,500  none (using prefill Stosd -- see below)

First, able's code is clearly emerging as the fastest.  The (clever) use of MMX regs and opcodes is the key.  I've been playing with an MMX version that uses lookups, but MMX makes it possible to CALCULATE the values faster than they can be looked up.  I would NEVER have predicted that.  Of course, in retrospect, that's why Intel put them there...

Also notice that the hexdigit output consumes HALF of the overall time (in my tests, I collected all of the data, and prepared to do the output, then skipped it.).  That's a direct pointer to the best place to look for ways to continue optimizing.

Next, able's unaligned write technique for outputing the hex digits (on my PC at least) is faster, but not tremendously so.  Consider that the CPU needs to execute only 59 opcode bytes for able but Dan's version is 127 bytes long.  Now, if memory access is the bottleneck and remains similar across many PCs, then the 127-byte version should be quite a bit faster on fast PCs.  I wouldn't be surprized to see my variation run faster, even on an 800MHz CPU, let alone a 1.6 GHz CPU.

One drawback of able's code is that it requires that the buffer be prefilled with spaces.  So what's the fastest way to do that?  I found that REP STOSD is significantly slower than a specific sequence of MOVQ writes.  Here are the two variations I tested:

#define M_PreFill64 __asm {\
__asm movq mm0,     qword ptr [spcs] \
__asm movq [edi],   mm0 \
__asm movq [edi+8], mm0 \
__asm movq [edi+16],mm0 \
__asm movq [edi+24],mm0 \
__asm movq [edi+32],mm0 \
__asm movq [edi+40],mm0 \
}

#define M_PreFillStosd __asm {\
__asm mov ebx, edi \
__asm mov ecx, 16 \
__asm mov eax, 0x20202020 \
__asm rep stosd \
__asm mov edi,ebx \
}

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
abel,

That is the type of 'outside the box' thinking that ends up with a highly optimized overall system.  And were I to implement this in a working program, I'd do exactly as you describe.  However, my feeling is that these suggestions are not part of the search for the fastest implementation of the fastest algorithm to perform a narrowly-defined task.  Differing opinions are welcomed...

-- Dan
0

LVL 49

Expert Comment

ID: 6907890
Where to go from here?

*  EMMX, anyone?

*  I wonder is a Prefetch could give new life to the lookup algorithm?

*  How much more performance can be sqeezed by shuffling the opcodes around to avoid pipeline stalls?  So far all of us except AvonWyss have tried to keep the code order natural (well the ASM of Pavlik's optimzed C was kinda scrambled by the compiler, presumably on purpose).  But 'interleaving' some operations can result in getting two opcodes executed simultaneously by eliminating dependancies.

At first look, able's code needs to run in that sequence.  But what if we could interleave portions of the lo-nibble code with the high-nibble code, or something.

I know that my hexdigit code has some potential in this regard.  For instance, I think that replacing:

pextrw eax,mm0,0x00
pextrw ebx,mm0,0x01
shl ebx,8
or  ebx, ecx

with

pextrw ebx,mm0,0x01
pextrw eax,mm0,0x00  // interleave the EAX access
shl ebx,8
or  ebx, ecx

might save me one cpu cycle.  And that PreFill routine

pextrw eax,mm0,0x00
mov [edi], ax
pextrw eax,mm0,0x01
mov [edi+3], ax

(etc) might run faster as:

pextrw eax,mm0,0x00
pextrw ebx,mm0,0x01
mov [edi], ax
mov [edi+3], bx

There are some complex rules regarding what combimnations of opcodes can get paired, but simply avoiding storing a register directly after loading it should save one stall.

If anyone is getting bored, let me know.  I'll distribute some points....

-- Dan
0

LVL 14

Author Comment

ID: 6907921
Dan, have you had a chance to look into the document which can be found at the link I posted for you?You'll learn a lot of interesting things and when stalls are to be expected.

For instance, one such thing is the register renaming of newer CPUs. There are about 40 temporary registers in the CPU, which are used for calculations instead of the "normal" (permanent) registers. Now, whenever you load or write a value from/to a register, one of these temp. registers will get assigned for that operation. Writing to a register always allocated a new temporary register for the result (therefore avoiding any problems with dependencies of renamed registers). Because of that, writing to a register will avoid a stall on reading the register later on because it already has a temp. register assigned for it and doesn't have to be loaded from the permanent registers first.

There are many such complex behaviours which can greatly influence the CPU runtime behaviour. Reading that document may really be interesting for those of you which don't know all the CPU details yet - just like me. I strongly recommend it!

Dan, shall we open a third thread? This one is also pretty long already.
0

LVL 39

Expert Comment

ID: 6908017
I like to comment, but I'll have to wait till tuesday. One thought though: what about loading the masks from immediate values into eax and from eax into mm? Although that requires more opcodes, it might run faster.

> If anyone is getting bored
nop. It's only getting more and more interesting ;)

Till tuesday!
0

LVL 49

Expert Comment

ID: 6909239
I have opened up a new thread here:
http://www.experts-exchange.com/jsp/qManageQuestion.jsp?ta=assembly&qid=20283475
or perhaps here:
http:Q.20283475.html (I'm not sure these abreiations always work)

-- Dan
0

LVL 14

Author Comment

ID: 6938903
Dan, since you actively participated in finding solutions, I feel that you also deserve some points. And since I also have thousends of points to waste, I'll just do just kind of that and award some to you. Thanks you!
0

LVL 49

Expert Comment

ID: 6939064
u r a gentleman and a scholar.  Thanks!
-- Dan
0

## Featured Post

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question