Solved

Comparing dwords, words, and bytes

Posted on 2004-04-23
4
295 Views
Last Modified: 2012-05-04
I've seen that switching cmp ax, dx with cmp eax, edx can make for huge speed improvements (I measured 40% faster in one of my algorithims).

Of course you need to zero the unused bytes of the larger register before moving anything into them for comparison.

What about cmp al, ah ? Would it be better switch to using the full registers? What about, like in this case there's no free register avaliable? I can save one register to the stack while I make my compariosns (anywhere from 1 to n where the average might be about 10, would this still be worthwhile?

I have some benchmarking code from somone on this board, but I couldn't figure out how to work it:

mov eax, 0
cpuid   ; to serialize the instructions
rdtsc
mov [timeLo], eax
mov [timeHi], edx

...  ; your code

mov eax, 0
cpuid   ; to serialize the instructions
rdtsc
sub [timeLo], eax
sbc [timeHi], edx

I replace the [timeLO/Hi] with other registers, but the compiler doesn't like the last line of code with sbc.

Thanks,
-Sandra
0
Comment
Question by:Sandra-24
  • 2
4 Comments
 
LVL 3

Author Comment

by:Sandra-24
ID: 10906000
What about for add/sub/inc/dec ops?
0
 
LVL 22

Accepted Solution

by:
grg99 earned 250 total points
ID: 10907110
The main thing to remember is:  in 32-bit mode, ANY reference to a 16-bit quantity is going to cost you.  Any 16-bit operation is flagged by an extra prefix byte (0x66).  This prefix byte has serious repercussions:

(1)  It's an extra op-code byte, so it increases the instruction length.

(2)  It prevents the instruction from "pairing" and running concurrently with another instruction (on most Pentiums).

So that can be up to a 50% penalty.

So you're correct, use 32-bit instructions in 32-bit code, 16 in 16-bit code, as much as possible.  

*BUT* there's a whole nother set of rules regarding byte-sized registers.   Accessing these doesnt require a prefix byte.  But that doesnt mean it's cheap either.  It varies with CPU model, but at least for the old Pentiums, accessing a byte part of a register can cause all kinds of strange delays.  For example, there is some bizarre rule that accessing a byte register stalls some CPU actions up to two cycles away!

So there too I'd stay away from accesing byte registers.  But depending on the frequency of access, it may not be worthwhile wasting time clearing or sign-extending bytes to words or dwords.   Each case is different, and it's also different across CPU models, so you'll just have to time the code and see.

I don't see anything obviously wrong with the timing code, perhaps you could give more info?

0
 
LVL 11

Assisted Solution

by:dimitry
dimitry earned 250 total points
ID: 10908576
1) It should be sbb (sub with borrow)...
--------------------------------------------------------------------------------
mov eax, 0
cpuid   ; to serialize the instructions
rdtsc
mov [timeLo], eax
mov [timeHi], edx

...  ; your code

mov eax, 0
cpuid   ; to serialize the instructions
rdtsc
sub [timeLo], eax
sbb [timeHi], edx

2) Rick Booth in his "Inner Loops" book recommends next things, for example:
  Replace
    movzx eax, bl
  with
    xor eax, eax
    mov al, bl
So I am 100% agree with grg99 that you need to try to use 32-bit commands with 32-bit registers
and 16-bit with 16-bit and not mess with them together.
0
 
LVL 3

Author Comment

by:Sandra-24
ID: 10910715
Interesting. So using byte ops is iffy, and should be measured in each scenario where it matters. Never would have guessed movzx is inferior to xor/mov combo, I've used that in a few inner loops that I could change.

Thanks also for fixing that benchmark code.

-Sandra
0

Featured Post

What Should I Do With This Threat Intelligence?

Are you wondering if you actually need threat intelligence? The answer is yes. We explain the basics for creating useful threat intelligence.

Join & Write a Comment

In this article, I will show you HOW TO: Suppress Configuration Issues and Warnings Alert displayed in Summary status for ESXi 6.5 after enabling SSH or ESXi Shell.
Performance in games development is paramount: every microsecond counts to be able to do everything in less than 33ms (aiming at 16ms). C# foreach statement is one of the worst performance killers, and here I explain why.
This demo shows you how to set up the containerized NetScaler CPX with NetScaler Management and Analytics System in a non-routable Mesos/Marathon environment for use with Micro-Services applications.
Polish reports in Access so they look terrific. Take yourself to another level. Equations, Back Color, Alternate Back Color. Write easy VBA Code. Tighten space to use less pages. Launch report from a menu, considering criteria only when it is filled…

760 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

23 Experts available now in Live!

Get 1:1 Help Now