asked on

Float multiplication - difference between assembly code with and without optimizer

I've recently encountered the following problem. I have a program written in C using Visual Studio 6.0. When run in Release version without optimization, the results of float multiplication are slightly different from those in Release version with Maximize speed.

I tried to look at assembly code, but was unable to find a difference. The problem is that when the program is compiled with optimization it is impossible to debug it, otherwise I would've looked at the registers.

So: is it true that Max Speed Optimizer affects the assembly code? If so, can I turn this specific feature off?

If it will be any help, I will post the relevant piece of source / assembly code by request.

P.S. This is a copy of the same question I posted in other sections. I'm aware of that, so please don't make special comments about it.

stefan73

Hi Lescha,
> is it true that Max Speed Optimizer affects the assembly code?
Of course! Otherwise there would be no improvement.
But the optimizer should still create FP code that fully complies with IEEE-754. Some compilers have options which explicitly create non-compliant code (such as Sun cc with -fast), but the documentation should say so.

Cheers,
Stefan

Lescha

ASKER

Okay, okay, I see where my formulation of the question was misleading.
I rephrase:

Is it true that Max Speed Optimizer affects the assembly code which concerns arithmetic operations, and multiplication in particular? If so, can I turn this specific feature off?

stefan73

Lescha,
Before I say "of course" again, perhaps let me re-phrase your question - I think I know what you're aiming at:

Is it true that Max Speed Optimizer affects the behavior of arithmetic operations, so that the result can differ from non-optimized code?

If that's what you mean: The bahavior of floating-point operations is defined in the IEEE-754 standard (read more at http://babbage.cs.qc.edu/courses/cs341/IEEE-754references.html).

This standard regulates the way floating-point operations are handled. Typical examples of non-complying optimizations are:

double x=12345.56789;
for(i=0;i<99;i++)
array[i] /= x;

Since multiplications is usually faster than division, the optimizer replaces x by x1:

double x=12345.56789;
double x1=1.0/x;
for(i=0;i<99;i++)
array[i] *= x1;

The MS documentation MUST mention non-compliant optimizations. Here's an example from Sun's cc man page:
-fsimple=0
Permits no simplifying assumptions. Preserves strict
IEEE 754 conformance.

-fsimple=1
Allows conservative simplifications. The resulting code
does not strictly conform to IEEE 754, but numeric
results of most programs are unchanged.

With -fsimple=1, the optimizer can assume the follow-
ing:
o The IEEE 754 default rounding/trapping modes do not
change after process initialization.
o Computations producing no visible result other than
potential floating- point exceptions may be deleted.
o Computations with Infinity or NaNs as operands need
not propagate NaNs to their results. For example, x*0
may be replaced by 0.
o Computations do not depend on sign of zero.

With -fsimple=1, the optimizer is not allowed to optim-
ize completely without regard to roundoff or
exceptions. In particular, a floating-point computation
cannot be replaced by one that produces different
results with rounding modes held constant at run time.

-fsimple=2
Permits aggressive floating point optimizations that
may cause many programs to produce different numeric
results due to changes in rounding. For example, -fsim-
ple=2 permits the optimizer to attempt replacing compu-
tations of x/y in a given loop where y and z are known
to have constant values, with x*z, where z=1/y is com-
puted once and saved in a temporary, thereby eliminat-
ing costly divide operations.

Even with -fsimple=2, the optimizer still is not per-
mitted to introduce a floating point exception in a
program that otherwise produces none.

This very clearly defines boundaries of optimizer behaviour.

Stefan

Lescha

ASKER

Yeah, okay, so I guess what I am actually asking is this: how can I retain <i>most</i> of maximize speed options, but bar it from optimizing the arithmetics?

Lescha

ASKER

I think I'll just post both assembly codes here for you.

Lescha

ASKER

WITHOUT OPTIMIZER

; 346 :                               CurInd.Range = (CurValue.Range - AmbResInput->StripData[NStrip].MinRange)*AmbResInput->MapGrid.InvStep.Range;

      mov      ecx, DWORD PTR ?NStrip@@3KA            ; NStrip
      imul      ecx, 12                              ; 0000000cH
      mov      edx, DWORD PTR _AmbResInput$[ebp]
      fld      DWORD PTR _CurValue$[ebp]
      fsub      DWORD PTR [edx+ecx+3145996]
      mov      eax, DWORD PTR _AmbResInput$[ebp]
      fmul      DWORD PTR [eax+3145984]
      fstp      DWORD PTR _CurInd$[ebp]

; 347 :                               // Calculate the cell index

; 348 :                               IndR = (DWORD)CurInd.Range;

      fld      DWORD PTR _CurInd$[ebp]
      call      __ftol
      mov      DWORD PTR _IndR$[ebp], eax

Lescha

ASKER

WITH OPTIMIZER

; 346 :                               CurInd.Range = (CurValue.Range - AmbResInput->StripData[NStrip].MinRange)*AmbResInput->MapGrid.InvStep.Range;

      fld      DWORD PTR _CurValue$[esp+3192]
      fsub      DWORD PTR [esi+3145996]
      fld      ST(0)
      fmul      DWORD PTR [edi+3145984]

; 347 :                               // Calculate the cell index
; 348 :                               IndD = (DWORD)CurInd.Doppler;

      fld      DWORD PTR _CurInd$[esp+3196]
      call      __ftol

; 349 :                               IndR = (DWORD)CurInd.Range;

      fld      ST(0)
      mov      edi, eax
      call      __ftol

Lescha

ASKER

Do you see any significant difference?
Or do you need more data?

stefan73

Lescha,
Optimizing is OK, as long as you get the same results. Are yours different?

Stefan

DanRollins

And what does it look like *with* optimization? And... why does it matter? What is it about the optimized code that offends you?

-- Dan

Lescha

ASKER

Yes! That's what my question is about! The "random fluctuation" beyond the decimal places of real significance are different!

For example, I can get 25611.2345 without an optimizer and 25611.2367 with an optimizer. This would not matter much, but, of course, sometimes it is 123456.9999 in one case and 123457.0001 in the other case, and this, when floor-ed to an integer gives a different result.

So, again: why are the arithmetical ops different with and without the optimizer?

stefan73

AFAIK, that looks OK. Keep in mind that Intel FPUs use a stack, so commands like fld ST(0) are faster than accessing memory via a pointer. The relevant commands are fsub, fmul and the __ftol call. fld and fstp are just load and store commands.

SOLUTION

stefan73

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

stefan73

Here is an article about floating point optimizations with VC++:

http://www.microsoft.com/indonesia/msdn/floapoint.asp

stefan73

Extract: The compiler needs to be called with

cl -fp:precise source.cpp
or
cl -fp:precise source.cpp

Check if the optimize for speed uses -fp:fast.

Zoppo

Maybe another option would be to generally use 'double' instead of 'float' ... IMO the problems
you see come from the fact that values are taken from memory (as float) in unoptimized code
while they may be taken from FPU's stack (as double) in optimized code ... I think the results
won't differ so extremely when values in memory are even doubles.

ZOPPO

stefan73

Zoppo,
> use 'double' instead of 'float'
Good point. But I think the decision here is against doubles for saving space. The ASM code above shows that there are pretty big objects on the stack already, so using doubles might not be an option. Or is it, Lescha?

You'd get less noise with doubles.

BTW: Have a look at this nice page here:
http://babbage.cs.qc.edu/courses/cs341/IEEE-754.html

It shows you all the exact binary layout of a double or float you enter.

Stefan

Zoppo

hm ... yes, 'space' is one argument ... but, dealing with numbers like '25611.2345' will lead to problems anyway
with float's precision of at max. 7 significant digits.

ZOPPO

SOLUTION

grg99

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

stefan73

grg99,
> there are very very very few programs that need more than 32-bits of precision

For a single value, that's for most cases true. But think about error propagation...

Stefan

rstaveley

Stefan, with respect to http://www.microsoft.com/indonesia/msdn/floapoint.asp

> Beginning with version 8.0 (Visual C++® "Whidbey"),

That's part of the yet-to-be-released Visual Studio 2005 - see the road map at: http://msdn.microsoft.com/vstudio/productinfo/roadmap.aspx

rstaveley

Odd that the version displayed by cl.exe in .NET 2003 is...

Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 13.10.3077 for 80x86
Copyright (C) Microsoft Corporation 1984-2002. All rights reserved.

....but we all know it as 7.1.

stefan73

rstaveley,
> version 8.0
Ouch, you're right!

Hmm, maybe the option also works in earlier versions.

Regarding the compiler version: That's probably just the compiler itself, not the Studio.

Stefan

SOLUTION

DanRollins

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

Lescha

ASKER

Wow! That's a hell of a lot of comments! I just increased the points, otherwise I won't have enough to split between all of you guys.

Now, to answer your questions:

1a) I switch to using floats because it saves both space and time. Space is obvious, time because on a 32-bit machine the same calculations done in floats are much faster than in doubles. That's also the reason why, for instance, I use DWORD or long where a byte or a short might have sufficed.
1b) With me, time is the decisive factor here, I'm talking about a pretty heavy algorithm and I managed to get it down to about 13ms without the optimization. Cannot go to double, it will (he-he) almost double the time. And, for the same reason, I cannot add the to-double and from-double lines of code.

2) I don't think I can replace truncation with rounding. That would solve the problem, of course, but, unfortunately, will give birth to other edge-effect problems.

3) Dan, what pragma would that be? Can you spell it out for me? Thanks!

4) Can I use a non-optimized DLL with an optimized EXE? Won't it create problems during the link stage?

ASKER CERTIFIED SOLUTION

DanRollins

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial