Solved

Float multiplication - difference between assembly code with and without optimizer

Posted on 2004-03-29
26
1,087 Views
Last Modified: 2008-03-10
I've recently encountered the following problem. I have a program written in C using Visual Studio 6.0. When run in Release version without optimization, the results of float multiplication are slightly different from those in Release version with Maximize speed.

I tried to look at assembly code, but was unable to find a difference. The problem is that when the program is compiled with optimization it is impossible to debug it, otherwise I would've looked at the registers.

So: is it true that Max Speed Optimizer affects the assembly code? If so, can I turn this specific feature off?

If it will be any help, I will post the relevant piece of source / assembly code by request.

P.S. This is a copy of the same question I posted in other sections. I'm aware of that, so please don't make special comments about it.
0
Comment
Question by:Lescha
  • 10
  • 8
  • 3
  • +3
26 Comments
 
LVL 12

Expert Comment

by:stefan73
ID: 10711465
Hi Lescha,
> is it true that Max Speed Optimizer affects the assembly code?
Of course! Otherwise there would be no improvement.
But the optimizer should still create FP code that fully complies with IEEE-754. Some compilers have options which explicitly create non-compliant code (such as Sun cc with -fast), but the documentation should say so.

Cheers,
Stefan
0
 
LVL 1

Author Comment

by:Lescha
ID: 10711490
Okay, okay, I see where my formulation of the question was misleading.
I rephrase:

Is it true that Max Speed Optimizer affects the assembly code which concerns arithmetic operations, and multiplication in particular? If so, can I turn this specific feature off?

0
 
LVL 12

Expert Comment

by:stefan73
ID: 10711795
Lescha,
Before I say "of course" again, perhaps let me re-phrase your question - I think I know what you're aiming at:

Is it true that Max Speed Optimizer affects the behavior of arithmetic operations, so that the result can differ from non-optimized code?

If that's what you mean: The bahavior of floating-point operations is defined in the IEEE-754 standard (read more at http://babbage.cs.qc.edu/courses/cs341/IEEE-754references.html).

This standard regulates the way floating-point operations are handled. Typical examples of non-complying optimizations are:

double x=12345.56789;
for(i=0;i<99;i++)
    array[i] /= x;

Since multiplications is usually faster than division, the optimizer replaces x by x1:

double x=12345.56789;
double x1=1.0/x;
for(i=0;i<99;i++)
    array[i] *= x1;

The MS documentation MUST mention non-compliant optimizations. Here's an example from Sun's cc man page:
          -fsimple=0
          Permits no simplifying assumptions. Preserves strict
          IEEE 754 conformance.

          -fsimple=1
          Allows conservative simplifications. The resulting code
          does not strictly conform to IEEE 754, but numeric
          results of most programs are unchanged.

          With -fsimple=1, the optimizer can assume the follow-
          ing:
          o  The IEEE 754 default rounding/trapping modes do not
          change after process initialization.
          o Computations producing no visible result other than
          potential floating- point exceptions may be deleted.
          o Computations with Infinity or NaNs as operands need
          not propagate NaNs to their results. For example, x*0
          may be replaced by 0.
          o Computations do not depend on sign of zero.

          With -fsimple=1, the optimizer is not allowed to optim-
          ize completely without regard to roundoff or
          exceptions. In particular, a floating-point computation
          cannot be replaced by one that produces different
          results with rounding modes held constant at run time.

          -fsimple=2
          Permits aggressive floating point optimizations that
          may cause many programs to produce different numeric
          results due to changes in rounding. For example, -fsim-
          ple=2 permits the optimizer to attempt replacing compu-
          tations of x/y in a given loop where y and z are known
          to have constant values, with x*z, where z=1/y is com-
          puted once and saved in a temporary, thereby eliminat-
          ing costly divide operations.

          Even with -fsimple=2, the optimizer still is not per-
          mitted to introduce a floating point exception in a
          program that otherwise produces none.

This very clearly defines boundaries of optimizer behaviour.


Stefan
0
 
LVL 1

Author Comment

by:Lescha
ID: 10711872
Yeah, okay, so I guess what I am actually asking is this: how can I retain <i>most</i> of maximize speed options, but bar it from optimizing the arithmetics?
0
 
LVL 1

Author Comment

by:Lescha
ID: 10711907
I think I'll just post both assembly codes here for you.
0
 
LVL 1

Author Comment

by:Lescha
ID: 10711909
WITHOUT OPTIMIZER

; 346  :                               CurInd.Range = (CurValue.Range - AmbResInput->StripData[NStrip].MinRange)*AmbResInput->MapGrid.InvStep.Range;

      mov      ecx, DWORD PTR ?NStrip@@3KA            ; NStrip
      imul      ecx, 12                              ; 0000000cH
      mov      edx, DWORD PTR _AmbResInput$[ebp]
      fld      DWORD PTR _CurValue$[ebp]
      fsub      DWORD PTR [edx+ecx+3145996]
      mov      eax, DWORD PTR _AmbResInput$[ebp]
      fmul      DWORD PTR [eax+3145984]
      fstp      DWORD PTR _CurInd$[ebp]

; 347  :                               // Calculate the cell index

; 348  :                               IndR = (DWORD)CurInd.Range;

      fld      DWORD PTR _CurInd$[ebp]
      call      __ftol
      mov      DWORD PTR _IndR$[ebp], eax
0
 
LVL 1

Author Comment

by:Lescha
ID: 10711913
WITH OPTIMIZER

; 346  :                               CurInd.Range = (CurValue.Range - AmbResInput->StripData[NStrip].MinRange)*AmbResInput->MapGrid.InvStep.Range;

      fld      DWORD PTR _CurValue$[esp+3192]
      fsub      DWORD PTR [esi+3145996]
      fld      ST(0)
      fmul      DWORD PTR [edi+3145984]

; 347  :                               // Calculate the cell index
; 348  :                               IndD = (DWORD)CurInd.Doppler;

      fld      DWORD PTR _CurInd$[esp+3196]
      call      __ftol

; 349  :                               IndR = (DWORD)CurInd.Range;

      fld      ST(0)
      mov      edi, eax
      call      __ftol

0
 
LVL 1

Author Comment

by:Lescha
ID: 10711915
Do you see any significant difference?
Or do you need more data?
0
 
LVL 12

Expert Comment

by:stefan73
ID: 10711919
Lescha,
Optimizing is OK, as long as you get the same results. Are yours different?

Stefan
0
 
LVL 49

Expert Comment

by:DanRollins
ID: 10711937
And what does it look like *with* optimization?    And... why does it matter?  What is it about the optimized code that offends you?

-- Dan
0
 
LVL 1

Author Comment

by:Lescha
ID: 10711955
Yes! That's what my question is about! The "random fluctuation" beyond the decimal places of real significance are different!

For example, I can get 25611.2345 without an optimizer and 25611.2367 with an optimizer. This would not matter much, but, of course, sometimes it is 123456.9999 in one case and 123457.0001 in the other case, and this, when floor-ed to an integer gives a different result.

So, again: why are the arithmetical ops different with and without the optimizer?
0
 
LVL 12

Expert Comment

by:stefan73
ID: 10711972
AFAIK, that looks OK. Keep in mind that Intel FPUs use a stack, so commands like fld ST(0) are faster than accessing memory via a pointer. The relevant commands are fsub, fmul and the __ftol call. fld and fstp are just load and store commands.
0
 
LVL 12

Assisted Solution

by:stefan73
stefan73 earned 30 total points
ID: 10711989
Lescha,
> The "random fluctuation" beyond the decimal places of real significance
Floating point arithmetics are not an "exact" science... If you want arbitrary precision, use some third party library, such as GSL:

http://www.gnu.org/directory/science/math/GNUsl.html

Stefan
0
Highfive + Dolby Voice = No More Audio Complaints!

Poor audio quality is one of the top reasons people don’t use video conferencing. Get the crispest, clearest audio powered by Dolby Voice in every meeting. Highfive and Dolby Voice deliver the best video conferencing and audio experience for every meeting and every room.

 
LVL 12

Expert Comment

by:stefan73
ID: 10712028
Here is an article about floating point optimizations with VC++:

http://www.microsoft.com/indonesia/msdn/floapoint.asp
0
 
LVL 12

Expert Comment

by:stefan73
ID: 10712035
Extract: The compiler needs to be called with

cl -fp:precise source.cpp
   or
cl -fp:precise source.cpp

Check if the optimize for speed uses -fp:fast.
0
 
LVL 30

Expert Comment

by:Zoppo
ID: 10712628
Maybe another option would be to generally use 'double' instead of 'float' ... IMO the problems
you see come from the fact that values are taken from memory (as float) in unoptimized code
while they may be taken from FPU's stack (as double) in optimized code ... I think the results
won't differ so extremely when values in memory are even doubles.

ZOPPO
0
 
LVL 12

Expert Comment

by:stefan73
ID: 10712845
Zoppo,
> use 'double' instead of 'float'
Good point. But I think the decision here is against doubles for saving space. The ASM code above shows that there are pretty big objects on the stack already, so using doubles might not be an option. Or is it, Lescha?

You'd get less noise with doubles.

BTW: Have a look at this nice page here:
http://babbage.cs.qc.edu/courses/cs341/IEEE-754.html

It shows you all the exact binary layout of a double or float you enter.

Stefan
0
 
LVL 30

Expert Comment

by:Zoppo
ID: 10712873
hm ... yes, 'space' is one argument ... but, dealing with numbers like '25611.2345' will lead to problems anyway
with float's precision of at max. 7 significant digits.

ZOPPO
0
 
LVL 22

Assisted Solution

by:grg99
grg99 earned 40 total points
ID: 10712992
Here's what may be going on:

The floating point hardware uses 80-bit numbers internally in its calculations and in its temporary variables in its floating point "stack".

BUT: Standard float variables are either 32 or 64 bits long.  The code that stores the intermediate results into variables is going to lose anything from 32 to 48 bits of precision when storing the intermediate results.

The code that keeps the intermediate results in the flaoting point register "stackl" will maintain the full 80 buits of precision.


But lets step back a bit-- there are very very very few programs that need more than 32-bits of precision, almost none that need 64-bits.

What are you doing and do you really need all that precision?    There are darned few physical quantities that are klnown to be that precise.If you're bothered by the truncation error, use rounding instead, it's going to be closer.

 
0
 
LVL 12

Expert Comment

by:stefan73
ID: 10713146
grg99,
> there are very very very few programs that need more than 32-bits of precision

For a single value, that's for most cases true. But think about error propagation...

Stefan
0
 
LVL 17

Expert Comment

by:rstaveley
ID: 10713335
Stefan, with respect to http://www.microsoft.com/indonesia/msdn/floapoint.asp

> Beginning with version 8.0 (Visual C++® "Whidbey"),

That's part of the yet-to-be-released Visual Studio 2005 - see the road map at: http://msdn.microsoft.com/vstudio/productinfo/roadmap.aspx
0
 
LVL 17

Expert Comment

by:rstaveley
ID: 10713363
Odd that the version displayed by cl.exe in .NET 2003 is...

Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 13.10.3077 for 80x86
Copyright (C) Microsoft Corporation 1984-2002. All rights reserved.

....but we all know it as 7.1.
0
 
LVL 12

Expert Comment

by:stefan73
ID: 10714199
rstaveley,
> version 8.0
Ouch, you're right!

Hmm, maybe the option also works in earlier versions.

Regarding the compiler version: That's probably just the compiler itself, not the Studio.

Stefan
0
 
LVL 49

Assisted Solution

by:DanRollins
DanRollins earned 180 total points
ID: 10716497
I think grg99 hit the nail on the head... in the non-optimized version, an intermediate value is saved to memory and then reloaded.  But in the optimized one, it looks like it remains an 80-bit internal FPU register value.

There are two approaches:

1) Use a #pragma to tune off optimization just above the function and tuen it back on afterwards.

2) Add code that compensates for "random-seeming"  tiny errors in the floating-point calculations.

#1 is a kludge, but it actually answers your question.

#2 is the correct way to fix this problem.  It indicates that you understand that floating-point calculations are prone to rounding errors (we all learned about this in DP 101) and so your code compensates... regardless of the hardware or platform or optimization level...

    double d= CurValue.Range - AmbResInput->StripData[NStrip].MinRange;
    d *= AmbResInput->MapGrid.InvStep.Range;
    d += 0.5;  // round upward
    CurInd.Range= (DWORD)d;

-- Dan
0
 
LVL 1

Author Comment

by:Lescha
ID: 10716619
Wow! That's a hell of a lot of comments! I just increased the points, otherwise I won't have enough to split between all of you guys.

Now, to answer your questions:

1a) I switch to using floats because it saves both space and time. Space is obvious, time because on a 32-bit machine the same calculations done in floats are much faster than in doubles. That's also the reason why, for instance, I use DWORD or long where a byte or a short might have sufficed.
1b) With me, time is the decisive factor here, I'm talking about a pretty heavy algorithm and I managed to get it down to about 13ms without the optimization. Cannot go to double, it will (he-he) almost double the time. And, for the same reason, I cannot add the to-double and from-double lines of code.

2) I don't think I can replace truncation with rounding. That would solve the problem, of course, but, unfortunately, will give birth to other edge-effect problems.

3) Dan, what pragma would that be? Can you spell it out for me? Thanks!

4) Can I use a non-optimized DLL with an optimized EXE? Won't it create problems during the link stage?
0
 
LVL 49

Accepted Solution

by:
DanRollins earned 180 total points
ID: 10716983
3) See http://msdn.microsoft.com/library/en-us/vccore98/html/_predir_optimize.asp
I suggest trying :

    #pragma optimize( "p", on )

above the procedure.  You could also set the settings for a particular module using thw GUI.

4) Optimization settings will not affect external access such as exported DLL functions.

-- Dan
0

Featured Post

Highfive Gives IT Their Time Back

Highfive is so simple that setting up every meeting room takes just minutes and every employee will be able to start or join a call from any room with ease. Never be called into a meeting just to get it started again. This is how video conferencing should work!

Join & Write a Comment

Introduction This article is the first in a series of articles about the C/C++ Visual Studio Express debugger.  It provides a quick start guide in using the debugger. Part 2 focuses on additional topics in breakpoints.  Lastly, Part 3 focuses on th…
  Included as part of the C++ Standard Template Library (STL) is a collection of generic containers. Each of these containers serves a different purpose and has different pros and cons. It is often difficult to decide which container to use and …
The goal of the tutorial is to teach the user how to use functions in C++. The video will cover how to define functions, how to call functions and how to create functions prototypes. Microsoft Visual C++ 2010 Express will be used as a text editor an…
The goal of the video will be to teach the user the concept of local variables and scope. An example of a locally defined variable will be given as well as an explanation of what scope is in C++. The local variable and concept of scope will be relat…

707 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

14 Experts available now in Live!

Get 1:1 Help Now