floating point numbers

I'd like to know what effects the floating point numbers operation?Are the math operations different in float numbers from integer values?What must be changed not to be able to do operations with the float numbers???
In my program;
float f;
f = f/10; in this line sometimes an error occurs and the program breaks but if I write
int x;
x = x/10; there is no problem.

Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

You have to set f to something before you try to divide it!

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
Also, you will lose precision in some cases by using int.

For example:

float f=95;
f=f/10;        // result is 9.5

int x=95;
x=x/10;     //  result is 9
Also, it's a good idea to tell you compiler that 10 is a float value:

Exploring SQL Server 2016: Fundamentals

Learn the fundamentals of Microsoft SQL Server, a relational database management system that stores and retrieves data when requested by other software applications.

If you use the 'f' suffix you don't really need the decimal point so 10f should work just fine.

Note that that makes 10 a float number - not double. 10.0 is a double value.

Perhaps it is time to remove some of the mysticism around floating point and reveal a little secret to you.

Floating point numbers are really integers in disguise. The disguise is so good though that they can often pass for
real numbers.

Here's how it works:

struct double {
      unsigned int      M : 51;
      unsigned int      X : 12;
      unsigned int      S : 1;

A few notes here. I of course show the example from one specific type of hardware however it is very common to do it this way and I will indicate variations where appropriate. First off, I assume that the struct is ordered so that the most significant bits come last, float is usually arranged so that the sign bit is the most significant bit then comes the exponent bits and then last comes the mantissa bits as the least significant bits.

The sizes of these fields may vary but 51 bits for mantissa isn't uncommon and 12 bits for exponent is also not uncommon. Note that the sum of all the fields is 64 since double on many machines are 64 bits.

All of these fields are unsigned fields. Some of them could have been signed but typically are not.

S is the sign bit and is usually 0 for positive numbers and 1 for negative numbers. If we introduce the pow(x,n)
function as the funtion that multiplies x with itself n times (x to the nth power) then we can write that as:

pow(-1,S);   if S is 0 this is 1 and if S is 1 the result is -1. We could also have written it as
1 - 2S and get the same result. Either way this produces a factor which is multiplied with the rest of the number.

In fact a floating point number can be broken down into three factors:

floatval = sign * mantissa * pow(BASE,exponent);

The sign is controlled by the S bit and is pow(-1,S) or (1 - 2S) (either expression is the same result).

The exponent is controlled by the X field but unlike X we want the exponent to be signed.

The mantissa is controlled by the M field we simply want to make a simple adjustment so that we force
the mantissa into a certain range.

The only value which isn't represented by the float bits is the BASE. This is fixed. On most machines it is 2 but
you can find some hardware where base is 16 or even 10. BASE is never changed but stay constant at
all times for a given floating point type on a given hardware.

The exponent is simply an integer and it is made signed by subtracting a certain BIAS from X. This BIAS value is
also fixed and is typically chosen so that approximately half of the values are positive and half of them are
negative so BIAS is typically near half the maximum value for X which is pow(2,XW)- 1 where XW is the number
of bits in the exponent field (12 in my example).

Thus BIAS is typically pow(2,XW-1) or pow(2,XW-1) - 1. This is also chosen by the hardware manufacturer and
is fixed for the given type just as BASE is.

The mantissa is chosen so that it is typically in a range from a minimum value to a maximum value:

min <= mantissa < max

and max/min == BASE or max == min * BASE. This is because if mantissa is ever less than min you can multiply mantissa with BASE and decrease the exponent by 1 or if mantissa is above max you can divide mantissa with BASE and increase the exponent by 1. Either min or max is fixed at 1.0 and thus we want the mantissa to be

1.0 <= mantissa < BASE


1/BASE <= manissa < 1.0

Which one of these we choose is again fixed by the hardware but I have seen both on various machines.

Also, since mantissa is always in the range from min to max we do not need to store the mantissa in the M bits, we only need to store mantissa - min in some way. This value is thus in the range:

0 <= mantissa - 1.0 < BASE - 1.0   or 0 <= mantissa - 1/BASE < 1.0 - 1.0/BASE

If BASE is not 2 then this isn't very useful and so you typically store mantissa without subtracting min but if BASE is exactly 2 then this is very useful since you get an extra bit which you don't have to store. If BASE is 2 the
two equations above become:

0 <= mantissa - 1.0 < 1.0 or 0 <= mantissa - 0.5 < 0.5

Either case you gain an extra bit.

Back to our not so generic example with XW = 12 and MW = 51 we simply get the mantissa to the chosen range
by dividing it by pow(2,MW) thus the value is in the range 0 <= mantissa < 1.0 for the first case and we divide it by pow(2,MW+1) if we want it to be in the second range.

Thus for our example and I will use the form with mantissa in the range 1.0 <= mantissa < 2.0 so we get:

mantissa = 1.0 + M / pow(2,MW)

Thus we get the complete value as:

FLOATVAL = pow(-1,S) * (1.0 + M/pow(2,MW)) * pow(2,X - BIAS)

Here I have fixed BASE at 2 since the '1.0' etc doesn't really make sense unless BASE is 2 as explained above but otherwise the MW and BIAS aren't fixed at any particular values although for the double struct I showed earlier MW is of course 51 and BIAS is typically 2047 or 2048. Most systems will choose 2047 since you possibly are more interested in a high range of high values than low values.

However, this floating point value has one serious problem and one not so serious but still annoying problem.

The first serious problem is that there is no way to represent 0.0 using this formula. That is pretty bad, we often need 0. This is solved by treating a floating point number with an exponent field X = 0 special.

Bluntly speaking when X = 1 we treat it just like as if X was 1 so the exponent field is the same as when X is higher but we do not add 1.0. Thus the lowest range of values form a full range from 0.0 to 1.0 with all steps equal to the smallest possible step which is pow(2,-MW + 1 - BIAS). Thus when X = 0 we use the following formula:

FLOATVAL = pow(-1,S) * (M / pow(2,MW)) * pow(2,1 - BIAS) = pow(-1,S) * M * pow(2,1 - BIAS - MW);

Thus we can represent 0 with X = 0 and M = 0. We even get two different zeroes, +0 and -0 - this is unlike
integer arithmetic which on most systems are 2-complement and so only one bit pattern represent 0.

One nice feature is that a floating point value of all bits 0 is the floating point value of +0.0000...

The other problem which is not so fatal but still annoying is that when it comes to floating point we often need
to represent a value which isn't a number. For example in math you frequently need to represent infinity both
positive and negative and some times when you try to take the square root of a negative number we simply
need a value that represent "not a real number" or "Not a Number". Further you might even want two of
those type of numbers, you want a "quiet NaN" which just silently returns not a number without any fuzz. The programmer can then test for that value and take appropriate action. The other is a "signaling NaN" which
is returned by some functions but if not checked by user will be trapped if you feed it as input to another function. For example if you want to square a square root you expect to get the same number back, right? Well, if you
give -1 as input you won't do that, the -1 will return a quiet Nan and when you square it the square function should
detect a quet Nan and change it into a signaling Nan causing your program to enter some form of hardware exception or software exception if the language supports that.

To represent infinity and NaNs we simply also treat the value X = pow(2,XW) - 1  (all bits set) as a special value. When X has this value we do not use the formula above but we simply say that the value is one of these special values, which one is selected by M and S.

On the x87 hardware a value of M = 0 will indicate infinity (+inf if S is 0 and -inf if S is 1) and M != 0 will indicate NaN. The exact value of M (actually the most significant bit of M) will decide if the NaN is quiet or signaling.

So there you have it, to decode a floating point number we first check the exponent field X and if it has the special value X = 11111....1111 (all bits one) then we decode the value as a special value (infinity or nan), if X is 0 we use the formula:

VALUE = pow(-1,S) * M * pow(2,1 - MW - BIAS)

and if X is neither all bits set or all bits 0 we use the formula:

VALUE = pow(-1,S) * (1.0 + M / pow(2,MW)) * pow(2,X - BIAS)

How do we implement the juicy functions to do arithmetic on those numbers?

I take this first because it is easiest:

1. If any of the operands is signaling or Nan we trap it, otherwise no operand is signaling nan we check if
any operand is quiet nan and if so we return signaling nan as result. Otherwise no operand is nan at all.

The above test is a standard test which we do for almost all our operations.

2. If any operands are infinity we return infinity with proper sign (CS = AS ^ BS, i.e. we XOR the sign bits of input).

The step above is also a "standard tests" which we do (with slight modifications) on all our operations.

3. Ok, when we get here none of the values have the exponent field with all 1's We then transform them
into an internal format (if they aren't in that format already) so that they are easier to work with. The two formats
described above are compact for storing floating point in memory variables but they aren't really nice to work with when you do math, so the internal format is typically a slight modification of them.

a) If X != 0 then we have an implied 1.0, we make that explicit by adding pow(2,MW) to M. The internal format
have M and X separated so the internal variable used to hold M for both operands have room for this extra
bit. Also, if X == 0 we really have an exponent equal to 1 so we set X = 1 in this case.

We do the above for both operands. We then have AM, AX, AS and BM, BX, BS we need to compute CS, CX and CM for the result. Remember that all values are integers.

4.  multiply CM = AM * BM. Remember that AM really represent AM / pow(2,MW) and BM really represent
BM / pow(2,MW) so this makes CM really represent CM / pow(2,MW + MW) but we want it to represent
CM / pow(2,MW) so we therefore "ignore" the MW lower bits and keep the MW + 1 higher bits  Since AM
is maximum MW + 1 bits and BM is the same the result is maximum 2 * MW + 1 bits so by "ignoring" the
lower MW bits we get the result we want, so we really compute:

CM = AM * BM;

and then extract the MW + 1 high bits of the result. The lower bits are ignored except that the most significant of these are used to round the result properly, we simply add the most significant of those bits to the result or
ignore it depending upon our rounding mode. However, we cannot do it quite yet so for now we leave the
result as it is.

We also add the exponents together but keep in mind that both are biased and we want the result to have only one bias so we do:

CX = AX + BX - BIAS;

The sign bits are the easy part, we just XOR them as I already said:

CS = AS ^ BS;

Now, all we need to do is collect all this together properly:

1. if CM has most significant bits set to 0's we want to get a bit set in the most significant bit position we can do
that by shifting CM (multiply by 2) and decrease CX by 1. This of course assumes that BASE equals 2 and
is the dominant choice of BASE on modern computers. If BASE is 16 you would check the 4 most significant
bits and if they are all 0 you would shift CM 4 times (multiply by 16) and decrease CX by 1.

Of course, if CM == 0 there are no bits so shifting won't help, therefore we first check if CM equals 0 and if it is we return 0 as result. (we could of course do that in the early part and check for 0 for any operands and return 0 if that is the case).

If CM != 0 then there is at least one bit so shifting will get it up as the most significant and we can do the shifting as indicated above.

Now, it may happen that CX has fallen off the range of valid X range, CX might even have become 0 or negative
(something it should never be) or it might equal the special high X value or even higher. Simply said, we need
to force CX into the range 1..pow(2,XW) - 2.

If CX is too high we simply have a value too high for our processor so we set the result to infinity (using CS as
sign bit) and return that value. I.e. we set CM to 0 and CX to pow(2,XW) - 1 and return CS,CX,CM;

If CX is too low we need to increase CX until we get a value equal to 1, for each time we increase CX we shift CM to the left (divide by 2). This will cause the most significant bit to be 0 but that is ok in this case.

When CX reaches 1 we set CX to 0 indicating that this value is of the form with CX with all bits 0. We then return CS, 0, CM (using bits MW..2*MW-1 of the result). Since the most significant bit is 0 this uses that
formula which does not add 1.0 so CX is 0. The bits 0..MW-1 of CM is ignored except that the highest
bit of these (bit MW-1) might be added to the result for rounding.

If CX is neither too high nor too low it is in the range 1..pow(2,XW) - 2 and CX can be used as is and we
pick up bits MW..2*MW-1 of the result. The highest bit 2*MW is 1 and correspond to the implied 1.0
and the lower bits 0..MW-1 are ignored except for rounding.

Addition and subtraction is somewhat more complicated compared to multiplication. The basic rule is this:

If the exponent are equal then we can add by simply adding the mantissa but if exponents are different we cannot, we need to adjust the exponent.

The step 1 is identical t o the similar step for multiplication.
Step 2 is also almost identical except for the following situations:

If you have +inf + -inf we return quiet nan since the result is indeterminate. Similarly for -inf + +inf.
If you have z + +inf or z + -inf where z is neither nan nor inf we return inf of approriate sign.
If you have +inf - -inf we return +inf;
if you have -inf - +inf we return -inf;
if you have -inf - -inf or +inf - +inf we return qnan.

When we reach step 3 we have two operands neither of which are nan or inf.

We do exactly the same unpacking so we get three different variables (registers) for each of the operands.

Now we can do slightly different things for different platforms but in general we can process the sign of
the result and the operation (+ or -) based on the signs of the operands and the operation as follows:
+x + +y        -> add and CS = 0.
+x + -y         -> sub and CS = 0 (+x + -y == x - y).
+x - +y         -> sub and CS = 0 (+x - +y == x - y).
+x - -y          -> add and CS = 0 (+x - -y == x + y)
-x + +y         -> sub and CS = 1 (-x + +y == -x + y = - (x - y))
-x + -y          -> add and CS = 1 (-x + -y == - (x + y))
-x - +y          -> add and CS = 1 (-x - +y == -(x + y))
-x - -y           -> sub and CS = 1(-x - -y == -x + y = -(x - y))

so the signs of the operands and the operation translate to a add/sub and CS equal to 1 or 0.

Next we want to do the operation but we cannot unless AX == BX.

If AX > BX then we need to make them the same, we can do that by shifting to the left (divide by 2) and increasing
X on the lowest (BX) until it equals AX. In fact we need to do that AX - BX times. If AX - BX > MW + 1 then this
shifting will cause BM to become 0 since the value is at most MW+1 bits (MW from M and 1 bit from 1.0 add).
This translates to the case that B is so small that adding it to A have no influence. If AX - BX is exactly equal
to MW+1 it is also too small but in this case the bit may influence rounding so we might do it anyway (we might
extend AM and BM a bit on the least significant side in order to handle rounding, thus the M field may be
wider than MW+1 bits). If B thus is too small we can simply return A as the result of the operation.

If AX < BX then we need to do the reverse but adjust A this time. If BX - AX > MW + 1 it is again so that A is
too small. If the operation is sub we set CS = ~BS and if operation is add we set CS = BS.

If AX == BX we can then do the operation CM = AM + BM or CM = AM - BM depending on the operation add or sub above, if BM > AM and we did CM + AM - BM we would get a 'negative' result and we switch that by inverting CS and negatng the result (inverting all bits of CM and add 1) so CM = -CM and CS = ~CS (CS = 1 - CS). We also set CX = AX (or BX since they are equal).

Due to the fact that when this is over CX is equal to either original AX or original BX we never have the issue of CX being outside range as we did for multiplication, we do need to check the high bit.

If the high bit (bit number MW) of CM is not 1 then we need to decrease CX by one and shift CM one bit to the left (multiply by 2). We shouldn't decrease CX lower than 1, if CX reaches 1 we stop.

If CX == 1 we check if high bit (bit number MW) of CM is 1 or not. If it is not 1 we have a "small" number and so
we set CX = 0. If CX > 1 we always have the high bit set (if that bit wasn't set we would decrease CX until we reached 1 or until we get a bit set in that position).

We then return CS,CX and CM (bit number 0 to MW-1, we ignore bit MW which is either 1 if CX > 0 or 0 if CX == 0).

Division is perhaps the toughest operation. Again we can go through the nan and inf cases. What we return if we detect inf / inf etc whatever is approriate. inf/ inf should probably return qnan. Note that for division we have situations where we have inf but we do not return inf. x / inf where x is not inf or nan should return 0.0000.

So the interesting situation is when neither operand is NaN or inf. Here we should also check for operand B == 0, i.e. if BX == 0 and BM == 0 then we have a x/0 situation. if A is also 0 (AX == 0 and AM == 0) we have a
0/0 situation which probably should return qnan or some such and for any other value x we should return inf of same sign as operand A).

So if neither operand is nan or inf and operand B is not zero we can move on to the "general" algorithm.

We could also check if A is 0 or not at this point (AX == 0 and AM == 0) since if A is zero we don't need to divide at all. The result is simply 0.

First do an integer divide of AM / BM

Note that poth these values are scaled by the same scale factor (pow(2,-MW)) so the result is not scaled, to fix that we also multiply in pow(2,MW) so we compute:

CM = AM * pow(2,MW) / BM

The scale is simply done by placing AM in the bit positions MW..2*MW-1 and fill 0..MW-1 with 0 bits before
we divide.

One good thing is that in most cases BM will have the high bit set so division algorithm is always assumed to work ok in that case. If BM does not we need to adjust BM first. Since we know that BM is not 0 (if BX wasn't 0 we have the high bit set already and if BX was zero we have to have BM != 0 or else we would have B == 0 which is handled above). We simply shift BM up until the high bit becomes 1 and decrease BX accordingly.

With integer division we usually get a quotient and remainder. We're not really interested in remainder (well,
frem() function might but float division isn't) but we need it to determine rounding. Again a simple way to do this
is simply to add aleast significant bit below which are intiialzed to 0 for both operands. This bit can then be used to determine rounding and we can ignore the remainder. Otherwise we keep the remainder but we're only interested to know if this remainder is less or greater or equal to half the value of BM. It is probably easiest to just add those extra bits and ignore the remainder per se.

The exponent is simply computed as CX = AX - BX.

Next we need to make sure that the quotient is proper, if the high bit (bit 2*MW) is not 1 we need to shift CM
to the left (multiply by 2) and decrease CX by 1 until we get that high bit.

The sign bit is simply XOR between AS and BS: CS = AS ^ BS;

Again we need to check that CX is in proper range. If CX is in the range 1..pow(2,XW)-2 then CX is fine as is
and we can return CS, CX, CM.

If CX is too high we return inf of proper sign (CS, CX=pow(2,XW)-1, CM = 0)

If CX is too low we increase CX until CX equals 1 and for each time we increase we shift CM to the left (divide by 2) If we reach CM == 0 before CX reaches 1 we have an underflow situation and we return 0.

if we reach CX == 1 we set CX = 0 since the high bit is 0 in this case (we have shifted by 1 at least once).

We then return CS,CX,CM.

There are of course some refinements to this. For example none of my algorithms sets overflow or underflow flags even though they clearly detect such situations. It is easy to set such flags for those situations though so I left
that out of the explanation.

So, if you have those functions, how can you compute juicy functions such as pow(x,y), exp(x), log(x), sin(x),
cos(x) etc?

Well, that has actually little to do with the floating point format per se, once you can represent floating point
numbers such problems are more math problems than hardware problems. Yes, some processors can compute
such functions in hardware but that is simply a matter of translating the mathematical equations to hardware.

pow(x,y) is usually computed through the following identity:

log(pow(x,y)) == y * log(x) so pow(x,y) = exp(y * log(x)).

pow(x,y) is therefore more inaccurate than pow(x,n) for integer number n if you have such a specialized function.
In particular, using pow(x,2) instead of x * x is typically something that is rather silly to do.

exp(x) is actually fairly fast to compute on computers, returning to our representation we have that

if y = exp(x) and we want to find y then we find y as a floating point number that is:

y = sign * mantissa * pow(2, exponent)

Here sign must be positive since exp(x) >= 0 for real number x, so sign == 1.0, mantissa and exponent
are left to find. Consider:

log(y) = x = log(sign * mantissa * pow(2,exponent)) = log(1.0 * mantissa * pow(2,exponent))
          = log(mantissa) + log(pow(2,exponent)) = log(mantissa) + exponent * log(2)

log(2) is a fixed number which we can store in some ROM memory in the CPU.
What we have here is that:

x = log(mantissa) + exponent * log(2)

so if we divide x by log(2) we get a quotient exponent and a remainder equal to log(mantissa). The exponent
is an integer number while log(mantissa) is not but it is a small number and it is 0 <= log(mantissa) < log(2)
or we can easily adjust it to be in that range, if log(mantissa) is negative we simply add log(2) to it and decrease
exponent by 1.

We can then simply compute mantissa = exp(log(mantissa)) where log(mantissa) is a small number
so we can use the following formula:

mantissa = 1.0 + z + z*z/2.0 + z*z*z/6.0....+ pow(z,k)/k!  for a few terms where z == log(mantissa).

We now have sign = 1.0 so sign bit is 0, mantissa and exponent and we can make our floating point number y which is the result of exp(x).

log(x) is a bit tougher. There is a formula to compute log(1 + x) and it is:

log(1 + x) = x - x*x/2 + pow(x,3)/3 - pow(x,4)/4 .....

You can find this formula by using the sequence 1/(1+x) = 1 - x + pow(x,2) - pow(x,3)... and integrate on both sides.
Similarly you can find that:

log(1 - x) = -x - x*x/2 - pow(x,3)/3 - pow(x,4)/4 - pow(x,5)/5

You can find this in a similar manner or by simply replacing x by -x in the previous formula.

Putting these two together by adding them doesn't really give you something new:

log(1+x)(1-x) = log(1-x*x) = -x*x - pow(x,4)/2 - pow(x,6)/3 .... = -x*x - pow(x*x,2)/2 - pow(x*x,3)/3 ...

but if you subtract them you do get a new sequence:

log(1+x)/(1-x) = 2 * (x + pow(x,3)/3 + pow(x,5)/5 + pow(x,7)/7...)

Another beautiful feature of this formula is that x is restricted to be in the range such that fabs(x) < 1.0 but
if x is in that range (1+x)/(1-x) maps to the whole range of real numbers from 0.0 to +inf. log is only valid for positive input arguments anyway so this is the whole range of valid input.

Thus for any x > 0 we can find a z such that (1+z)/(1-z) = x and fabs(z) < 1.0 we can then put z into
our formula for log and compute.

One problem with this is that the formula works for any value but very slowly, for large values of x or for values of x close to 0 you will get a z close to -1 or +1 and then the formula is very inacurate or you have to use many terms in the sequene to get it right. This sequence is much worse than the corresponding sequence for exp(x). Thus, for example the GNU C library doesn't use this formula directly. First it processes the input in some way similarly to what we did for exp(x) and then when the value is small enough we can use the above formula and compute it. If the value is really small and close to 1 we could even use the original sequence:

log(1 + z) = z - z*z/2 + z*z*z/3 - pow(z,4)/4....

for a couple of terms to get a reasonable value.

sin(x) and cos(x) are essentially similar to the above. The sequence for sin(x) and cos(x) is:

sin(x) = x - pow(x,3)/3! + pow(x,5)/5! - pow(x,7)/7! ...
cos(x) = 1 - pow(x,2)/2! + pow(x,4)/4! - pow(x,6)/6! ...

where k! is k * (k-1)*(k-2)...*3*2*1 with 0! = 1, 1! = 1, 2! = 2, 3! = 6, 4! = 24 etc.

Just as for exp(x) this sequence will in theory work for any x but in practice we don't want x to be too high, luckily, sin(x) and cos(x) have several identiies, for example sin(x + 2*pi) = sin(x) for any x so we can definitely divide x by 2*pi and only use the remainder. In addition we have that sin(x + pi) = sin(x)*cos(pi) + cos(x)*sin(pi) = -sin(x)
so if that remainder is >= pi we can simply compute sin(x) = -sin(x-pi) if x is in the range pi <= x < 2*pi
the x-pi will be in the range 0 <= x < pi. Thus we can force x to always be in the range 0 <= x < pi.

We can do even better, if x >= pi/2 we have that sin(x) = sin(y + pi/2) = sin(y) * cos(pi/2) + cos(y)*sin(pi/2)
so sin(x) = cos(y) so we can always subtract pi/2 from x and compute the value for cos(y) instead.

We can do even better than that, if pi/4 <= x < pi/2

we have that sin(x) = sin(pi/2 - y) = sin(pi/2)cos(y) - cos(pi/2)sin(y) = cos(y)

Thus we can force the argument to be in the range 0 <= x < pi/4. Similar equations works for cos(x) as well
so both functions take arguments in that range and when we have x in this range we can use the above sequence
to compute cos(x) or sin(x). tan(x) can then simply be computed as sin(x)/cos(x).

For the inverse functions we start with atan(x). If you derivate atan(x) you get the following:

atan(x) = y so x = tan(y) = sin(y)/cos(y) or x * cos(y) = sin(y). If we derivate both sides with respect to x we get:

cos(y) + x* (-sin(y))y' = cos(y) - x*sin(y)*y' = cos(y)y' or cos(y) = (cos(y) - x*sin(y))y' or:

y' = cos(y) / (cos(y) - x * sin(y)) = 1 / (1 - x * tan(y)) = 1 / (1 - x*x)

1 / (1 - x*x) is our old friend 1 + x*x + pow(x,4) + pow(x,6) + pow(x,8) with x*x instead of x.

Integrating this gives:

y = atan(x) = C + x + pow(x,3)/3 + pow(x,5)/5 + pow(x,7)/7 ....

The only unknown is C which is easily fixed, since tan(0) == 0 we have that atan(0) can also be set as 0 (atan is actually a multivalued function but the computer version of atan is typically such that it chooses the value where atan(0) == 0), so C is 0 and so we get:

y = x + pow(x,3)/3 + pow(x,5)/5 ...

This COULD be used to compute pi/4 but I won't recommend it since there are much better ways to find pi, also, even if you let the program run for a long time you would get a very inaccurate form of pi as result).

However, the formula works well for small values of x so if we can find ways to get x to become a small value we should be just fine.

Let me wrap this up by giving formula for sqrt(x). This could be done by computing pow(x,0.5) but that wouldn't
be very accurate. There's a better way:

if y = sqrt(x) then y*y == x and y*y + x = 2 * y* y so (y*y+x)/(2*y) == (y + x/y)/2== y.

In a way we simply compute the average of y and x/y and we expect to get y. If we do not have y yet (but wish to compute it) we have instead a guess value z we can put that into the same formula:

(z + x/z)/2 = v

Since z != y we don't get y nor z as result but we get instead a new guess value v which is "close" to y. In fact it can be shown that it is closer to y than z was, at least if you do the process twice. The point is that repeating this process will give you alternate values which are too high or too low and so doing it twice will give you a steady sequence advancing towards the correct square root.

Note that there are faster ways to compute square root for integer values using integer arithmetic but for floating point this formula although old is actually still good. The starting guess isn't really important but a good guess is average between input variable (which must be positive) and 1.0, so:

z = (1.0 + x)/2.0;

Then repeat until the difference between two successive guesses becomes smaller than EPSILON where EPSILON is a suitably chosen small value (it determines the accuracy).

while (fabs((v = (z + x/z)/2) - z) >= EPSILON)
     z = v;

When you're done with the loop you can use v as the final guess.
return v;

Now, you can easily extend this to do complex arithmetic and define sin(z), cos(z), exp(z), log(z) etc for
complex values z, but that is pure math and have little to do with floating point per se.

As a rule I only use doubles.  You lose some precision when going from double to int.
For your example try the following:

double f = 95.40;  // or some other number since f is undefined otherwise.
f = f/10.00;            // keep the precision by dividing by another float
f = f / (double) 10;
f /= 10.00;

Jim's "problem" is that the floating point type typically uses a BASE equal to 2. If BASE had been 10 he would have been OK. However, BASE equal to 10 is only fast and easy to implement in hardware if the mantissa is stored in BCD format (actually packed BCD format with one nibble per digit). For an ordinary representation it is difficult to get the digits from
a sequence of bits and so using a BASE other than 2, 16 or some other power of 2 is problematic.

Note that BASE = 10 _is_ frequently used for fixed point numbers (scaled numbers) such as the decimal type in C# but they operate slightly differently.

The problem, to put it plain is that 0.1 is not representable as a floating point number. You can represent 0.5 exactly and
0.25 exactly and 0.125 exactly, In fact any power of 2 can be represented exactly but a number such as 0.1 get an
infinite sequence much like 1/3 gets when you try to divide it out (you get 0.33333333) in decimal arithmetic. The point
is that since 10 has factors 2 and 5 any combination of a power of 2 and power of 5 can be represented exactly in
decimal format but since BASE equal to 2 have only 2 as factor only powers of 2 can be represented exactly.

0.1 in double format can be found like this: 0.1 is 1 / 10, 1 is in my previous format (see my previous posting).

0.1 = m.mmmmmmmmmmmmm

since the integer part of both is 0 we set the first m to 0. We then multipy by 2
0.2 = m.mmmmmmmmm       mantissa = 0.

again we have a 0 bit, multiply by 2 again.
0.4 = m.mmmmmmmm         mantissa = 0.0
again, multiply by 2.
0.8 = m.mmmmmmm      mantissa = 0.00
again, multiply by 2.
1.6 = m.mmmmmmm       mantissa = 0.000

This time we got a 1, add it to mantissa and subtract it from 1.6 so we get 0.6 and multiply by 2.
1.2 = m.mmmm     mantissa = 0.0001
Again, we got a 1, same procedure, subtract from 1.2 to get 0.2 and multiply by 2.
0.4 = m.mmmmm mantissa = 0.00011

We now see that we are in the same situation as previously so we will get 0011 in a cycle from here on.

so 0.1 in binary is 0.00011001100110011001100110011....

Since the sequence doesn't terminate it is impossible to represent 0.1 exactly as a floating point number.

Thus, 0.3 also will have an infinite sequence, in fact any number except for the power of two can be represented exactly.

0.5, 0.25, 0.125, 0.0625 etc and all integers are the only types of values possible to represent exactly

An integer such as 3 can be represented exacty even if it isn't a power of two:

3.0000 is mantissa 11.000000  which is divided by 2 to get it in range 1.0 <= mantissa < 2.0
so we get 1.100000 and exponent of 1 or M = pow(2,MW-1) and X = BIAS + 1 and S = 0.

So the above problem with values that aren't power of two apply only to non-integers.

Also, as an alternative to dividing for a floating point hardware the manufacturer can also make use of the fact that:

1/(1 - x) = 1 + x + pow(x,2) + pow(x,3) + ... +

If you take sufficient number of terms for accuracy and the value of x is scaled to a suitable range then
multiply and add might be more efficient than actually dividing.

Hope this explains.

No comment has been added lately, so it's time to clean up this TA.
I will leave the following recommendation for this question in the Cleanup topic area:

Accept: grg99 {http:#9241513}

Please leave any comments here within the next seven days.

EE Cleanup Volunteer
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today

From novice to tech pro — start learning today.