• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 266
  • Last Modified:

Setting precision

I've been learning C since a couple of weeks.
I have to write down a program that makes calculations with many math
functions (mainly sin and cos), using a 3-byte floating point: 8 bits for
exponent, 1 for sign and 15 for mantissa.
How can I set this precision?

Many thanks.

  • 2
  • 2
1 Solution
Why not just use float or double as your data types?

pietropaoloAuthor Commented:
My goal is to calculate the error I make using a 3-bytes floating
point in a sine/cosine equation instead of an 8-bytes floating point.
The online help of my compiler reports that sin and cos functions accept only an 8-bytes double as argument.
So, what should I do?


The math routines typically return a double as an return value, and take one (or more) arguments that are also doubles.  

You can use casting to convert your 3-byte floating point number to the double argument required by the math routines, and use casting to force the result to a 3-byte floating point number.

What your question does not indicate is how you intend to store the 3-byte floating point value. Assume you have typedef'd it as FLOAT_3BYTE.

Then the equation y = sin (x) would be evaluated in 3-byte floating points in C as:

FLOAT_3BYTE     x, y;

y = sin ( doule)x);

This means that x (the 3-byte float you construct) would be converted to a double, the sine routine called, and a double result returned. Since y is of type FLOAT_3BYTE, C would automatically  convert the double result  to a FLOAT_3BYTE for you.

If you don't like all this casting, then an alternative is to  write 'wrapper' functions around the math routines that deal only with the FLOAT_3BYTE data type. For example, here is a sine routine that takes and returns the 3-byte floating type:

FLOAT_3BYTE sin_3byte ( FLOAT_3BYTE x)
    return sin ((FLOAT_3BYTE) x);

I hope this helps. MK

pietropaoloAuthor Commented:
Thank you very much for your answer, Mjkajen.

Now the problem is: how can I define the 3-bytes floating point
type FLOAT_3BYTE ?

Many thanks again!


Here is some code that constructs a 3-byte floating point number and stores it in a double. Please see the notes at the end.

// precision.c

#include <stdio.h>
#include <assert.h>
#include <float.h>
#include <math.h>

/* Prototypes */
double Build3ByteFloat (int iSign, int iExponent, int iMantissa);
void   report (int iSign, int iExponent, int iMantissa);

void main (void)
      report (1, 0, 0);      
      report (1, 1, 0);
      report (1, 0, 1);

      report (1, 1, 10);
      report (1, 1, 15);

/* Routine to build a 3-byte floating point number that is
** stored in a double.
** Input args:
** iSign = 1 bit (-1 or 1)
** iMantissa = 15 bit number
** iExponent = 8 bit number
** The result is returned as a double.

double Build3ByteFloat (int iSign, int iMantissa,  int iExponent)
      ** Check that the input arguments will, in fact, fit into
      ** a double for this machine.
      assert (iMantissa <= DBL_MANT_DIG);
      assert (iSign == -1 || iSign == 1);
      assert ((double) abs(iExponent) <= DBL_MAX_EXP);

      ** Use the ldexp () function to do all the work.
      return iSign * ldexp ((double) iMantissa, iExponent);

** Routine to compute and display a 3-byte floating point
** number for debugging purposes.
void report (int iSign, int iExponent, int iMantissa)
      double            dResult;

      dResult = Build3ByteFloat (iSign, iExponent, iMantissa);
      printf ("Result for %d, %d, %d, is %e\n",
                  iSign, iExponent, iMantissa, dResult);


The above code demonstrates simulating a 3-byte float inside a double. There a subtle problems with this approach. Basically, on the machine I'm using, a double is "bigger" than the 3-byte float. Suppose I add together the  two largest 3-byte floating point numbers possible. Technically, this should cause an overflow, however, since these numbers are stored in doubles, an overflow will probably NOT occur.  This will therefore give you accurarcy that is not possible with pure 3-byte floating point numbers.

So, although I've shown how to simulate 3-byte floating point numbers with doubles, this may not be suitable for your task.

To accurately simulate arithmetic that is different from the native machine, one must also provide the addition, subtractioin, mult, and division operations. This is a lot of work. For example, one would provide an add function that would "know" how to add two 3-byte floating point numbers, and it would know all of the overflow rules. The same goes for subtraction, mult, and division (and, of course, sin, cos, etc.).

Again, this may not be what you need, but it's the 'correct' approach. Please don't think that when a 3-byte float is simulated using a double, that arithmetic performed on these double will reflect the precision of the 3-byte floats: it will only reflect the precision available  in the underlying simulation.

There are arithmetic packages that available to simulate aritrary precision, but these tend to be targeted to large, or exact, precision.



 so when the 3-byte floating point number would "overflow", the double representation

Featured Post


Modern healthcare requires a modern cloud. View this brief video to understand how the Concerto Cloud for Healthcare can help your organization.

  • 2
  • 2
Tackle projects and never again get stuck behind a technical roadblock.
Join Now