Setting precision

Posted on 1997-07-04
Last Modified: 2008-03-03
I've been learning C since a couple of weeks.
I have to write down a program that makes calculations with many math
functions (mainly sin and cos), using a 3-byte floating point: 8 bits for
exponent, 1 for sign and 15 for mantissa.
How can I set this precision?

Many thanks.

Question by:pietropaolo
  • 2
  • 2

Expert Comment

ID: 1251989
Why not just use float or double as your data types?


Author Comment

ID: 1251990
My goal is to calculate the error I make using a 3-bytes floating
point in a sine/cosine equation instead of an 8-bytes floating point.
The online help of my compiler reports that sin and cos functions accept only an 8-bytes double as argument.
So, what should I do?


Accepted Solution

mjkajen earned 70 total points
ID: 1251991

The math routines typically return a double as an return value, and take one (or more) arguments that are also doubles.  

You can use casting to convert your 3-byte floating point number to the double argument required by the math routines, and use casting to force the result to a 3-byte floating point number.

What your question does not indicate is how you intend to store the 3-byte floating point value. Assume you have typedef'd it as FLOAT_3BYTE.

Then the equation y = sin (x) would be evaluated in 3-byte floating points in C as:

FLOAT_3BYTE     x, y;

y = sin ( doule)x);

This means that x (the 3-byte float you construct) would be converted to a double, the sine routine called, and a double result returned. Since y is of type FLOAT_3BYTE, C would automatically  convert the double result  to a FLOAT_3BYTE for you.

If you don't like all this casting, then an alternative is to  write 'wrapper' functions around the math routines that deal only with the FLOAT_3BYTE data type. For example, here is a sine routine that takes and returns the 3-byte floating type:

FLOAT_3BYTE sin_3byte ( FLOAT_3BYTE x)
    return sin ((FLOAT_3BYTE) x);

I hope this helps. MK


Author Comment

ID: 1251992
Thank you very much for your answer, Mjkajen.

Now the problem is: how can I define the 3-bytes floating point
type FLOAT_3BYTE ?

Many thanks again!



Expert Comment

ID: 1251993
Here is some code that constructs a 3-byte floating point number and stores it in a double. Please see the notes at the end.

// precision.c

#include <stdio.h>
#include <assert.h>
#include <float.h>
#include <math.h>

/* Prototypes */
double Build3ByteFloat (int iSign, int iExponent, int iMantissa);
void   report (int iSign, int iExponent, int iMantissa);

void main (void)
      report (1, 0, 0);      
      report (1, 1, 0);
      report (1, 0, 1);

      report (1, 1, 10);
      report (1, 1, 15);

/* Routine to build a 3-byte floating point number that is
** stored in a double.
** Input args:
** iSign = 1 bit (-1 or 1)
** iMantissa = 15 bit number
** iExponent = 8 bit number
** The result is returned as a double.

double Build3ByteFloat (int iSign, int iMantissa,  int iExponent)
      ** Check that the input arguments will, in fact, fit into
      ** a double for this machine.
      assert (iMantissa <= DBL_MANT_DIG);
      assert (iSign == -1 || iSign == 1);
      assert ((double) abs(iExponent) <= DBL_MAX_EXP);

      ** Use the ldexp () function to do all the work.
      return iSign * ldexp ((double) iMantissa, iExponent);

** Routine to compute and display a 3-byte floating point
** number for debugging purposes.
void report (int iSign, int iExponent, int iMantissa)
      double            dResult;

      dResult = Build3ByteFloat (iSign, iExponent, iMantissa);
      printf ("Result for %d, %d, %d, is %e\n",
                  iSign, iExponent, iMantissa, dResult);


The above code demonstrates simulating a 3-byte float inside a double. There a subtle problems with this approach. Basically, on the machine I'm using, a double is "bigger" than the 3-byte float. Suppose I add together the  two largest 3-byte floating point numbers possible. Technically, this should cause an overflow, however, since these numbers are stored in doubles, an overflow will probably NOT occur.  This will therefore give you accurarcy that is not possible with pure 3-byte floating point numbers.

So, although I've shown how to simulate 3-byte floating point numbers with doubles, this may not be suitable for your task.

To accurately simulate arithmetic that is different from the native machine, one must also provide the addition, subtractioin, mult, and division operations. This is a lot of work. For example, one would provide an add function that would "know" how to add two 3-byte floating point numbers, and it would know all of the overflow rules. The same goes for subtraction, mult, and division (and, of course, sin, cos, etc.).

Again, this may not be what you need, but it's the 'correct' approach. Please don't think that when a 3-byte float is simulated using a double, that arithmetic performed on these double will reflect the precision of the 3-byte floats: it will only reflect the precision available  in the underlying simulation.

There are arithmetic packages that available to simulate aritrary precision, but these tend to be targeted to large, or exact, precision.



 so when the 3-byte floating point number would "overflow", the double representation

Featured Post

PRTG Network Monitor: Intuitive Network Monitoring

Network Monitoring is essential to ensure that computer systems and network devices are running. Use PRTG to monitor LANs, servers, websites, applications and devices, bandwidth, virtual environments, remote systems, IoT, and many more. PRTG is easy to set up & use.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Summary: This tutorial covers some basics of pointer, pointer arithmetic and function pointer. What is a pointer: A pointer is a variable which holds an address. This address might be address of another variable/address of devices/address of fu…
This is a short and sweet, but (hopefully) to the point article. There seems to be some fundamental misunderstanding about the function prototype for the "main" function in C and C++, more specifically what type this function should return. I see so…
The goal of this video is to provide viewers with basic examples to understand opening and writing to files in the C programming language.
Video by: Grant
The goal of this video is to provide viewers with basic examples to understand and use while-loops in the C programming language.

867 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

18 Experts available now in Live!

Get 1:1 Help Now