How do I perform operations on the actual bits of a double?

I want to be able to change the actual bits of a given double value, how do I access the actual bits to change them?
Who is Participating?
SalteConnect With a Mentor Commented:

you must be joking, your text didn't make sense to me.

if variable is of type double and is 10.5 how are you going to get (variable << 1) to make any form of sense. The << operator is defined for integers and the double will then possibly be converted to integer 10 which is then shifted to 20 (not 100) and there's no way you can get 105 from this in any meaningful way.

In addition I think that the conversion to int from double appear suspect, no standard C++ compiler would do that.

You must be smoking some form of illegal substance or something...


Why would you want to change the bits of the double value? if it is in order to translate them in some crypting scheme then just store the double value into a int buffer (int array or some such) and then convert it along with your strings and int and short values etc etc. You might also use a char buffer if you like, the principle is the same, just stuff it all into the buffer and then encrypt the data.

struct buf {
   double x;
   int y;
   char z;
   char s[3];

union {
  buf b;
  unsigned char ub[sizeof(buf)];

b.x = 3.14;
b.y = 15;
b.z = 'a';
// note that string b.s cannot be more than 2 chars
// since the null byte also need 1 char.

ub is now an unsigned char buffer holding the data.
ready to encryption.

If you want to get at the bits in order to emulate the floating point type the easiest way is probably to define a struct for it.

Note that such a struct is very machine dependent. A floating point value is defined by 3 fields:

A one bit field S indicating the sign of the floating point value.

A k bit field being an integer X. Based on X you can compute an exponent value x.

A n bit field being an integer M. Based on M you can compute a mantissa value m.

1 + k + n is the number of bits in the double and on most machines that is 64. On the PC k is 11 and n is 52.

The conversion from X to x vary from machine to machine even if they have the same number of bits for the X field. Similarly the conersion from M to m vary from machine to machine even if they have the same number of bits for the M field.

Once you have m and x the floating point value is:

v = s * m * 2^x where ^ is the power or exponent operator. (not the C XOR operator).

s is +1 if S == 0 and -1 if S == 1.
m is often scaled to be in the range 0.5 <= m < 1.0 but for special values of X it isn't. However, this is also very machine specific and other machines may have the range:

1.0 <= m < 2.0 or some such instead. Again for specific values of X this isn't the case.

If m was always in the range given above you wouldn't be able to express 0.0 in type double, so the value 0.0 is expressed using a special value for X such that the mantissa m is outside of the range given but is instead in some other range which includes the value 0.0.

The value x is then usually the same as X but with a bias so that you can represent negative values. If the bias is B then x = X - B. This means that an exponent of -1 can be repsented by having X == B - 1. B is typically a large positive value so X is also a positive value even if the exponent x is negative. In fact B is made so large that no matter how small x is the value X is always positive.

X > 0 (usually X == 0 is the special value used to represent 0.0). Since x = X - B you then get:
X = B + x > 0 and so B must be large enough that the smallest possible exponent value x will still make B + x positive. Thus X is always a positive integer.

Of course, it's thinkable that some weird machine may appear that uses a signed exponent representation for x so that X is simply equal to x in a signed complement fashion but I have never heard of such machine.

Similarly the mantissa field M is an integer and is used to represent the mantissa m. Since M is an integer and m is a fraction in the range 0.5 <= m < 1.0 This clearly involves some scaling. If you represent the value m in binary notiation you will typically get something like:


The first 1 appear because the value is scaled to be in the range 0.5 <= m < 1.0. Because of this most representations simply remove that 1. It is always there so there's no point to store it and so the mantissa M is simply the bits xxxxxx a total of n bits can then be represented in the mantissa M. This means that m and M is related by:

m = 0.5 + M / 2^(n+1) = (2^n + M)/2^(n+1)

Again, the ^ is the power or exponent operator and not the C or C++ XOR operator.

When X has the special value 0, you simply do not add that extra 0.5, this gives a mantissa m in the range:

0.0 <= m < 0.5 and with a fixed small exponent x can be used to represent very small floating point values close to 0.

Also, floating point usually also have a special exponent value equal to X = 2^k-1 I.e. the value with all bits in the exponent equal to 1. This is used to represent the special values "not a number" and +"infinity" and -"infinity".

(S = ?, X = 111111...11, M = non-zero) == "not a number"
(S = 0, X = 111......11, M = 0) == +infinity
(S = 1, X = 111......11, M = 0) == -infinity

It is possible that some machines switches the meaning of M non-zero and M == 0 so that infinity is with a non-zero M and vice versa. This values are never calculated with as such anyway, they are special values which require special testing by the floating point hardware anyway.

If X is not 0 and not 2^k-1 then X is in the range:

0 < X < 2^k-1

and then the number is a regular floating point value with:

s = S == 0 ? 1 : -1;
m = (pow(2,n) + M)/pow(2,n+1);
x = X - B;

v = s * m * pow(2,x);

Of course, the floating point hardware never do that last computation but do the floating point operations on the S, X and M instead.

It isn't as hard as you might think:

+ and -:

First add that implicit 1 bit above all the M bits, so we get 1mmmmmmmm where each m is a bit in the M field. Do this for both operands.

Second arrange that both operands have the same exponent X.

if a has the greater exponent then the operand b with lower exponent is changed so that the mantissa is shifted one step to the left and the exponent X is increased by 1. This process continues until either all the bits has been shifted out and the value is so small compared to a that it makes no contribution to the value and the result is a or the exponent field of b equals the exponent field of a.

Also, if any of a and b has a negative sign then the mantissa is negated before continuing. the operation is done in a modulo 2^(n+1) the value is (n+1) bit 2 complement.

When a and b has the same exponent just add or subtract the results and then rescale the mantissa again, if the mantissa got a carry overflow then that one bit carry is added above the mantissa and the mantissa is shifted down (and the exponent is increased) so that the value is again with n bits of mantissa plus the 1 bit at the top. If it was subtraction and the one bit at top disappeared, shift the bits to the left and decrease the exponent until you get a 1 bit at the top. The 1 bit at top is removed and the n remaining bits is the new mantissa of the result. The exopnent computed is the exponent of the result.

The sign bit is similarly handled in the obvious manner. If the result is negative so the sign bit of the result is 1, the mantissa must be negated before the result is set.

Multiplication is in many ways simpler than addition and subtraction. Just multiply the mantissa of each of the two operands and keep the top n+1 bits (discard the lower bits - well you might want to keep the most significantly discarded bit in order to do proper rounding). Add the exponents, remmeber that they are with bias so the bias is added twice if you just add the exponents. a.X + b.X - B should be the proper exponent. Watch out for special values and values overflowing the allowed range for exponents.

Division is also very similar, just divide the mantissa's, you generally don't worry about the remainder in this since the values are really floating point and not really integers. However, you will check the highest bits of the remainder if you want to round the result properly. The exponents are simply subtracted, here you need to add in the bias again so c.X = a.X - b.X + B;

Also here you need to rescale the quotient so that you get a 1 bit at the top and to fill in the bits shifted in you can then use the remainder if you like, just use the fact that for each time you shift up the remainder can be multiplied by 2, when you have shifted the quotient so you get a 1 bit at the top you then have a factor f which is 2^j where j is the number of shifts you did. If you then multiply the remainder with 2^j and divide it with the mantissa of b (b.M) you then get exactly j bits in that quotient which are the lower j bits of the result mantissa.

Note that there are algorithms that combine this so that you divide, find the top bit if it is 0 you know you need to shift so you shift immediately (adjust the exponent also). In this way when the division is done you can discard the remainder (except for the rounding bit) and you have a mantissa with the top bit of 1.

The best way to handle the rounding bit is just to continue the loop one more time and compute an additional bit and then add that bit to the mantissa.

Now you can go and make your own floating point emulator if you want.

Oh yeah, how to get the values. Well, the union as shown earlier can be used or you can use:

double x;
unsigned int a[2];

memcpy(a, & x, sizeof(a));

assuming that sizeof(double) == sizeof(int)*2.

or : reinterpret_cast<int *>(& x)[0] etc..

if you have 64 bit integers you can even do:

unsigned long long y = *(unsigned long long *)(&x);

and you have the exact bit pattern of x in y. The integer can then have the bits extracted as you please.

Note that the array etc will have a layout that is very machine dependent, on some machines the exponent and sign appear in a[0] while the mantissa is in low bits of a[0] and all of a[1]. In other machines it is some other arrangements, for example a[0] contain 32 bits of the mantissa and a[1] contain sign bit exponent field and part of the mantissa etc.

So if you really want to interpret the bits you need to know the specific format on your hardware. If you just want to encrypt etc so you don't really care about the bits in themselves but just want to get the bit pattern, then using a struct or casting and copy the data to a buffer is the best way to go.

might be more able to help if you said what you want to do
but one trick i used once was to get rid of the decimal. i used it to XOR a float, as you know you can't normally do that so what i did was shift the bits over to get rid of the decimal point
so 10.5 was shifted once (variable = variable<<1)
then i put that into a long, xor'd it then shifted back
seemed to work fine
only really easy if you know that the number of decimal places will always be one or two. you can convert it to string and then check the decimal places that way, theres other ways also.
o....k i think you need to get off that HIGH horse you're on salte. i mis-stated what i meant by accident.
i meant 10.5*10 gets rid of the decimal point. as would
10.05*100 or #*(N*10) N being the decimal places
if you thought about it it was an obvious mistake.
I'm sure you don't make any mistakes though.
I don't mind when i make a mistake whether it be what i really thought to be right , a typo, or just a mis-statement. And i don't mind being told i was wrong the same as i would correct anyone else, you don't learn without making mistakes. But your reaction to a wrong statement whether meant lightly or not is sort of childish.
The problem with multiplying by a factor of 10 etc gets you the floating point value to a certain number of decimals, either by truncating or rounding. Either way, that wasn't how I interpreted the original question which was how one can manipulate the individual bits of a double. Regarding that original question there are two points which I tried to make clear in my posting:

1. You can easily enough store the double in some form of char or int buffer and do bit manipulation on it there.

2. Why would you do that? if you want to interpret the bits you probably want to make some software emulation. So I outlined how you can do that. If you want to make some encryption or other manipulation of the bits I also explained that. If you want to do something else, I really wonder what that something else is. In most cases it probably wouldn't make sense. For example if he really wanted to truncate or round the value then he doesn't really want to manipulate the bits but rather work on the value in a decimal 10 form and so the multiplication of 10 or 100 or 1000 etc becomes part of the solution. However, this isn't the individual bits of the double. A double isn't stored in decimal 10 format.

The only language that uses decimal 10 format values that I know of is COBOL which uses the so-called BCD numbers. BCD numbers are also supported on the intel platform but is a separate type from double or int or any other type.

Btw, the decimal type in C# - despite its name - isn't decimal, it is more akin to the currency type of Visual Basic which is an integer scaled by a factor of 10. This way you can represent 10.01 exactly as the value 1001 scaled by -2, i.e. divided by 100. What is good about the decimal type of C# though is that it is 128 bits with 32 bit scaling and 96 bit integer value. That gives very high precision.

No comment has been added lately, so it's time to clean up this TA. I will
leave a recommendation in the Cleanup topic area that this question is:

Answered: Points to Salte

Please leave any comments here within the next seven days.

Experts: Silence means you don't care. Grading recommendations are made in light
of the posted grading guidlines (


-bcl (bcladd)
EE Cleanup Volunteer

All Courses

From novice to tech pro — start learning today.