problem related with float point issue (Converting Binary to Decimal with real part)

Hello expert,
i am trying to covert, binary to decimal (both int part and real part)
but it is not working as per my expectation, due to some floating point issues.
actually when i try some calculation of float, it is approximated. and this is causing trouble.
please check, i have attached code(.Cpp file) , sample output and comments (.doc file).

Please help, what is the solution.
Sandeep SoodProgrammerAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

I recommend reading the input as a string, and then processing that string one character at a time. You will avoid overflows, rounding issueas, and all other problems you are currently experiencing.

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
Sandeep SoodProgrammerAuthor Commented:
Thanks infinity08.
it will be great idea to get input as string, in this case, will be much simpler also.

but, please, i wish to learn, is there any solution in the way as i am already doing it.
Please, check.

>> is there any solution in the way as i am already doing it.

Not really. The float type has a limited precision, so you can't accurately read the input with it.
Cloud Class® Course: Microsoft Windows 7 Basic

This introductory course to Windows 7 environment will teach you about working with the Windows operating system. You will learn about basic functions including start menu; the desktop; managing files, folders, and libraries.

Sandeep SoodProgrammerAuthor Commented:
Ok,Thanks for ur reply.

But it is really strange and a big limitation of C / C++ (i mean, if this simple problem can be solved)
is not it ?
Please, what do you say ?

Your input is in binary, but you are reading it as if it were decimal, and then once it's read, you try to convert it.

You should convert BEFORE putting it in the target type.

Or, put differently : you cannot read binary values using scanf (with %d or %f or similar), and expect meaningful results.
probably you're better off with

      scanf("%d.%d", &bin_int, &bin_real);
instead of
      scanf("%f", &binary);
>> probably you're better off with

There are similar problems with that. Not all binary values can be read like that.

You're better off reading as a string. It's a lot easier, and you won't have to worry about any of these issues.
one additional problem is:

 1 == 01 == 001 == 0001 ...


0.1 != 0.01 != 0.001 ...
Sandeep SoodProgrammerAuthor Commented:
Thanks all for replies.
and DonConsolio,
one additional problem is:

 1 == 01 == 001 == 0001 ...


0.1 != 0.01 != 0.001 ...

how is it related with my problem. couldnt undersantd.
pls reply.
if you enter

101.0001 and convert to 0001 from binary it would be decimal 1
101.1 and convert to 1 from binary it would be decimal 1

but 101.0001 is not equal to 101.1

Sandeep SoodProgrammerAuthor Commented:
Thanks for ur reply, DonConsolio.
>> 101.0001 and convert to 0001 from binary it would be decimal 1
no, it wouldnt be 1 (decimal number system) it would be 1 x 2 ^(-4)  i.e pow(2, -4)
DonConsolio's point is that the way you're currently doing it, has this problem, as well as other problems. The way to get around those problems, is to get your input as a string (refer to my first post here).
>>>> But it is really strange and a big limitation of C / C++ (i mean, if this simple problem can be solved)

To add to above comments:

The floating point type has an accuracy of about 6 decimal digits. That means you could read a number like


with %f into a float and can expect to get the same number for output when you print it.

But assume you have


This number has 8 decimal digits and read into a float could arise rounding issues at 3rd position of the fraction. I. e. when you print it you might get


what surely isn't what you expect.

If using double you have an accuracy of about 15 decimal digits what is enough for the samples above but also isn't an acceptable way to solve it.

The limitations of a float or double isn't actually a "big C/C++ limitation" cause 7 or 15 decimal digits is a quite reasonable precision for many applications using floating point numbers. You got the problems cause you want to retrieve numbers in binary representation which requires huge decimal numbers both for the integer part and the fraction part, e. g. a target number of 129.129 would have a binary representation 1000001.1000001 where both parts were beyond float precision.

>>>> You're better off reading as a string.
When reading the number as a string you nevertheless could make the conversion from binary to decimal in a similar way as you do it now.

   char s[128] = { '\0' };
   char * ps1 = s;
   char * ps2;
   scanf("%s", s);
   ps2 = strchr(s, '.');
   if (ps2 == NULL) return;  // error
   *ps2 = '\0';      // set terminating zero after integer part
                           // ps1 points to integer part
   ps2++;    // points now to fraction part

   sum1 = 0;
   for (i = 0; i < strlen(ps1); ++i)
       if (ps1[i] == '1')
           sum1 = sum1 + (1<<i);        // shifts bit 0 to the left thus doing pow(2, i)
       else if (ps1[i] != '0')
            return;   // error

Note, the ps2 cannot be handled same way cause the positive integer of the fraction is not what you need to evaluate. Instead you need to calculate 2^-(i+1)  == 1/(2^(i+1)) what is a floating point number for each '1' in the fraction.

E. g. if you have a fraction of .011, your current loop would "convert" it to

    bin_real = 0.011 * 10 = 0.11
    sum2 = 0 + 0.11 * 2^(-1) = 0.11 * 0.5 = 0.055
    bin_real = 0.11 - 0.11 = 0.   //?????

but actually it is

i == 0:
    sum2 = 0;   // no calculation cause ps2[i] == '0'

i == 1:
    sum2 = 0 + 2^-(1 + 1) = 2^-2 = 1/(2^2) = 0.25
i == 2
    sum2 = 0.25 + 2^-(2 + 1) = 0.25 + 2^-3 = 0.25 + 1/(2^3) = 0.25 + 0.125 = 0.375

Note the sum2 is a double and all temporary terms are double as well.


It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today

From novice to tech pro — start learning today.