Link to home
Start Free TrialLog in
Avatar of littlecharva
littlecharva

asked on

Normal Distribution

Hi,

I'm a programmer who's in the process of converting a tool built in Excel into a C# program.  I've managed to port everything except for the NORMDIST() function, which is scaring the hell out of me.

I need to work out the cumulative normal distribution given x, the mean and the standard deviation.  I posed this question in one of the C# question areas and got what I thought was an answer to my question:

https://www.experts-exchange.com/questions/20936306/Normal-Distribution.html

But it turned out not to work, for instance:

x = 2.01490302054226
mean = 3.24088418968677
std dev = 0.427899387582919

Using NORMDIST(x,mean,stddev,TRUE) Excel gives me: 0.00208435088016279
The C# function gives me: 556735840.808829

Sometimes the C# function is spot on, sometimes it's way off like above.

I don't need anyone to write the C# function for me (although it would be nice if you did), I just need it explaining in english how I go about calculating it using basic maths.

If anyone can help, or point me in the right direction?

Cheers,
LittleC
Avatar of ChadClancy
ChadClancy

Look up NORMDIST in the Excel help.  It looks like the formula given in the help is somewhat different from the one computed in your C# function.  Specifically, the value that the exponent is raised to.  According to this formula, the exponent should be raised to:

-pow((x-u),2) / (2*pow(deviation,2))

the remainder of your formula looks fine.
LittleC,

Please disregard my last comment, that is only for the formula where cumulative - FALSE.  I'll see if I can find anything else.
The formula
  Math.Exp(-(Math.Pow((x - mean)/deviation, 2)/2))/Math.Sqrt(2*Math.PI)/deviation;
is perfectly correct for the normal distribution, but it gives the probability mass function.  That is, it behaves like NORMDIST with the 4th parameter hardcoded to FALSE.  We are talking about the bell-shaped curve that is shown in many places when we have to do with random events.

The cumulative function is what you obtain when you integrate the mass function from
-infinity to x.  That is, you calculate the area under the bell-shaped curve that is on the left of x.  This function is not available in analytical form.  Its values can only be obtained numerically...

I don't know how Excel calculates it.  I probably interpolates a table.  Fortunately this function is very well behaved: it starts very flat for far left of the mean (i.e. a few times the deviation; grows smoothly and flattens up again for x far on the right of the mean.  It is symmetrical with respect to the mean, where it reaches the value 0.5 .

It is therefore reasonable to use a table and interpolate it linearly or with a second order polynom.

Let's call t = (x - mean)/deviation and G(t) the cumulative function.  Here are some values:

G(0) = 0.5
G(0.1) = 0.539827896
G(0.2) = 0.579259687
G(0.3) = 0.617911357

If you estimate very crudely G(0.2)est = (G(0.3) + G(0.1))/2 = 0.578869627
you see that the estimate differs from the correct value only by 0.07% That is SEVEN PARTS OUT OF 10,000!

A "curvier" area is around t = 1:
G(0.5) = 0.691462467
G(1.0) = 0.84134474
G(1.5) = 0.933192771
This time the rough interpolation G(1.0)est = (G(0.5) + G(1.5))/2 gives an accuracy of about 3.5%.

How big a table you need to build really depends on the accuracy you need to have.

A linear interpolation (better than the rough one) is done by finding the equation of the straight line between the two points:
  A*t + B = f
You impose that it passes through G(0.5) and G(1.5):
  A*0.5 + B = 0.691462467
  A*1.5 + B = 0.933192771
and resolving for A and B you get:
  f = 0.241730304*t + 0.570597315
valid for 0.5 < t < 1.5

You have a better interpolation if you approximate the curve with a parabola:
  f=A*t^2 + B*t + C
instead of a straight line.  To do this you need three consecutive points (e.g. 0.5, 1.0, and 1.5):
  A*0.25 + B*0.5 + C = 0.691462467
  A*1.00 + B*1.0 + C = 0.933192771
  A*2.25 + B*1.5 + C = 0.691462467

I am not going to find the coefficients here, but it is straightforward to obtain values for A, B, and C.

You should build a grid of function values at intervals D of the variable t and pre-calculate A, B, and C for each triplet of points.  D should be small enough to give you enough accuracy (I would recommend that you use constant Ds instead of trying to optimise the density of points, because you would waste time determining what triplet to use).

The "curviest" part of the cumulative function is where its second derivative is largest, but that is the first derivative of the mass function and can be calculated.  If I didn't make mistakes (!) this should happen at  |t| = 3^(1/4) = 1.316 . Therefore, you should check your accuracy there.

When you estimate a set of A/B/C from a triplet T - D, T, and T + D, use them for
  T - D/2 < t <= T + D/2
to have the best results.

Build the table to cover the interval -5 < t < +5 and use some very coarse approximation outside that.  For t = -5 the cumulative function is about 3*10^ -7.  If you go to +/- 6 you will have 9 significant digits guaranteed and probably can just set the value to 0 outside that.

Only make a table for 1/2 of the curve: -6 < t < 0 and then calculate the other half with the formula f(t) = 1 - f(-t)
Avatar of ozo
#include <stdlib.h>
float NORMDIST(float mean,float stddev){
    static int iset=0;
    static float gset;
    float fac,rsq,v1,v2;
    if( iset==0 ){
        do{
            v1=2.0*rand()/RAND_MAX - 1.0;
            v2=2.0*rand()/RAND_MAX - 1.0;
            rsq = v1*v1+v2*v2;
       }while( rsq >= 1.0 || rsq == 0.0 );
        fac = sqrt(-2.0*log(rsq)/rsq);
        gset=v1*fac;
        iset = 0;
        return mean+stddev*v2*fac;
    }else{
        iset=0;
        return mean+stddev*gset;
    }
}
ASKER CERTIFIED SOLUTION
Avatar of ozo
ozo
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of littlecharva

ASKER

Cheers ozo, you get the points for making it the easiest for me, but thanks to everyone else for their assistance.

LittleC
You say you have a working formula that, when gives you back values between 0-1 matches the values in Excel, but it sometimes gives you values out of the range [0,1]... ok, first, the values have to be between 0-1, because what you are looking for is that...

Probabiliy of getting a value less than X in an experiment witch result are distribuited as a gaussian bell centered in "mean" value (the most usual experiment result) and of a "width" (the bell) of "stddev" (how far result use to be from the "mena" result).

...and this probabilities are always coded from 0 (imposible) to 1 (for sure).

So, if your formula is working great for some numbers "near" the center of the bell (the "mean") but worst for other values not so near, then you have to increase the number of iterartions from 16 to 32 or more (it will not be noticed), then give it another try on that numbers. If it worked for some number, should do for others.

If you don't want to play with 16 or 32, you could also make if undefined: Do all iterations needen UNTIL result is between 0,1 (inclusive) and difference with last iteration result is less than 0.0000001, this approach will give you a more reliable formula, but it is a work that can be replaced be using 999 iterations instead of 16!!!

Ofcurse it may not work, but it is very easy to try and I would bet for it!
Opps! I didn't read ozo before writting... great job, ozo!!
Funny. After all the effort of reasoning about interpolating tables, I just found a function which is very similar to ozo's.  You will find the C code in
  http://finance.bi.no/~bernt/gcc_prog/recipes/recipes/node23.html

It only has 5th powers of t. It will teherfore require less computing.  You might like to have a look at it.