Solved

Normal Distribution

Posted on 2004-04-01
9
1,624 Views
Last Modified: 2012-08-13
Hi,

I'm a programmer who's in the process of converting a tool built in Excel into a C# program.  I've managed to port everything except for the NORMDIST() function, which is scaring the hell out of me.

I need to work out the cumulative normal distribution given x, the mean and the standard deviation.  I posed this question in one of the C# question areas and got what I thought was an answer to my question:

http://www.experts-exchange.com/Programming/Programming_Languages/C_Sharp/Q_20936306.html

But it turned out not to work, for instance:

x = 2.01490302054226
mean = 3.24088418968677
std dev = 0.427899387582919

Using NORMDIST(x,mean,stddev,TRUE) Excel gives me: 0.00208435088016279
The C# function gives me: 556735840.808829

Sometimes the C# function is spot on, sometimes it's way off like above.

I don't need anyone to write the C# function for me (although it would be nice if you did), I just need it explaining in english how I go about calculating it using basic maths.

If anyone can help, or point me in the right direction?

Cheers,
LittleC
0
Comment
Question by:littlecharva
  • 2
  • 2
  • 2
  • +2
9 Comments
 
LVL 5

Expert Comment

by:ChadClancy
ID: 10731809
Look up NORMDIST in the Excel help.  It looks like the formula given in the help is somewhat different from the one computed in your C# function.  Specifically, the value that the exponent is raised to.  According to this formula, the exponent should be raised to:

-pow((x-u),2) / (2*pow(deviation,2))

the remainder of your formula looks fine.
0
 
LVL 5

Expert Comment

by:ChadClancy
ID: 10731912
LittleC,

Please disregard my last comment, that is only for the formula where cumulative - FALSE.  I'll see if I can find anything else.
0
 
LVL 5

Expert Comment

by:PointyEars
ID: 10732864
The formula
  Math.Exp(-(Math.Pow((x - mean)/deviation, 2)/2))/Math.Sqrt(2*Math.PI)/deviation;
is perfectly correct for the normal distribution, but it gives the probability mass function.  That is, it behaves like NORMDIST with the 4th parameter hardcoded to FALSE.  We are talking about the bell-shaped curve that is shown in many places when we have to do with random events.

The cumulative function is what you obtain when you integrate the mass function from
-infinity to x.  That is, you calculate the area under the bell-shaped curve that is on the left of x.  This function is not available in analytical form.  Its values can only be obtained numerically...

I don't know how Excel calculates it.  I probably interpolates a table.  Fortunately this function is very well behaved: it starts very flat for far left of the mean (i.e. a few times the deviation; grows smoothly and flattens up again for x far on the right of the mean.  It is symmetrical with respect to the mean, where it reaches the value 0.5 .

It is therefore reasonable to use a table and interpolate it linearly or with a second order polynom.

Let's call t = (x - mean)/deviation and G(t) the cumulative function.  Here are some values:

G(0) = 0.5
G(0.1) = 0.539827896
G(0.2) = 0.579259687
G(0.3) = 0.617911357

If you estimate very crudely G(0.2)est = (G(0.3) + G(0.1))/2 = 0.578869627
you see that the estimate differs from the correct value only by 0.07% That is SEVEN PARTS OUT OF 10,000!

A "curvier" area is around t = 1:
G(0.5) = 0.691462467
G(1.0) = 0.84134474
G(1.5) = 0.933192771
This time the rough interpolation G(1.0)est = (G(0.5) + G(1.5))/2 gives an accuracy of about 3.5%.

How big a table you need to build really depends on the accuracy you need to have.

A linear interpolation (better than the rough one) is done by finding the equation of the straight line between the two points:
  A*t + B = f
You impose that it passes through G(0.5) and G(1.5):
  A*0.5 + B = 0.691462467
  A*1.5 + B = 0.933192771
and resolving for A and B you get:
  f = 0.241730304*t + 0.570597315
valid for 0.5 < t < 1.5

You have a better interpolation if you approximate the curve with a parabola:
  f=A*t^2 + B*t + C
instead of a straight line.  To do this you need three consecutive points (e.g. 0.5, 1.0, and 1.5):
  A*0.25 + B*0.5 + C = 0.691462467
  A*1.00 + B*1.0 + C = 0.933192771
  A*2.25 + B*1.5 + C = 0.691462467

I am not going to find the coefficients here, but it is straightforward to obtain values for A, B, and C.

You should build a grid of function values at intervals D of the variable t and pre-calculate A, B, and C for each triplet of points.  D should be small enough to give you enough accuracy (I would recommend that you use constant Ds instead of trying to optimise the density of points, because you would waste time determining what triplet to use).

The "curviest" part of the cumulative function is where its second derivative is largest, but that is the first derivative of the mass function and can be calculated.  If I didn't make mistakes (!) this should happen at  |t| = 3^(1/4) = 1.316 . Therefore, you should check your accuracy there.

When you estimate a set of A/B/C from a triplet T - D, T, and T + D, use them for
  T - D/2 < t <= T + D/2
to have the best results.

Build the table to cover the interval -5 < t < +5 and use some very coarse approximation outside that.  For t = -5 the cumulative function is about 3*10^ -7.  If you go to +/- 6 you will have 9 significant digits guaranteed and probably can just set the value to 0 outside that.

Only make a table for 1/2 of the curve: -6 < t < 0 and then calculate the other half with the formula f(t) = 1 - f(-t)
0
 
LVL 84

Expert Comment

by:ozo
ID: 10733162
#include <stdlib.h>
float NORMDIST(float mean,float stddev){
    static int iset=0;
    static float gset;
    float fac,rsq,v1,v2;
    if( iset==0 ){
        do{
            v1=2.0*rand()/RAND_MAX - 1.0;
            v2=2.0*rand()/RAND_MAX - 1.0;
            rsq = v1*v1+v2*v2;
       }while( rsq >= 1.0 || rsq == 0.0 );
        fac = sqrt(-2.0*log(rsq)/rsq);
        gset=v1*fac;
        iset = 0;
        return mean+stddev*v2*fac;
    }else{
        iset=0;
        return mean+stddev*gset;
    }
}
0
How to run any project with ease

Manage projects of all sizes how you want. Great for personal to-do lists, project milestones, team priorities and launch plans.
- Combine task lists, docs, spreadsheets, and chat in one
- View and edit from mobile/offline
- Cut down on emails

 
LVL 84

Accepted Solution

by:
ozo earned 250 total points
ID: 10737443
Sorry, I misunderstood. I thought NORMDIST was supposed to return a normally distributed random variable

This version should give the integral of the normal distribution, which is what I think you wanted

#include <math.h>
float NORMDIST(float x,float mean,float stddev,int cumulative){
    x = (x-mean)/stddev;
    if( cumulative ){
        float t,z,ans;
        z = fabs(x)/sqrt(2.0);
        t = 1.0/(1.0+0.5*z);
        ans = t*exp(-z*z-1.26551223+t*(1.00002368+t*(0.37409196+t*(0.09678418+
              t*(-0.18628806+t*(0.27886807+t*(-1.13520398+t*(1.48851587+
              t*(-0.82215223+t*0.17087277)))))))))/2.0;
        return x <= 0.0 ? ans : 1.0-ans;
    }else{
        return exp(-x*x/2.0)/sqrt(2.0*3.14159265358979);
    }
}
0
 

Author Comment

by:littlecharva
ID: 10739229
Cheers ozo, you get the points for making it the easiest for me, but thanks to everyone else for their assistance.

LittleC
0
 
LVL 6

Expert Comment

by:Sergio_Hdez
ID: 10739244
You say you have a working formula that, when gives you back values between 0-1 matches the values in Excel, but it sometimes gives you values out of the range [0,1]... ok, first, the values have to be between 0-1, because what you are looking for is that...

Probabiliy of getting a value less than X in an experiment witch result are distribuited as a gaussian bell centered in "mean" value (the most usual experiment result) and of a "width" (the bell) of "stddev" (how far result use to be from the "mena" result).

...and this probabilities are always coded from 0 (imposible) to 1 (for sure).

So, if your formula is working great for some numbers "near" the center of the bell (the "mean") but worst for other values not so near, then you have to increase the number of iterartions from 16 to 32 or more (it will not be noticed), then give it another try on that numbers. If it worked for some number, should do for others.

If you don't want to play with 16 or 32, you could also make if undefined: Do all iterations needen UNTIL result is between 0,1 (inclusive) and difference with last iteration result is less than 0.0000001, this approach will give you a more reliable formula, but it is a work that can be replaced be using 999 iterations instead of 16!!!

Ofcurse it may not work, but it is very easy to try and I would bet for it!
0
 
LVL 6

Expert Comment

by:Sergio_Hdez
ID: 10739249
Opps! I didn't read ozo before writting... great job, ozo!!
0
 
LVL 5

Expert Comment

by:PointyEars
ID: 10739297
Funny. After all the effort of reasoning about interpolating tables, I just found a function which is very similar to ozo's.  You will find the C code in
  http://finance.bi.no/~bernt/gcc_prog/recipes/recipes/node23.html

It only has 5th powers of t. It will teherfore require less computing.  You might like to have a look at it.
0

Featured Post

Free Trending Threat Insights Every Day

Enhance your security with threat intelligence from the web. Get trending threat insights on hackers, exploits, and suspicious IP addresses delivered to your inbox with our free Cyber Daily.

Join & Write a Comment

How to Win a Jar of Candy Corn: A Scientific Approach! I love mathematics. If you love mathematics also, you may enjoy this tip on how to use math to win your own jar of candy corn and to impress your friends. As I said, I love math, but I gu…
Article by: Nicole
This is a research brief on the potential colonization of humans on Mars.
This video gives you a great overview about bandwidth monitoring with SNMP and WMI with our network monitoring solution PRTG Network Monitor (https://www.paessler.com/prtg). If you're looking for how to monitor bandwidth using netflow or packet s…
In this tutorial you'll learn about bandwidth monitoring with flows and packet sniffing with our network monitoring solution PRTG Network Monitor (https://www.paessler.com/prtg). If you're interested in additional methods for monitoring bandwidt…

707 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

19 Experts available now in Live!

Get 1:1 Help Now