Link to home
Start Free TrialLog in
Avatar of mattososky
mattososky

asked on

How to Determine Normal Distribution?

I'm generating a dataset, and as I am doing so I am calculating the standard deviation of that data set. I want to know how to determine when my data set hits the Normal Distribution mark. That is where 68% of the dataset is with the first standard deviation, 95 in the second, and 99 in the third.

Wikipedia defines the Normal distribution as having a Variance of 1. I don't see the logical in this statment. My dataset starts with some number of integers all equaling the mean. It then loops through a function slowly randomly expanding the values & recalculating the standard deviation. This goes on unil  the standard deviation reaches some pre-defined limit.

Example would be having 100 integers. Say the mean is 100 so at the start are the values are set to 100. We then loop with these steps:

1) randomly selecting some of the integers for movement outward from the mean, both positive and negative (the values also have a range of 100 so then cannot move beyond 0 or 200).

Graphing the results I get nice distributions, but they definately change based on the standard deviation threshhold. what I want to know is how to calculate where my standard deviation threshold should be in order to get the "Normal Distribution" as specified above.

Thanks,
Matt

2) recalcuate the current standard deviation. If the deviation exceeds my limit the loop breaks.

Avatar of Markus Fischer
Markus Fischer
Flag of Switzerland image

There are several problems here.

The normal distribution Wikipedia refers to is the distribution N(0,1), with mean 0 and SD 1. If SD = 1, the variance (SD squared) is also 1. This is trivial.

You want to generate a fake normal distribution of N(100,a), where a is the predetermined SD. But your method isn't guaranteed to generate a normal distribution. It can be shown that the sum of a large number of small errors will tend to yield a normal distribution. But that method isn't very well suited for a predetermined SD.

Try a google search with this line of keywords:
generate normally distributed random numbers

You will find many different methods to do that.

Cheers!
(°v°)
Avatar of aburr
That is where 68% of the dataset is with the first standard deviation, 95 in the second, and 99 in the third.
The above is a necessary condition to have a normal curve, BUT it is NOT sufficient.
The curve you found in the Wikipedia is a normalized normal curve with a mean of 0 and a standard deviation of 1.
Your procedure could calculate a normal distribution, given enough integers, every time depending on just how you randomly expand the variable. All you change is the standard deviation. The more random error you introduce the larger the standard deviation.
Just what are you trying to do? Create a normal curve? Get a bunch of random numbers normally distributed? Create an interesting computer program?
Avatar of mattososky
mattososky

ASKER

Yes I am creating a simulation program. What I want to do is be able to create the normal distribtion as grounds for having certain events and configurations be 'random on a normal distribution curve'. Thats it. I have been scowering the i-net looking for an easy way to do it, didn't really come up with it. Pleant of descriptions on how to calculate what the distribution is, but what i want is to fit a distribution to the normal distribution curve. Obvisouly to i will probably want to vary it according to my needs to produce different results. But I was under the impression there was some 'magic' about the normal distribution having to do with it's appearance in nature. I want to know when My distribution hit's the magic spot.

I know the distribution will probably never be exaclty some curve that i want given is random limited nature, but I just want it to be very close. Like i said, I can graph the results and depending on what my SD limit is i can get images that look very much like the normal distribution curve. I want to know mathmatically how to determine it.

Wouldn't it have somthing to do with the SD and the number of data points?

No, in fact, it doesn't.

You can calculate the mean, the variance, and the standard deviation for any distribution. A preset value of the standard deviation does not serve as a test for normality. Try a web search with these keywords; Wikipedia gives a whole list of them: http://en.wikipedia.org/wiki/Normality_test

One simple approximation is to use the sum of 12 random numbers [0;1[ minus 6. This follows with sufficient precision a N(0,1) normal distribution.

Closer to your attempt would be to start with numbers all at 100 and add random values to them (any distribution). Once you have added a large number of "errors" to each number, you will approximate a normal distribution. You can then stretch it to conform to the standard deviation you want.

Good luck!
(°v°)
Let me back up and say it another way.

I just want to produce a data set along a nomal distribution. Instead of my current method couldn't I simply do somthing like this:

1)produce a function like y = (x*x).

2) then for a given size (lets say 100) I get  y values produced by the function. We'll call this FunctionDataSet.

3) From FunctionDataSet I randomly select a value. Lets call it DistributionValue.

4) From that value i create another number called DistributionValuePecrentage. This would be the percent of the maxium value of FunctionDataSet ( we should know this to be 10000 currently). so if step 3 randomly selected index 10 from FunctionDataSet i would have a Distribution Value of 100, and a DistributionValuePercentage of 0.001

5) Invert the DistributionValuePercentage. Now we get have 0.999. (InvertPercentage)

(Still following?)

6) Now we'll create yet another value called PointVariance. This value will be the InvertPercentage multiplied by the the maximum variance for my target data set. We'll say this is 100 (again to keep simple, hopefully) So with .999 of 100 we get 99.(were using an integer)

7) we now take the PointVariance value, and either add or subtract (a random decision) to the mean of my target data set(again a value of 100). So our final answer for our randomly selection form function data set will be 199 or 1. (100+99 or 100-99);

__________

This whole process should produce a final values which are mostly centered around the mean (100) because we are randomly moving way from the mean with values( PointVariance) that should mostly small (relatively) compared with the possible range.

Graphi this and we get some kind of bell curve centered around the mean.

I havent check, but i bet that the function y=(x*x) does not produce a normal distribution.

What function would produce a normal distribution in this scenario?

(sorry i can't make this worth more points).

matt
harfang said

"Try a google search with this line of keywords:
generate normally distributed random numbers

Have you tried this? do you like the result?"
His suggestion should solve your problem
I'm not seeing an answer in the context of what i am asking with a search.
I guess I do not understand what you are trying to do.
If you want to produce a graph of a normal distribution, there is no need for random numbers or loops. Just use the equation.
If you want to produce a group of random numbers which have a normal distribution with a given mean and standard deviation, the procedure given at

http://www.quantitativeskills.com/sisa/calculations/randhlp.htm

shows an outline of how to do it. (Use uniform random number generator and put the result in a normal curve equation.)
I may understand your problem a bit better.

Do you want to create the smallest data set of random numbers which will have a given mean and standard deviation and which will have within 1, 2, and 3 sds of the mean the percentage of the data set that a normal distribution would have?
I'm still looking at the link you sent,

What I want to acheieve is this. Be able to get out a variable length array of numbers which are centered around the mean and have a variance I choose. I assume that at a particular variance I will have a Normal distribution.

One Application. Say I have a variable number of points which i want to represent angles in a complete circle If the number points was 6, the mean of my equation would be 60 (degrees (360/6)).

So I have a mean of 60 which i want 6 data points centered around but also would have a max & min variance, in this particular case due to the limitations of defining a circle say the min might be 1 and the max would be 180 (limit of a triangle's largest angle).

So I want most of my values to be around 60, and be able to define 'what percent of the time the value will be very large', like 175.

Since I know the rangle of possibilites, 1-180, I think I sould be able to construct my dataset (6 ints) that fall with a normal distribution. Like where only 5% of the time I get an angle of at least 160 or somthing.

But I specifically want to know how to manipulate the variance to acheive a normal distribution.
ASKER CERTIFIED SOLUTION
Avatar of Markus Fischer
Markus Fischer
Flag of Switzerland image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Sorry, didn't follow your last posts... was busy writing mine! -- (^v°)
"But I specifically want to know how to manipulate the variance to acheive a normal distribution."

You CANNOT achieve a normal distribution by manipulating the variance only.

In your example you CANNOT achieve a normal distribution of angles about a mean with  lower limit of 1 and upper limit of 175 degrees because a normal distribution approaches +- infinity, BUT you can approach a normal distribution if the standard deviation is small enough.


I assume that at a particular variance I will have a Normal distribution.

That assumption is impossible to make.       A sample with an assigned variance can be taken from an appropriate triangular distribution. It will have a mean and the desired variance but will not be normal. (But then again maybe it is not essential that your distribution be normal).


What I want to acheieve is this. Be able to get out a variable length array of numbers which are centered around the mean and have a variance I choose.

That can be done.   (In several different ways to varying degrees of accuracy.)

Again, I'm still looking at these answer, but I do understand a normal distribution approches -+ infinity, but in reality it is not important for me to calculate the 1% or 1/2% of the data set on either side. I'n most case any real values in ranges would be less than 1 (in relation to my standard deviation) and could be ignored. I am just trying to get close.
Do you have the equation for a normal curve? (It is time consuming to write here but can be done).
Do you want the numbers in your data set TO BE TAKEN from a normal curve with a given mean and standard deviation or do you want the data set numbers TO HAVE a given mean and standard deviation?
I dont understand your statement, either way you end up with a dataset with a given mean & standard deviation right?

"either way you end up with a dataset with a given mean & standard deviation right?"   No

A few numbers taken from a normal curve might have a smaller (larger) mean and will have a different standard deviation, but is easy to do.

To get a given mean and standard deviation involves more numbers and the testing of results.

Do you have access to the normal function?
 I do not want try type it in if you already have it.
mattososky,

Actually, it's still not very clear for us. What do you want to do exactly?

* I want to generate 5000 numbers taken from X ~ N(100,4)

This means you want a method to create 5000 values which will have an expected mean of 100, with an expected standard deviation of 2. The actual sample will have a slightly different mean and a slightly different standard deviation each time, of course. If you were to generate smaller samples, the observed statistics would vary even more around the expected ones.

For this, all you really need is a way to create random numbers following any normal distribution. This has been explained above.

* I want to explore my own method for generating normally distributed numbers

This would be consistent with what you are repeating. You start with numbers at value 100 and then try different methods to generate an error around the base value 100, in the range [0;200].

In that case, you will want one (probably several) normality tests to decide whether the result is indeed normally distributed. You can consider it two ways:
- you are trying to smuggle in obvious non-normal distributions, in order to purposely break normality tests;
- you want actually to test whether your method creates normally distributed samples.

* This is only a small part of a much larger problem

You might have simplified too much. This would be consistent with "I have a variable number of points which i want to represent angles in a complete circle". Angles have their own problems, especially when the standard deviation of the angle error is in the range of 60° or higher.

The normal distribution may or may not apply to your problem. We cannot decide that with the information provided. Astronomical measures are often angular, and use the normal distribution extensively, for instance. The modelisation of the flight pattern of flies also involves angles, but they aren't normal at all.

* You just want something pretty to plot

Again, this may or may not involve the normal distribution. If you work with small samples, almost anything goes. You might even enforce a non-normal distribution that *looks* more normal graphically. There are reduction methods for this (use a large sample and generate fake results for a much smaller sample).

* You think you need the normal distribution, but don't know what it is

If that is the case, we can help. You could start by reading a simple description like the one at http://en.wikipedia.org/wiki/Normal_distribution and then ask us what you don't understand about it or how you want to use it.

If you are an Excel user, you can play with NORMDIST() and NORMINV(). If you have any statistical package or mathematical software, look up help on the normal distribution: they all implement it.

Cheers!
(°v°)