• Status: Solved
• Priority: Medium
• Security: Public
• Views: 2982

Finding the lowest and upper point of expected variation

How do I find the lowest point of expected variation and the upper point for these numbers?
7.51
7.57
7.55
7.53
7.53
7.56
7.52
7.58
7.55
7.53
7.56
7.58
7.55
7.55
7.54
7.57
7.54
7.55
7.55
7.56
7.54
7.56
7.56
7.55
7.54
7.55
7.56
7.57
7.54
7.55
7.54
7.55
7.53
7.54
7.55
0
Jamie33
• 11
• 7
• 4
• +3
1 Solution

Commented:
What's wrong with just using the max and min, which are 7.58 and 7.51 respectively.
0

Author Commented:
Because it isn't that simple.
0

Commented:
What do the numbers mean, and how are they generated?
0

Commented:
How is it more complicated?  What is going on that you haven't told us?
How accurate do you have to be?
0

Commented:
The sample you posted has 35 numbers between 7.51 and 7.58.
How many more numbers will you be looking at?
What are the consequences of getting the max or min value wrong?

You could assume that this sample is generated by a normal process with some mean and standard deviation.

If you make that assumption, then you can find the probability of getting at least one value greater or less some limit in a particular number of tests.
0

Author Commented:
Mean for WIDTH:                              7.550
Standard deviation for WIDTH:::            0.030
What is the lowest point of expected variation?
What is the upper point of expected variation?

What is expected variation?

We know that 68% of the data from a normal process are expected to fall within + or - 1 sigma (standard deviations) from the mean.

We know that 95% of the data from a normal process are expected to fall within + or - 2 sigma (standard deviations) from the mean.

We know that 99.7% of data from a normal process are expected to fall within + or - 3 sigma(standard deviations) from the mean.

So, the expected variation that we would likely see in any normally distributed process is between + and - 3 sigma (standard deviations) of the mean.
0

Commented:
So do you just want 7.550 +/- 3*0.030 ?
0

Author Commented:
Yes
0

Commented:
Do you need a formula for Standard Deviation?
http://mathworld.wolfram.com/StandardDeviation.html
0

Commented:
That is all correct.

The expected variation depends on the mean and standard deviation of the process and the number of measurements in the sample.

But you have found the characteristics of the sample not the process
Mean                7.550
Std Dev             0.030

So you might want to say the 95% of the samples fall between 7.495 and 7.615.
And they certainly do.

But shouldn't 5% of the samples fall outside of this range?  It isn't happening.
0

Author Commented:
No, thanks a lot.
0

Commented:
Are you familiar with the difference between mean and SD for sample and for a population?

http://www.isixsigma.com/tools-templates/sampling-data/basic-sampling-strategies-sample-vs-population-data/
0

Commented:
But shouldn't 5% of the samples fall outside of this range?  It isn't happening.
So it seems highly likely that your process is not normally distributed.
0

Author Commented:
So I have both the Mean and the StDev. How did you come up with the 95%?
0

Commented:
We know that 99.7% of data from a normal process are expected to fall within + or - 3 sigma(standard deviations) from the mean.
But if we don't know whether a set of numbers was generated from a normal process, this may not be relevant.
0

Author Commented:
I Believe the following:

Lowest= StDev times -3.00 minus the Mean
Upper+ StDev times 3.00 minus the mean.

Is my formula wrong?
0

Commented:
>>  How did you come up with the 95%?

I was just trying out your ±2 sigma rule.

If all you have is the data, then that's all you have.
There is really no way to make any sort of reliable prediction.

If you are willing or able to make some assumptions, then you may be able to do more.

Do you have reason to believe your data are randomly selected from a much larger normal distribution?

Is there a specific question would you like to answer or a prediction you would like to make?
0

Commented:
There is no "lowest" or "upper"
If your process is normally distributed, then well over 99% should be within 3 standard deviations of the mean (as you suggest) but there could be a number that comes in 50 standard deviations above the mean, it's just very unlikely (unless your dataset is incredibly huge).

So maybe the answer is "There is no upper and lower"
0

Commented:
You can get upper and lower bounds for a confidence interval (95%, 98%, etc) but not for the whole thing. Of course that depends on what your distribution really is. I don't think any real life data is really perfectly normally distributed, just close enough that the numbers work out okay.
0

Commented:
The 3 sigma rule would be:            Lowest  =  Mean - 3*SD
Highest =  Mean + 3*SD

99.7% of the normal population should fall within these bounds.

If you take one measurement, there is a 99.7% chance it will be within the bounds.

So if you take 1000 measurements,  997 should be inside and 3 should be outside.

But your sample seems to have failed the 2 sigma test.  There isn't enough scatter in the data.
So either you don't have a normal distribution or you don't have a random sample.
0

Commented:
Is this an academic exercise or a real world problem?
Are you actually measuring something?  If so, what?
What do the data mean and what do you hope to accomplish by setting boundaries?

How do you know that my first post isn't good enough?

>>  What's wrong with just using the max and min, which are 7.58 and 7.51 respectively.

Or adding 0.1 to the min and max, and using the range from 7.50 to 7.59
0

Author Commented:
0

Commented:
This is the Excel formula for a Normal Dist
=NORMINV(RAND(), 7.545, 0.03)

If you use that to generate a 35 element sample, you will see that it usually has more scatter than your posted sample.

7.51       7.53       7.53       7.49       7.56       7.50
7.59       7.55       7.57       7.51       7.53       7.57
7.48       7.56       7.48       7.56       7.53       7.53
7.53       7.51       7.49       7.57       7.52       7.52
7.55       7.55       7.56       7.52       7.51       7.53
7.57       7.59       7.57       7.57       7.55       7.57
7.53       7.49       7.55       7.53       7.54       7.55
0

Commented:
But what does it all mean.

Maybe you are supposed to assume that your sample is representative of a normal distribution.
Then your 3*sigma rule would be correct at the 0.3% level.

Maybe you are supposed to notice that you don't have a normal distribution.

Maybe the instructor tried to generate a normal dist by hand and did a bad job.

Maybe there is a problem due to the small sample size (35) and the poor resolution (8 bins) of the data.

Maybe it is the sample versus population issue I mentioned earlier.
0

Commented:
As in any real world situation you do not have a normal distribution. It may be close enough though.You can use a bootstrap program (see itunes store) to use your data to get a very close approximation to the mean and sd of an equivalent normal distribution.
0

Commented:
Hi there Jamie33,

Although you haven't exactly used the usual language of statistics
lowest point of expected variation and the upper point for these numbers
I think I understand what you mean.

You want to know what are the lowest value (A) and highest value (B) that all future values will lie between. That is when the data is taken from the same source as the example numbers. With the extra condition that you want the largest possible value for A, and the smallest possible value for B.

Now the usual statistical case is that if you want to make inferences about population parameters (in this case a min and max value) then the sample drawn from that population you will use to develop a statistic must be unbiased.  One way of getting an unbiased sample is random sampling.  If the sample could be biased in unknown ways then any statistical  analysis will be invalid!

That said,  if you can provide an assurance that the population was normally distributed then the answers above is the best you will be able to obtain. Note that with n = 35 you should be using the appropriate value from the t-tables with 34 degrees of freedom rather than the value from the normal tables.

Aside

- the problem here is that you are not using the true values from the distribution - those values you don't know - and so you make do with estimating them from the sample. However in estimating the standard deviation from the sample you are using an already estimated value for the mean which means that unless adjusted by using a t value (and not a normal value) it will be too small.

However there are other things we should consider.

1.

Triangle distributionFirst it could well be true that the data came from a triangle distribution. In that case you could estimate the required parameters using the maximum likelihood estimation technique. Now for the normal distribution, the  uniform distribution, the triangle distribution with A and B fixed (and want to find the position of the mode where it changes direction) then it is possible to work out a formula in terms of values in the sample.  You just plug those values in and as they say in the movies, Bob's your uncle.  However to find the 3 important parameters of a triangle distribution (A, B, and C) in the general case there is no formula that I know of.  I strongly suspect it is provable that in general a unique analytical solution does not exist. You will need to use numerical mathematics to do some sort of hill climbing technique on the likelihood function to get the best estimates for A and B.  Alternatevely, using the openbugs program you could develop credible intervals for both A and B using the evidence (this is the individual values) from your sample.

However if it is not critical then there are other easier ways to get estimates. They will not however be on a solid foundation.

2.

There is yet another attackFor any distribution we have Chebyshev’s inequality which says
at least 1 - 1 /(K * K)  fraction of values must fall within K standard deviations from the mean.
(K > 1)
See This discussion of Chebyshev's inequality
In your sample data set (and in fact with any data set what so ever) this applies. It calculates a wide enough margin that is is always TRUE.  If you want to transfer that knowledge to the unknown distribution that your data came from you will be using estimates (the sample mean for the population mean, and the sample standard deviation - based on n-1 - for the population standard deviation) so there is a (probably very small) chance of making an assertion that is incorrect.

=====

Questions.

Do you know that the population the data comes from is normally distributed?
Or triangle distributed?
Or uniformly distributed - (I very much doubt this one)?
Do you know if the sampling technique for getting sample values is unbiased?

Ian
0

Commented:
You want to know what are the lowest value (A) and highest value (B) that all future values will lie between.
Again, no such numbers exist. You say it is an academic problem, so did they give you a confidence interval to use? What wording does the problem use?
0

Commented:
Hi there Tommy,

Don't be too hasty.  If the process is truly normally distributed then arbitrarily large and small numbers are possible. However this cannot be the case as the values are measurements of width which are impossible to go negative.  Additionally if they are from an industrial process then there is going to be some (maybe quite large) value on the upper width so as the piece will actually fit in the machinery or whatever is producing the items. While such values would be absolute bounds, they would be next to useless - a zero minimum value being more than 250*sd from the mean.

Industrial processes can be influenced by many small perturbations which if independent will (by the central limit theorem) become approximately close to a normal distribution. However it is the tails where any approx fit to a normal could start to break down, especially if influenced by absolute mins and maxs due to physical considerations.

It all boils down to a practical problem of knowing -

1.

How they want to specify the limits - eg no more than 1% chance of a new item being outside the limits or a real absolute -never go past - limit

2.

The physical aspects of the process generating the items - for any absolute limits and the squeeze that will put on the distribution tails

3.

How the process behaves in the case of unusually large or small items - eg do they have manual overrides to prevent such values and thus generating a truncated distribution....

My feeling is that this is a poorly posed academic question, however Jamie33 may be able to provide more useful information.

Ian
0

Commented:
Hi there Modulus_Twelve,

I would reject (A). The asker has been forced to consider issues and hence has benefited from the combined help for the experts.

I would reject (B) as so far it doesn't appear that the asker has a value for these lowest and highest points as requested.

Hence (C) is my recommendation, however I don't believe that the asker has so far "found" an answer (or probably more importantly in view of http:#a39189032 its academic)  a method to produce an answer.  Maybe an option (D) Continue to keep the Q open.

The question was poorly worded, but for someone naive about the subject area this is understandable. Hence  the asker was asked many times (http:#a39188659 , http:#a39188808 , http:#a39188861 and http:#a39190031 ) about this.  Unfortunately there was no forthcoming answer.  Do you want to determine the minimum and maximum of all items coming from the process that this dataset was sampled from?  and  Is that process normally distributed?

There are many curly aspects to questions in this area (like given a probability of all data points in the next sample being within the bounds, then those bounds will depend on the size of that future sample: the smaller the sample the closer those bounds are together).

However so far, the asker has been given more than enough to consider in either nailing down the exact problem or calculating an answer.
For example  http:#a39188610 , http:#a39188717 , http:#a39188893 , http:#a39188910 , http:#a39190031

If the Q is homework then http:#a39189169 gives a good list of things we need to consider.

Ian
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.