• Status: Solved
• Priority: Medium
• Security: Public
• Views: 202

# Calculating Standard Deviation inside Excel.

Hi

I need to calculate the Standard Deviation Values that would represent the SD Values of

+4
+3
+2
+1
0
-1
-2
-3
-4

based on the data that I have in Column J. The formulas needs to be kept inside the Orange Box in Column J itself. I have attached the sample excel sheet with the data.

https://www.experts-exchange.com/questions/28977905/What-is-the-Difference-between-Standard-Deviations-and-z-scores-calculated-within-Excel.html

I am using Excel 2016 Professional Plus x64

Thanks
Standard-Deviation.xlsx
0
Wanu San
• 3
• 2
1 Solution

Author Commented:
I have been able to calculate the Standard Deviation Values for each data set respectively, by following the suggestion given by crystal here -
https://www.experts-exchange.com/questions/28978630/Calculating-Z-SCORE-inside-Excel.html

But I am still stuck at the last part, now how do I get the Standard Deviation Values from "+4 to zero to -4", in the orange box ?

Thanks for any ideas.
0

PrincipalCommented:
The problem isn't very well stated.  So, one has to make some assumptions about what it means.  I believe it's reasonable to state the problem this way:

"Provide the value for the underlying distribution of the data at the number of standard deviations given in column F."

One might (or might not) decide that the data is "zero mean" effectively .. otherwise it is not.
Or, one might (or might not) decide that the underlying distribution of the data is "zero mean" .. it may well be as it's close to zero.  But what's "close" in this case?  It's up to the eye of the beholder.

So, a simplified view might be that one wants to know the values at +/- 1,2,3,4 standard deviations of the underlying distribution assumed or declared to have zero mean.  In this case, the formulas in Column J would be:
J12==F12*STDEV.P(J\$30:J\$141)
etc.

A more precise view might be:
j12=F12*STDEV.P(J\$30:J\$141)+AVERAGE(J\$30:J\$141)
etc.

The latter formulation biases the results according to the offset of the mean value of the data.
It still assumes a symmetric distribution around the mean which is implied by 1,2,3,4 standard deviations.  Is that reasonable in this case?
In that regard, it's interesting to note that there is NO data value that reaches +4 sigma but it might be said that the underlying distribution could reach that value.
After all, there are only 112 values given and the probability of a value occurring at 4 sigma is low.
0

Author Commented:
"Provide the value for the underlying distribution of the data at the number of standard deviations given in column F."
Thank you so much for your comment, Fred Marshall, you have understood the requirement 100 % correctly.

I really do not know if the data is zero mean or not. If we assume that the data is not symmetrically distributed around the mean, then does that means we cannot use Standard Deviations based concepts on this data, in order to separate the middle values from outlier values etc. ? I am just experimenting, I am not sure about all this.

In that regard, it's interesting to note that there is NO data value that reaches +4 sigma but it might be said that the underlying distribution could reach that value.
After all, there are only 112 values given and the probability of a value occurring at 4 sigma is low.

It is perfectly fine, if we do not have any value within this data sample, which would have a +4 sigma. We just need to know what value would have +4 sigma, even if it is not present within the current sample.

J12==F12*STDEV.P(J\$30:J\$141)
etc.
A more precise view might be:
j12=F12*STDEV.P(J\$30:J\$141)+AVERAGE(J\$30:J\$141)
Regarding these 2 different formulas, would you please tell what EXACT DIFFERENCE, does that makes. And which one is preferred over the other, if we are trying to find the middle values and the outlier values from a data sample, by using the concept of standard deviations.

Thanks again for giving it a shot.
0

PrincipalCommented:
I really do not know if the data is zero mean or not.
Well you already have the data in Excel and AVERAGE will give the mean and it comes out around 0.2..
Regarding these 2 different formulas, would you please tell what EXACT DIFFERENCE, does that makes. And which one is preferred over the other, if we are trying to find the middle values and the outlier values from a data sample, by using the concept of standard deviations.
The exact difference is a constant equal to AVERAGE(J\$30:J\$141).  The idea is that there is an UNDERLYING normal distribution around some mean value.  i.e. where does the peak of the normal curve reside?  Since the average of the data is only around 0.2 then one might choose to ignore it OR one might choose to *decide* for a variety of real-world reasons that the process would be zero mean.  So, expressed "in pictures", this added constant value shifts the normal distribution so the peak is located at the mean value of the DATA.
If a difference of 0.2 matters then you might include it.  If not, then you might not include it.  To me it hardly makes any difference but only you will know if it changes the desired outcome for you in any material way.
If we assume that the data is not symmetrically distributed around the mean, then does that means we cannot use Standard Deviations based concepts on this data,
No but it may mean that a normal distribution isn't perhaps the best UNDERLYING MODEL for the DATA.

Philosophically speaking:
You have the DATA.
You conjure up a MODEL that matches the DATA as well as you can or need.  This MODEL represents what I called earlier the UNDERLYING DISTRIBUTION.
It is very common to use a normal distribution as a model because the real world often behaves that way - or at least behaves that way well enough.  And there are good arguments for this.
But you could create all sorts of models using something like (for example) a Weibull distribution with the parameters for that model selected by matching it to the data.
[See: http://www.niar.wichita.edu/coe/NCAMP_Documents/Publications/NCAMP_Techincal_Presentations/Distribution_for_small_sample_sizes_and_fatigue_data.pdf]
But, as I mentioned before, with only 112 values given you are going to be making assumptions when trying to fit a function all the way out to +/-4 sigma.  So using a normal distribution is not only easier but may also be the best choice for matching and predicting reality.
1

Author Commented:
Thank you Fred Marshall
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.