asked on

# Standard Deviation question

Sometimes I ramble in wikipedia to keep my education current and refreshed. I had a question that I hoped someone would be able to answer fairly easily, but it may require a detailed understanding of math and statistics.

When calculating the standard deviation, we find the average of the values' squared differences from the mean, then take the square root.

It would seem more obvious to me that we should average the differences from the mean, and that would be more representative of something.

In other words, why did we square the differences at all?

What mathematical principle makes this necessary/ and/or better?

There are a lot of similar constructs in math, such as fitting a curve to a set of data points and using the least squares algorithm to find how well the curve fits.

Once again, why use the square at all?

Meaningful, cogent explanations will get more points. A GENERAL description of this principle and why it works better/ or at all is what I'm looking for

Thanks for any answers!

-Jeffrey Blayney
SOLUTION
phoffric

membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.

ASKER

>>For example, taking partial derivatives of functions having absolute value terms can get messy.

So you believe it is mostly for convenience of calculation?

I could see that.
phoffric

Convenience of analysis is important. Using absolute values is actually more convenient in terms of calculation.

I found some supportive comments from:
http://mathforum.org/library/drmath/view/52722.html
The reason that squared values are used is so that the algebra is easier.  For example, the variance (second central moment) is equal to the expected value of the square of the distribution (second non-central moment) minus the square of the mean of the distribution.  This would not be true, in general, if the absolute value definition were used.

This is not to say that the absolute value definition is without merit.  It is quite reasonable for use as a measure of the spread of the distribution.  In fact, I have heard of someone who used it in teaching a course in statistics.  (I think he used it because he thought it was a more 'natural' way to measure the spread.)

When choosing an analysis tool, it is important to understand the data. You mentioned "fitting a curve". If there are a few significant outlier points that can mess up the trend, sometimes using median and absolute values may be a better approach. I am not well versed in median deviation analysis, but I'll give you some links if you wish to pursue this topic in more depth:

Absolute deviation
http://en.wikipedia.org/wiki/Absolute_deviation

See: Why might we use the mean deviation? in "The advantages of the mean deviation"
http://www.leeds.ac.uk/educol/documents/00003759.htm
The other thing is that squares accentuate the difference, so if there is more deviation in the deviation (if you know what I mean), the standard deviation would be higher.  E.g., if you have one set of data where everything is pretty close to the mean, and another set where most things are exactly the mean, but a few are far, then if you averaged their distance from the mean, they might both be the same, but the one with a few far-away items would have a bigger deviation when calculated with squares because squaring big numbers makes them that much bigger.

Same sort of thing with using root-mean-square averages instead of regular averages.  You get different numbers, but for the right application one or the other is more accurate.
Sum of the absolute differences is minimized from the median.
Sum of the square differences is minimized from the mean.
Both have their applicability.
Sum of squares is the appropriate estimator for data that have a Gaussian distribution
Sum of absolute values is the appropriate estimator for data that have a Cauchy distribution.

Approximately Gaussian distributions are very common.  The average of many values from any distribution will approach a Gaussian.
But people will also often analyze data as if they were Gaussian just because the calculations work out so conveniently.
>  The average of many values from any distribution will approach a Gaussian.
I should qualify that to be any distribution with a finite mean and variance.
In particular, a Cauchy distribution does not have a finite variance.
That's why outliers can be a problem when analyzing a Cauchy distribution as if it were Gaussian, and why median would be more robust estimator than mean for such distributions.

But when you do have a Gaussian distribution, it is entirely appropriate for outliers to have a large effect on the average value.
For Gaussian distributions, minimizing the sum of the squares is the same as minimizing the -log of the probability of the observed deviations, which is the maximum likelihood estimator.
For values in one dimension, this is same as taking the average of the values, which is a simple well understood operation with nice regular properties.
Extensions to multiple dimensions are also much easier when you assume your distributions are Gaussian.

The squared values you mention are use to calculate the "variance" (a measure of the spread of the numbers).  The standard deviation is then the square root of the "variance".
it is correct that the average of the differences equates to zero as phoffric stated.

What could be calculated is the average of the absolute difference from the mean, and I agree that that value could be useful.  Where it becomes tricky is in the mathematical manipulation of probability distributions

var = E[(x - mean)^2]

can be shown that

var = E[x^2] - mean^2

which is pretty useful, squaring automatically does the job of taking an absolute value, which is most useful - take the concept of least squares regression, it is much more easily calculated than least absolute deviation regression

ASKER CERTIFIED SOLUTION

membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.

ASKER

ozo, I am not very well versed in Cauchy distributions, I've only learned about statistics involving a normal or Gaussian Distribution, and tricks to guide me into assumptions that a data set may be Gaussian.

ozo, in regard to:

>> For values in one dimension, this is same as taking the average of the values, which is a simple well understood operation with nice regular properties.
Extensions to multiple dimensions are also much easier when you assume your distributions are Gaussian.

So, are you implying that if we drop the sum of squares thing going from two dimensions to one, and just average the values, it would work the other way? For instance, for a set of points in an assumed Gaussian Distribution in 3 dimentions would we use a "sum of the absolute value of the cubes" curve estimation to best accompany the data?

If not, or if that was poorly stated, how would you ideally fit a curb to data that extends into deeper dimensions? Would you still use sum of squares, or go into higher order exponents?

-Jeff

ASKER

Just kidding, I just read your last comment (after I posted last, yes) and it makes complete sense. That is the answer I was looking for.

But still, what happens if we minimize to higher order exponents? Say:

X ¿ N   and  X>=1   where  we minimize the sum of  (|difference|^X).

So, When

X=1, average of the absolute differences is minimized from the median.
X=2, the average of the squared differences is minimized from the mean.
X=3, ??

I really hope I'm making some sort of sense in these questions. I won't bother ya'll anymore after these and award points

ASKER

The upside down question mark is supposed to be a backwards E meaning "an element of" - website error.
The times when the average of a set of values is the best estimate for the center of the distribution
are the same times when it is most appropriate to use the sum of square deviations from the center

The times when the median of a set of values is the best estimate for the center of the distribution
are the same times when it is most appropriate to use the sum of absolute deviations from the center

Higher dimensions can get more complicated, for example you can have a Gaussian error in one dimension and a Cauchy error in another other dimension, and there can be various relationships between errors in different dimensions, but in general, the above principles still tend to hold.
I think the Student's t-distribution with 3 degrees of freedom might be one for which you would want to use sum of (|difference|^3)
(but I might think that only because I'm not familiar enough with the Student's t-distribution to know better)
If you are not versed in Cauchy distributions, you might prefer to not go into that yet, and think more about understanding Gaussian first.