Broken down into practical pointers and step-by-step instructions, the IT Service Excellence Tool Kit delivers expert advice for technology solution providers. Get your free copy now.

Suppose I have a series of numbers like this

-0.0289436618082665

-0.0322635297824615

0.0473380547993016

-0.0483053616147235

0.0561386651052217

-0.0546202231192121

3912478746624.73

-0.0570958411471398

-0.0406567550991673

-0.0191101260081410

-0.0178598058749180

5912378756654.12

-0.00649518615382946

-0.0569007033673227

0.00634860933789683

So you most are within a certain range, say +/- 1.0, but there a couple numbers way out of this range.

Is there an algorithm to determine which numbers are these outliers?

I was thinking if I could detect these, then I could calculate the mean of the non-outliers and replace the outliers with that.

-0.0289436618082665

-0.0322635297824615

0.0473380547993016

-0.0483053616147235

0.0561386651052217

-0.0546202231192121

3912478746624.73

-0.0570958411471398

-0.0406567550991673

-0.0191101260081410

-0.0178598058749180

5912378756654.12

-0.00649518615382946

-0.0569007033673227

0.00634860933789683

So you most are within a certain range, say +/- 1.0, but there a couple numbers way out of this range.

Is there an algorithm to determine which numbers are these outliers?

I was thinking if I could detect these, then I could calculate the mean of the non-outliers and replace the outliers with that.

This threshold for what makes it an outlier is really application dependent. In fact, for the data you posted, one of the obvious outliers is less than 3 standard deviations from the mean. Many applications would throw out the top and bottom 5-10% of the data before doing any caclulations. If you do that, then your outliers will be over a billion standard deviations from the mean and would be outliers by almost any standard.

Physics is full of stories about people ignoring outliers and missing Nobel prizes.

Nevertheless people find it useful to establish algorithms to spot outliers. There is no standard algorithm to which objections cannot be raised.

Several popular ones have been given above.

Obviously if you run your data through whatever algorithm you choose enough times you will end up with one data point. You should not discard any data point without a non-statistical cause. Nevertheless often the problem is not important enough to spend a lot of time on it so one of the algorithms mentioned above will be usually an improvement in the decision making process.

You may also want to consider determining the median instead. And if you can define from your model what an outlier is, then you might consider a % threshold error (or an absolute threshold error value - depends on your model), so that if the absolute value of the difference between the median and the data point exceeds the threshold, then that point will be considered an outlier.

As others have already alluded, you need to understand your model in order to define what an outlier is.

http://rdsrc.us/GyvkW7

http://rdsrc.us/U4sDl9

http://rdsrc.us/VrQc4e

The best way to do that is to get the mean and sd for the whole set and 'remove' the one in question.

If you have N data points and want to remove x then you just do newmean = (mean*n-x)/(n-1) and that gives you the mean without considering x.

For the sd, remember that the mean of the x^2 minus the mean^2 gives the sd, so if you keep track of the mean of the squares, you can do the same thing for sd.

Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.

then for each number you calculate the number of standard deviations from the mean

if the number is more than 4 standard deviations from the mean, it can be considred an outsider.

I see your out-lying values are extremely outside the range of the others.