Identify outliers through SQL

Hi- I have a large recordset of about 300,000 records and am looking to capture the outliers of a column within SQL.  

For the sake of simplicity, let's call my table mainTable and the data in question pData.

I understand that I'll need to capture the quartiles, but am not quite sure how to go about doing this.

Thanks in advance.
Andrew LuedkeAsked:
Who is Participating?
 
PortletPaulfreelancerCommented:
without so much as a table name or field name to go by all I can do is suggest NTILE()
e.g.

SELECT
   *
  , NTILE(10) OVER (PARTITION BY [Subject] ORDER BY Marks DESC) AS [TileNo]
FROM Students

You could perhaps also "do this in both directions" so use NTILE() twice , but order ASC in one and DESC in the other, then you can filter out the outliers that have 1 one either of those columns. Also note you can alter the number of "tiles" in my example I used 10

for ranking functions in SQL 2008 see: https://msdn.microsoft.com/en-us/library/ms189798(v=sql.100).aspx

---
if you had (or have) SQL 2012 you could use PERCENT_RANK()
0
 
AndyAinscowFreelance programmer / ConsultantCommented:
You could display the max and min values.  That will give you the range and then do a select query with >= or <= against the values you want.
SELECT MAX(x) as MaxX, MIN(x) as MinX FROM tbl1
then eg.
SELECT * FROM tbl1 WHERE (x >= (MaxX-5))
and
SELECT * FROM tbl1 WHERE (x <= (MinX+5))
0
 
AndyAinscowFreelance programmer / ConsultantCommented:
ps.  Exactly what you define as an 'outlier' is something you will have to decide.
0
Ultimate Tool Kit for Technology Solution Provider

Broken down into practical pointers and step-by-step instructions, the IT Service Excellence Tool Kit delivers expert advice for technology solution providers. Get your free copy now.

 
Andrew LuedkeAuthor Commented:
In terms of outliers, I'd like to grab the 1% and 99% percentiles to capture the extreme values.  Does this help to refine the algorithm?
0
 
AndyAinscowFreelance programmer / ConsultantCommented:
Do you mean the 300 highest and 300 lowest or those higher than (lowest + 0.99*range)
0
 
Andrew LuedkeAuthor Commented:
Apologies for the lack of clarity here.  The numbers could vary.

The formula should determine the general distribution.  Meaning that if you have a range of numbers, the formula should determine the thresholds and capture 1% of values below the normal range and the other 1% of values above.  This way, you can dynamically captures the extreme highs and lows of a set.  

For example, let's say we have a 200,000 numbers with the following characteristics:

Min:
-100

Max:
1.09

Avg:
.2

STDev:
.5

How do we go about capturing those outliers within the set?
0
 
AndyAinscowFreelance programmer / ConsultantCommented:
Do you want everything done in SQL or can you perform some calculations outside of the SQL to determine the limits which you then feed back into an SQL select query?
0
 
Andrew LuedkeAuthor Commented:
Given the size of the database, it would be best to try and accomplish everything via SQL.
0
 
Vitor MontalvãoMSSQL Senior EngineerCommented:
Andrew, you still have the issue or it's already solved?
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.