Improve company productivity with a Business Account.Sign Up

x
  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 85
  • Last Modified:

Identify outliers through SQL

Hi- I have a large recordset of about 300,000 records and am looking to capture the outliers of a column within SQL.  

For the sake of simplicity, let's call my table mainTable and the data in question pData.

I understand that I'll need to capture the quartiles, but am not quite sure how to go about doing this.

Thanks in advance.
0
Andrew Luedke
Asked:
Andrew Luedke
1 Solution
 
AndyAinscowFreelance programmer / ConsultantCommented:
You could display the max and min values.  That will give you the range and then do a select query with >= or <= against the values you want.
SELECT MAX(x) as MaxX, MIN(x) as MinX FROM tbl1
then eg.
SELECT * FROM tbl1 WHERE (x >= (MaxX-5))
and
SELECT * FROM tbl1 WHERE (x <= (MinX+5))
0
 
AndyAinscowFreelance programmer / ConsultantCommented:
ps.  Exactly what you define as an 'outlier' is something you will have to decide.
0
 
Andrew LuedkeAuthor Commented:
In terms of outliers, I'd like to grab the 1% and 99% percentiles to capture the extreme values.  Does this help to refine the algorithm?
0
What Kind of Coding Program is Right for You?

There are many ways to learn to code these days. From coding bootcamps like Flatiron School to online courses to totally free beginner resources. The best way to learn to code depends on many factors, but the most important one is you. See what course is best for you.

 
AndyAinscowFreelance programmer / ConsultantCommented:
Do you mean the 300 highest and 300 lowest or those higher than (lowest + 0.99*range)
0
 
Andrew LuedkeAuthor Commented:
Apologies for the lack of clarity here.  The numbers could vary.

The formula should determine the general distribution.  Meaning that if you have a range of numbers, the formula should determine the thresholds and capture 1% of values below the normal range and the other 1% of values above.  This way, you can dynamically captures the extreme highs and lows of a set.  

For example, let's say we have a 200,000 numbers with the following characteristics:

Min:
-100

Max:
1.09

Avg:
.2

STDev:
.5

How do we go about capturing those outliers within the set?
0
 
AndyAinscowFreelance programmer / ConsultantCommented:
Do you want everything done in SQL or can you perform some calculations outside of the SQL to determine the limits which you then feed back into an SQL select query?
0
 
Andrew LuedkeAuthor Commented:
Given the size of the database, it would be best to try and accomplish everything via SQL.
0
 
PaulCommented:
without so much as a table name or field name to go by all I can do is suggest NTILE()
e.g.

SELECT
   *
  , NTILE(10) OVER (PARTITION BY [Subject] ORDER BY Marks DESC) AS [TileNo]
FROM Students

You could perhaps also "do this in both directions" so use NTILE() twice , but order ASC in one and DESC in the other, then you can filter out the outliers that have 1 one either of those columns. Also note you can alter the number of "tiles" in my example I used 10

for ranking functions in SQL 2008 see: https://msdn.microsoft.com/en-us/library/ms189798(v=sql.100).aspx

---
if you had (or have) SQL 2012 you could use PERCENT_RANK()
0
 
Vitor MontalvãoMSSQL Senior EngineerCommented:
Andrew, you still have the issue or it's already solved?
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

Join & Write a Comment

Featured Post

Build your data science skills into a career

Are you ready to take your data science career to the next step, or break into data science? With Springboard’s Data Science Career Track, you’ll master data science topics, have personalized career guidance, weekly calls with a data science expert, and a job guarantee.

Tackle projects and never again get stuck behind a technical roadblock.
Join Now