?
Solved

Identify outliers through SQL

Posted on 2015-02-24
10
Medium Priority
?
34 Views
Last Modified: 2016-06-19
Hi- I have a large recordset of about 300,000 records and am looking to capture the outliers of a column within SQL.  

For the sake of simplicity, let's call my table mainTable and the data in question pData.

I understand that I'll need to capture the quartiles, but am not quite sure how to go about doing this.

Thanks in advance.
0
Comment
Question by:Andrew Luedke
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
10 Comments
 
LVL 44

Expert Comment

by:AndyAinscow
ID: 40628453
You could display the max and min values.  That will give you the range and then do a select query with >= or <= against the values you want.
SELECT MAX(x) as MaxX, MIN(x) as MinX FROM tbl1
then eg.
SELECT * FROM tbl1 WHERE (x >= (MaxX-5))
and
SELECT * FROM tbl1 WHERE (x <= (MinX+5))
0
 
LVL 44

Expert Comment

by:AndyAinscow
ID: 40628458
ps.  Exactly what you define as an 'outlier' is something you will have to decide.
0
 

Author Comment

by:Andrew Luedke
ID: 40628473
In terms of outliers, I'd like to grab the 1% and 99% percentiles to capture the extreme values.  Does this help to refine the algorithm?
0
Learn by Doing. Anytime. Anywhere.

Do you like to learn by doing?
Our labs and exercises give you the chance to do just that: Learn by performing actions on real environments.

Hands-on, scenario-based labs give you experience on real environments provided by us so you don't have to worry about breaking anything.

 
LVL 44

Expert Comment

by:AndyAinscow
ID: 40628500
Do you mean the 300 highest and 300 lowest or those higher than (lowest + 0.99*range)
0
 

Author Comment

by:Andrew Luedke
ID: 40628550
Apologies for the lack of clarity here.  The numbers could vary.

The formula should determine the general distribution.  Meaning that if you have a range of numbers, the formula should determine the thresholds and capture 1% of values below the normal range and the other 1% of values above.  This way, you can dynamically captures the extreme highs and lows of a set.  

For example, let's say we have a 200,000 numbers with the following characteristics:

Min:
-100

Max:
1.09

Avg:
.2

STDev:
.5

How do we go about capturing those outliers within the set?
0
 
LVL 44

Expert Comment

by:AndyAinscow
ID: 40628725
Do you want everything done in SQL or can you perform some calculations outside of the SQL to determine the limits which you then feed back into an SQL select query?
0
 

Author Comment

by:Andrew Luedke
ID: 40628735
Given the size of the database, it would be best to try and accomplish everything via SQL.
0
 
LVL 49

Accepted Solution

by:
PortletPaul earned 2000 total points
ID: 40629613
without so much as a table name or field name to go by all I can do is suggest NTILE()
e.g.

SELECT
   *
  , NTILE(10) OVER (PARTITION BY [Subject] ORDER BY Marks DESC) AS [TileNo]
FROM Students

You could perhaps also "do this in both directions" so use NTILE() twice , but order ASC in one and DESC in the other, then you can filter out the outliers that have 1 one either of those columns. Also note you can alter the number of "tiles" in my example I used 10

for ranking functions in SQL 2008 see: https://msdn.microsoft.com/en-us/library/ms189798(v=sql.100).aspx

---
if you had (or have) SQL 2012 you could use PERCENT_RANK()
0
 
LVL 51

Expert Comment

by:Vitor Montalvão
ID: 40639280
Andrew, you still have the issue or it's already solved?
0

Featured Post

NFR key for Veeam Agent for Linux

Veeam is happy to provide a free NFR license for one year.  It allows for the non‑production use and valid for five workstations and two servers. Veeam Agent for Linux is a simple backup tool for your Linux installations, both on‑premises and in the public cloud.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Real-time is more about the business, not the technology. In day-to-day life, to make real-time decisions like buying or investing, business needs the latest information(e.g. Gold Rate/Stock Rate). Unlike traditional days, you need not wait for a fe…
A Stored Procedure in Microsoft SQL Server is a powerful feature that it can be used to execute the Data Manipulation Language (DML) or Data Definition Language (DDL). Depending on business requirements, a single Stored Procedure can return differe…
Using examples as well as descriptions, and references to Books Online, show the documentation available for date manipulation functions and by using a select few of these functions, show how date based data can be manipulated with these functions.
Via a live example combined with referencing Books Online, show some of the information that can be extracted from the Catalog Views in SQL Server.
Suggested Courses

762 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question