Solved

Exclude erroneous values using Standard Deviation

Posted on 2006-10-25
4
643 Views
Last Modified: 2008-03-04
I have a query running from within Excel based on a view in SQL Server. It returns up to 60000 records (but rarely over a few hundred) into the Excel spreadsheet. The data is source data for a chart. The GAM, MAM, and TM columns are values, and these end up being different series in the chart. The problem occurs when an erroneous number ends up in the data - e.g. let's say the normal range of values is between 100 and 500 - then somewhere due to some strange reason, a value of 12,000 is returned. This skews the chart and analysis.

I have been asked to focus in on 90% of the central values, and the outlying 10% of values are to be excluded.

Question:
======
1. How do I do this - is it a Standard Deviation Job
2. What is the SQL I need to get the data I want.

Background INFO:
===========
----SQL Server View (FreightDataView)--------
CREATE VIEW dbo.FreightDataView
AS
SELECT     InvoiceYearMonth, OrigCountry AS Orig, DestCountry AS Dest, Weight, ProdCode, (CASE AccountType WHEN 'G' THEN NetRev ELSE 0 END) AS GAM,
                      (CASE AccountType WHEN 'MAM' THEN NetRev ELSE 0 END) AS MAM, (CASE AccountType WHEN 'SME' THEN NetRev ELSE 0 END) AS TM
FROM         FreightData
WHERE     (NOT (weight = 0))
----------------------------------------------------

----Excel Query------------
select top 60000 Orig, Dest, ProdCode, Weight, GAM, MAM, TM
from FreightDataView
where (blah blah blah...)
----------------------------

Cheers,

LoveToSpod
0
Comment
Question by:LoveToSpod
  • 2
4 Comments
 
LVL 92

Expert Comment

by:Patrick Matthews
ID: 17802791
Hi LoveToSpod,
> I have been asked to focus in on 90% of the central values, and the outlying 10%
> of values are to be excluded.

So what does that mean: chop off 5% at each tail of the distribution?

Regards,

Patrick
0
 
LVL 11

Accepted Solution

by:
regbes earned 500 total points
ID: 17802839
Hi LoveToSpod,
try one of these

CREATE VIEW dbo.FreightDataView
AS
SELECT     InvoiceYearMonth, OrigCountry AS Orig, DestCountry AS Dest, Weight, ProdCode, (CASE AccountType WHEN 'G' THEN NetRev ELSE 0 END) AS GAM,
                      (CASE AccountType WHEN 'MAM' THEN NetRev ELSE 0 END) AS MAM, (CASE AccountType WHEN 'SME' THEN NetRev ELSE 0 END) AS TM
FROM         FreightData
WHERE     (NOT (weight = 0))
and NetRev < 1000

or

CREATE VIEW dbo.FreightDataView
AS
SELECT     InvoiceYearMonth, OrigCountry AS Orig, DestCountry AS Dest, Weight, ProdCode, (CASE AccountType WHEN 'G' THEN NetRev ELSE 0 END) AS GAM,
                      (CASE AccountType WHEN 'MAM' THEN NetRev ELSE 0 END) AS MAM, (CASE AccountType WHEN 'SME' THEN NetRev ELSE 0 END) AS TM
FROM         FreightData
WHERE     (NOT (weight = 0))

and NetRev <  NetRev + (select stddev(NetRev) - Avg(netrev) from freightdata WHERE   (NOT (weight = 0)))


HTH

R.
0
 

Author Comment

by:LoveToSpod
ID: 17803004
Hi matthewspatrick

We could lop-off top and/or bottom 5% of values, but looking at it closer, I would like to add user controls that configure how much is lopped-off either side of the data, therefore the '5%' becomes a variable.

Cheers, LTS
0
 

Author Comment

by:LoveToSpod
ID: 17980454
Hi

I simply added a manual range into the SQL. This allows the user to eliminate any 'outside' data that skew the analysis unnecessarily.

Thx
0

Featured Post

PRTG Network Monitor: Intuitive Network Monitoring

Network Monitoring is essential to ensure that computer systems and network devices are running. Use PRTG to monitor LANs, servers, websites, applications and devices, bandwidth, virtual environments, remote systems, IoT, and many more. PRTG is easy to set up & use.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Everyone has problem when going to load data into Data warehouse (EDW). They all need to confirm that data quality is good but they don't no how to proceed. Microsoft has provided new task within SSIS 2008 called "Data Profiler Task". It solve th…
Ever needed a SQL 2008 Database replicated/mirrored/log shipped on another server but you can't take the downtime inflicted by initial snapshot or disconnect while T-logs are restored or mirror applied? You can use SQL Server Initialize from Backup…
This videos aims to give the viewer a basic demonstration of how a user can query current session information by using the SYS_CONTEXT function
Viewers will learn how to use the SELECT statement in SQL to return specific rows and columns, with various degrees of sorting and limits in place.

828 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question