Creating Percentile Aggregate Function In SQL Server 2005 Express

I am looking to create summary information on numerical data sets within SQL Server 2005 Express. These need to include additional aggregate statistics such as Percentiles (flexible percentiles preferentially i.e add the nth value of the percentile statistic as an argument). I am not sure as to the best approach but have been considering creating an Aggregate Function.

I am surprised that looking into doing this i have not found very much information on the web.

I am considering using a standard method such as that used in Excel to produce a percentile which would require ordering the aggregate group subset and determining the relevant rank of the percentile argument (simplistically 50th Percentile:  50/100 * n = rank) .

One concern i have is that i believe aggregate functions may split and merge subsets which would not work for ranking functions?

If anyone has experience of implementing statistical functions in this way or alternative methods i would be grateful in hearing about it.

Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Raja Jegan RSQL Server DBA & Architect, EE Solution GuideCommented:
Hope this percentile script helps:
CREATE TABLE test ( a int IDENTITY, b varchar(10), c int)
INSERT INTO test (b,c) VALUES ( 'A', 1);
INSERT INTO test (b,c) VALUES ( 'B', 3);
INSERT INTO test (b,c) VALUES ( 'C', 2);
INSERT INTO test (b,c) VALUES ( 'D', 5);
INSERT INTO test (b,c) VALUES ( 'E', 4);

Open in new window

I have done this in Teradata, DB2, and SQL Server and there are variations between them as to how to do it.  (Teradata has the best options, by the way. ;-)
Essentially, if I hunderstand your stated approach, it is going to be the appropriate one.  
First you need to determine the way you want to break the data (deciles, quartiles, percentiles, or something else).  Having done that, you have to define a way of identifying the whatever-ile that a row falls into.  
For example, you could order the the rows by SalesThisMonth, use a Rank() function to add a temporary sequence number, and then use (Rank() modulo N) [where N is the number of rows divided by 4 for quartiles) for each row to break the set into the desired quartile subsets.
The Good News is that SQL Server 2005/2008 have a Rank() function and you can get the number of rows fairly easily.  So, all you have to do is create some subqueries to get this done. ;-)
Now, as to the aggregation of the data based upon the subsets, that is where you have another decision to make.  You will probably want to partition the data before you do the aggregations if you want the calculations to be done by whatever-ile.  In other words, if you wwant the aggregations to be within, for instance, the quartiles but, if you want to have the aggregations done accross the full dataset, then you need to do some aggregations before you create the subsets.
Hope that helps.
I would recommend putting this sort of thin in one or more stored procedures and parameterizing it so that you can pull quartiles, deciles, percentiles, or whatever.
The basic approach provided by rrjegan17 handles the ranking process; however, if you need to get, for instance, percentage of Sales for each row, then you will need to get the total sales (SUM(Sales)] and divide that into each rows Sales, and do the RANK() OVER (ORDER BY that calcualation).  I would probably put that in some sort of temp table and then work with that resultset in another step to provide the final partitioning because you may also need to provide a running total column (either in ascending or descending order) in order to get the apprpriate partitioning (e.g. the top 10% of the rows will probably represent more or less than the top 10% of the Sales.
Determine the Perfect Price for Your IT Services

Do you wonder if your IT business is truly profitable or if you should raise your prices? Learn how to calculate your overhead burden with our free interactive tool and use it to determine the right price for your IT services. Download your free eBook now!

watermanagementAuthor Commented:
Thanks rrjegan17,

The script gives the percentile of each number in the series where as the requirement is to return the value of a specific (argument) percentile.

E.g.  for values  0.37, 0.01, 0.28, 0.20  The summary query would look something along the lines:

Grouping     Min   Median   Average     Max       5%ile        95%ile
July-2009    0.01   0.24           0.22         0.37        0.04       0.36

The 50% percentile (median) falls between ranked values 2 and 3, 0.22 and 0.28 to give a returned value of 0.24.

So I could put any chosen group against it such as this:
SELECT GroupFields, Min(values), Percentile(values, 0.5) Avg(values), Max(values), Percentile(values, 0.05), Percentile(values, 0.95)
GROUP BY GroupFields

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
watermanagementAuthor Commented:
To 8080_Diver

Thanks for your response.  I have been through the possibility of creating multiple spu's (reviewing Joe Celko's methodologies) but thought that this way is convoluted and restrictive to a specific table setup. A more attractive approach would be to create an Aggregate Function in the same context as the inherent Min, Max and Avg functions - but not having much experience in creating these am not aware if it is possible or practical. What do you think?
First let me ask a question regarding your response to rrjegan17.  When you indicate the 5th Percentile to be 0.04, are you saying that the value in the row at that is 5% of the way up from the bottom is 0.04 or that the average value of the rows between 4.5% and 5.5% of the way up from the bottom is 0.04 or that the average of the first 5% of the data is 0.04?  (As you probably realize, it makes a difference and "the 5th percentile" can be defined in varying ways. ;-)
One other question, I am not intimately familiar with Joe's methodologies and, therefore, am not familiar with the abbreviation "spu".  (I do know of Joe and have read many of his papers/books and posts to another forum to which both he and I belong; however, I sometimes find him to be a bit pedantic and to use what I consider to be more academic terms while I live in the real world. ;-)
Depending on how you define the ranking function and how you define the meaning of the xth whatever-ile, you should be able to create a stored procedure that accomplishes the task; however, I think that a UDF would get pretty complicated and could have a seriously negative impact on the over all query performance.  
The aggregation functions (e.g. SUM, AVG, etc.) are coded into the database for a reason.  What they do and what, for instance, a Percentile function would do involve activities that are best not worked out in SQL. ;-)
By the way, this is one of those "extensions" that Teradata has that makes Teradata perfect for data warehouses and reporting. ;-)
watermanagementAuthor Commented:
I think copying this out of wikipedia may give a more lucid explanation of what i need percentile to mean:

This comes directly from Wikipedia:

Many software packages, such as Microsoft Excel, use the following method recommended by NIST[4] to estimate the value, vp, of the pth percentile of an ascending ordered dataset containing N elements with values v1,v2,...,vN;

    n = \frac{p}{100}\,({N}-1)+1

n is then split into its integer component, k and decimal component, d, such that n = k + d. vp is then calculated as:

    v_p = \begin{cases} v_1, & \mbox{for }n=1 \\ v_N, & \mbox{for }n=N \\ v_p=v_k+d(v_{k+1}-v_k), & \mbox{for }1 < n < N \end{cases}

"spu" is actually my, obviously confusing, abbreviation for stored procedures.

The Aggregation Function in 2005+ use CLR integration and it is this that I was looking to use (also optimisation may not be as bigger issue as ease of use) but my concern is how the aggregated dataset is passed through the function (possible splitting and merging  of the dataset?)

optimisation may not be as bigger issue as ease of use
Initially, in a test environment, that may be true; however, you need to consider the size of the dataset that will have to be handled in a production environment, the potential number of users/applications requesting this sort of thing, the possible number of these per row of data output, and other similar load factors.  Sometimes, ease of coding/use leads to SQL and/or UDF's (User Defined Functions) that do not scale well and that can result in sow downs in retrieving the desired dataset.
The snippet you posted really didn't translate well into a posting. ;-)
However, the web site link answered my question.  You are saying that, for instance, the 73 percentile would be the value of the row that is 73 percent (or the closest possible to 73 percent) of the way down the dataset if the dataset is sorted in ascending order by percentile ranking.  In that case, simply appending a ranking based upon whatever you have chosen to the end of the data and then selecting the row that is has a rank of 73% of the number of rows should do the trick.  
In other words, whatever you rank the data over, sort the data bay that (ascending) and select the Nth row and you have your answer or select the value from the row with the Nth rank value.  For example if you have 1000 rows of data and you want the 67th percentile, then you could select the row that has a ranking of 670.
If you want multiple selections (e.g. 2nd, 5th, 33rd, 50th, 67th, 95th, and 98th percentile), you would need to do a subselect for each percentile, which is why I suggested using a stored proc and/or a temp table.  With a temp table, the ranking precoess happens once and then you have a simple set of subselects.
watermanagementAuthor Commented:
The datasets/databases  in question are discrete projects and have limited users, which is why i say "optimisation may not be as bigger issue as ease of use".  Ease of use would be the ability for myself to transfer the method of calculation across tables, fields without having to recode complex stored procedures for each slight change in table design.  I agree with your caution but as i explain with smaller multiple databases/datasets the ability to be flexible outweighs the need for scalability in this case.

The ranking does take you part the way but the Nth percentile will only sometimes be equivalent to an actual row as you express in your example. If the group is less than 100, for example, the 67 percentile will fall between two rows and as the web link demonstrates, a linear interpolation between the two is required.  As you point out, to give the reporting flexibility I am after from the data may not be very profitable in SQL and so i may opt for a combination SQL Server and MS Office automation to pull out the descrete subsets and produce the reporting that way, particularly as Word and Excel are the general reporting tools used in the overall process/analysis.

Thanks for your insight and maybe it explains how come there's not much information on the web in general to do it this way.

Essentially, to find the Nth percentile, ou can use a FLOOR  technique.  In your example of having only less than 100 rows but wanting the 67th percentile, the calculation would be something approximating INT(N * 0.67).  If you have 80 rows, for example, the 67th percentile would actually be row number 53.6 but INT(56.3) would give row 53.
As for why there is not much out there on the net regarding this topic.  There are probably some rather esoteric sites that might dicuss the topic.  However, in some cases, the reason also involves the choice of DBMS.  As I said earlier, Teradata has some very powerful "extensions" that facilitate this sort of thing.
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Microsoft SQL Server

From novice to tech pro — start learning today.