Creating Percentile Aggregate Function In SQL Server 2005 Express
I am looking to create summary information on numerical data sets within SQL Server 2005 Express. These need to include additional aggregate statistics such as Percentiles (flexible percentiles preferentially i.e add the nth value of the percentile statistic as an argument). I am not sure as to the best approach but have been considering creating an Aggregate Function.
I am surprised that looking into doing this i have not found very much information on the web.
I am considering using a standard method such as that used in Excel to produce a percentile which would require ordering the aggregate group subset and determining the relevant rank of the percentile argument (simplistically 50th Percentile: 50/100 * n = rank) .
One concern i have is that i believe aggregate functions may split and merge subsets which would not work for ranking functions?
If anyone has experience of implementing statistical functions in this way or alternative methods i would be grateful in hearing about it.
CREATE TABLE test ( a int IDENTITY, b varchar(10), c int)INSERT INTO test (b,c) VALUES ( 'A', 1);INSERT INTO test (b,c) VALUES ( 'B', 3);INSERT INTO test (b,c) VALUES ( 'C', 2);INSERT INTO test (b,c) VALUES ( 'D', 5);INSERT INTO test (b,c) VALUES ( 'E', 4);SELECT A,B,C,(RANK() OVER (ORDER BY C ASC) * 100) / COUNT(*) OVER () PERCENTILEFROM TEST
I have done this in Teradata, DB2, and SQL Server and there are variations between them as to how to do it. (Teradata has the best options, by the way. ;-)
Essentially, if I hunderstand your stated approach, it is going to be the appropriate one.
First you need to determine the way you want to break the data (deciles, quartiles, percentiles, or something else). Having done that, you have to define a way of identifying the whatever-ile that a row falls into.
For example, you could order the the rows by SalesThisMonth, use a Rank() function to add a temporary sequence number, and then use (Rank() modulo N) [where N is the number of rows divided by 4 for quartiles) for each row to break the set into the desired quartile subsets.
The Good News is that SQL Server 2005/2008 have a Rank() function and you can get the number of rows fairly easily. So, all you have to do is create some subqueries to get this done. ;-)
Now, as to the aggregation of the data based upon the subsets, that is where you have another decision to make. You will probably want to partition the data before you do the aggregations if you want the calculations to be done by whatever-ile. In other words, if you wwant the aggregations to be within, for instance, the quartiles but, if you want to have the aggregations done accross the full dataset, then you need to do some aggregations before you create the subsets.
Hope that helps.
I would recommend putting this sort of thin in one or more stored procedures and parameterizing it so that you can pull quartiles, deciles, percentiles, or whatever.
The basic approach provided by rrjegan17 handles the ranking process; however, if you need to get, for instance, percentage of Sales for each row, then you will need to get the total sales (SUM(Sales)] and divide that into each rows Sales, and do the RANK() OVER (ORDER BY that calcualation). I would probably put that in some sort of temp table and then work with that resultset in another step to provide the final partitioning because you may also need to provide a running total column (either in ascending or descending order) in order to get the apprpriate partitioning (e.g. the top 10% of the rows will probably represent more or less than the top 10% of the Sales.
0
Want to increase your earning potential in 2018? Pad your resume with app building experience. Learn how with this hands-on course.
The script gives the percentile of each number in the series where as the requirement is to return the value of a specific (argument) percentile.
E.g. for values 0.37, 0.01, 0.28, 0.20 The summary query would look something along the lines:
Grouping Min Median Average Max 5%ile 95%ile
July-2009 0.01 0.24 0.22 0.37 0.04 0.36
The 50% percentile (median) falls between ranked values 2 and 3, 0.22 and 0.28 to give a returned value of 0.24.
So I could put any chosen group against it such as this:
SELECT GroupFields, Min(values), Percentile(values, 0.5) Avg(values), Max(values), Percentile(values, 0.05), Percentile(values, 0.95)
GROUP BY GroupFields
0
watermanagementAuthor Commented:
To 8080_Diver
Thanks for your response. I have been through the possibility of creating multiple spu's (reviewing Joe Celko's methodologies) but thought that this way is convoluted and restrictive to a specific table setup. A more attractive approach would be to create an Aggregate Function in the same context as the inherent Min, Max and Avg functions - but not having much experience in creating these am not aware if it is possible or practical. What do you think?
First let me ask a question regarding your response to rrjegan17. When you indicate the 5th Percentile to be 0.04, are you saying that the value in the row at that is 5% of the way up from the bottom is 0.04 or that the average value of the rows between 4.5% and 5.5% of the way up from the bottom is 0.04 or that the average of the first 5% of the data is 0.04? (As you probably realize, it makes a difference and "the 5th percentile" can be defined in varying ways. ;-)
One other question, I am not intimately familiar with Joe's methodologies and, therefore, am not familiar with the abbreviation "spu". (I do know of Joe and have read many of his papers/books and posts to another forum to which both he and I belong; however, I sometimes find him to be a bit pedantic and to use what I consider to be more academic terms while I live in the real world. ;-)
Depending on how you define the ranking function and how you define the meaning of the xth whatever-ile, you should be able to create a stored procedure that accomplishes the task; however, I think that a UDF would get pretty complicated and could have a seriously negative impact on the over all query performance.
The aggregation functions (e.g. SUM, AVG, etc.) are coded into the database for a reason. What they do and what, for instance, a Percentile function would do involve activities that are best not worked out in SQL. ;-)
By the way, this is one of those "extensions" that Teradata has that makes Teradata perfect for data warehouses and reporting. ;-)
0
watermanagementAuthor Commented:
I think copying this out of wikipedia may give a more lucid explanation of what i need percentile to mean:
Many software packages, such as Microsoft Excel, use the following method recommended by NIST[4] to estimate the value, vp, of the pth percentile of an ascending ordered dataset containing N elements with values v1,v2,...,vN;
n = \frac{p}{100}\,({N}-1)+1
n is then split into its integer component, k and decimal component, d, such that n = k + d. vp is then calculated as:
"spu" is actually my, obviously confusing, abbreviation for stored procedures.
The Aggregation Function in 2005+ use CLR integration and it is this that I was looking to use (also optimisation may not be as bigger issue as ease of use) but my concern is how the aggregated dataset is passed through the function (possible splitting and merging of the dataset?)
optimisation may not be as bigger issue as ease of use
Initially, in a test environment, that may be true; however, you need to consider the size of the dataset that will have to be handled in a production environment, the potential number of users/applications requesting this sort of thing, the possible number of these per row of data output, and other similar load factors. Sometimes, ease of coding/use leads to SQL and/or UDF's (User Defined Functions) that do not scale well and that can result in sow downs in retrieving the desired dataset.
The snippet you posted really didn't translate well into a posting. ;-)
However, the web site link answered my question. You are saying that, for instance, the 73 percentile would be the value of the row that is 73 percent (or the closest possible to 73 percent) of the way down the dataset if the dataset is sorted in ascending order by percentile ranking. In that case, simply appending a ranking based upon whatever you have chosen to the end of the data and then selecting the row that is has a rank of 73% of the number of rows should do the trick.
In other words, whatever you rank the data over, sort the data bay that (ascending) and select the Nth row and you have your answer or select the value from the row with the Nth rank value. For example if you have 1000 rows of data and you want the 67th percentile, then you could select the row that has a ranking of 670.
If you want multiple selections (e.g. 2nd, 5th, 33rd, 50th, 67th, 95th, and 98th percentile), you would need to do a subselect for each percentile, which is why I suggested using a stored proc and/or a temp table. With a temp table, the ranking precoess happens once and then you have a simple set of subselects.
0
watermanagementAuthor Commented:
The datasets/databases in question are discrete projects and have limited users, which is why i say "optimisation may not be as bigger issue as ease of use". Ease of use would be the ability for myself to transfer the method of calculation across tables, fields without having to recode complex stored procedures for each slight change in table design. I agree with your caution but as i explain with smaller multiple databases/datasets the ability to be flexible outweighs the need for scalability in this case.
The ranking does take you part the way but the Nth percentile will only sometimes be equivalent to an actual row as you express in your example. If the group is less than 100, for example, the 67 percentile will fall between two rows and as the web link demonstrates, a linear interpolation between the two is required. As you point out, to give the reporting flexibility I am after from the data may not be very profitable in SQL and so i may opt for a combination SQL Server and MS Office automation to pull out the descrete subsets and produce the reporting that way, particularly as Word and Excel are the general reporting tools used in the overall process/analysis.
Thanks for your insight and maybe it explains how come there's not much information on the web in general to do it this way.
Essentially, to find the Nth percentile, ou can use a FLOOR technique. In your example of having only less than 100 rows but wanting the 67th percentile, the calculation would be something approximating INT(N * 0.67). If you have 80 rows, for example, the 67th percentile would actually be row number 53.6 but INT(56.3) would give row 53.
As for why there is not much out there on the net regarding this topic. There are probably some rather esoteric sites that might dicuss the topic. However, in some cases, the reason also involves the choice of DBMS. As I said earlier, Teradata has some very powerful "extensions" that facilitate this sort of thing.
0
Featured Post
ZipGrep is a utility that can list and search zip (.war, .ear, .jar, etc) archives for text patterns, without the need to extract the archive's contents.
One of a set of tools we're offering as a way to say thank you for being a part of the community.
Open in new window