Solved

Best Implementation of Excel's PERCENTRANK function in SQL Server 2008

Posted on 2008-09-30
23
3,842 Views
Last Modified: 2012-05-05
Hi,

I've done a bunch of web searching on this question, and among others, found the solution below. I want to covert the approach described below into a more versatile UDF or stored procedure for calculating percentile ranks in lots of different tables and views in a database quickly and easily:

http://www.sqlteam.com/article/computing-percentiles-in-sql-server

Basically, in this approach you define a UDF for linear interpolation called LERP; syntax in this application would be LERP(given_value,min_value,max_value,output_min,output_high).

The function is defined as:
_______________________________________

declare @pp float
set @pp = .5

declare @k int, @d float, @ax float, @bx float
declare @values table (i int identity(1,1), x float)

insert @values select Score from @TestScores order by Score

select @k=floor(kf), @d=kf-floor(kf)
from (select 1+@pp*(count(*)-1) as kf from @values) as x1

select @ax=x from @values where i=@k
select @bx=x from @values where i=@k+1

select @pp as factor,
   dbo.LERP(@d, 0.0, 1.0, @ax, @bx) as percentile}
_______________________________________

They then apply the function over this generated data set:
_______________________________________

Declare @TestScores table (StudentID int identity(1,1), Score int)
insert @TestScores (Score) Values (20)
insert @TestScores (Score) Values (03)
insert @TestScores (Score) Values (40)
insert @TestScores (Score) Values (45)
insert @TestScores (Score) Values (50)
insert @TestScores (Score) Values (20)
insert @TestScores (Score) Values (90)
insert @TestScores (Score) Values (20)
insert @TestScores (Score) Values (11)
insert @TestScores (Score) Values (30)
while @@ROWCOUNT > 0
insert @TestScores select t.Score from @TestScores t
cross join
(select max (StudentID) MaxRowNum from @TestScores) x
where
t.StudentID <= 100 - x.MaxRowNum
_______________________________________

In my application, I have a large list of people and scores. I basically want what I have in excel-- a relatively easy way to apply this function automatically (perhaps using dynamic SQL?) to a given SELECT statement. For example, suppose I had this data in a table called Results:

Person      Score
Bill                 45
Fred                 112
Jim                 23
Rick                 200
Bob                 76
Andy            23
Roger         39

Ideally I would like to be able to do something like the following:

SELECT r.Person, r.Score, PERCENTILERANK(r.Score) AS 'Percentile Rank'
FROM Results r
WHERE r.Person not like 'R%';

This would produce:

Person           Score            Percent
Bill                    45            0.500
Fred                      112            1.000
Jim                      23            0.000
Bob                      76            0.750
Andy                      23            0.000

As I said, this would be ideal. If this is simply not technically feasible in SQL Server 2008, then I think the next best thing would be a stored procedure that could be very quickly modified each time to compute the percentile ranking for a given table, perhaps populating a new column using INTO or something. I have read just about everything on the web about this, and I can't help but think there is a better way out there for this incredibly common and useful task. I keep thinking there is a way to use NTILE and PARTITION to do what I want, but I can't figure it out. I am bewildered why this functionality is not embedded in T-SQL in 2008.

I suspect there is also a clever way to do this using a data cube in Analysis Services, but Analysis Services is a big complicated piece of software and I'd rather stick to vanilla T-SQL.

I leave it to the wizards out there to show the way. Thanks for your help!




 

0
Comment
Question by:Dicklestein
  • 9
  • 9
  • 5
23 Comments
 
LVL 142

Accepted Solution

by:
Guy Hengel [angelIII / a3] earned 300 total points
ID: 22603516
please check this out:

declare @t table (person varchar(10), score int )
set nocount on 
insert into @t values('Bill         ',        45)
insert into @t values('Fred          ',       112)
insert into @t values('Jim          ',       23)
insert into @t values('Rick          ',       200)
insert into @t values('Bob         ',        76)
insert into @t values('Andy      ',      23)
insert into @t values('Roger     ',    39) 
;with data as (
select t.*
, dense_rank() over (order by score asc) dr
from @t t
where person not like 'R%'
)
select data.*
     , cast(cast(dr as decimal(10,2)) / ( select max(dr) from data ) as decimal(10,2)) percentile_rank
  from data

Open in new window

0
 

Author Comment

by:Dicklestein
ID: 22606603
AngelIII,

Thanks very much for your prompt reply. Your solution is very nice and fast. But is there any way to wrap up the whole thing into a UDF that I can then use APPLY on? What I'm really looking for is a more modularized solution that is easy to re-use, and which can be easily put into a view. Thanks
0
 
LVL 142

Expert Comment

by:Guy Hengel [angelIII / a3]
ID: 22607867
eventually it would be possible with a .Net CLR function...
however, performance being a key factor, I prefer to rewrite/copy/paste some nice sql code over having to fiddle with .net code (although I like writing .net code, actually) :)
0
Optimizing Cloud Backup for Low Bandwidth

With cloud storage prices going down a growing number of SMBs start to use it for backup storage. Unfortunately, business data volume rarely fits the average Internet speed. This article provides an overview of main Internet speed challenges and reveals backup best practices.

 

Author Comment

by:Dicklestein
ID: 22610255
OK, that makes sense. If you don't mind (not sure what the protocol is for this-- please correct me if I'm wrong), I'd like to keep the question open in hopes that someone will post a such a CLR function or other means of accomplishing my goal.

Also, AngelIII, I'm interested in talking to you about some hourly work, and could not locate your email address. You can reach me at this one-time use email:
<email removed to avoid spamming>

Thanks, and I'll share any points with you and anyone who contributes more.
0
 
LVL 142

Expert Comment

by:Guy Hengel [angelIII / a3]
ID: 22612099
>Also, AngelIII, I'm interested in talking to you about some hourly work, and could not locate your email address. You can reach me at this one-time use email:

please check my profile, it is listed there.
angel eyes
0
 
LVL 51

Assisted Solution

by:Mark Wills
Mark Wills earned 200 total points
ID: 22621666
The dense_rank is the best way as AngelIII has already shown (is he ever wrong ?),

but not sure why you want a CLR or even a UDF. Simply use a view :

create view vw_percentiles as
with data
as
( select t.*, dense_rank() over (order by score asc) dr
     from ee_Scores t
     where person not like 'R%'
)
select data.*
     , cast(cast(dr as decimal(10,2)) / ( select max(dr) from data ) as decimal(10,2)) percentile_rank
  from data
;


then you can do a:

select * from  mytable inner join vw_percentiles on mytable.person = vw_percentiles.person

or is there someother reason you want to use the cross apply ?

in which case could then use a function, but really depends on "why a function" and "why not a view"...

create function uGetPercentiles()
returns @tbl table (person varchar(10), score int, dr int, percentile decimal(10,5))
as
begin

insert @tbl(person,score,dr,percentile)
select * from vw_percentiles

return

end


All crisp clean SQL, no CLR, possibly no SP, just a humble (and often overlooked) view...
0
 

Author Comment

by:Dicklestein
ID: 22621685
It's really more of a matter of ease of use. I'd like to be able to just call it like you would an excel function, but it appears that SQL Server just doesn't work like that that. With the solution you guys give, you would have to re-type all of that stuff every time you ever want to calculate a percentage rank (which is for me constantly).

Anyway, since it can't be done easily the way I'm talking about, I will close the question and split the points. In fairness, AngelIII answered first with substantially the same response as mark, so he will get more of the points.

Thanks again for the help.
0
 
LVL 51

Expert Comment

by:Mark Wills
ID: 22622650
Thanks for the points, and just a tip, if you are unhappy with the answer, or doesn't seem to be what you want, or feel a need to query us, then please let us know... Only too happy to spend time to resolve any hanging issues...

Not sure why it has to be retyped every time. Once it is part of a view (for example), it is dynamically available - the view is already part of the database, and cannot think of anything easier than select * from vw_percentiles - certainly do not have to re-create. Or, was there something else that you think might have to be regenerated ?

Even though you have awarded points, would not mind finishing off the discussion...
0
 
LVL 142

Expert Comment

by:Guy Hengel [angelIII / a3]
ID: 22622712
mark,

 the "problem" is that if you have several columns, not only score, and you want to apply the same ranking method, you have to copy/paste the function....

  unless you want to implement an .net clr aggregation function :)
0
 

Author Comment

by:Dicklestein
ID: 22622734
Well, I understand that you could easily select from a view of percentile ranks calculated for a given quantity (say, Score in my example above). That's just a matter of creating a view as you describe in your post.

That's not what I'm talking about-- I want to be able to apply the percent rank function to dozens of different quantities, and I don't want to have to create a separate view for each of them, or even to have one view with a bunch of identical SQL statements, each of which re-specifies the way to calculate a percent rank using the DENSE RANK function, etc. etc.

For example, if you go back to my original sample data, suppose that instead of one set of scores, we had 20 of them-- score1 is a score on a written part of a test, score2 is the oral part, score3 is for the extra credit section, etc. So every student has a set of 20 scores. I want an easy way to calculate, in one view, the percentile rank for each student for each set of scores. That's why I think a CLR function would be super useful-- you could just write something like:

SELECT r.Person
, PERCENTILERANK(r.Score1) AS ' Score1Rank'
, PERCENTILERANK(r.Score2) AS ' Score2Rank'
, PERCENTILERANK(r.Score3) AS ' Score3Rank'
FROM Results r

From what I could find online, this functionality is built directly into Oracle. I think it's really a glaring omission from SQL Server.
0
 
LVL 51

Expert Comment

by:Mark Wills
ID: 22622767
Aaahhh, missed that, and then again, still cannot see that requirement above. But yes quite right (yet again - do you ever get tired), if there were several different columns, or in fact a want to nominate which "measure" should be used to rank then it would add an extra layer of complexity... Dynamic SQL would be handy,. but alas not in a function or view - a stored procedure however could go part way (if not all the way).

Cheers,
mark
0
 
LVL 51

Expert Comment

by:Mark Wills
ID: 22622860
Hi Dicklestein, small timing problem, I was typing the same time you were, my comments are as a response to AngelIII's post.

The functionality is partially available, and instead of opening up the function so that some of the facilities for stored procedures are available, Microsoft has decided that the ability to incorporate CLR is far more powerful. And they are correct - if you are a programmer, with all the tools.

There is nothing to stop you from having multiple statements as you have already observed (though have not checked) :

with data
as
( select t.*, dense_rank() over (order by score_1 asc) dr1,
                   dense_rank() over (order by score_2 asc) dr2,
                   dense_rank() over (order by score_3 asc) dr3
     from ee_Scores t
     where person not like 'R%'
)
select data.*
     , cast(cast(dr1 as decimal(10,2)) / ( select max(dr1) from data ) as decimal(10,2)) percentile_rank_1
     , cast(cast(dr2 as decimal(10,2)) / ( select max(dr2) from data ) as decimal(10,2)) percentile_rank_2
     , cast(cast(dr3 as decimal(10,2)) / ( select max(dr3) from data ) as decimal(10,2)) percentile_rank_3
  from data
;

However a function as you suggest would be considerably more elegant.

I will leave it to you guys to sort out - I am sure e-mails have already been exchanged. And thanks for taking the time to explain.

0
 

Author Comment

by:Dicklestein
ID: 22622928
As I read up more on this topic, it doesn't even sound that hard. See these links:

http://msdn.microsoft.com/en-us/library/ms131063(SQL.90).aspx

http://technet.microsoft.com/en-us/library/ms182741.aspx

http://searchsqlserver.techtarget.com/generic/0,295582,sid87_gci1251415,00.html

Perhaps I should cross post this on the .NET section of EE? Really, this seems like 2 minutes of work for a .NET wizard, and would be very nice for the SQL Server community to have.
0
 
LVL 142

Expert Comment

by:Guy Hengel [angelIII / a3]
ID: 22622981
If my comment implied somewhere that it was "hard" to do the clr.net function, I must apologize.
I justed wanted to make sure the border(s) you have to cross :)

and you are of course free to ask in the .net zone to get a .net developer to do it... now, although I do .net coding, I prefer, for the moment, to keep it out of my (prod) sql servers :)
0
 

Author Comment

by:Dicklestein
ID: 22622991
I realized that I could very easily use excel to dynamically generate the SQL Statements in this form:

with data1 as (select  w.*, dense_rank() over (order by t.Test1 asc) dr FROM  TestScores t) select  cast(cast(dr as decimal(10,3)) / ( select max(dr) from data1 ) as decimal(10,3)) Test1 from data1 order by 'Test1' desc

with data2 as (select  w.*, dense_rank() over (order by t.Test2 asc) dr FROM  TestScores t) select  cast(cast(dr as decimal(10,3)) / ( select max(dr) from data1 ) as decimal(10,3)) Test2 from data2 order by 'Test2' desc

with data3 as (select  w.*, dense_rank() over (order by t.Test3 asc) dr FROM  TestScores t) select  cast(cast(dr as decimal(10,3)) / ( select max(dr) from data1 ) as decimal(10,3)) Test3 from data3 order by 'Test3' desc

Each of these works fine on its own, but is there a way to "glue" all of these together using INNER JOINS or Unions or something, so we have all of the PercentRanks for each Test, and the leftmost column would be the name of the Person.

0
 
LVL 51

Expert Comment

by:Mark Wills
ID: 22622995
Yep, that is a good idea...

There is a lot of good info in the MS arena (as much as we all bag them out), it is pretty powerful stuff... Simply add EE and you have solutions (not just technology)!

Cheers,
0
 
LVL 51

Expert Comment

by:Mark Wills
ID: 22623035
I gotta get faster e-mail. My last comment (22622995) was referring to Dicklestein's .net comments...

In reference to ID:22622991 (ie "glue") then if you are going to do that, then you can do it in one query see 22622860 above.

0
 

Author Comment

by:Dicklestein
ID: 22623069
Yes, but my way has the advantage that the atomistic statements all fit on one line and thus can be trivially replicated with easy string manipulations in excel. See the attached file to see what I mean.
PercentileRank-SQL-Generate.xls
0
 
LVL 51

Expert Comment

by:Mark Wills
ID: 22623355
Crikey !!

When AngelIII first suggest multi-columnar difficulties, I thought - yeah 3... Not quite 56...

If you are really happy with Excell, then what about if... we simply create a procedure that can respond to your Excel ? Might be fun anyway, even if not answering your question (Told you you closed this question off far too early...). Possibly more an example that MS SQL might be at least flexibile if not as powerful as Oracle...

It is possible to open your excel spreadsheet as a linked server, attaching to that worksheet (as if it were a table) create a dynamic SQL statement from the big column, and then execute it....

Have it as a stored procedure, nominating which test you want to run, or all. Results can be appended to a table for subsequent review (or display as output).




-- create a link to that spreadsheet so we can import
 
EXEC sp_addlinkedserver MyExcel,
     'Jet 4.0',
     'Microsoft.Jet.OLEDB.4.0',
     'c:\ee\PercentileRank-SQL-Generate.xls', 
      NULL,
     'Excel 5.0;'
GO
 
--Set up login mappings (just ADMIN - jet wants something).
 
EXEC sp_addlinkedsrvlogin MyExcel, FALSE, NULL, Admin, NULL
GO
 
--List the tables in the linked server (these are the worksheet names) - not needed, but interesting
 
EXEC sp_tables_ex MyExcel
GO
 
-- I did take the liberty of "fixing" the column names... SEQ, COMMENT, TEST, QUERY
 
select * from myexcel...percentrank$ where TEST like 'TEST%'
 
-- we can also write back / update that spreadsheet from here just like a table.
 
 
-- now run through the blighter and execute each command...
 
declare @sql varchar(2000)
 
DECLARE c CURSOR FAST_FORWARD READ_ONLY FOR select query from myexcel...percentrank$ where TEST like 'TEST%'
OPEN c
 
FETCH NEXT FROM c INTO @sql
WHILE @@FETCH_STATUS = 0
BEGIN
      print @sql  -- or insert
      exec (@sql) -- really an executesql, just lazy
      FETCH NEXT FROM c INTO @sql
END
CLOSE c
DEALLOCATE c
 
-- now clean it up...
 
sp_dropserver 'MyExcel', 'droplogins';
GO

Open in new window

0
 
LVL 51

Expert Comment

by:Mark Wills
ID: 22623407
If you don't need to update, and just use as a datasource,

you could also do :

SELECT * FROM OPENROWSET('Microsoft.Jet.OLEDB.4.0', 'Excel 8.0;Database=c:\ee\PercentileRank-SQL-Generate.xls;HDR=YES', 'SELECT * FROM [percentrank$]')
GO

select * from OPENDATASOURCE('Microsoft.Jet.OLEDB.4.0', 'data source="c:\ee\PercentileRank-SQL-Generate.xls";Extended properties=Excel 5.0;')...[percentrank$]
GO
0
 

Author Comment

by:Dicklestein
ID: 22626039
Mark,

I'm not sure I really follow how your code works, but will try to study it more. If I don't actually need to interface with excel, and can just copy and paste from excel to a SQL query pane, is there a nice way you know to glue together those queries as I suggested?

Also, I posted a question in the .NET zone about creating the percentile rank CLR User Defined Aggregate function we discussed:

http://www.experts-exchange.com/Programming/Languages/.NET/Visual_CSharp/Q_23782290.html
0
 

Author Comment

by:Dicklestein
ID: 22626495
Mark, never mind my last question. I figured it out myself using your approach.

I created the attached excel template which is very nice for generating this SQL in a couple minutes, even for dozens of quantities. I suppose having a collection of similar excel templates could really speed up development.
Automate-Percent-Rank-SQL-Query.xls
0
 
LVL 51

Expert Comment

by:Mark Wills
ID: 22626655
Looks good....
0

Featured Post

3 Use Cases for Connected Systems

Our Dev teams are like yours. They’re continually cranking out code for new features/bugs fixes, testing, deploying, testing some more, responding to production monitoring events and more. It’s complex. So, we thought you’d like to see what’s working for us.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

How to leverage one TLS certificate to encrypt Microsoft SQL traffic and Remote Desktop Services, versus creating multiple tickets for the same server.
This article describes how to use the timestamp of existing data in a database to allow Tableau to calculate the prior work day instead of relying on case statements or if statements to calculate the days of the week.
Along with being a a promotional video for my three-day Annielytics Dashboard Seminor, this Micro Tutorial is an intro to Google Analytics API data.
Finds all prime numbers in a range requested and places them in a public primes() array. I've demostrated a template size of 30 (2 * 3 * 5) but larger templates can be built such 210  (2 * 3 * 5 * 7) or 2310  (2 * 3 * 5 * 7 * 11). The larger templa…

776 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question