Solved

# Percentile calculations in MS Access

Posted on 2007-07-26
1,365 Views
I am working with calculating percentiles in MS Access 2003. I have been able to set up a module that calculates any percentile from any field of data in any table.

For example:

percentile("water","temp",".25") will calculate the 25th percentile of temperature data located in a table called "water".

I use this module in an expression within a query to get the result.

My problem arises when I attempt to group the results by, let's say, the time of day the water temperature was measured. My attempt at this has been to add another field to "Group By" in the query. I add the "time" field and choose "Group By". The expression continues to calculate  the 25th percentile of ALL of the temperature data and does NOT group by the "time" field. The query returns the SAME percentile value (the value that represents the entire field of data) for each time specified through the "Group By".

I have also tried using "Criteria" to select out a subset of my data to perform the calculation, but this does not work either, the expression still returns the value based on the full set of data.

I am thinking that I can make this happen by modifying the SQL (copied here):

SELECT percentile("water","temp",".25") AS Expr1, water.time
FROM water
GROUP BY water.time;

Any suggestions of what can be done would be great!
0
Question by:dougf1r

LVL 92

Expert Comment

Try this (fingers crossed):

SELECT DISTINCT w1.[time],
(SELECT TOP 1 w3.temp
FROM
(SELECT TOP 25 PERCENT w2.temp
FROM water w2
WHERE w2.[time] = w1.[time]
ORDER BY w2.temp ASC) AS w3
ORDER BY w3.temp DESC) AS 25th_Perc
0

LVL 92

Expert Comment

Sorry, was incomplete :)

SELECT DISTINCT w1.[time],
(SELECT TOP 1 w3.temp
FROM
(SELECT TOP 25 PERCENT w2.temp
FROM water w2
WHERE w2.[time] = w1.[time]
ORDER BY w2.temp ASC) AS w3
ORDER BY w3.temp DESC) AS 25th_Perc
FROM w1.water
ORDER BY w1.[time]
0

LVL 58

Expert Comment

The query solution works only with recordsets having 1 record or any multiple of four plus one records. The 25th percentile isn't necessarily the value of one single record. For example, the 25th percentile of the values {0, 10} is 2.5.

However, the criteria can be incorporated into the percentile function. Imagine this function (with a different ordering of parameters, sorry!):

Function Percentile( _
Percent As Double, _
Expression As String, _
Source As String, _
Optional Criteria = Null)

Dim strSQL As String
Dim dblPosition As Double

Percentile = Null
strSQL = "SELECT " & Expression _
& " FROM " & Source _
& " WHERE " & Expression & " Is Not Null" _
& " AND " + Criteria _
& " ORDER BY " & Expression
With CurrentDb.OpenRecordset(strSQL, dbOpenSnapshot)
If .RecordCount = 0 Then Exit Function
If Percent <= 0 Then Percentile = .Fields(0): Exit Function
.MoveLast
If Percent >= 1 Then Percentile = .Fields(0): Exit Function
dblPosition = (.RecordCount - 1) * Percent
.AbsolutePosition = Int(dblPosition)
Percentile = .Fields(0)
If .AbsolutePosition < dblPosition Then
Percentile = Percentile * (dblPosition - .AbsolutePosition)
.MoveNext
Percentile = Percentile _
+ .Fields(0) * (.AbsolutePosition - dblPosition)
End If
End With

End Function

You could then write your query as:

SELECT
Percentile(0.25, "temp", "water", "time=" & water.time) AS As P25,
water.time
FROM water
GROUP BY water.time;

Note that the exact syntax depends on the data type of time, as you need to build a valid SQL string as criteria:

numeric:      "time=" & water.time
string:         "time='" & water.time & "'"
date/time:   "time=#" & Format(time, "hh:nn:ss") & "#"
or:               "time=" & CDbl(water.time)

Cheers!
(Â°vÂ°)
0

LVL 1

Author Comment

Thanks for the multiple potential solutions.

matthewspatrick: I tried your SQL method first.... after choosing to run the query I get a message box requiring me to "Enter Parameter Value" for m1.time before it will give me any results.

If I enter "0" it returns 1 row of data with 2 fields. Field 1 (titled Expr 1) has a value of 0 and Field 2 (titled 25th_Perc) is null.

If I enter "1" it returns a similar matirx, however, the 25th_Perc field contains the 25th percentile of the full "temp" data field (there is no break down by "time")

If I enter "2" it gives me a message that says "At most one record can be returned by this subquery"

Suggestions?

0

LVL 1

Author Comment

harfang:

Question #1: In the SELECT statement of the query, should it only have one AS before the P25? I ask because when I first ran the query I got "The SELECT statement includes a reserved word or an argument name that is misspelled or missing, or the punctuation is incorrect". After removing one of the AS's it seemed to work better, but led to another error message.....

Question #2: The next message I got was "Undefined function 'Percentile' in expression". I created the function (Percentile) by copying in the code you provided. Do I need to add anything for the query to recognize it?

......my time field is numeric (double)
0

LVL 92

Expert Comment

dougf1r,

I made up a dataset and tried it in Access, and could not get it to work.  I guess Jet just really
hated my syntax.

However, the following "worked" in SQL Server, and gave me back exactly the results I
expected:

SELECT DISTINCT w1.[time],
(SELECT TOP 1 w3.temp
FROM
(SELECT TOP 25 PERCENT w2.temp
FROM water w2
WHERE w2.[time] = w1.[time]
ORDER BY w2.temp ASC) AS w3
ORDER BY w3.temp DESC) AS [25th_Perc]
FROM water w1
ORDER BY w1.[time]

Exactly the same logic as my last post, just cleaned up a typo or two.

Now, I say "worked" because Markus has a legitimate point about what I am doing.
For small data sets my logic will give a potentially misleading answer, as I make no
attempt to interpolate when the sample population in each segment is not an even
multiple of four.

Regards,

Patrick
0

LVL 92

Expert Comment

dougf1r,

BTW, I ran Markus's suggestion in Access, and here is how our approaches stack up on my
fake data set (4000 records total):

time       Markus        Patrick
12:00 AM       23,243.50        23,213.00
1:00 AM       30,091.00        30,091.00
2:00 AM       28,608.50        28,481.00
3:00 AM       21,854.75        21,750.00
4:00 AM       21,987.00        21,987.00
5:00 AM       26,392.00        26,188.00
6:00 AM       24,855.00        24,855.00
7:00 AM       24,702.00        24,620.00
8:00 AM       28,751.50        28,591.00
9:00 AM       27,380.00        27,047.00

Again, the differences come about because Markus interpolated, and I did not.

Note to Markus:  your SQL needs to be:

SELECT
Percentile(0.25, "temp", "water", "time=#" & water.time & "#") AS P25,
water.time
FROM water
GROUP BY water.time;

:)

Regards,

Patrick
0

LVL 58

Expert Comment

Yes, typo in the query. This covers question #1 -- (^vÂ°)

About question #2, "Undefined function 'Percentile' in expression", this could be simply because you now have two functions with the same name, yours and mine. This is never good and in this case VB can't decide which to use.

If that isn't the problem, did you paste the function in a normal module, with Option Explicit at the top, and did it compile properly? (Debug | Compile <project name>)

Finally, you can also incorporate the criteria into your existing function, and just use my code as example on how to achieve that.

(Â°vÂ°)
0

LVL 1

Author Comment

harfang,

I only have the one function loaded, so no chance for double naming.

I had "Option Compare Database" at the top, i have replaced it with "Option Explicit", but it still does not run. It compiles properly.

I am relatively new to VB and had a colleague assist with my original function. I have studied the function you provided to see how I could restructure my own function, but have not had success.

Any other suggestions as to why the function may be 'undefined'?

dougf1r
0

LVL 58

Expert Comment

What is the name of your module? A module cannot use the same name of a function (well, it can, but that makes it almost impossible to call the function).

(Â°vÂ°)
0

LVL 1

Author Comment

Markus,

It works now, after renaming the module.

On a continuation to this question... how would grouping by a second field (say, the "day" of the year) be incorporated into the function?

Patrick: I still could not get the SQL to run properly, it was giving me the same "Enter Parameter Value" message box as before.

dougf1r
0

LVL 58

Accepted Solution

You can create more complex criteria by chaining them with "And". For example:

SELECT
Percentile(0.25, "temp", "water",
"time=" & water.time & " and DOY=" & water.DOY
) AS As P25,
water.time
FROM water
GROUP BY water.DOY, water.time;

Again, the exact syntax depends on the field type of DOY, used here only as example of an additional field.

I'm glad the function works now!
(Â°vÂ°)
0

LVL 1

Author Comment

Me too!

Thanks for your help with this.

- dougf1r
0

LVL 92

Expert Comment

dougf1r said:
>>Patrick: I still could not get the SQL to run properly, it was giving me the same "Enter
>>Parameter Value" message box as before.

As I mentioned before, Jet just must not like the syntax.  The same syntax worked in
SQL Server.  Oh well :)

BTW, I did get this to work in Access:

SELECT w3.[time], Max(w3.temp) AS xxx
FROM
(SELECT w1.[time],  w1.temp, (SELECT Count(*) FROM water w2 WHERE w2.[time] = w1.[time] AND w2.temp <= w1.temp) /  (SELECT Count(*) FROM water w2 WHERE w2.[time] = w1.[time]) AS Perc25
FROM water w1
WHERE (SELECT Count(*) FROM water w2 WHERE w2.[time] = w1.[time] AND w2.temp <= w1.temp) /  (SELECT Count(*) FROM water w2 WHERE w2.[time] = w1.[time]) <= 0.25
ORDER BY  w1.[time],  w1.temp) AS w3
GROUP BY w3.[time]
ORDER BY w3.[time];

All in all, I think Markus's suggestion is probably the best.  Glad to see things worked out, and
thanks for giving me a chance to stretch those SQL-related brain muscles :)
0

LVL 58

Expert Comment

> stretch those SQL-related brain muscles

LOL. You can say that again! I liked your first version (the one Jet chokes on) much better, naturally. While reading the question I was already building something similar mentally!

(Â°vÂ°)
0

LVL 1

Author Comment

Markus,

The SQL you gave for expanding the criteria did not work for me, so i modified the syntax a bit (in regards to the AND statement)....

SELECT Percentile(0.25,"temp","water","time=" & water.time and "DOY=" & water.DOY) AS P25, water.time
FROM water
GROUP BY water.time, water.flow_DOY;

However, I still can't get it to run. I tried a few other changes, and still no go.

Do you see anything wrong with the syntax I have here?

dougf1r
0

LVL 58

Expert Comment

Yes. You need to use your field name throughout. I used DOY just as an example. You seem to use water.flow_DOY, and water.DOY. That will not work. (And I doubt your field name just happens to be similar to my silly acronym for "day of year".)

Second, you moved the "and" from inside the quotes to outside. That has a very different meaning and will not work at all. Put the "and" back inside the quotes, just as I showed.

Also, remember the bit about the data type. The syntax is different whether you new field is numeric, text, or date.

(Â°vÂ°)
0

LVL 1

Author Comment

harfang,

I hear you about keeping the proper field name throughout. I was just trying to keep the acronym's you were using in there to avoid confusion (but I forgot to rename that last piece properly).

I moved the "and" back inside the quotes (and removed one of the AS statements) and i continue to get errorrs. (I also added water.DOY after the AS statement to include that field)

The data type is numeric, so i am keeping everything as "time="

So my current SQL now reads:

SELECT
Percentile(0.25, "temp", "water",
"time=" & water.time & " and DOY=" & water.DOY) AS P25, water.time, water.DOY
FROM water
GROUP BY water.DOY, water.time;

...but I am still having errors. it first gives me an "Enter Parameters Values" message box, then Run-Time error 3061.

dougf1r

0

LVL 1

Author Comment

harfang,

IT WORKS NOW!

I think it was the space between the quotation and "and" in the SELECT statement...

Thanks again!

- dougf1r
0

LVL 58

Expert Comment

Welcome, and grats on finding that missing space!
(Â°vÂ°)
0

LVL 1

Author Comment

harfang,

I am checking the percentiles that are calculated via the module in Access against calculations in Excel.

It seems that in some instances the two methods give exactly the same result, however, other instances give somewhat significantly different results. For example:

For this data range:

0.022
0.001
0.038
0.052
0.088
0.063
0.126
0.062
0.135
0.103
0.171
0.06
0.127
0.094
0.057
0.133
0.129

Access and Excel give these results for the 5th, 25th, 50th, 75th, and 95th percentiles:

PCT      Access        Excel
P05      0.0052          0.0178
P25      0.057           0.057
P50      0.088           0.088
P75      0.127           0.127
P95      0.1638          0.1422

It looks  like the module matches well with Excel for 25 <= percentile <= 75, but does not match well with Excel for the extreme cases.

Is there something in the VB code for the Percentile function that would be causing this?

- Dougf1r
0

LVL 58

Expert Comment

Since you have 17 values, the quartiles are respectively the 1st, 5th, 9th, 12th, and 17th records. That is trivial and doesn't need interpolation.

However, my interpolations are wrong. Just the reverse of what they should be, actually! Replace the final If statement with this:

If .AbsolutePosition < dblPosition Then
Percentile = Percentile * (1 - dblPosition + .AbsolutePosition)
.MoveNext
Percentile = Percentile _
+ .Fields(0) * (1 - .AbsolutePosition + dblPosition)
End If

Then the function will agree with Excel.

I thought I had tested it, but apparently not with the right numbers. Sorry about that, and I'm glad you didn't take the function at face value!

(Â°vÂ°)
0

LVL 1

Author Comment

Works great!
0

## Featured Post

Confronted with some SQL you don't know can be a daunting task. It can be even more daunting if that SQL carries some of the old secret codes used in the Ye Olde query syntax, such as: (+) Â  Â  as used in Oracle; Â  Â  *= Â  Â  =* Â  Â as used in Sybase â€¦
Occasionally there is a need to clean table columns, especially if you have inherited legacy data. There are obviously many ways to accomplish that, including elaborate UPDATE queries with anywhere from one to numerous REPLACE functions (even withinâ€¦
It is a freely distributed piece of software for such tasks as photo retouching, image composition and image authoring. It works on many operating systems, in many languages.
This video gives you a great overview about bandwidth monitoring with SNMP and WMI with our network monitoring solution PRTG Network Monitor (https://www.paessler.com/prtg). If you're looking for how to monitor bandwidth using netflow or packet sâ€¦