Link to home
Start Free TrialLog in
Avatar of dougf1r
dougf1r

asked on

Percentile calculations in MS Access

I am working with calculating percentiles in MS Access 2003. I have been able to set up a module that calculates any percentile from any field of data in any table.

For example:

percentile("water","temp",".25") will calculate the 25th percentile of temperature data located in a table called "water".

I use this module in an expression within a query to get the result.

My problem arises when I attempt to group the results by, let's say, the time of day the water temperature was measured. My attempt at this has been to add another field to "Group By" in the query. I add the "time" field and choose "Group By". The expression continues to calculate  the 25th percentile of ALL of the temperature data and does NOT group by the "time" field. The query returns the SAME percentile value (the value that represents the entire field of data) for each time specified through the "Group By".

I have also tried using "Criteria" to select out a subset of my data to perform the calculation, but this does not work either, the expression still returns the value based on the full set of data.

I am thinking that I can make this happen by modifying the SQL (copied here):

SELECT percentile("water","temp",".25") AS Expr1, water.time
FROM water
GROUP BY water.time;

Any suggestions of what can be done would be great!
Avatar of Patrick Matthews
Patrick Matthews
Flag of United States of America image

Try this (fingers crossed):


SELECT DISTINCT w1.[time],
    (SELECT TOP 1 w3.temp
    FROM
        (SELECT TOP 25 PERCENT w2.temp
        FROM water w2
        WHERE w2.[time] = w1.[time]
        ORDER BY w2.temp ASC) AS w3
    ORDER BY w3.temp DESC) AS 25th_Perc
Sorry, was incomplete :)


SELECT DISTINCT w1.[time],
    (SELECT TOP 1 w3.temp
    FROM
        (SELECT TOP 25 PERCENT w2.temp
        FROM water w2
        WHERE w2.[time] = w1.[time]
        ORDER BY w2.temp ASC) AS w3
    ORDER BY w3.temp DESC) AS 25th_Perc
FROM w1.water
ORDER BY w1.[time]
The query solution works only with recordsets having 1 record or any multiple of four plus one records. The 25th percentile isn't necessarily the value of one single record. For example, the 25th percentile of the values {0, 10} is 2.5.

However, the criteria can be incorporated into the percentile function. Imagine this function (with a different ordering of parameters, sorry!):

Function Percentile( _
    Percent As Double, _
    Expression As String, _
    Source As String, _
    Optional Criteria = Null)
   
    Dim strSQL As String
    Dim dblPosition As Double
   
    Percentile = Null
    strSQL = "SELECT " & Expression _
        & " FROM " & Source _
        & " WHERE " & Expression & " Is Not Null" _
        & " AND " + Criteria _
        & " ORDER BY " & Expression
    With CurrentDb.OpenRecordset(strSQL, dbOpenSnapshot)
        If .RecordCount = 0 Then Exit Function
        If Percent <= 0 Then Percentile = .Fields(0): Exit Function
        .MoveLast
        If Percent >= 1 Then Percentile = .Fields(0): Exit Function
        dblPosition = (.RecordCount - 1) * Percent
        .AbsolutePosition = Int(dblPosition)
        Percentile = .Fields(0)
        If .AbsolutePosition < dblPosition Then
            Percentile = Percentile * (dblPosition - .AbsolutePosition)
            .MoveNext
            Percentile = Percentile _
                + .Fields(0) * (.AbsolutePosition - dblPosition)
        End If
    End With
   
End Function

You could then write your query as:

SELECT
  Percentile(0.25, "temp", "water", "time=" & water.time) AS As P25,
  water.time
FROM water
GROUP BY water.time;

Note that the exact syntax depends on the data type of time, as you need to build a valid SQL string as criteria:

numeric:      "time=" & water.time
string:         "time='" & water.time & "'"
date/time:   "time=#" & Format(time, "hh:nn:ss") & "#"
or:               "time=" & CDbl(water.time)

Cheers!
(°v°)
Avatar of dougf1r
dougf1r

ASKER

Thanks for the multiple potential solutions.

matthewspatrick: I tried your SQL method first.... after choosing to run the query I get a message box requiring me to "Enter Parameter Value" for m1.time before it will give me any results.

If I enter "0" it returns 1 row of data with 2 fields. Field 1 (titled Expr 1) has a value of 0 and Field 2 (titled 25th_Perc) is null.

If I enter "1" it returns a similar matirx, however, the 25th_Perc field contains the 25th percentile of the full "temp" data field (there is no break down by "time")

If I enter "2" it gives me a message that says "At most one record can be returned by this subquery"

Suggestions?



Avatar of dougf1r

ASKER

harfang:

Question #1: In the SELECT statement of the query, should it only have one AS before the P25? I ask because when I first ran the query I got "The SELECT statement includes a reserved word or an argument name that is misspelled or missing, or the punctuation is incorrect". After removing one of the AS's it seemed to work better, but led to another error message.....

Question #2: The next message I got was "Undefined function 'Percentile' in expression". I created the function (Percentile) by copying in the code you provided. Do I need to add anything for the query to recognize it?

......my time field is numeric (double)
dougf1r,

I made up a dataset and tried it in Access, and could not get it to work.  I guess Jet just really
hated my syntax.

However, the following "worked" in SQL Server, and gave me back exactly the results I
expected:

SELECT DISTINCT w1.[time],
    (SELECT TOP 1 w3.temp
    FROM
        (SELECT TOP 25 PERCENT w2.temp
        FROM water w2
        WHERE w2.[time] = w1.[time]
        ORDER BY w2.temp ASC) AS w3
    ORDER BY w3.temp DESC) AS [25th_Perc]
FROM water w1
ORDER BY w1.[time]

Exactly the same logic as my last post, just cleaned up a typo or two.

Now, I say "worked" because Markus has a legitimate point about what I am doing.  
For small data sets my logic will give a potentially misleading answer, as I make no
attempt to interpolate when the sample population in each segment is not an even
multiple of four.

Regards,

Patrick
dougf1r,

BTW, I ran Markus's suggestion in Access, and here is how our approaches stack up on my
fake data set (4000 records total):

time       Markus        Patrick       
12:00 AM       23,243.50        23,213.00       
1:00 AM       30,091.00        30,091.00       
2:00 AM       28,608.50        28,481.00       
3:00 AM       21,854.75        21,750.00       
4:00 AM       21,987.00        21,987.00       
5:00 AM       26,392.00        26,188.00       
6:00 AM       24,855.00        24,855.00       
7:00 AM       24,702.00        24,620.00       
8:00 AM       28,751.50        28,591.00       
9:00 AM       27,380.00        27,047.00       

Again, the differences come about because Markus interpolated, and I did not.

Note to Markus:  your SQL needs to be:

SELECT
  Percentile(0.25, "temp", "water", "time=#" & water.time & "#") AS P25,
  water.time
FROM water
GROUP BY water.time;

:)

Regards,

Patrick
Yes, typo in the query. This covers question #1 -- (^v°)

About question #2, "Undefined function 'Percentile' in expression", this could be simply because you now have two functions with the same name, yours and mine. This is never good and in this case VB can't decide which to use.

If that isn't the problem, did you paste the function in a normal module, with Option Explicit at the top, and did it compile properly? (Debug | Compile <project name>)

Finally, you can also incorporate the criteria into your existing function, and just use my code as example on how to achieve that.

(°v°)
Avatar of dougf1r

ASKER

harfang,

I only have the one function loaded, so no chance for double naming.

I had "Option Compare Database" at the top, i have replaced it with "Option Explicit", but it still does not run. It compiles properly.

I am relatively new to VB and had a colleague assist with my original function. I have studied the function you provided to see how I could restructure my own function, but have not had success.

Any other suggestions as to why the function may be 'undefined'?

dougf1r
What is the name of your module? A module cannot use the same name of a function (well, it can, but that makes it almost impossible to call the function).

(°v°)
Avatar of dougf1r

ASKER

Markus,

It works now, after renaming the module.

On a continuation to this question... how would grouping by a second field (say, the "day" of the year) be incorporated into the function?

Patrick: I still could not get the SQL to run properly, it was giving me the same "Enter Parameter Value" message box as before.

dougf1r
ASKER CERTIFIED SOLUTION
Avatar of Markus Fischer
Markus Fischer
Flag of Switzerland image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of dougf1r

ASKER

Me too!

Thanks for your help with this.

- dougf1r
dougf1r said:
>>Patrick: I still could not get the SQL to run properly, it was giving me the same "Enter
>>Parameter Value" message box as before.

As I mentioned before, Jet just must not like the syntax.  The same syntax worked in
SQL Server.  Oh well :)

BTW, I did get this to work in Access:

SELECT w3.[time], Max(w3.temp) AS xxx
FROM
    (SELECT w1.[time],  w1.temp, (SELECT Count(*) FROM water w2 WHERE w2.[time] = w1.[time] AND w2.temp <= w1.temp) /  (SELECT Count(*) FROM water w2 WHERE w2.[time] = w1.[time]) AS Perc25
    FROM water w1
    WHERE (SELECT Count(*) FROM water w2 WHERE w2.[time] = w1.[time] AND w2.temp <= w1.temp) /  (SELECT Count(*) FROM water w2 WHERE w2.[time] = w1.[time]) <= 0.25
    ORDER BY  w1.[time],  w1.temp) AS w3
GROUP BY w3.[time]
ORDER BY w3.[time];

All in all, I think Markus's suggestion is probably the best.  Glad to see things worked out, and
thanks for giving me a chance to stretch those SQL-related brain muscles :)
> stretch those SQL-related brain muscles

LOL. You can say that again! I liked your first version (the one Jet chokes on) much better, naturally. While reading the question I was already building something similar mentally!

(°v°)
Avatar of dougf1r

ASKER

Markus,

The SQL you gave for expanding the criteria did not work for me, so i modified the syntax a bit (in regards to the AND statement)....

SELECT Percentile(0.25,"temp","water","time=" & water.time and "DOY=" & water.DOY) AS P25, water.time
FROM water
GROUP BY water.time, water.flow_DOY;

However, I still can't get it to run. I tried a few other changes, and still no go.

Do you see anything wrong with the syntax I have here?

dougf1r
Yes. You need to use your field name throughout. I used DOY just as an example. You seem to use water.flow_DOY, and water.DOY. That will not work. (And I doubt your field name just happens to be similar to my silly acronym for "day of year".)

Second, you moved the "and" from inside the quotes to outside. That has a very different meaning and will not work at all. Put the "and" back inside the quotes, just as I showed.

Also, remember the bit about the data type. The syntax is different whether you new field is numeric, text, or date.

(°v°)
Avatar of dougf1r

ASKER

harfang,

I hear you about keeping the proper field name throughout. I was just trying to keep the acronym's you were using in there to avoid confusion (but I forgot to rename that last piece properly).

I moved the "and" back inside the quotes (and removed one of the AS statements) and i continue to get errorrs. (I also added water.DOY after the AS statement to include that field)

The data type is numeric, so i am keeping everything as "time="

So my current SQL now reads:

SELECT
  Percentile(0.25, "temp", "water",
   "time=" & water.time & " and DOY=" & water.DOY) AS P25, water.time, water.DOY
FROM water
GROUP BY water.DOY, water.time;

...but I am still having errors. it first gives me an "Enter Parameters Values" message box, then Run-Time error 3061.

dougf1r

Avatar of dougf1r

ASKER

harfang,

IT WORKS NOW!

I think it was the space between the quotation and "and" in the SELECT statement...

Thanks again!

- dougf1r
Welcome, and grats on finding that missing space!
(°v°)
Avatar of dougf1r

ASKER

harfang,

One follow up question...

I am checking the percentiles that are calculated via the module in Access against calculations in Excel.

It seems that in some instances the two methods give exactly the same result, however, other instances give somewhat significantly different results. For example:

For this data range:

0.022
0.001
0.038
0.052
0.088
0.063
0.126
0.062
0.135
0.103
0.171
0.06
0.127
0.094
0.057
0.133
0.129

Access and Excel give these results for the 5th, 25th, 50th, 75th, and 95th percentiles:

PCT      Access        Excel
P05      0.0052          0.0178
P25      0.057           0.057
P50      0.088           0.088
P75      0.127           0.127
P95      0.1638          0.1422

It looks  like the module matches well with Excel for 25 <= percentile <= 75, but does not match well with Excel for the extreme cases.

Is there something in the VB code for the Percentile function that would be causing this?

- Dougf1r
Since you have 17 values, the quartiles are respectively the 1st, 5th, 9th, 12th, and 17th records. That is trivial and doesn't need interpolation.

However, my interpolations are wrong. Just the reverse of what they should be, actually! Replace the final If statement with this:

        If .AbsolutePosition < dblPosition Then
            Percentile = Percentile * (1 - dblPosition + .AbsolutePosition)
            .MoveNext
            Percentile = Percentile _
                + .Fields(0) * (1 - .AbsolutePosition + dblPosition)
        End If

Then the function will agree with Excel.

I thought I had tested it, but apparently not with the right numbers. Sorry about that, and I'm glad you didn't take the function at face value!

(°v°)
Avatar of dougf1r

ASKER

Works great!