Link to home
Start Free TrialLog in
Avatar of ThomasFoege
ThomasFoege

asked on

TOP n records pr group?

Hello EE!

I've tried following a guide at
http://allenbrowne.com/subquery-01.html#TopN

in order to give me the TOP n records by group, but I cannot seem to get it to work.

I got a database containing categories (RVG), buyers (LNR), and amounts (KP)

What I'm interested in getting is the sum of the amounts (KP) of the top 5 buyers pr category. So far I've got this:


SELECT RAWDATA.RVG, Sum(RAWDATA.KP), RAWDATA.LNR
FROM RAWDATA
WHERE RAWDATA.LNR IN
   (SELECT TOP 5 LNR                            
   FROM RAWDATA AS Dupe                              
   WHERE Dupe.RVG = RAWDATA.RVG        
   ORDER BY Sum(Dupe.KP) DESC, Dupe.LNR DESC)
ORDER BY RAWDATA.RVG, Sum(RAWDATA.KP), RAWDATA.LNR;

Open in new window



I'm getting the error that RVG is not part of the aggregate function?

Secondly, is it possible to also do a sub query that returns the total spend for that category?

So that I would get:

RVG, LNR, Sum of KP (top 5), Sum of KP (All)?

Thanks in advance!
Avatar of mbizup
mbizup
Flag of Kazakhstan image

You need to use a GROUP BY clause:

SELECT RAWDATA.RVG, Sum(RAWDATA.KP), RAWDATA.LNR
FROM RAWDATA
WHERE RAWDATA.LNR IN
   (SELECT TOP 5 LNR                            
   FROM RAWDATA AS Dupe                              
   WHERE Dupe.RVG = RAWDATA.RVG        
   ORDER BY Sum(Dupe.KP) DESC, Dupe.LNR DESC)
GROUP BY RAWDATA.RVG,  RAWDATA.LNR;
Avatar of ThomasFoege
ThomasFoege

ASKER

Hey! Thnks for the reply

Doing that i get LNR is not part of the aggregate function?
That will give you a query that sums amount per category and buyer.  However if you need to sum all of the buyers' amounts per group (ie: total amount per group_, you would need to drop the RAWDATA.LNR from your outer query:

SELECT RAWDATA.RVG, Sum(RAWDATA.KP)
FROM RAWDATA
WHERE RAWDATA.LNR IN
   (SELECT TOP 5 LNR                            
   FROM RAWDATA AS Dupe                              
   WHERE Dupe.RVG = RAWDATA.RVG        
   ORDER BY Sum(Dupe.KP) DESC, Dupe.LNR DESC)
GROUP BY RAWDATA.RVG;
I'm trying to get the sum of the amounts of the top 5 buyers pr category :)
Actually... The SUM should not be in the inner query.  Try this:


SELECT RAWDATA.RVG, Sum(RAWDATA.KP)
FROM RAWDATA
WHERE RAWDATA.LNR IN
   (SELECT TOP 5 LNR                            
   FROM RAWDATA AS Dupe                              
   WHERE Dupe.RVG = RAWDATA.RVG        
   ORDER BY Dupe.KP DESC, Dupe.LNR DESC
GROUP BY RAWDATA.RVG;
Missed the close parenthesis on the order by.  It should be this:

SELECT RAWDATA.RVG, Sum(RAWDATA.KP)
FROM RAWDATA
WHERE RAWDATA.LNR IN
   (SELECT TOP 5 LNR                            
   FROM RAWDATA AS Dupe                              
   WHERE Dupe.RVG = RAWDATA.RVG        
   ORDER BY Dupe.KP DESC, Dupe.LNR DESC)
GROUP BY RAWDATA.RVG;
It's running, calculating is taking forever ;)
Theres about 20 different RVG and maybe 60.000 lines across 50 different LNR. Does it sound probable that it's still running query after 10 minutes?
I think you are also confusing Amount (KP) with Customers (LNR).

For a sum of the top 5 Per group based on amount (KP):

SELECT RAWDATA.RVG, Sum(RAWDATA.KP)
FROM RAWDATA
WHERE RAWDATA.KP IN
   (SELECT TOP 5 KP                          
   FROM RAWDATA AS Dupe                              
   WHERE Dupe.RVG = RAWDATA.RVG        
   ORDER BY Dupe.KP DESC)
GROUP BY RAWDATA.RVG;
<< Does it sound probable that it's still running query after 10 minutes? >>

It *seems* unlikely.  I think the query is still off (see above).  Try my last post and see if that helps.
RVG is a Category, such as Books
LNR is a buyer such as Person A
KP is the amount columns


LNR can buy books maybe 20 or 100 times in the dataset


What I'm hoping to get is an output like this:
Category, Top5 Spend, All Spend
Books, 10$, 50$
Food, 5$, 5$

Then i know that of the top 5 buyers, they buy 10$ out of 50$ and in food all spend is within top 5

I might have confused them in my query
SELECT RAWDATA.RVG, Sum(RAWDATA.KP)
FROM RAWDATA
WHERE RAWDATA.KP IN
   (SELECT TOP 5 KP                          
   FROM RAWDATA AS Dupe                              
   WHERE Dupe.RVG = RAWDATA.RVG        
   ORDER BY Dupe.KP DESC)
GROUP BY RAWDATA.RVG; 

Open in new window


Keeps running forever to :/

I really hope this is possible to do within a single "query" so I do not have to create pivots etc.
Give that last query a shot on a sample set of data and see if it gives you the expected results.

If that doesn't work, it might help to post a sample database (no sensitive information, please) and specify what your expected results are.
ASKER CERTIFIED SOLUTION
Avatar of mbizup
mbizup
Flag of Kazakhstan image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
It did work with a smaller dataset, however does that query not only return the spend of the top 5 buyers divided out pr category?

What I'm hoping to get is for each category, what is the spend within that category of the top 5 suppliers within that category. So there will be a top 5 for each category (RVG)

Or did i miss something
Okay I may have misunderstood your requirements-

It sounds like you might need to run a couple of queries... and this will likely be a very time consuming query...

1.  You need to first obtain a SUM (total) of all of the consumers' spending in each group to give you these results (one row per customer per group).

2.  Based on those results, you need to obtain a top 5 "Total" amounts per category

Does that sound right?

Can you post a sample database with just the RAWDATA table containing enough rows of data to illustrate your needs?  (and also post what your expected results are)
Time consuming is fine, currently its way more time consuming doing alot of pivots and vlookups ;)

Thats correct! Some consumers may be in several categories, and could in theory be in top 5 in more categories, but their spend is only counted toward that top 5 spot within the category

An example of data is included. It's made up, but represents exactly the same kind of data i got

The desired output is as follows:

RVG - Spend within RVG made up of the Top 5 LNR within that RVG - Total spend within that RVG


So if the dataset was only
ID, RVG, LNR, Amount
1, Books, Person A, 10
2, Books, Person B, 100
3, Books, Person C, 30
4, Books, Person D, 50
5, Books, Person E, 200
6, Books, Person F, 200

7, Food, Person G, 1000

The output would be:
RVG - Spend in Top 5 - All Spend
Books - 580 - 590
Food  1000 - 1000


Thank you again for using your time to help me!

RAWDATA-Example.xlsx
Hey, i played around with your code

SELECT RAWDATA2.RAA_VARE_GRUPPE, Sum(RAWDATA2.KOEBSPRIS) AS SumOfKOEBSPRIS
FROM RAWDATA2
WHERE (((RAWDATA2.LEVERANDOERNR) In (SELECT TOP 5 LEVERANDOERNR                          
   FROM RAWDATA2 AS Dupe                              
   WHERE Dupe.RAA_VARE_GRUPPE = RAWDATA2.RAA_VARE_GRUPPE
   GROUP BY LEVERANDOERNR        
   ORDER BY Sum(Dupe.KOEBSPRIS) DESC)))
GROUP BY RAWDATA2.RAA_VARE_GRUPPE;

Open in new window


seems to work! awesome!
Hi - Just got back...

Are you all set then?
It's very very slow though. Would it make sense to make a tempoary table or something like that to lower processing time?
<< Would it make sense to make a tempoary table or something like that to lower processing time >>

I have used a two-step approach like that effectively to improve performance on some large queries.

I believe th IN clause with the SUM in it might be what is causing you grief....

You *might* be able to improve speed by creating a query that sums all of the amounts per customer, per group, and turning that into a Make Table query.  The run a SELECT 5 by group on the resulting table - which would be a much smaller dataset...
Worked perfectly!