gfody
asked on
top N (aggregate) for each group by
I've seen the question how to limit results on an inner group by (similar to mysql's limit n), however this problem is bit different please read on..
select group1, group2, count(*) cnt
from stuff
group by group1, group2
order by group1, group2, cnt desc
I want to limit the results to only the top 2 in group2 (according to the aggregate).
(group1,group2,...)
a,a
a,a
a,a
a,b
a,b
a,b
a,b
a,c
a,c
a,c
a,c
a,c
b,e
b,e
b,e
b,f
b,f
b,g
b,g
b,g
b,g
b,g
results should be:
a,c,5
a,b,4
b,g,5
b,e,3
select group1, group2, count(*) cnt
from stuff
group by group1, group2
order by group1, group2, cnt desc
I want to limit the results to only the top 2 in group2 (according to the aggregate).
(group1,group2,...)
a,a
a,a
a,a
a,b
a,b
a,b
a,b
a,c
a,c
a,c
a,c
a,c
b,e
b,e
b,e
b,f
b,f
b,g
b,g
b,g
b,g
b,g
results should be:
a,c,5
a,b,4
b,g,5
b,e,3
I am a little confused about your request, can you post a little more information.. Top 2 by number of occurences or something else?
ASKER
I'm making a report from a log table.. here's a better example
time, deploymentid, domain, actionid
2/2/2004 1:23am, 1, aol.com, 1
2/2/2004 4:37am, 2, aol.com, 1
2/2/2004 5:11am, 3, aol.com, 2
2/3/2004 12:03pm, 3, yahoo.com, 1
2/3/2004 2:43pm, 3, msn.com, 1
2/3/2004 2:50pm, 2, aol.com, 2
I like to give the option to group by time, deployment, and domain. for instance.. deployment by domain would be this query: (with some example aggregates I may use)
select
deploymentid,
domain,
sum(case actionid when in (1) then 1 else 0 end) a1
group by deploymentid,domain
order by a1 desc
the same report, grouped by domain and then deployment would be:
select
domain,
deploymentid,
sum(case actionid when in (1) then 1 else 0 end) a1
group by domain,deploymentid
order by a1 desc
in either case, the domain column contains upwards of 30,000 unique values.. so the report gets cluttered with tons of insignificant data. what the customer is really interested in is the top N domains (top depends on the aggregate).. in the example above the top 2 domains would be the domains with the most rows with where actionid=1
time, deploymentid, domain, actionid
2/2/2004 1:23am, 1, aol.com, 1
2/2/2004 4:37am, 2, aol.com, 1
2/2/2004 5:11am, 3, aol.com, 2
2/3/2004 12:03pm, 3, yahoo.com, 1
2/3/2004 2:43pm, 3, msn.com, 1
2/3/2004 2:50pm, 2, aol.com, 2
I like to give the option to group by time, deployment, and domain. for instance.. deployment by domain would be this query: (with some example aggregates I may use)
select
deploymentid,
domain,
sum(case actionid when in (1) then 1 else 0 end) a1
group by deploymentid,domain
order by a1 desc
the same report, grouped by domain and then deployment would be:
select
domain,
deploymentid,
sum(case actionid when in (1) then 1 else 0 end) a1
group by domain,deploymentid
order by a1 desc
in either case, the domain column contains upwards of 30,000 unique values.. so the report gets cluttered with tons of insignificant data. what the customer is really interested in is the top N domains (top depends on the aggregate).. in the example above the top 2 domains would be the domains with the most rows with where actionid=1
ASKER
oops, that is:
sum(case when actionid in (1) then 1 else 0 end)
sum(case when actionid in (1) then 1 else 0 end)
Paste the following in your query analyser with appropriate fieldnames and tablenames and execute:
select group1, group2, count(*) cnt
into temp2 from stuff
group by group1, group2
order by group1, group2,cnt asc
select a.group1, max(a.cnt) cnt into temp3
from temp2 a, temp2 b
where a.cnt < (Select max(b.cnt) from temp2 b)
group by a.group1
UNION
select a.group1, max(a.cnt) cnt
from temp2 a
inner join temp2 b
on a.group1 = b.group1
group by a.group1
select a.group1,a.group2,a.cnt from temp2 a inner join temp3 b
on a.group1 = b.group1 and a.cnt = b.cnt
drop table temp2
drop table temp3
select group1, group2, count(*) cnt
into temp2 from stuff
group by group1, group2
order by group1, group2,cnt asc
select a.group1, max(a.cnt) cnt into temp3
from temp2 a, temp2 b
where a.cnt < (Select max(b.cnt) from temp2 b)
group by a.group1
UNION
select a.group1, max(a.cnt) cnt
from temp2 a
inner join temp2 b
on a.group1 = b.group1
group by a.group1
select a.group1,a.group2,a.cnt from temp2 a inner join temp3 b
on a.group1 = b.group1 and a.cnt = b.cnt
drop table temp2
drop table temp3
ASKER
thanks for the reply iyerbhuvanac, your solution does return only the top 2, however I do need to limit to an arbitrary number of results per group. top 2 was just for my example.. the actual options are top 5, 20, 50, 100.
HERE is a proc which is going to help you. You just have to pass the parameter ie the top number of rows you want to retrieve.say,5 or 10 or 15 or 20,etc
create this in your database and execute with the required params.
This is the base version. I shall try to optimize this and post it as and when time permits
create procedure Getmax @param int
AS
declare
@count int
begin
set @count = 1
select group1, group2, count(*) cnt
into temp2 from stuff
group by group1, group2
order by group1, group2,cnt asc
while @count <= @param
begin
select a.group1, max(a.cnt) cnt into temp3
from temp2 a
inner join temp2 b
on a.group1 = b.group1
group by a.group1
select a.group1,a.group2,a.cnt cnt into temp5 from temp2 a inner join temp3 b
on a.group1 = b.group1 and a.cnt = b.cnt
select * from temp5
delete from temp2 where temp2.group1 in (select group1 from temp5)
and temp2.group2 in (select group2 from temp5)
and temp2.cnt in (select cnt from temp5)
set @count = @count + 1
drop table temp3
drop table temp5
end
drop table temp2
end
create this in your database and execute with the required params.
This is the base version. I shall try to optimize this and post it as and when time permits
create procedure Getmax @param int
AS
declare
@count int
begin
set @count = 1
select group1, group2, count(*) cnt
into temp2 from stuff
group by group1, group2
order by group1, group2,cnt asc
while @count <= @param
begin
select a.group1, max(a.cnt) cnt into temp3
from temp2 a
inner join temp2 b
on a.group1 = b.group1
group by a.group1
select a.group1,a.group2,a.cnt cnt into temp5 from temp2 a inner join temp3 b
on a.group1 = b.group1 and a.cnt = b.cnt
select * from temp5
delete from temp2 where temp2.group1 in (select group1 from temp5)
and temp2.group2 in (select group2 from temp5)
and temp2.cnt in (select cnt from temp5)
set @count = @count + 1
drop table temp3
drop table temp5
end
drop table temp2
end
ASKER
I appreciate your effort, but I really needed something less hackerish (read scalable)
I came up with this method using a temp table..
select g1,g2,c1,c2
group by g1,g2
into #temp
create clustered index on #temp (g1,c1 desc) -- important!
select g1,g2,c1,c2 from temp t
where g2 in (select top 10 g2 from #temp where g1=t.g1 order by c1 desc)
thanks for everyones help
I came up with this method using a temp table..
select g1,g2,c1,c2
group by g1,g2
into #temp
create clustered index on #temp (g1,c1 desc) -- important!
select g1,g2,c1,c2 from temp t
where g2 in (select top 10 g2 from #temp where g1=t.g1 order by c1 desc)
thanks for everyones help
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.