Learn how to a build a cloud-first strategyRegister Now

x
?
Solved

Fast Group Summary Count.

Posted on 2009-02-19
18
Medium Priority
?
370 Views
Last Modified: 2012-06-21
Hi all,

We have a table in a database with between half a million to 2 million rows. The rows are constantly being updated/added and removed.
My problem is theres a particular column we need to have a live grouped count of. And when I say live we need to know like every 5 seconds or so.

Currently we have a key on that column and run an SP to get a group summary count. But this runs every 5 seconds and Im convinced theres another way of getting the results.

Is this really the best way to do it ?  or can I incorporate triggers to make it easier for the server ?
Or is there something else I can pursue ?

Thanks in advance.
K.
0
Comment
Question by:Kinger247
  • 7
  • 6
  • 4
  • +1
18 Comments
 
LVL 13

Expert Comment

by:St3veMax
ID: 23679851
You could consider implementing a trigger to do a count on the table and insert the result into another table; however this could have issues and prevent future inserts running until the count has completed.

Is is just a straight count of the number of rows that you're after?
0
 
LVL 6

Expert Comment

by:FVER
ID: 23679871
You said :
"Currently we have a key on that column and run an SP to get a group summary count."

By "Key" I assume you mean "index".
If your SP contains a query like :
select column_name, count(*) from table_name group by columns_name
then I believe this is the best way to do that.

Even for 2 millions lines, such a query using an index is very fast and have a low cost.

Using a trigger to trace data insert/update/delete on this column would result in a degradation of performance for these operations that I assume are even more important than the count operation.
0
 
LVL 10

Author Comment

by:Kinger247
ID: 23679886
I need a group summary count on just one of the columns, but we need it to refresh in real time or as close to real time as possible.

I just think there must be some other way of tracking this group column count instead of performing a summary count every 5 seconds.
0
Independent Software Vendors: We Want Your Opinion

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 
LVL 10

Author Comment

by:Kinger247
ID: 23679927
FVER, yes index is what I mean (not hot on db's as you can see :))

Ok, what your're syaing kinda makes me feel the current way we're doing it is less resource intensive than using a trigger or some other fancy method.
I just though that as our rows are being added/removed and amended so often this would hit the index update each time, which may slow it down.
0
 
LVL 6

Expert Comment

by:FVER
ID: 23679969
Index update surely consumes less resources than a trigger execution.
However, this index may require a rebuild from time to time if its size grows too much. That's common maintenance action.
0
 
LVL 51

Expert Comment

by:Mark Wills
ID: 23680222
With that much activity, it will probably stay cached for longer and hig memory a bit, but, in itself will run fairly quickly. It can place some resources on lock tables, so would be best if you had a covering index on the search/ select criteria and the item being aggregated - mind you if the table is changing with updates and changes to existing data held in an index then that can lead to index fragmentation and be counter productive.

Normally (is there is such a thing), if data is primarily being added as in an OLTP database, best to create a non-clustered "covering" index like:

create nonclustered index idx_my_covering_summary on my_table
(column1, column2)
include
(column3, value1)
;

then when accessing it would be something like :

select column1, column3, sum(value1) from my_table with (nolock) where column1 = <some_value> and column2 = <some_other_value>

the query engine will use only the index pages and not have to touch the row data. will make it quicker all around, and will not put a lock on row data.
0
 
LVL 51

Expert Comment

by:Mark Wills
ID: 23680227
oh, forgot the group by above. sorry about that...
0
 
LVL 51

Expert Comment

by:Mark Wills
ID: 23680531
And a trigger is not such a bad thing in some respects - arguably better than the stored procedure because it is only affecting current rows...

Have a look below...


create table My_Summary (id int identity primary key clustered, column1 varchar(10), column2 varchar(10), value1 decimal(18,5), time_last_modified datetime default getdate())
Go
create nonclustered index idx_my_summary_value on my_summary (column1,column2);     -- do not include value1 this time because we KNOW it is highly volatile
Go
 
 
create table My_Transactions (id int identity primary key clustered, column1 varchar(10), column2 varchar(10), value1 decimal(18,5), time_last_modified datetime default getdate())
Go
 
create trigger trg_My_Transaction_Summary on My_Transactions
for insert, update, delete
as
begin
 
  if update(column1) or update(column2) or update(value1)
  begin
 
     update My_Summary set value1 = my_summary.value1 + isnull(I.value1,0) - isnull(D.value1,0), time_last_modified = getdate()
     from my_summary
     left outer join inserted i on i.column1 = my_summary.column1 and i.column2 = my_summary.column2 
     left outer join deleted d on d.column1 = my_summary.column1 and d.column2 = my_summary.column2 
 
     insert My_Summary (column1, column2, value1)
     select i.column1,i.column2,sum(isnull(I.value1,0)) 
     from inserted i 
     left outer join my_summary on i.column1 = my_summary.column1 and i.column2 = my_summary.column2 
     where my_summary.column1 is NULL
     group by i.column1, i.column2 
 
  end
 
end
Go
 
insert My_Transactions (column1, column2, value1) 
select 'Mark','EE',100.00  union all
select 'Mark','EE',120.00  union all
select 'Mark','EE',130.00  union all
select 'Mark','EE',140.00  union all
select 'Mark','WORK',1500.00  union all
select 'Mark','WORK',1200.00  union all
select 'Mark','WORK',1000.00  union all
select 'fred','DAYS',100.00  union all
select 'fred','DAYS',120.00  union all
select 'fred','DAYS',130.00  union all
select 'fred','DAYS',140.00  union all
select 'fred','JOBS',1500.00  union all
select 'fred','JOBS',1200.00  union all
select 'fred','JOBS',1000.00  
Go
 
select * from my_summary
Go
 
update My_transactions set column1 = 'Annie', value1= 10000 where id = 1
Go
 
select * from my_summary
Go

Open in new window

0
 
LVL 6

Expert Comment

by:FVER
ID: 23680793
Mark,

Unless I am confused, we are talking about counting rows, not summing values.
If a trigger was usefull, it would look like the one below, but I'm sure performance would drop.
However, your advice to use "with (nolock) " in the select query is most valuable.
The SP could also use isolation level read uncommitted to boost performance when rollbacks are unlikely to occur

create trigger trg_My_Transaction_Summary on My_Transactions
for insert, update, delete
as
begin 
     update My_Summary set value1 = my_summary.value1
                                  + isnull(cnt1,0) - isnull(cnt2,0)
       from my_summary left join (select column1, count(*), cnt
                                    from inserted group by column1
                                 ) i on i.column1 = my_summary.column1 
                       left join (select column1, count(*), cnt
                                    from deleted group by column1
                                 ) d on d.column1 = my_summary.column1
end
Go

Open in new window

0
 
LVL 51

Expert Comment

by:Mark Wills
ID: 23681395
You're not confused - I thought it was a value aggregation of sorts, and did not pay close enough attention to "count of column". Be that as it may, with a few million rows, mostly historic I would imagine, still stand by everything I have posted so far with a small tweak in the trigger...

But it does make me think a bit more about what Column1 really is - what type of data, how often would it change (as distinct from the count), are the new instances of Column1 incrementally larger, or could be anything, and are they evenly distributed pieces on information.

Now, there is an inherent problem in incremental updates in a trigger and that it it could require an occassional reset.

but if the stored procedure is currently doing something like :

select column, count(*)
from table
group by column

is possibly best as a query. Not exactly sure why it needs to be a stored procedure. SP query plans are captured at time of creating, and if data is that volatile, might not be applicable.

There is already an index on that column, and would use the query hint. There is greater lock contention at a data row level unless you use that covering index, but not too much that can be covered over and above the column itself...  And then, with the trigger, then it only has to update when the column count is changed, so, if there is a high incidence of updating columns other than the one we are interested in, then the trigger might still be OK compared to a count on a few million rows.

Can we find out a bit more about that Column ?

Anyway, that small tweak on the trigger :


--drop table my_summary
--drop table my_transactions
 
-- depending on what column1 really is, might be tempted to use that as the primary key, and not wory about the other index.
create table My_Summary (id int identity primary key clustered, column1 varchar(10), value1 int, time_last_modified datetime default getdate())
Go
create nonclustered index idx_my_summary_value on my_summary (column1);   -- do not include value1 this time because we KNOW it is highly volatile
Go
-- if transactions already exist then do insert my_summary (column1, value1) select column1,count(*) from my_transactions group by column1
 
create table My_Transactions (id int identity primary key clustered, column1 varchar(10), column2 varchar(10), value1 decimal(18,5), time_last_modified datetime default getdate())
Go
 
create trigger trg_My_Transaction_Summary on My_Transactions
for insert, update, delete
as
begin
 
 
     update My_Summary set value1 = my_summary.value1 + isnull(icount,0) - isnull(dcount,0), time_last_modified = getdate()
     from my_summary
     left outer join (select column1,count(*) as icount from inserted group by column1) i on i.column1 = my_summary.column1  
     left outer join (select column1,count(*) as dcount from deleted group by column1) d on d.column1 = my_summary.column1 
 
     insert My_Summary (column1, value1)
     select i.column1,count(*) 
     from inserted i 
     left outer join my_summary on i.column1 = my_summary.column1 
     where my_summary.column1 is NULL
     group by i.column1
 
end
Go
 
insert My_Transactions (column1, column2, value1) 
select 'Mark','EE',100.00  union all
select 'Mark','EE',120.00  union all
select 'Mark','EE',130.00  union all
select 'Mark','EE',140.00  union all
select 'Mark','WORK',1500.00  union all
select 'Mark','WORK',1200.00  union all
select 'Mark','WORK',1000.00  union all
select 'fred','DAYS',100.00  union all
select 'fred','DAYS',120.00  union all
select 'fred','DAYS',130.00  union all
select 'fred','DAYS',140.00  union all
select 'fred','JOBS',1500.00  union all
select 'fred','JOBS',1200.00  union all
select 'fred','JOBS',1000.00  
Go
 
select * from my_summary
Go
 
update My_transactions set column1 = 'Annie', value1= 10000 where id = 1
Go
 
select * from my_summary
Go

Open in new window

0
 
LVL 10

Author Comment

by:Kinger247
ID: 23682639
Hi, the column is an int type. And currently the query is very similar to yours like

select column, count(*)
from table
group by column

Basically (as a bit of background), we have a table thats constantly being updated by about 500 users in the building.
But we need to know the amount of records for each value in this column. Mostly the values in the column range from 1 to 15 and mean something when cross checked with another descriptive lookup table.

So the query that currently gets run brings back (every 5 seconds) a summary list of: Type1:300,000 Type2:200,000, Type3: ..... etc.
Its been a long time since doing any db stuff, but I thought after seeing this process happen on our system there must be a better way of getting this count information other than running a query every 5 seconds.
The query is a Stored Proc called from a VB.NET service.
The service then pings the info via udp to a central console viewed by an administrator.
If the counts for each record type look out of order (somehow, dont understand that part of the process), then the administrator loads more data or removes data.

That trigger does look interesting, I'm just concerned now thinking about it that something like that could get out of sink somehow ? or maybe not.

So correct me if Im wrong, the overhead on performing a select statement every 5 seconds is minimal, but updating the index whenever the column data is amended etc, could have an overhead.
The trigger though looks like less over head to me ?
0
 
LVL 6

Expert Comment

by:FVER
ID: 23683379
As I said before, my point of view is that firing a trigger causes more overhead than updating the index.
Moreover, the my_summary table updated by the trigger would become a hotspot (same lines updated at a high frequency) that should be avoided.
0
 
LVL 51

Expert Comment

by:Mark Wills
ID: 23683967
Well, it is a balancing act, and one that is only effectively based on real life data...

Indexes do help, but overindexing can be detrimental. If the value of column1 is not changing all that much (as in a department or something like it) then the indexes on that column should be OK, and considering there is about 15 different values, it should fit quite nicely on a few 8K pages. However, considering there are so few different values, then the query optimizer might be labouring with a few million rows, meaning it is not all that unique, resulting in a bit of hashing to point to data rows (or  key values of clustered PK which  point to data rows), and consequently consuming quite a lot of hashed index values.

Having about 500 users then there is a reasonable assumption that there is a fair bit of sharing of those 15 different values in that column - ie there is a high probability that what ever they are doing will likely result in a re-count (which is really inserts and deletes and changes to that column specifically). In that regard the trigger will most likely have a few waits states for every detailed transaction, because of the very high probability that more than one user is doing something concurrently that will affect the count for any one of the values in that column.

There is a plus though, doing a query across all the rows could readily encounter lock contention just the same, however, doing a count for 15 rows in a summary table will be extraordinarily quick to both update and retrieve because it will be pretty much held in memory. Would probably remove the index off the big table, and use the summary table and have a net gain in pages and cache.

Would also use the Column value as the primary key for the summary table - that will also save a bit. So long as there is NOTHING else that the summary table is used for, then there should be no deadlocks or too much overhead with a transiant wait state...

As for re-synch - run a routine each night that truncates the current table and re-populates... In fact, to get rid of any insert in the trigger, could place yet another trigger on the masterfile of those column values, so all we have to do is worry about updating the one value, and we can do more work on the inserted and deleted temp tables to precalculate and summarise as much as possible before updating... Some of it will be dictated by how many rows could be in inserted / deleted in a normal transaction - the assumption is based on about 1...

And then let us not forget that the query :
select column, count(*)
from my_transactions with (nolock)
group by column

will work pretty quickly, but it does have to wait for any OLTP lock processing as well, would work better with an index, does need to recalc a few million rows, and maybe more detrimental on cache and buffers...

It will be a small challenge to compare...
create table My_Summary (column1 int  primary key clustered, value1 int)
Go
-- if transactions already exist, and or refreshing data, then do a truncate table my_summary and then insert my_summary (column1, value1) select column1,count(*) from my_transactions group by column1
 
create table My_Transactions (id int identity primary key clustered, column1 int, column2 varchar(10), value1 decimal(18,5), time_last_modified datetime default getdate())
Go
 
create trigger trg_My_Transaction_Summary on My_Transactions
for insert, update, delete
as
begin
 
     update My_Summary set value1 = my_summary.value1 + vcount
     from my_summary
     INNER JOIN ( select column1, sum(vcount) as vcount from 
                     (select column1,1 as vcount from inserted         -- reasonable(?) to assume normal OLTP not too many rows in inserted / deleted
                      union all
                      select column1,-1 from deleted) s 
                      group by column1 having sum(vcount) <> 0
                 ) v on v.column1 = my_summary.column1 
 
/*   -- replace this with a trigger in the masterfile of column1 so that we only ever have an update in this trigger.
     insert My_Summary (column1, value1)
     select i.column1,count(*) 
     from inserted i 
     left outer join my_summary with (nolock) on i.column1 = my_summary.column1 
     where my_summary.column1 is NULL
     group by i.column1
*/ 
end
Go

Open in new window

0
 
LVL 6

Accepted Solution

by:
FVER earned 1000 total points
ID: 23689922
Mark,
"doing a query across all the rows could readily encounter lock contention"
=> that's not true if isolation level is set on read uncommitted in the SP holding the count statement, am I right ?
0
 
LVL 51

Assisted Solution

by:Mark Wills
Mark Wills earned 1000 total points
ID: 23690690
You are right, and more so the with (nolock) should further minimise (but really it means being able to read uncommitted rows), but it was the locks held by the transaction activity that I was thinking of at the time. So in that case probably do not need isolation levels...

But, I am kind of attracted to the thought / idea of having the entire summary table held in 15 rows of two integers... Could almost hold that in one hand...

But being so few, then there is a massively increased probability of concurrent activity affecting those few. But, then the index pages of a few million being hashed amongst the fifteen is not a pretty thought either (well, not exactly because if there is a clustered PK created before hand, then it will use those key values to point to data - even still there is a lot of pointing happening for each one), and might be very worthwhile killing that index altogether.

Interesting problem...  Might need to run some stats first...
0
 
LVL 10

Author Closing Comment

by:Kinger247
ID: 31548693
Cheers guys for your help, I think by the looks of it its going beyond what I can do in the timeframe I have.
I think I'm going to leave the process as it is for now.

thanks Again :)
0
 
LVL 51

Expert Comment

by:Mark Wills
ID: 23697670
Thanks, didn't mean to overcomplicate, but it is an interesting problem...

Think every 5 seconds to determine how to adjust is a bit overkill, In one site I was working with they needed "real time" information and that ended up being every minute and displayed on a wide screen monitor via sharepoint dashboard in the common room (and the Boss's desk).

Bit different if you are trading stock or the money market where a second can mean the difference between success or failure, but if you are not in that space, maybe look closely at that time interval, the more you can control that, then the more it lends itself to a query. Also, might be worthwhile thinking about a time interval from the dataset itself and incorporate datetime into the index to effectively reduce the active set slightly and give slightly more uniqueness to that index and possibly better trended / more relevant information as to what might impact today.

Thanks for your question, enjoyed it, and thanks FVER, enjoyed swapping thoughts...
0
 
LVL 6

Expert Comment

by:FVER
ID: 23709207
Thank you two, it was indeed an interesting debate.
0

Featured Post

Upgrade your Question Security!

Add Premium security features to your question to ensure its privacy or anonymity. Learn more about your ability to control Question Security today.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

This post looks at MongoDB and MySQL, and covers high-level MongoDB strengths, weaknesses, features, and uses from the perspective of an SQL user.
Sometimes MS breaks things just for fun... In Access 2003, only the maximum allowable SQL string length could cause problems as you built a recordset. Now, when using string data in a WHERE clause, the 'identifier' maximum is 128 characters. So, …
Whether it be Exchange Server Crash Issues, Dirty Shutdown Errors or Failed to mount error, Stellar Phoenix Mailbox Exchange Recovery has always got your back. With the help of its easy to understand user interface and 3 simple steps recovery proced…
With just a little bit of  SQL and VBA, many doors open to cool things like synchronize a list box to display data relevant to other information on a form.  If you have never written code or looked at an SQL statement before, no problem! ...  give i…

810 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question