Create Day by Day Count of Clients

PatHartman
PatHartman used Ask the Experts™
on
I currently have a VBA code loop that does this for me (so technically I have a working solution) but I'm hoping that there is an SQL solution that might be faster.  I have about 1 million client records. They have a ServiceStartDate and a ServiceEndDate.  The ServiceEndDate will be null if the service is still active.  What I need is a recordset that contains 1 row for every day since around 1/1/09 when the company started showing the number of accounts active that day by UtilityID.  Nice to have would also be the number of accounts added that day and accounts dropped that day.  So input would be.
UtilityID, Acct #, ServiceStartDate, ServiceEndDate
1, abc, 1/1/9, 1/5/9
1, aaa, 1/3/9
1, aab, 1/3/9,


output would be
UtilityID, CalendarDate, Active, Added, Dropped
1, 1/1/9, 1,1,0
1, 1/2/9, 1, 0, 0
1, 1/3/9, 3, 2, 0
1, 1/4/9, 3, 0, 0
1, 1/5/9, 2,0, 1
1, 1/6/9, 2, 0, 0
....
So just taking UtilityID 1 for 3 accounts over 6 days, this is what I end up with.  Every day, I have a count of the number of active accounts.  Added and dropped if you can calculate it at the same time would be good but otherwise, it is pretty easy to calculate by summing the recordset by start and end dates so I can add that later.  It is counting by day that is the problem.

I have tried this by joining to a table of days.  There are about 3500 days in this time frame but a Cartesian Product of 1,000,000 x 3500 is huge and way too slow.  I am hoping there is some slick T-SQL syntax that can generate these intermediate records.  The Cartesian Product method works fine for smaller recordsets but fails for one of this size.  The current code loop does the process one day at a time.
Comment
Watch Question

Do more with

Expert Office
EXPERT OFFICE® is a registered trademark of EXPERTS EXCHANGE®
Vitor MontalvãoIT Engineer
Distinguished Expert 2017

Commented:
Tables schemas would help and also their sample data.
Distinguished Expert 2017

Author

Commented:
Isn't that what I gave you?  There isn't anything except the 4 columns that are involved in this process and I can't give you a million client records.  Plus the tables are SQL Server.
Dale FyeOwner, Dev-Soln LLC
Most Valuable Expert 2014
Top Expert 2010

Commented:
So Pat,  instead of a Cartesian join, use a non-equi join

SELECT C.DateField
, SUM(1) as Active
, SUM(CASE WHEN C.DateField = C.ServiceStartDate THEN 1 ELSE 0 END) as StartedThisDay
, SUM(CASE WHEN C.DateField = C.ServiceEndDate THEN 1 ELSE 0 END) as EndedThisDay
FROM (
SELECT D.DateField, T.UtilityID, T.[Acct #], T.ServiceStartDate, T.ServiceEndDate
FROM yourTable as T
INNER JOIN yourDateTable as D
ON D.DateField >= T.ServiceStartDate
AND D.DateField <= T.ServiceEndDate) as C
Ensure you’re charging the right price for your IT

Do you wonder if your IT business is truly profitable or if you should raise your prices? Learn how to calculate your overhead burden using our free interactive tool and use it to determine the right price for your IT services. Start calculating Now!

Dale FyeOwner, Dev-Soln LLC
Most Valuable Expert 2014
Top Expert 2010

Commented:
SELECT C.DateField
, SUM(1) as Active
, SUM(CASE WHEN C.DateField = C.ServiceStartDate THEN 1 ELSE 0 END) as StartedThisDay
, SUM(CASE WHEN C.DateField = C.ServiceEndDate THEN 1 ELSE 0 END) as EndedThisDay
FROM (
SELECT D.DateField, T.UtilityID, T.[Acct #], T.ServiceStartDate, T.ServiceEndDate
FROM yourTable as T
INNER JOIN yourDateTable as D
ON D.DateField >= T.ServiceStartDate
AND D.DateField <= IsNull(T.ServiceEndDate, GetDate()) as C
Distinguished Expert 2017

Author

Commented:
Thanks Dale,
I should have mentioned that I tried something similar but only for the ServiceStartDate since the End date can be null.  So I left the End date in the where clause and it was still too slow.  I'm running this from Access so I'm not sure what would happen if I used Nz() for the end date so I could use it in the join.  Would Access be able to translate that to T-SQL or would it just ask for the million rows to be returned?  I spend so much time doing queries with Access that my RDBMS specific skills are somewhat limited.

I've also tried running the queries directly in SSMS and they just sit and churn.

I will try your variation tomorrow just in case.  But doesn't UtilityID have to be in the outer Select since I need to have that returned so I have a count by day, by utility?
Owner, Dev-Soln LLC
Most Valuable Expert 2014
Top Expert 2010
Commented:
Pat,

Why are you running it from Access if the tables are in SQL Server?  Let the server do it and return the result set of several thousand records to you.  run it as a stored procedure or a pass-thru query with returns records set to true.

Yes, I missed the UtilityID in the result set, so it would have to be in the outer query as well, and I forgot the Group By clause:

IN SQL Server:
SELECT C.UtilityID
, C.DateField
, SUM(1) as Active
, SUM(CASE WHEN C.DateField = C.ServiceStartDate THEN 1 ELSE 0 END) as StartedThisDay
, SUM(CASE WHEN C.DateField = C.ServiceEndDate THEN 1 ELSE 0 END) as EndedThisDay
FROM (
SELECT D.DateField, T.UtilityID, T.[Acct #], T.ServiceStartDate, T.ServiceEndDate
FROM yourTable as T
INNER JOIN yourDateTable as D
ON D.DateField >= T.ServiceStartDate
AND D.DateField <= IsNull(T.ServiceEndDate, GetDate()) as C
GROUP BY C.UtilityID, C.DateField

In Access, you should be able to substitute:

IsNULL(T.ServiceEndDate, GetDate())

with:

NZ(T.ServiceEndDate, Date())

By putting the condition in the WHERE clause of your earlier query, you end up pulling all of the dates > ServiceStartDate, into the query before the WHERE gets processed.  By putting it in the JOIN clause, it should only return the records that meet the criteria.
Distinguished Expert 2017

Author

Commented:
I'm running it from Access because Access makes every effort to pass-through all queries.  So, unless I've done something to prevent the query from being passed-through, all I incur doing it this way is a small amount of overhead to have the ODBC driver convert the SQL syntax for me.  The recordset returned will be about 3500 rows which is not large in the greater scheme of things especially since we are starting with over a million accounts.   As I said, I also ran the queries directly in SSMS and they still wouldn't run.

I'll try this tomorrow.  It is quite possible, I did the non-equi-join incorrectly.  I'll let you know if it works.  I also didn't do it as a sub-select.  I'm not sure what the sub-select is doing for you in this case.  Why wouldn't a simple join with a group by suffice?
Scott PletcherSenior DBA
Most Valuable Expert 2018
Top Expert 2014

Commented:
First try using a tally table, which is just a table of sequential, keyed integers.  Literally a table with just a single integer column.  Typically using it runs very efficiently to generate multiple rows.

I can't post the code to create a tally table because work filters block trying to post any CROSS APPLY(!), but such code is easily available via Google.  A lot of people name the integer column "N"; I hate single-letter column names, so I changed the name to "number", which is what I use in the code below.

After creating that table (once), here is the main query:

SELECT
    d.UtilityID,
    ca1.CalendarDate,
    SUM(1) AS Active,
    SUM(CASE WHEN ca1.CalendarDate = d.ServiceStartDate THEN 1 ELSE 0 END) AS Added,
    SUM(CASE WHEN ca1.CalendarDate = d.ServiceEndDate THEN 1 ELSE 0 END) AS Dropped
FROM #data d
INNER JOIN dbo.tally t ON t.number BETWEEN 0 AND DATEDIFF(DAY, ServiceStartDate, ISNULL(ServiceEndDate, GETDATE()))
CROSS APPLY (
    SELECT DATEADD(DAY, t.number, d.ServiceStartDate) AS CalendarDate
) AS ca1
GROUP BY d.UtilityID, ca1.CalendarDate
ORDER BY d.UtilityID, ca1.CalendarDate
Distinguished Expert 2017

Author

Commented:
At the moment, I'm working with Dale's suggestion and we are very close but not quite right.  I should be able to independently verify each of the numbers this query produces.

Adds for today = Select Count(*) where StartDate = today
Drops for today = Select Count(*) where DropDate = today
Active for today = Select Count(*) where StartDate <= today and (DropDate > today or DropDate Is Null)

The adds and drops match for any day I choose but the Active count is off.  I added criteria to get a sample for a small utility and Dale's query gives me 18,051 for today but the straight count gives me 18,128.  Summing all the adds and drops gives me 132,347 - 114,298 = 18,049.  Today's adds and drops are 19 and 41 respectively so that doesn't account for the difference between 18.051 and 18,049 or 18,128.

I'll try to work with a smaller sample to see if I can figure it out.

The only change I made to Dale's query was to add a missing paren before the "as C" and I changed the GetDate() to be GetDate() +1
Dale FyeOwner, Dev-Soln LLC
Most Valuable Expert 2014
Top Expert 2010

Commented:
Pat,

Don't forget that my numbers are grouped by UtilityID and CalendarDate,

If you add the UtilityID to your queries above and then add the Group By clause, it might help you track down the differences.

Dale
Dale FyeOwner, Dev-Soln LLC
Most Valuable Expert 2014
Top Expert 2010

Commented:
The other thing that might be causing an issue is GetDate()

GetDate() is like Now() in Access, and returns a date/time value.
Distinguished Expert 2017

Author

Commented:
My count queries are filtered the same way as the "big" query.

I don't think that GetDate() is causing the problem since the counts for any given day are accurate.  Something is happening in the summing and it also affects the counts when I run them individually.  I should be able to add all the adds and drops using

Select Utility, Count(*) As AddCount Where StartDate <= today

Select Utility, Count(*) as DropCount Where DropDate <= today

Netting those two numbers should give the active count for today but that number is different from this:

Select Utility, Count(*) as Active Where StartDate <= today And (DropDate is null or DropDate> today)

If the <= are "off", it should only be by the individual count for today and the count is off by more.
Dale FyeOwner, Dev-Soln LLC
Most Valuable Expert 2014
Top Expert 2010

Commented:
How about using:

Select Utility, Count(*) as Active (Where StartDate <= today) And (NZ(DropDate, Date() + 1)> today)
Distinguished Expert 2017

Author

Commented:
This query produces 18051, 19, 41 for today - active, start, end

SELECT C.calcFPNAForecastUtilityAbrev
, C.Date
, SUM(1) as Active
, SUM(CASE WHEN C.Date = C.RunDate THEN 1 ELSE 0 END) as StartedThisDay
, SUM(CASE WHEN C.Date = C.MeterServiceEndDate THEN 1 ELSE 0 END) as EndedThisDay
FROM (
      SELECT D.Date, T.calcFPNAForecastUtilityAbrev, T.RunDate, T.MeterServiceEndDate
      FROM accessdb.dbo.tblUniqueAccounts as T
            INNER JOIN concourse.dbo.SYS_Calendar as D
            ON D.Date >= T.RunDate
            AND D.Date <= IsNull(T.MeterServiceEndDate, GetDate() +1)) as C
where c.calcFPNAForecastUtilityAbrev = 'CT_CLP'
GROUP BY C.calcFPNAForecastUtilityAbrev, C.Date
order by c.date desc

I wrapped the above query in another query and summed the adds and drops (131,853, 113,843) and end up with 18,010 net.

Select
sum(d.startedthisday) as adds,
sum(d.endedthisday) as drops,
sum(d.startedthisday - d.endedthisday) as net
From
(SELECT C.calcFPNAForecastUtilityAbrev
, C.Date
, SUM(1) as Active
, SUM(CASE WHEN C.Date = C.RunDate THEN 1 ELSE 0 END) as StartedThisDay
, SUM(CASE WHEN C.Date = C.MeterServiceEndDate THEN 1 ELSE 0 END) as EndedThisDay
FROM (
      SELECT D.Date, T.calcFPNAForecastUtilityAbrev, T.RunDate, T.MeterServiceEndDate
      FROM accessdb.dbo.tblUniqueAccounts as T
            INNER JOIN concourse.dbo.SYS_Calendar as D
            ON D.Date >= T.RunDate
            AND D.Date <= IsNull(T.MeterServiceEndDate, GetDate() +1)) as C
where c.calcFPNAForecastUtilityAbrev = 'CT_CLP' and c.date <= getdate()
GROUP BY C.calcFPNAForecastUtilityAbrev, C.Date
) as d

Does anyone have any idea what is causing the difference?  Is my logic flawed?
Vitor MontalvãoIT Engineer
Distinguished Expert 2017

Commented:
Maybe the date filter (c.date <= getdate())?
Distinguished Expert 2017

Author

Commented:
I don't understand what you are trying to tell me Vitor.  c.date <=GetDate() restricts the rows to today and earlier.  I don't want future days.  It just confuses the process.
Dale FyeOwner, Dev-Soln LLC
Most Valuable Expert 2014
Top Expert 2010

Commented:
Pat,

It is now 2:34 pm at my house.  If the server is in the Eastern Time Zone, GetDate() + 1 will equal"

'2016-03-17 2:34:00'

if your ServiceEndDate is a DATE only field then doing a <= will also capture the 2016-03-17 dates in the query, as will a < against the GetDate() + 1, so you might need to simply use <= GetDate()

Dale
Distinguished Expert 2017

Author

Commented:
I added the +1 because I was trying to figure out why the query did not count "future" active records.  We add accounts when the client signs up with us and we specify a start date which is normally in the future.  Same with drop dates.  They are also added in the future.  The future might be tomorrow or it might be next month in the case of someone who is moving.  So currently we have start dates that go out to 3/30 and drop dates as far out as 4/12 but for the days from 3/17 to the last start/stop date, the active counts are incorrect and I can't figure out what they are counting.

You are not going to believe what I just found out -  They reload one of the tables I was using at 11:30 AM so when I was running my counting queries, it was loading so I was getting different counts from different queries.  You can't make this stuff up.  They told me that the files were refreshed over night between midnight and 6 AM (I'm working with day-old data).  Well, it turns out that most of the files are refreshed over night except for the ones that are refreshed during the middle of the day!!!!!!  So, I'm going to do my balancing on week old data so I don't have a moving target.

I'm calling it a day.  I'll pick this up tomorrow after I copy all the data to my local database where it won't get updated while I'm in the middle of validating the counts.  Thanks.  We're almost there.
Dale FyeOwner, Dev-Soln LLC
Most Valuable Expert 2014
Top Expert 2010

Commented:
LOL,  You just cannot make this stuff up!

Is this the same client where the contract IT support said "nothing changed"?
Distinguished Expert 2017

Author

Commented:
No.  These guys are generally on the ball.  This was a recent change that no one thought to mention to me.  My challenge with this client is data.  A service bureau does the bulk of their processing and what we get from them are downloaded "reports" which makes all the data as of yesterday (except for some that comes during the day as I just found out).  We have no clue what the underlying schema is (they won't tell us) and when they make us a "report", they don't seem to understand that they need to include all the "key" fields necessary to make a unique record so joining the various reports always results in losing records or duplicating records.  Every file has "duplicates" that I have to get rid of plus there are timing differences so that for example if a utility changes the meter number for a particular meter, not all the files get updated at the same time so on any given day, we "loose" about 200 accounts to just this discrepancy.  You go along for years having sane customers with solid data and then the worm turns and you end up with chaos and bad data.  I can't solve the real problems for either client which is very frustrating.
Distinguished Expert 2017

Author

Commented:
Thanks to all of you.  Here is the final solution.

SELECT
  C.calcFPNAForecastUtilityAbrev
, C.RateGrp
, C.ZoneName
, C.Date
, SUM(1) as Active
, SUM(CASE WHEN C.Date = C.RunDate THEN 1 ELSE 0 END) as StartedThisDay
, SUM(CASE WHEN C.Date = C.MeterServiceEndDate THEN 1 ELSE 0 END) as EndedThisDay
FROM (
      SELECT D.Date, T.calcFPNAForecastUtilityAbrev, T.RateGrp, T.ZoneName, T.RunDate, T.MeterServiceEndDate
            FROM accessdb.dbo.tblUniqueAccounts as T
                  INNER JOIN concourse.dbo.SYS_Calendar as D
            ON D.Date >= T.RunDate
                  AND D.Date <= IsNull(T.MeterServiceEndDate, GetDate() +1)) as C                  --- make GetDate an argument to calc future days

GROUP BY C.calcFPNAForecastUtilityAbrev, C.RateGrp, C.ZoneName, C.Date
order by C.calcFPNAForecastUtilityAbrev, c.date desc, c.RateGrp
Dale FyeOwner, Dev-Soln LLC
Most Valuable Expert 2014
Top Expert 2010

Commented:
So, how long is it taking this to run now?

Dale
Distinguished Expert 2017

Author

Commented:
I ran it originally with just the utility and the date and it took 15 seconds.  Once I added the other two fields, it went to a little over a minute.  That's a huge improvement over my code solution.  I still don't have a complete solution for the forecasting part.  I'll probably try to make something similar that goes forward.  This is a little more complicated since it involves incorporating our analysts forecasts for adds and drops.

Do more with

Expert Office
Submit tech questions to Ask the Experts™ at any time to receive solutions, advice, and new ideas from leading industry professionals.

Start 7-Day Free Trial