Avatar of PatHartman
PatHartman
Flag for United States of America asked on

Create Day by Day Count of Clients

I currently have a VBA code loop that does this for me (so technically I have a working solution) but I'm hoping that there is an SQL solution that might be faster.  I have about 1 million client records. They have a ServiceStartDate and a ServiceEndDate.  The ServiceEndDate will be null if the service is still active.  What I need is a recordset that contains 1 row for every day since around 1/1/09 when the company started showing the number of accounts active that day by UtilityID.  Nice to have would also be the number of accounts added that day and accounts dropped that day.  So input would be.
UtilityID, Acct #, ServiceStartDate, ServiceEndDate
1, abc, 1/1/9, 1/5/9
1, aaa, 1/3/9
1, aab, 1/3/9,


output would be
UtilityID, CalendarDate, Active, Added, Dropped
1, 1/1/9, 1,1,0
1, 1/2/9, 1, 0, 0
1, 1/3/9, 3, 2, 0
1, 1/4/9, 3, 0, 0
1, 1/5/9, 2,0, 1
1, 1/6/9, 2, 0, 0
....
So just taking UtilityID 1 for 3 accounts over 6 days, this is what I end up with.  Every day, I have a count of the number of active accounts.  Added and dropped if you can calculate it at the same time would be good but otherwise, it is pretty easy to calculate by summing the recordset by start and end dates so I can add that later.  It is counting by day that is the problem.

I have tried this by joining to a table of days.  There are about 3500 days in this time frame but a Cartesian Product of 1,000,000 x 3500 is huge and way too slow.  I am hoping there is some slick T-SQL syntax that can generate these intermediate records.  The Cartesian Product method works fine for smaller recordsets but fails for one of this size.  The current code loop does the process one day at a time.
Microsoft AccessMicrosoft SQL Server

Avatar of undefined
Last Comment
PatHartman

8/22/2022 - Mon
Vitor Montalvão

Tables schemas would help and also their sample data.
PatHartman

ASKER
Isn't that what I gave you?  There isn't anything except the 4 columns that are involved in this process and I can't give you a million client records.  Plus the tables are SQL Server.
Dale Fye

So Pat,  instead of a Cartesian join, use a non-equi join

SELECT C.DateField
, SUM(1) as Active
, SUM(CASE WHEN C.DateField = C.ServiceStartDate THEN 1 ELSE 0 END) as StartedThisDay
, SUM(CASE WHEN C.DateField = C.ServiceEndDate THEN 1 ELSE 0 END) as EndedThisDay
FROM (
SELECT D.DateField, T.UtilityID, T.[Acct #], T.ServiceStartDate, T.ServiceEndDate
FROM yourTable as T
INNER JOIN yourDateTable as D
ON D.DateField >= T.ServiceStartDate
AND D.DateField <= T.ServiceEndDate) as C
I started with Experts Exchange in 2004 and it's been a mainstay of my professional computing life since. It helped me launch a career as a programmer / Oracle data analyst
William Peck
Dale Fye

SELECT C.DateField
, SUM(1) as Active
, SUM(CASE WHEN C.DateField = C.ServiceStartDate THEN 1 ELSE 0 END) as StartedThisDay
, SUM(CASE WHEN C.DateField = C.ServiceEndDate THEN 1 ELSE 0 END) as EndedThisDay
FROM (
SELECT D.DateField, T.UtilityID, T.[Acct #], T.ServiceStartDate, T.ServiceEndDate
FROM yourTable as T
INNER JOIN yourDateTable as D
ON D.DateField >= T.ServiceStartDate
AND D.DateField <= IsNull(T.ServiceEndDate, GetDate()) as C
PatHartman

ASKER
Thanks Dale,
I should have mentioned that I tried something similar but only for the ServiceStartDate since the End date can be null.  So I left the End date in the where clause and it was still too slow.  I'm running this from Access so I'm not sure what would happen if I used Nz() for the end date so I could use it in the join.  Would Access be able to translate that to T-SQL or would it just ask for the million rows to be returned?  I spend so much time doing queries with Access that my RDBMS specific skills are somewhat limited.

I've also tried running the queries directly in SSMS and they just sit and churn.

I will try your variation tomorrow just in case.  But doesn't UtilityID have to be in the outer Select since I need to have that returned so I have a count by day, by utility?
ASKER CERTIFIED SOLUTION
Dale Fye

THIS SOLUTION ONLY AVAILABLE TO MEMBERS.
View this solution by signing up for a free trial.
Members can start a 7-Day free trial and enjoy unlimited access to the platform.
See Pricing Options
Start Free Trial
GET A PERSONALIZED SOLUTION
Ask your own question & get feedback from real experts
Find out why thousands trust the EE community with their toughest problems.
PatHartman

ASKER
I'm running it from Access because Access makes every effort to pass-through all queries.  So, unless I've done something to prevent the query from being passed-through, all I incur doing it this way is a small amount of overhead to have the ODBC driver convert the SQL syntax for me.  The recordset returned will be about 3500 rows which is not large in the greater scheme of things especially since we are starting with over a million accounts.   As I said, I also ran the queries directly in SSMS and they still wouldn't run.

I'll try this tomorrow.  It is quite possible, I did the non-equi-join incorrectly.  I'll let you know if it works.  I also didn't do it as a sub-select.  I'm not sure what the sub-select is doing for you in this case.  Why wouldn't a simple join with a group by suffice?
⚡ FREE TRIAL OFFER
Try out a week of full access for free.
Find out why thousands trust the EE community with their toughest problems.
Scott Pletcher

First try using a tally table, which is just a table of sequential, keyed integers.  Literally a table with just a single integer column.  Typically using it runs very efficiently to generate multiple rows.

I can't post the code to create a tally table because work filters block trying to post any CROSS APPLY(!), but such code is easily available via Google.  A lot of people name the integer column "N"; I hate single-letter column names, so I changed the name to "number", which is what I use in the code below.

After creating that table (once), here is the main query:

SELECT
    d.UtilityID,
    ca1.CalendarDate,
    SUM(1) AS Active,
    SUM(CASE WHEN ca1.CalendarDate = d.ServiceStartDate THEN 1 ELSE 0 END) AS Added,
    SUM(CASE WHEN ca1.CalendarDate = d.ServiceEndDate THEN 1 ELSE 0 END) AS Dropped
FROM #data d
INNER JOIN dbo.tally t ON t.number BETWEEN 0 AND DATEDIFF(DAY, ServiceStartDate, ISNULL(ServiceEndDate, GETDATE()))
CROSS APPLY (
    SELECT DATEADD(DAY, t.number, d.ServiceStartDate) AS CalendarDate
) AS ca1
GROUP BY d.UtilityID, ca1.CalendarDate
ORDER BY d.UtilityID, ca1.CalendarDate
PatHartman

ASKER
At the moment, I'm working with Dale's suggestion and we are very close but not quite right.  I should be able to independently verify each of the numbers this query produces.

Adds for today = Select Count(*) where StartDate = today
Drops for today = Select Count(*) where DropDate = today
Active for today = Select Count(*) where StartDate <= today and (DropDate > today or DropDate Is Null)

The adds and drops match for any day I choose but the Active count is off.  I added criteria to get a sample for a small utility and Dale's query gives me 18,051 for today but the straight count gives me 18,128.  Summing all the adds and drops gives me 132,347 - 114,298 = 18,049.  Today's adds and drops are 19 and 41 respectively so that doesn't account for the difference between 18.051 and 18,049 or 18,128.

I'll try to work with a smaller sample to see if I can figure it out.

The only change I made to Dale's query was to add a missing paren before the "as C" and I changed the GetDate() to be GetDate() +1
Dale Fye

Pat,

Don't forget that my numbers are grouped by UtilityID and CalendarDate,

If you add the UtilityID to your queries above and then add the Group By clause, it might help you track down the differences.

Dale
Your help has saved me hundreds of hours of internet surfing.
fblack61
Dale Fye

The other thing that might be causing an issue is GetDate()

GetDate() is like Now() in Access, and returns a date/time value.
PatHartman

ASKER
My count queries are filtered the same way as the "big" query.

I don't think that GetDate() is causing the problem since the counts for any given day are accurate.  Something is happening in the summing and it also affects the counts when I run them individually.  I should be able to add all the adds and drops using

Select Utility, Count(*) As AddCount Where StartDate <= today

Select Utility, Count(*) as DropCount Where DropDate <= today

Netting those two numbers should give the active count for today but that number is different from this:

Select Utility, Count(*) as Active Where StartDate <= today And (DropDate is null or DropDate> today)

If the <= are "off", it should only be by the individual count for today and the count is off by more.
Dale Fye

How about using:

Select Utility, Count(*) as Active (Where StartDate <= today) And (NZ(DropDate, Date() + 1)> today)
⚡ FREE TRIAL OFFER
Try out a week of full access for free.
Find out why thousands trust the EE community with their toughest problems.
PatHartman

ASKER
This query produces 18051, 19, 41 for today - active, start, end

SELECT C.calcFPNAForecastUtilityAbrev
, C.Date
, SUM(1) as Active
, SUM(CASE WHEN C.Date = C.RunDate THEN 1 ELSE 0 END) as StartedThisDay
, SUM(CASE WHEN C.Date = C.MeterServiceEndDate THEN 1 ELSE 0 END) as EndedThisDay
FROM (
      SELECT D.Date, T.calcFPNAForecastUtilityAbrev, T.RunDate, T.MeterServiceEndDate
      FROM accessdb.dbo.tblUniqueAccounts as T
            INNER JOIN concourse.dbo.SYS_Calendar as D
            ON D.Date >= T.RunDate
            AND D.Date <= IsNull(T.MeterServiceEndDate, GetDate() +1)) as C
where c.calcFPNAForecastUtilityAbrev = 'CT_CLP'
GROUP BY C.calcFPNAForecastUtilityAbrev, C.Date
order by c.date desc

I wrapped the above query in another query and summed the adds and drops (131,853, 113,843) and end up with 18,010 net.

Select
sum(d.startedthisday) as adds,
sum(d.endedthisday) as drops,
sum(d.startedthisday - d.endedthisday) as net
From
(SELECT C.calcFPNAForecastUtilityAbrev
, C.Date
, SUM(1) as Active
, SUM(CASE WHEN C.Date = C.RunDate THEN 1 ELSE 0 END) as StartedThisDay
, SUM(CASE WHEN C.Date = C.MeterServiceEndDate THEN 1 ELSE 0 END) as EndedThisDay
FROM (
      SELECT D.Date, T.calcFPNAForecastUtilityAbrev, T.RunDate, T.MeterServiceEndDate
      FROM accessdb.dbo.tblUniqueAccounts as T
            INNER JOIN concourse.dbo.SYS_Calendar as D
            ON D.Date >= T.RunDate
            AND D.Date <= IsNull(T.MeterServiceEndDate, GetDate() +1)) as C
where c.calcFPNAForecastUtilityAbrev = 'CT_CLP' and c.date <= getdate()
GROUP BY C.calcFPNAForecastUtilityAbrev, C.Date
) as d

Does anyone have any idea what is causing the difference?  Is my logic flawed?
Vitor Montalvão

Maybe the date filter (c.date <= getdate())?
PatHartman

ASKER
I don't understand what you are trying to tell me Vitor.  c.date <=GetDate() restricts the rows to today and earlier.  I don't want future days.  It just confuses the process.
This is the best money I have ever spent. I cannot not tell you how many times these folks have saved my bacon. I learn so much from the contributors.
rwheeler23
Dale Fye

Pat,

It is now 2:34 pm at my house.  If the server is in the Eastern Time Zone, GetDate() + 1 will equal"

'2016-03-17 2:34:00'

if your ServiceEndDate is a DATE only field then doing a <= will also capture the 2016-03-17 dates in the query, as will a < against the GetDate() + 1, so you might need to simply use <= GetDate()

Dale
PatHartman

ASKER
I added the +1 because I was trying to figure out why the query did not count "future" active records.  We add accounts when the client signs up with us and we specify a start date which is normally in the future.  Same with drop dates.  They are also added in the future.  The future might be tomorrow or it might be next month in the case of someone who is moving.  So currently we have start dates that go out to 3/30 and drop dates as far out as 4/12 but for the days from 3/17 to the last start/stop date, the active counts are incorrect and I can't figure out what they are counting.

You are not going to believe what I just found out -  They reload one of the tables I was using at 11:30 AM so when I was running my counting queries, it was loading so I was getting different counts from different queries.  You can't make this stuff up.  They told me that the files were refreshed over night between midnight and 6 AM (I'm working with day-old data).  Well, it turns out that most of the files are refreshed over night except for the ones that are refreshed during the middle of the day!!!!!!  So, I'm going to do my balancing on week old data so I don't have a moving target.

I'm calling it a day.  I'll pick this up tomorrow after I copy all the data to my local database where it won't get updated while I'm in the middle of validating the counts.  Thanks.  We're almost there.
Dale Fye

LOL,  You just cannot make this stuff up!

Is this the same client where the contract IT support said "nothing changed"?
⚡ FREE TRIAL OFFER
Try out a week of full access for free.
Find out why thousands trust the EE community with their toughest problems.
PatHartman

ASKER
No.  These guys are generally on the ball.  This was a recent change that no one thought to mention to me.  My challenge with this client is data.  A service bureau does the bulk of their processing and what we get from them are downloaded "reports" which makes all the data as of yesterday (except for some that comes during the day as I just found out).  We have no clue what the underlying schema is (they won't tell us) and when they make us a "report", they don't seem to understand that they need to include all the "key" fields necessary to make a unique record so joining the various reports always results in losing records or duplicating records.  Every file has "duplicates" that I have to get rid of plus there are timing differences so that for example if a utility changes the meter number for a particular meter, not all the files get updated at the same time so on any given day, we "loose" about 200 accounts to just this discrepancy.  You go along for years having sane customers with solid data and then the worm turns and you end up with chaos and bad data.  I can't solve the real problems for either client which is very frustrating.
PatHartman

ASKER
Thanks to all of you.  Here is the final solution.

SELECT
  C.calcFPNAForecastUtilityAbrev
, C.RateGrp
, C.ZoneName
, C.Date
, SUM(1) as Active
, SUM(CASE WHEN C.Date = C.RunDate THEN 1 ELSE 0 END) as StartedThisDay
, SUM(CASE WHEN C.Date = C.MeterServiceEndDate THEN 1 ELSE 0 END) as EndedThisDay
FROM (
      SELECT D.Date, T.calcFPNAForecastUtilityAbrev, T.RateGrp, T.ZoneName, T.RunDate, T.MeterServiceEndDate
            FROM accessdb.dbo.tblUniqueAccounts as T
                  INNER JOIN concourse.dbo.SYS_Calendar as D
            ON D.Date >= T.RunDate
                  AND D.Date <= IsNull(T.MeterServiceEndDate, GetDate() +1)) as C                  --- make GetDate an argument to calc future days

GROUP BY C.calcFPNAForecastUtilityAbrev, C.RateGrp, C.ZoneName, C.Date
order by C.calcFPNAForecastUtilityAbrev, c.date desc, c.RateGrp
Dale Fye

So, how long is it taking this to run now?

Dale
Experts Exchange has (a) saved my job multiple times, (b) saved me hours, days, and even weeks of work, and often (c) makes me look like a superhero! This place is MAGIC!
Walt Forbes
PatHartman

ASKER
I ran it originally with just the utility and the date and it took 15 seconds.  Once I added the other two fields, it went to a little over a minute.  That's a huge improvement over my code solution.  I still don't have a complete solution for the forecasting part.  I'll probably try to make something similar that goes forward.  This is a little more complicated since it involves incorporating our analysts forecasts for adds and drops.