Query to get customer counts per day

I am probably going to have to run this query in SQL Server but I can convert an Access solution to T-SQL so either will do.

I have a table of customers with their ServiceStartDate and ServiceDropDate.  Drop date will be null if the account is currently active.  In order to do proper forecasting, I need a count of all active accounts by day starting in Jan of 2009.  I have a calendar table that is populated with a record per day going from that time out to Dec of 2099.  The query I have uses a non-equi-join between the two tables and counts the results but it doesn't run in an acceptable amount of time (I've let it run for almost two hours before cancelling it).  There are 1.2 million customer records so the intermediate recordset is huge.  Here is the query as I have written it.  ACCESS_30DayMeterForecast is a view that does a very complicated calculation to obtain Zone.  SYS_Calendar is a table.

SELECT MF.Commodity, MF.UtilityID, MF.FinalZone, SYS_Calendar.Date, Count(*) AS MeterCount
FROM ACCESS_30DayMeterForecast AS MF INNER JOIN SYS_Calendar ON MF.ServiceStartDate <= SYS_Calendar.Date
WHERE (MF.DropDate Is Null Or MF.DropDate >[SYS_Calendar].[Date]) AND (SYS_Calendar.is_weekday = 1)
GROUP BY MF.Commodity, MF.UtilityID, MF.FinalZone, SYS_Calendar.Date;

Open in new window

 I'm looking for alternative solutions.

At the moment, there are no useful indexes on either table.  The customer table is reloaded every night and I could ask the DBA to add indexes if I am convinced that will help.

One idea I have is to use a cursor instead of trying to do this in a query.  The code would read each record in the customer table and increment buckets in an array.  The array would be two dimensional.  Year and day or I could just as easily make it one dimensional and simply use the integer part of the date to decide which buckets to update.  The inner loop starts from the customer's ServiceStartDate and increments the date, one day at a time until it gets to the drop date or the current date.  Then at the end of the process, the array will be written out and used in the forecasting.  Typically cursors are much slower than queries but this may not be true in this case due to the huge size of the intermediary table that the query engine would have to create prior to aggregating the data.
LVL 45
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

John TsioumprisSoftware & Systems EngineerCommented:
Something is wrong with your table structure....no query runs for 2 hours....it eithers runs for a few seconds/minutes or it won't....if you have such performance issues maybe you should examine your criteria and apply filtering before the actual query....maybe create a filtered table and run it as source...
Of course some sample data would be more than helpful....
Paul Cook-GilesSenior Application DeveloperCommented:
Could you add a CountOfActive field to your calendar table, and build a loop that counts the number of customers who have MF.ServiceStartDate =< SYS_Calendar.Date and (F.DropDate = Null or MF.DropDate => SYS_Calendar.Date), and drops the count into the CountOfActive field for the SYS_Calendar.Date?
PatHartmanAuthor Commented:
John, what's wrong with the query is the two things I mentioned.
1. the base view has a complicated process that looks up Zone. **
2. the final view returns 1.2 million rows which now have to be matched to the calendar using a non-equi-join and that causes a huge intermediate table to be created which is summed for the final output.

Paul, that's interesting.  It might coerce the query engine into using an index if one existed on the ServiceStartDate (assuming you can index a view).  That might restrict the intermediate recordset to just the number of customers.  It's still large but not out of sight.

** there may be headway on this front.  The DBA may have actually acquiesced to my request from almost two years ago to add the calculated Zone to the base table after it is reloaded every night.  That way the stupid zone calculation only has to get done once each night when the table gets loaded, rather than every time any view that requires it runs.  The bulk of our data is processed by a third party and we reload our tables every night.  The DBA, rather than adding additional columns such as Zone (which is more complicated to calculate than you might imagine) insisted on leaving the tables "pure" (i.e. as they come from the vendor) and doing all the other stuff in each and every view.  In fact, Zone is calculated 3 ways for different users.  Why can't they agree on one method you might ask?  My answer is - don't get me started.
Active Protection takes the fight to cryptojacking

While there were several headline-grabbing ransomware attacks during in 2017, another big threat started appearing at the same time that didn’t get the same coverage – illicit cryptomining.

Scott PletcherSenior DBACommented:
Yeah, too much processing.  The way I see it, all you really need is:

1) the total number of active clients on the starting date (in this case, likely Jan 1, 2009)
2) the total number of clients that were activated or deactivated for each other day

I can't yet provide sample code for your specific data because there's no useable sample data available.

The daily totals could even be inserted into another table for easier processing.  Cluster that table by date (not by identity!).  Unless the data changes each day, all you need to add to that table is the current day's totals.

Then you can do a running total on the daily totals.  Even 10 years' worth of days is < 4000 rows, and they'll already be stored sorted in date order.  That should be lightening quick.

[Btw, even if the DBA didn't want to "corrupt" the table structure from the vendor -- and there is some justification for that view -- he/she could still create a separate table with a 1-1 relationship to the original table and store computed values there.]
PatHartmanAuthor Commented:
Logically, it isn't a difficult task.  It is only a problem due to the size of the recordset.  Running a query that counts the number of clients added on a given day is very fast.  If it weren't for that pesky ability to drop, we'd be done.  Once you consider both add and drop, now there will be some overlap but not much so we're back up to at least very close to a million rows.

I agree about the "corrupt" but he and I have a different opinion on what that means.  I think that as long as he does nothing to change existing data or alter row counts, adding additional columns shouldn't cause any problem with integrity and it sure as hell would lighten the load on the network and the server.  We have 25 analysts slicing and dicing all day long using the views he created to incorporate the additional calculated fields we need that don't come from the third party.  Another silly field that has to be calculated in every single view for every single table is UtilityID.  Utilities change names over time and name is what the vendor stores.  That's fine except they don't alter history so when UI changed to Eversource recently, all the closed accounts stayed as UI but the active ones got changed to Eversource.  So, we have a conversion table that takes all variations (including some misspellings) and converts them to an ID so we're all working with the same set of 38 utilities rather than some of us using 56 (including the various name spellings).  You'd think a utility could spell its own name correctly on EDI transactions but I guess that is asking a lot..
Paul Cook-GilesSenior Application DeveloperCommented:
Another thought:  create a temp table containing only the data you actually need from the view, and slap whatever index you want on it.  Then use that in your query.  HTH.
Scott PletcherSenior DBACommented:
Can you provide some sample data?  I'm almost certain that don't need to join back across all the days again initially to get the right totals.
Scott PletcherSenior DBACommented:
Maybe something like this?!:

    Commodity int,
    UtilityID int,
    FinalZone int,
    Date date,
    Start_Count int,
    Drop_Count int,
    PRIMARY KEY ( Commodity, UtilityID, FinalZone, Date ) WITH ( FILLFACTOR = 99 )

DECLARE @start_date date
DECLARE @end_date date

SET @start_date = '20090101'
SET @end_date = GETDATE()

/*you need to add code here to load totals for @start_date *only* to ACCESS_Starts_Drops*/
SELECT <starting_active_total_for_@start_date>

/*load starts/drops totals for other days*/
cteTally10 AS (
    SELECT * FROM (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) AS numbers(number)
cteTally100 AS (
    SELECT 0 AS number
    FROM cteTally10 c1
    CROSS JOIN cteTally10 c2
cteTally10K AS (
    SELECT ROW_NUMBER() OVER(ORDER BY c1.number) AS number
    FROM cteTally100 c1
    CROSS JOIN cteTally100 c2
    Commodity, UtilityID, FinalZone, Date, Start_Count, Drop_Count
    MF.Commodity, MF.UtilityID, MF.FinalZone, Date,
    SUM(CASE WHEN MF.ServiceStartDate = Date THEN 1 ELSE 0 END) AS Start_Count,
    SUM(CASE WHEN MF.DropDate = Date THEN 1 ELSE 0 END) AS Drop_Count
FROM cteTally10K days
    SELECT CAST(DATEADD(DAY, days.number, @start_date) AS date) AS Date
) AS assign_alias_names1
INNER JOIN ACCESS_30DayMeterForecast AS MF ON MF.ServiceStartDate = Date OR MF.MF.DropDate = Date
WHERE Date <= @end_date

/*compute final totals by date*/
SELECT Commodity, UtilityID, FinalZone, Date,
    ISNULL((SELECT SUM(Start_Count) - SUM(Drop_Count) FROM ACCESS_Starts_Drops ASD2
            WHERE ASD2.Commodity = ASD.Commodity AND ASD2.UtilityID = ASD.UtilityID AND
                  ASD2.FinalZone = ASD.FinalZone AND ASD2.Date <= ASD.Date), 0) AS Active_Count
ORDER BY Commodity, UtilityID, FinalZone, Date
PortletPaulEE Topic AdvisorCommented:
>>"The customer table is reloaded every night "
you re-load 1.2 million records every night?????

>>"and I could ask the DBA to add indexes "
s/he probably doesn't want indexes on that table to make the reloading faster
tackle the reasons for re-loading that table

>>"if I am convinced that (indexes) will help."
You do know that indexes make massive improvements to query efficiency - right?

1.2 million * 10 years * 365 days = 3,942,000,000 (thereabouts, you aren't using 10 years - yet)

I don't see how you avoid that - but once you have the history, you don't need to recalculate it.
You could do it one day/week/month/quarter/year at a time, just updating a table.
Perhaps add an integer column to that calendar table for this?

Your calendar table is indexed? (how)

Please provide sample data (way less than 1.2million!! way way less!)
Both tables: calendar & customer
PatHartmanAuthor Commented:
you re-load 1.2 million records every night?????

Open in new window

As I said, we don't maintain the data.  A third party does and it is waaaaay more than 1.2 million rows.  This is only a single table and there are dozens.  The ones that deal with invoices are much bigger since every one gets 12 of those per year.  

I'm working on a sample.
PatHartmanAuthor Commented:
OK, here's the sample.  I imported 279 records before I got tired of obfuscating them.  The qCountByDay does the non-equi-join and creates one record per day and counts them.  The earliest start date is 12/9/2009 and the latest start date is 12/19/16.  I changed the query so it wouldn't go past today.  That will help some.  However, those 279 records for the specified period result in 26,356 records.  The actual data has more variation in Utility and Zone so will contain more.

Be careful.  When I imported the data, Access helpfully changed the 1's in the calendar to -1's so if you swap back and forth between Access and SQL Server, you'll need to change the query since True doesn't work in SQL Server.

I didn't include any code to multiply the data but you can do it easily with an append query that just copies what is there and appends it back. So running it a few times will get you a lot of rows.  Just take the PK off the Access "table".  For volume testing, it doesn't matter if the account numbers are duplicated since they are summarized away anyway.

Have fun.
John TsioumprisSoftware & Systems EngineerCommented:
I think i got a little better performance by tweaking your query like that
SELECT MF.Commodity, MF.UtilityID, MF.FinalZone, SYS_Calendar.Date, Count(*) AS MeterCount
FROM ACCESS_30DayMeterForecast AS MF , SYS_Calendar WHERE MF.ServiceStartDate <= SYS_Calendar.Date
AND (MF.DropDate Is Null Or MF.DropDate >[SYS_Calendar].[Date]) AND (SYS_Calendar.is_weekday = True)  AND sYS_Calendar.Date <= Date()
GROUP BY MF.Commodity, MF.UtilityID, MF.FinalZone, SYS_Calendar.Date
ORDER BY MF.Commodity, MF.UtilityID, MF.FinalZone, SYS_Calendar.Date;

Open in new window

PortletPaulEE Topic AdvisorCommented:
I haven't used Access for years and never really that much, so I'll just suggest something and leave it at that.
I don't get why tables would be un-indexed, or re-loaded.
Good Luck

add an integer column to your calendar table (I used "ActiveClientCount" in the example)
set the variables to a date range
start small

eventually all you will need to do was "yesterday" once per day

declare @start_date datetime
declare @end_date   datetime

set@start_date  = '20090101'
set @end_date   = '20091231'

SET c.ActiveClientCount = d.n
FROM sys_calendar AS c
          , COUNT(*) n
      FROM ACCESS_30DayMeterForecast AS a
      INNER JOIN sys_calendar AS sc ON sc.[date] between @start_date and @end_date
                                   AND sc.[Date] BETWEEN a.ServiceStartDate AND ISNULL(a.DropDate, @end_date)
      GROUP BY sc.[Date]
      ) d 
      ON c.[date] = d.[date]

Open in new window

PatHartmanAuthor Commented:
I don't get why tables would be un-indexed, or re-loaded.
THIS IS A DATA WAREHOUSE in case you couldn't work that out from my earlier descriptions.  There are no relationships because no one can update the data.  There are few indexes because most of the users are Excel and their criteria is either active or all.  Indexing the status won't help at all since there are too few values.

The problem with attempting to keep a running total is data changes - OUT OF OUR CONTROL -  and there would be no way to account for that in a running update procedure.

A third-party manages all the data and interaction with utilities.  All communications between the third party and the utilities happen via EDI transactions.  We have a separate web app that collects new applications and provides basic data to existing customers.  Our previous interaction with the third party was to use their "reports" extracts.  That meant that all day, every day, huge recordsets were being pulled down over the internet.  Doing the pull overnight and reloading SQL server tables daily has eliminated that bottleneck.  There is no way the vendor would ever grant us access to the real database and if they were crazy enough to do that, it would still be slower than molasses because everything would still get pulled over the internet.  Cloud/shmoud.  Anyone keeping their data remotely has a problem getting at it.
PortletPaulEE Topic AdvisorCommented:
Thanks. I guess our experiences with data warehouses are different. I  think of them as large star schemas of fact and dimension tables and I  have not experienced using them via access.

Many years ago when a I worked for providing a client server product, they used a reporting approach that populated acess dbs with data and reports produced from it. But Access was so limited in capacity back then this approach was abandoned and replaced by BI products drawing straight from the main db.  More recently that was abandoned and replaced by supplying ETL (Pentaho) so the transactional db was transformed to a separate star schema and a BI tool (Jaspersoft) used for end-user reporting as well as stock reports.

So I  guess my suggestion may not be of much use.
PatHartmanAuthor Commented:
The data is not being stored in an Access database.  The data is being stored in SQL Server.  My original comment was regarding the SQL syntax.  I will most likely end up with a pass-through query which would have to be T-SQL but I could translate a suggestion made using Access syntax.

Our data situation is unique but the company is still small people-wise so until they get large enough to justify a large IT staff that can write real applictions and manage their business process,  most of the analysts use Excel and I am teaching the power users how to leverage Access to do certain types of pre-processing so they don't have to run vLookups that take 3 hours when they can be writing a simple query with a join.  Excel may be able to open a sheet that has a million rows but it sure isn't quick about it.  Now that the users have a SQL Server source for their data, they no longer all need to all download the same "reports" every day.  Instead the download happens automatically overnight to a local SQL Server database.  That in itself provides a performance improvement.
PortletPaulEE Topic AdvisorCommented:
>>"The data is being stored in SQL Server. "

Then I presume the calendar data is a table that does not get re-loaded and it can (and should) have indexes on it - the date column in particular which could be the clustered index for that table.

Add indexes (to the tables that are re-loaded) after the re-load to help populate your initial backlog of calculations, again the dates will be important as they get referenced so many times. Remove the indexes after your calculations if they might interfere with the existing re-load process.

Retain the active client result in a permanent column of the calendar table.
I think you would only need to update that count column on a incremental basis once you have the history.

small working example: http://sqlfiddle.com/#!3/4b921/1/1
I had hoped to use the execution plans there but they don't seem to be working sadly.

larger sample, here:  http://rextester.com/ZND78179
PatHartmanAuthor Commented:
Thanks.  I haven't been able to try the last suggestion yet.  I'll try it Wednesday or Thursday.
PortletPaulEE Topic AdvisorCommented:
Have you had an opportunity to test any suggestions?
PatHartmanAuthor Commented:
The problem is simply that the intermediate recordset is too large.  3600 rows from the Calendar table times 1.2 million from the customer table gives us over 4 BILLION rows.  I figured out how to generate the data one year at a time and then sum it.  I haven't automated it yet but that is where I am going.  I am selecting rows from the calendar between the start date of the range and today.  I am selecting data from the customer table where the service start date is between the start date of the range and the end date and the drop date of the account is either null or greater than the calendar date.  I ran all ten years and each year took between 3 and 15 minutes.  It probably depended on what else the server was doing at the time.  It's a little awkward now because the table I am appending to is in my test database but the source data is in the production database.  I'll automate it when I can get everything into the same database.

The DBA has also finally agreed to store the calculated zone each day when the table is loaded so the view doesn't have to recalculate it each time it runs and that makes the view significantly faster so there's lots of happy campers in the office.

I'm going to close the  thread.  Thank you all for trying to help.

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
PatHartmanAuthor Commented:
None of the suggestions solved the problem.  In the end, I had to break the process into pieces.
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Microsoft Access

From novice to tech pro — start learning today.