Link to home
Start Free TrialLog in
Avatar of slightwv (䄆 Netminder)
slightwv (䄆 Netminder)

asked on

Who can improve this SQL?

Edited:

Even though the end result will be Postgres, I would entertain SQL from other leading database products.  So, I've added additional Topic Areas.


A more database independent testcase setup will be provided in a separate comment.

end Edit....


Postgres 14.2.


Looking for alternate methods of writing the SQL.  I have working SQL but keep thinking there is a better way and I've gone down one rabbit hole and cannot see different ways to solve the problem.


I have a list of calendar days (Yes, I call it months.... deal with it) supposed to be worked by employees by year and month.


I have a vacation days/months.

I need a count by quarters of number of employees that took no vacation days for that quarter.

Test setup only has 60 rows and any SQL will run just fine.


My real-world tables are 40+ million calendar days and a couple hundred thousand vacations... 


drop table if exists calendar_days;
drop table if exists vacation_days;

create table calendar_days(day_id int, emp_id int, data_year int, data_month int); create table vacation_days(day_id int, had_fun int);
insert into calendar_days select row_number() over(), emp_ids, 2019, months from generate_series(1,5) emp_ids cross join generate_series(1,12) months ;
insert into vacation_days select day_id, 1 from calendar_days where emp_id in (1,2) and data_month=1;

insert into vacation_days select day_id, 1 from calendar_days where emp_id=4 and data_month=12;

Open in new window


SQL I have:

with get_worked_days as (
   select
      day_id,
      emp_id,
      data_year,
      data_month,
      case when v.had_fun is null then 0 else 1 end missed_work
   from calendar_days c
      left join vacation_days v using(day_id)
),
sum_by_quarter as (
   select
      data_year,
      (data_month-1)/3+1 data_quarter,
      emp_id,
      sum(missed_work) vacation_days
   from get_worked_days
   group by
      data_year,
      (data_month-1)/3+1,
      emp_id
)
select
   data_quarter,
   count(*) quarter_count
from sum_by_quarter
where vacation_days=0
group by data_quarter
order by data_quarter
;

Open in new window

Avatar of ste5an
ste5an
Flag of Germany image

These things are hard to predict. I always use generated test data to get at least the same magnitude of rows.

Without testing: Does Postgres can use indices on expressions like (data_month - 1) / 3 + 1?
If yes, then I would think it should be pretty optimal. Otherwise I would add a sub-query to sum over months first and then over the calculated quarters. Assuming that you have a clustered index over year, month, day or even a partitioned table.

And of course: What area of business is this?

In HR and finance such reports are normally splitted into monthly, quarterly and yearly reports which use precalculated data. Cause otherwise it is hard to get revision safe reports.
So doing a daily aggregegation on e.g. month can solve possible runtime issues.
Avatar of slightwv (䄆 Netminder)
slightwv (䄆 Netminder)

ASKER

This was more about different methods of writing the SQL.  More of how to do a count of counts without the ability to nest aggregates.  I saw it as an interesting SQL puzzle that I'm not sure I ever encountered before.

Feel free to scale it up if you desire?  For this exercise, I didn't feel it necessary.

Yes, you can create expression indexes but again, not really necessary for this exercise.

The original SQL uses two aggregates.  I was able to rewrite it myself using a single aggregate but it is still pretty ugly SQL.  Didn't provide it because I wanted to see what else you all could come up with without seeing what I had.

Decided to throw it out to the EE Experts to see what their SQL minds can come up with outside of the box.
The original SQL uses two aggregates.
You can obviously replace the second sub-query with the final query and a HAVING clause and the COUNT(*) with a COUNT(DISTINCT) or SUM(CASE..END) depending on your calendar logic.

But after rereading your statement, especially the final one: Here in Germany we have a regulartory vacation in each quarter, thus the entire construct would be a constant expression for count(), it is always zero. So the you don't need to join the vacation table at all. Doing a trivial quarter calculus on your default calender and output zero.
Hi,
Are these tables partitioned ? If not you should consider partition the tables accordingly.
And enable all these parameters. The last one should/could speed up your aggregation in the query.

enable_parallel_append (boolean) 
        
Enables or disables the query planner's use of parallel-aware append plan types. The default is on.
        
enable_parallel_hash (boolean) 
        
Enables or disables the query planner's use of hash-join plan types with parallel hash. Has no effect if hash-join plans are not also enabled. The default is on.
        
enable_partition_pruning (boolean) 
        
Enables or disables the query planner's ability to eliminate a partitioned table's partitions from query plans. This also controls the planner's ability to generate query plans which allow the query executor to remove (ignore) partitions during query execution. The default is on. See Section 5.11.4 for details.
        
enable_partitionwise_join (boolean) 
        
Enables or disables the query planner's use of partitionwise join, which allows a join between partitioned tables to be performed by joining the matching partitions. Partitionwise join currently applies only when the join conditions include all the partition keys, which must be of the same data type and have one-to-one matching sets of child partitions. Because partitionwise join planning can use significantly more CPU time and memory during planning, the default is off.
        
enable_partitionwise_aggregate (boolean) 
        
Enables or disables the query planner's use of partitionwise grouping or aggregation, which allows grouping or aggregation on a partitioned tables performed separately for each partition. If the GROUP BY clause does not include the partition keys, only partial aggregation can be performed on a per-partition basis, and finalization must be performed later. Because partitionwise grouping or aggregation can use significantly more CPU time and memory during planning, the default is off.


Regards,
    Tomas Helgi
You all are thinking way to deep here.

My actual data and tables have noting to do with vacations or working days.  I made up a test case and vacations seems like a good analogy that people would understand.

ste5an,
I'd be willing to look at whatever rewrite you have.

Tomas,
In the few years I've been using Postgres, partitioning hasn't ever sped up queries.

Again, this really isn't about new indexes or table designs.  I'm looking for different ways to think about the SQL approach itself.

with get_worked_days as (
   select
      emp_id,
      data_year,
      (data_month-1)/3+1 as data_quarter,
      sum(case when v.had_fun is null then 0 else 1 end) missed_work
   from calendar_days c
      left join vacation_days v using(day_id)
   /*where ...limit year/month as much as possible that matches your requirements */.
   group by
      emp_id,
      data_year,
      (data_month-1)/3+1
   having
      sum(case when v.had_fun is null then 0 else 1 end) = 0
)
select
   data_year,
   data_quarter,
   count(*)
from get_worked_days
group by
   data_year,
   data_quarter
order by
   data_year,
   data_quarter
>>having  sum(case when v.had_fun is null then 0 else 1 end) = 0

Interesting moving the where up into the CTE but still does two aggregates and on my scaled up actual tables:  really doesn't change the plans.

Helpful points!
hmm, without testing:

SELECT  (data_month - 1) / 3 + 1 AS data_quarter,
        COUNT(*) AS quarter_count
FROM    calendar_days c
    LEFT JOIN vacation_days v USING(day_id)
GROUP BY (data_month - 1) / 3 + 1,
        ROLLUP(emp_id, data_year)
HAVING  SUM(CASE WHEN v.had_fun IS NULL THEN 0 ELSE 1 END) = 0

Open in new window

I gave a test case for a reason?

expecting 4 rows, got 36...
https://dbfiddle.uk/?rdbms=postgres_14&fiddle=f1dc6a39ba079b801998fa4e8cd74ee5
Interesting moving the where up into the CTE but still does two aggregates and on my scaled up actual tables:  really doesn't change the plans.
You're using two aggregates with two not hierarchical aggregates. Thus the engine needs to aggregates in the plan.
I should change the GROUPing in the first query, overlooked that before:

with get_worked_days as (
   select
      emp_id,
      data_year,
      (data_month-1)/3+1 as data_quarter,
      sum(case when v.had_fun is null then 0 else 1 end) missed_work
   from calendar_days c
      left join vacation_days v using(day_id)
   /*where ...limit year/month as much as possible that matches your requirements */.
   group by
      data_year,
      (data_month-1)/3+1,
      emp_id
   having
      sum(case when v.had_fun is null then 0 else 1 end) = 0
)
select
   data_year,
   data_quarter,
   count(*)
from get_worked_days
group by
   data_year,
   data_quarter
order by
   data_year,
   data_quarter 
>>I should change the GROUPing in the first query, overlooked that before:
Changing the order of the columns?  I might have missed something else.

Not sure I've seen the order of the columns in a group by changing anything as far as the execution.  But no change over the other one.

Will go ahead and give more helpful points since the other one was worth more than 50 points!
ASKER CERTIFIED SOLUTION
Avatar of Sean Stuber
Sean Stuber

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
There ya go!

That one is outside the box!

Helpful points for now.

Don't want to accept it now because others might think I've already found a solution and move on....
Since SQL is close to interchangeable between vendors these days and interest here has seemed to have dropped off, I've added other leading database products to see if anyone else has ideas.

Please don't just post variations of what has already been provided.  Not looking for simple porting to say Oracle.

I'm looking for different ways to write the SQL.

I've also added a more general data load for the calendar table.  Sorry Oracle folks...  I'll let you modify the insert.
insert into calendar_days values
(1,1,2019,1),
(2,1,2019,2),
(3,1,2019,3),
(4,1,2019,4),
(5,1,2019,5),
(6,1,2019,6),
(7,1,2019,7),
(8,1,2019,8),
(9,1,2019,9),
(10,1,2019,10),
(11,1,2019,11),
(12,1,2019,12),
(13,2,2019,1),
(14,2,2019,2),
(15,2,2019,3),
(16,2,2019,4),
(17,2,2019,5),
(18,2,2019,6),
(19,2,2019,7),
(20,2,2019,8),
(21,2,2019,9),
(22,2,2019,10),
(23,2,2019,11),
(24,2,2019,12),
(25,3,2019,1),
(26,3,2019,2),
(27,3,2019,3),
(28,3,2019,4),
(29,3,2019,5),
(30,3,2019,6),
(31,3,2019,7),
(32,3,2019,8),
(33,3,2019,9),
(34,3,2019,10),
(35,3,2019,11),
(36,3,2019,12),
(37,4,2019,1),
(38,4,2019,2),
(39,4,2019,3),
(40,4,2019,4),
(41,4,2019,5),
(42,4,2019,6),
(43,4,2019,7),
(44,4,2019,8),
(45,4,2019,9),
(46,4,2019,10),
(47,4,2019,11),
(48,4,2019,12),
(49,5,2019,1),
(50,5,2019,2),
(51,5,2019,3),
(52,5,2019,4),
(53,5,2019,5),
(54,5,2019,6),
(55,5,2019,7),
(56,5,2019,8),
(57,5,2019,9),
(58,5,2019,10),
(59,5,2019,11),
(60,5,2019,12);

Open in new window

It's been a day since I added the additional Topic Areas with no response.  Guess I'll go head and close it.

Sean wins with a completely different approach to solve the problem.

Sorry Scott, can't give you co-answerer points.  Your SQL worked but really didn't alter the plans nor was it really all that different from what I had already posted.
Did you put the appropriate WHERE clause that I commented about in the SQL?

That could make the biggest difference in performance by reducing the number of rows to be read.

It wasn't 100% clear from your posts what the specific code needed to be, but I did state it was needed:
...
USING(...)
/*where ...limit year/month as much as possible that matches your requirements */  --<<--

Did you add the WHERE clause?
>>Did you add the WHERE clause?

Yes and no.

First:
I've spent the last two years rewriting/tuning SQL where the previous developers believed that very thing:
I have a 500 line SQL with 20-30 CTEs:  Each CTE reduces the number of rows dealt with from one to the other.  So by the time it gets to the final query, it has the least number of rows.

False belief.

Not sure about SQL Server but in Oracle and Postgres the optimizers will rewrite the SQL for you so adding a where in a CTE to limit rows normally doesn't do what people believes it does.  It is up to the optimizer to decide if it will materialize that CTE and prune rows before moving on to other CTEs.  From my experience, manually materializing a CTE in an attempt to do this decreases overall performance more than it helps.


Second:
Not really relevant to the question's intent.  Any WHERE I would add to your SQL, I would have also added to mine, so, it's still a wash, right?