Solved

Optimize Complex Query

Posted on 2006-11-05
8
380 Views
Last Modified: 2010-05-18
I'll try to simplify this to only the relevant information and hope that I don't inadvertently omit something important.  (-:

I have...
1.  A table of payroll deductions.  Let's say its fields are (amount DECIMAL, person_id CHAR(9), deduction_date DATE).  This table is fairly big (probably around a million rows and growing).
2.  A table of EXPECTED deductions -- common amounts that we'd expect to see.  Let's say its only field is (amount DECIMAL).  This table is tiny (probably no more than 40 rows ever).

I wish to produce a report that...
1.  Includes only UNUSUAL deductions (i.e. amounts that do not appear in expected_deductions)
2.  Includes only items with a particular date.
3.  Is ordered by the TOTAL of all deductions for a particular person on that date.

For example, I might have...
Person A   $790
Person A     $30
Person B   $800
Person C   $500

Notice that Person A's deductions appear first because the TOTAL of 790 and 30 is greater than Person B's 800.  They're still listed as separate lines, however, so just saying "throw in a GROUP BY" isn't gonna cut it.  (-:


I've tried a bunch of different queries, but the best execution time I've gotten so far (around 20 seconds) was for this:

SELECT * FROM deductions
INNER JOIN (SELECT SUM(amount) total, BUID FROM deductions WHERE payroll_date='2006-11-03' GROUP BY person_id) tblX
    ON (tblX.person_id = deductions.person_id)
LEFT JOIN expected_deductions ed ON (ed.amount = DOE.amount)
WHERE deduction_date = '2006-11-03' AND ed.amount IS NULL
ORDER BY tblX.total DESC, amount DESC;


Any query that obtains the result I want, exploiting any MySQL 5.0 features (including perhaps a new index that I've not yet tried) will do it.  I'll settle, if necessary, for a solution that does some work in PHP too, but at this point I'm being stubborn and I'm determined to get the DBMS to tell me what I want in under 5 seconds.  (-:
0
Comment
Question by:VoteyDisciple
  • 5
  • 3
8 Comments
 
LVL 14

Expert Comment

by:racek
ID: 17877030
Do you have INDEX on (deduction_date, person_id) ? - I mean ONE index over two columns
0
 
LVL 14

Accepted Solution

by:
racek earned 500 total points
ID: 17877081
you query cannot works ... what is DOE?
I suppose that something like that with index on (deduction_date, person_id) can works

create temporary table sums (person_id int not null primary key, total int not null default 0);
insert into sums select person_id , SUM(amount) total FROM deductions WHERE payroll_date='2006-11-03' GROUP BY person_id;
SELECT * FROM deductions d  
      JOIN         sums AS tblX                ON (tblX.person_id = d.person_id)
      LEFT JOIN expected_deductions ed ON (ed.amount = tblX.total)
WHERE d.deduction_date = '2006-11-03'
    AND ed.amount IS NULL
ORDER BY tblX.total DESC, d.amount DESC;
0
 
LVL 19

Author Comment

by:VoteyDisciple
ID: 17877235
Whoops, DOE is what I'm here calling 'deductions'.  For that matter, BUID is what I'm here calling person_id.

I do have a wide variety of indices on this table including both (person_id, deduction_date) and one that ought to be perfect for that derived query: (person_id, deduction_date, amount).

According to EXPLAIN, MySQL chooses the index on (deduction_date) by itself for this query.  With a USE INDEX hint, however, performance only improves by a fraction of a second.

(What's baffling to me is why EXPLAIN reports that it'll have to scan about 40,000 rows with just the deduction_date, yet will have to scan nearly 300,000 with the larger index.  All the information it needs should be contained in that index; perhaps it is not as useful as I would have guessed.)
0
NEW Veeam Agent for Microsoft Windows

Backup and recover physical and cloud-based servers and workstations, as well as endpoint devices that belong to remote users. Avoid downtime and data loss quickly and easily for Windows-based physical or public cloud-based workloads!

 
LVL 19

Author Comment

by:VoteyDisciple
ID: 17877258
I've also tried moving the derived query to a temporary table; I assume your thinking was to get the ability to index over the temporary table?  This does gain some performance, but still not a whole lot.

In total, I've still brought the performance of the entire query down to about 13 seconds from 20.
0
 
LVL 14

Expert Comment

by:racek
ID: 17877264

LEFT JOIN expected_deductions ed ON (ed.amount = DOE.amount)
WHERE deduction_date = '2006-11-03' AND ed.amount IS NULL
ORDER BY tblX.total DESC, amount DESC;

create temporary table sums (person_id int not null primary key, total int not null default 0);
insert into sums select person_id , SUM(amount) total FROM deductions WHERE payroll_date='2006-11-03' GROUP BY person_id;
SELECT a.*  FROM
(SELECT * FROM deductions AS d  
                 JOIN  sums         AS tblX   ON  tblX.person_id = d.person_id
              WHERE  d.deduction_date = '2006-11-03' ) AS a
LEFT JOIN expected_deductions ed ON ed.amount = a.amount
WHERE ed.amount IS NULL
ORDER BY a.total DESC, a.amount DESC;
0
 
LVL 14

Expert Comment

by:racek
ID: 17877266
sorry

create temporary table sums (person_id int not null primary key, total int not null default 0);
insert into sums select person_id , SUM(amount) total FROM deductions WHERE payroll_date='2006-11-03' GROUP BY person_id;
SELECT a.*  FROM
(SELECT * FROM deductions AS d  
                 JOIN  sums         AS tblX   ON  tblX.person_id = d.person_id
              WHERE  d.deduction_date = '2006-11-03' ) AS a
LEFT JOIN expected_deductions ed ON ed.amount = a.amount
WHERE ed.amount IS NULL
ORDER BY a.total DESC, a.amount DESC;
0
 
LVL 14

Expert Comment

by:racek
ID: 17877285
if optimizer does not uses INDEX on (deduction_date, person_id), try:

create temporary table sums (person_id int not null primary key, total int not null default 0);
insert into sums select person_id , SUM(amount) total FROM deductions WHERE payroll_date='2006-11-03' GROUP BY person_id;
SELECT a.*  FROM
(SELECT *
    FROM   deductions AS d,
                sums         AS tblX  
    WHERE  tblX.person_id    = d.person_id
         AND d.deduction_date = '2006-11-03' ) AS a
LEFT JOIN expected_deductions ed ON ed.amount = a.amount
WHERE ed.amount IS NULL
ORDER BY a.total DESC, a.amount DESC;
0
 
LVL 19

Author Comment

by:VoteyDisciple
ID: 17877795
The best performance I've been able to get is with this combination:

CREATE TEMPORARY TABLE sums (person_id INT UNSIGNED PRIMARY KEY, total INT) AS (SELECT person_id, SUM(amount) total FROM deductions WHERE payroll_date='2006-11-03' GROUP BY person_id);

SELECT *
FROM deductions
INNER JOIN sums ON (deductions.person_id = sums.person_id)
LEFT JOIN expected_deductions ed ON (deductions.amount = ed.amount)
WHERE check_date = '2006-11--03'  AND ed.amount IS NULL
ORDER BY sums.total DESC, deductions.amount DESC

Total time to run this combination (now exploiting the best available index at each step) is around 8.5 seconds.


Adding that derived table hurt much more than it helped in this case, but switching to a temporary table for the list of sums did gain some improvement.  


I think I'm gonna call this adequate.  Twenty seconds (counting PHP processing time) is still a long time to wait for a webpage to load, but I think the "real solution" will be for the client to come up with criteria for this report that narrow down the results a bit more.  (-:
0

Featured Post

Announcing the Most Valuable Experts of 2016

MVEs are more concerned with the satisfaction of those they help than with the considerable points they can earn. They are the types of people you feel privileged to call colleagues. Join us in honoring this amazing group of Experts.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Introduction In this installment of my SQL tidbits, I will be looking at parsing Extensible Markup Language (XML) directly passed as string parameters to MySQL 5.1.5 or higher. These would be instances where LOAD_FILE (http://dev.mysql.com/doc/refm…
When table data gets too large to manage or queries take too long to execute the solution is often to buy bigger hardware or assign more CPUs and memory resources to the machine to solve the problem. However, the best, cheapest and most effective so…
The Email Laundry PDF encryption service allows companies to send confidential encrypted  emails to anybody. The PDF document can also contain attachments that are embedded in the encrypted PDF. The password is randomly generated by The Email Laundr…
I've attached the XLSM Excel spreadsheet I used in the video and also text files containing the macros used below. https://filedb.experts-exchange.com/incoming/2017/03_w12/1151775/Permutations.txt https://filedb.experts-exchange.com/incoming/201…

679 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question