We help IT Professionals succeed at work.

# Optimize Complex Query

on
Medium Priority
397 Views
I'll try to simplify this to only the relevant information and hope that I don't inadvertently omit something important.  (-:

I have...
1.  A table of payroll deductions.  Let's say its fields are (amount DECIMAL, person_id CHAR(9), deduction_date DATE).  This table is fairly big (probably around a million rows and growing).
2.  A table of EXPECTED deductions -- common amounts that we'd expect to see.  Let's say its only field is (amount DECIMAL).  This table is tiny (probably no more than 40 rows ever).

I wish to produce a report that...
1.  Includes only UNUSUAL deductions (i.e. amounts that do not appear in expected_deductions)
2.  Includes only items with a particular date.
3.  Is ordered by the TOTAL of all deductions for a particular person on that date.

For example, I might have...
Person A   \$790
Person A     \$30
Person B   \$800
Person C   \$500

Notice that Person A's deductions appear first because the TOTAL of 790 and 30 is greater than Person B's 800.  They're still listed as separate lines, however, so just saying "throw in a GROUP BY" isn't gonna cut it.  (-:

I've tried a bunch of different queries, but the best execution time I've gotten so far (around 20 seconds) was for this:

SELECT * FROM deductions
INNER JOIN (SELECT SUM(amount) total, BUID FROM deductions WHERE payroll_date='2006-11-03' GROUP BY person_id) tblX
ON (tblX.person_id = deductions.person_id)
LEFT JOIN expected_deductions ed ON (ed.amount = DOE.amount)
WHERE deduction_date = '2006-11-03' AND ed.amount IS NULL
ORDER BY tblX.total DESC, amount DESC;

Any query that obtains the result I want, exploiting any MySQL 5.0 features (including perhaps a new index that I've not yet tried) will do it.  I'll settle, if necessary, for a solution that does some work in PHP too, but at this point I'm being stubborn and I'm determined to get the DBMS to tell me what I want in under 5 seconds.  (-:
Comment
Watch Question

## View Solution Only

Commented:
Do you have INDEX on (deduction_date, person_id) ? - I mean ONE index over two columns
Commented:
you query cannot works ... what is DOE?
I suppose that something like that with index on (deduction_date, person_id) can works

create temporary table sums (person_id int not null primary key, total int not null default 0);
insert into sums select person_id , SUM(amount) total FROM deductions WHERE payroll_date='2006-11-03' GROUP BY person_id;
SELECT * FROM deductions d
JOIN         sums AS tblX                ON (tblX.person_id = d.person_id)
LEFT JOIN expected_deductions ed ON (ed.amount = tblX.total)
WHERE d.deduction_date = '2006-11-03'
AND ed.amount IS NULL
ORDER BY tblX.total DESC, d.amount DESC;

Not the solution you were looking for? Getting a personalized solution is easy.

Commented:
Whoops, DOE is what I'm here calling 'deductions'.  For that matter, BUID is what I'm here calling person_id.

I do have a wide variety of indices on this table including both (person_id, deduction_date) and one that ought to be perfect for that derived query: (person_id, deduction_date, amount).

According to EXPLAIN, MySQL chooses the index on (deduction_date) by itself for this query.  With a USE INDEX hint, however, performance only improves by a fraction of a second.

(What's baffling to me is why EXPLAIN reports that it'll have to scan about 40,000 rows with just the deduction_date, yet will have to scan nearly 300,000 with the larger index.  All the information it needs should be contained in that index; perhaps it is not as useful as I would have guessed.)

Commented:
I've also tried moving the derived query to a temporary table; I assume your thinking was to get the ability to index over the temporary table?  This does gain some performance, but still not a whole lot.

In total, I've still brought the performance of the entire query down to about 13 seconds from 20.

Commented:

LEFT JOIN expected_deductions ed ON (ed.amount = DOE.amount)
WHERE deduction_date = '2006-11-03' AND ed.amount IS NULL
ORDER BY tblX.total DESC, amount DESC;

create temporary table sums (person_id int not null primary key, total int not null default 0);
insert into sums select person_id , SUM(amount) total FROM deductions WHERE payroll_date='2006-11-03' GROUP BY person_id;
SELECT a.*  FROM
(SELECT * FROM deductions AS d
JOIN  sums         AS tblX   ON  tblX.person_id = d.person_id
WHERE  d.deduction_date = '2006-11-03' ) AS a
LEFT JOIN expected_deductions ed ON ed.amount = a.amount
WHERE ed.amount IS NULL
ORDER BY a.total DESC, a.amount DESC;

Commented:
sorry

create temporary table sums (person_id int not null primary key, total int not null default 0);
insert into sums select person_id , SUM(amount) total FROM deductions WHERE payroll_date='2006-11-03' GROUP BY person_id;
SELECT a.*  FROM
(SELECT * FROM deductions AS d
JOIN  sums         AS tblX   ON  tblX.person_id = d.person_id
WHERE  d.deduction_date = '2006-11-03' ) AS a
LEFT JOIN expected_deductions ed ON ed.amount = a.amount
WHERE ed.amount IS NULL
ORDER BY a.total DESC, a.amount DESC;

Commented:
if optimizer does not uses INDEX on (deduction_date, person_id), try:

create temporary table sums (person_id int not null primary key, total int not null default 0);
insert into sums select person_id , SUM(amount) total FROM deductions WHERE payroll_date='2006-11-03' GROUP BY person_id;
SELECT a.*  FROM
(SELECT *
FROM   deductions AS d,
sums         AS tblX
WHERE  tblX.person_id    = d.person_id
AND d.deduction_date = '2006-11-03' ) AS a
LEFT JOIN expected_deductions ed ON ed.amount = a.amount
WHERE ed.amount IS NULL
ORDER BY a.total DESC, a.amount DESC;

Commented:
The best performance I've been able to get is with this combination:

CREATE TEMPORARY TABLE sums (person_id INT UNSIGNED PRIMARY KEY, total INT) AS (SELECT person_id, SUM(amount) total FROM deductions WHERE payroll_date='2006-11-03' GROUP BY person_id);

SELECT *
FROM deductions
INNER JOIN sums ON (deductions.person_id = sums.person_id)
LEFT JOIN expected_deductions ed ON (deductions.amount = ed.amount)
WHERE check_date = '2006-11--03'  AND ed.amount IS NULL
ORDER BY sums.total DESC, deductions.amount DESC

Total time to run this combination (now exploiting the best available index at each step) is around 8.5 seconds.

Adding that derived table hurt much more than it helped in this case, but switching to a temporary table for the list of sums did gain some improvement.

I think I'm gonna call this adequate.  Twenty seconds (counting PHP processing time) is still a long time to wait for a webpage to load, but I think the "real solution" will be for the client to come up with criteria for this report that narrow down the results a bit more.  (-:
##### Thanks for using Experts Exchange.

• View three pieces of content (articles, solutions, posts, and videos)
• Ask the experts questions (counted toward content limit)
• Customize your dashboard and profile