Go Premium for a chance to win a PS4. Enter to Win

x
  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 362
  • Last Modified:

Optimize MySQL Database

Sorry for the long winded explanation, but, here goes...

I receive data hourly from a third party which is a zip file containing multiple .csv files. I've written a script which extracts the .csv files, creates tables for each file and loads records from the .csv files. The PHP script recognizes when a new column is added and alters the table if necessary. It also recognizes specific columns and creates indexes on those columns. For each campaign being monitored, it will create eleven different tables most of which are benign dictionaries containing fewer than 20 records. Three of the tables are not so benign.

One table contains all the contacts involved in a campaign and specific static information for each record. This table may easily contain more than 100K records.

Another table contains all the events for the campaign. Similar to a running activity log. Some events could generate as many as 15-20 records. Each event has an event code which can be cross-referenced in the Event-dictionary table. Each event also contains a Contact_ID which references a record in the contacts table. Both these columns are indexed. This table easily contains more than 1 million records.

One of the events is an optout event.

I frequently run queries to determine the number of contacts who've performed a specific event. I need to remove from this count those who've opted out as well as those who've been deactivated. The deactivated contact ids are in a separate table which has only one column, Contact_ID.

My query selects columns from the contact table for all Contact_IDs in a sub-query of Contact_IDs who've performed a specific event, minus all the Contact_IDs who've performed the optout event, minus all the Contact_IDs in the disabled table.
SELECT ...cols... 
FROM contacts c, data d 
WHERE c.Contact_ID=d.Contact_ID 
AND d.Event=eventcode 
AND NOT Contact_ID IN (
	SELECT Contact_ID FROM data WHERE Event=optout
) 
AND NOT Contact_ID IN (
	SELECT Contact_ID FROM disabled
)
GROUP BY Contact_ID

Open in new window

Since contacts may trigger the event multiple times and since date is one of the columns I'm selecting, I have to use a group by clause. Distinct doesn't work since the date changes for each time they trigger the event.

This query takes so long to complete that PHP drops the connection.

So I decided to create a view of all the Contact_IDs who've opted out (the third not-so-benign table). But after creating these views for six campaigns, mysql was so slow that it took 10 minutes for phpmyadmin to display the list of tables for that database.

If I remove the clause that eliminates the optout records (lines 5-7), the query takes less than a second to execute. With that clause, it takes nearly 60 seconds.

I've removed the views of optout Contact_IDs. Please help me optimize the performance of this database or the individual query.
0
Kim Walker
Asked:
Kim Walker
  • 3
  • 2
1 Solution
 
Dan CraciunIT ConsultantCommented:
1. Do you have an index on data.Event?
2. If the majority of the Event values are optout, then it's faster to select those that are not.
3. Try using a JOIN. Maybe something like this (untested):

SELECT ...cols...
    FROM contacts c
    LEFT JOIN data d
    ON c.Contact_ID=d.Contact_ID
    RIGHT JOIN (SELECT Contact_ID FROM data WHERE Event !=optout) q
    ON c.Contact_ID = q.Contact_ID
    WHERE Contact_ID NOT IN (SELECT Contact_ID FROM disabled)
    AND d.Event=eventcode
    GROUP BY Contact_ID
0
 
Kim WalkerWeb Programmer/TechnicianAuthor Commented:
1. Yes, data.Event is indexed
2. Quite the opposite, optouts are probably only about 1% of the Event values
3. The problem with a join is that each contact could have hundreds of records in the data table and many different events. I'm looking for unique contacts who have one particular Event value in one record if they don't also have an optout Event value in a different record.
0
 
Dan CraciunIT ConsultantCommented:
SELECT ...cols...
FROM contacts c, data d
WHERE c.Contact_ID=d.Contact_ID
is the same as
SELECT ...cols...
    FROM contacts c
    LEFT JOIN data d
    ON c.Contact_ID=d.Contact_ID
Well, technically there are differences (cartesian product vs join), but the result is the same, the time to execute is the same and the memory usage is the same. At least from my tests. And I find joins easier to read.

NOT IN is about the slowest query I can think of. You're better off just doing a join and then filtering on NULL values.
So instead of
SELECT columns FROM table WHERE column NOT IN (SELECT column FROM another_table)
it's faster to do
SELECT columns FROM table t1
LEFT JOIN another_table t2 ON t1.column = t2.column
WHERE t2.column IS NULL
0
 
Kim WalkerWeb Programmer/TechnicianAuthor Commented:
Dan Craciun you are exactly right. I just had to figure out how to join two queries from the same table. This reduced my query time from 25+ seconds to <1 sec. Thanks.
0
 
Dan CraciunIT ConsultantCommented:
Glad I could help!
0

Featured Post

Free Tool: Port Scanner

Check which ports are open to the outside world. Helps make sure that your firewall rules are working as intended.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

  • 3
  • 2
Tackle projects and never again get stuck behind a technical roadblock.
Join Now