asked on

Optimize MySQL Database

Sorry for the long winded explanation, but, here goes...

I receive data hourly from a third party which is a zip file containing multiple .csv files. I've written a script which extracts the .csv files, creates tables for each file and loads records from the .csv files. The PHP script recognizes when a new column is added and alters the table if necessary. It also recognizes specific columns and creates indexes on those columns. For each campaign being monitored, it will create eleven different tables most of which are benign dictionaries containing fewer than 20 records. Three of the tables are not so benign.

One table contains all the contacts involved in a campaign and specific static information for each record. This table may easily contain more than 100K records.

Another table contains all the events for the campaign. Similar to a running activity log. Some events could generate as many as 15-20 records. Each event has an event code which can be cross-referenced in the Event-dictionary table. Each event also contains a Contact_ID which references a record in the contacts table. Both these columns are indexed. This table easily contains more than 1 million records.

One of the events is an optout event.

I frequently run queries to determine the number of contacts who've performed a specific event. I need to remove from this count those who've opted out as well as those who've been deactivated. The deactivated contact ids are in a separate table which has only one column, Contact_ID.

My query selects columns from the contact table for all Contact_IDs in a sub-query of Contact_IDs who've performed a specific event, minus all the Contact_IDs who've performed the optout event, minus all the Contact_IDs in the disabled table.

SELECT ...cols... 
FROM contacts c, data d 
WHERE c.Contact_ID=d.Contact_ID 
AND d.Event=eventcode 
AND NOT Contact_ID IN (
	SELECT Contact_ID FROM data WHERE Event=optout
) 
AND NOT Contact_ID IN (
	SELECT Contact_ID FROM disabled
)
GROUP BY Contact_ID

Open in new window

Since contacts may trigger the event multiple times and since date is one of the columns I'm selecting, I have to use a group by clause. Distinct doesn't work since the date changes for each time they trigger the event.

This query takes so long to complete that PHP drops the connection.

So I decided to create a view of all the Contact_IDs who've opted out (the third not-so-benign table). But after creating these views for six campaigns, mysql was so slow that it took 10 minutes for phpmyadmin to display the list of tables for that database.

If I remove the clause that eliminates the optout records (lines 5-7), the query takes less than a second to execute. With that clause, it takes nearly 60 seconds.

I've removed the views of optout Contact_IDs. Please help me optimize the performance of this database or the individual query.

Dan Craciun

1. Do you have an index on data.Event?
2. If the majority of the Event values are optout, then it's faster to select those that are not.
3. Try using a JOIN. Maybe something like this (untested):

SELECT ...cols...
FROM contacts c
LEFT JOIN data d
ON c.Contact_ID=d.Contact_ID
RIGHT JOIN (SELECT Contact_ID FROM data WHERE Event !=optout) q
ON c.Contact_ID = q.Contact_ID
WHERE Contact_ID NOT IN (SELECT Contact_ID FROM disabled)
AND d.Event=eventcode
GROUP BY Contact_ID

Kim Walker

ASKER

1. Yes, data.Event is indexed
2. Quite the opposite, optouts are probably only about 1% of the Event values
3. The problem with a join is that each contact could have hundreds of records in the data table and many different events. I'm looking for unique contacts who have one particular Event value in one record if they don't also have an optout Event value in a different record.

ASKER CERTIFIED SOLUTION

Dan Craciun

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

Kim Walker

ASKER

Dan Craciun you are exactly right. I just had to figure out how to join two queries from the same table. This reduced my query time from 25+ seconds to <1 sec. Thanks.

Dan Craciun

Glad I could help!