Link to home
Start Free TrialLog in
Avatar of jwzk
jwzk

asked on

Database Design

What type of database design do you have to use for something like google analytics? where there is a massive amount of data collection on a daily basis, but with a ton of breakdowns per hit, and the ability to use mysql GROUP functions, as well as manual search of the data?

I basically need to come up with a normalization technique that will allow me to store aggregate information (like total hits, unique hits) for each user (say I have admin can view everything, Katie can only view rows with "Katie's" ID), without needing to run a full table scan on the original data.

For example, I might want the total hits for all rows where the browser was Firefox, and also where the screen size was 1024x768 - year to date.
ASKER CERTIFIED SOLUTION
Avatar of Beverley Portlock
Beverley Portlock
Flag of United Kingdom of Great Britain and Northern Ireland image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Everything Brian said.

You can use EXPLAIN to see what MySQL is doing with your queries.  Almost always a good idea if a query runs longer than a few seconds.

You indicated that there was a "a massive amount of data collection on a daily basis" but Google Analytics is collecting in real time.  Arrival rates matter, and if you can tell us some more about the transaction load and arrival rates we may be able to give more specific comments.

Also, looking at this: "total hits, unique hits" - you might want to have some pre-canned queries that you can run during off-hours.  Store the query results somewhere.  Then clients can see the most recent (yesterday) data without reference to the full tables.  A web page with a bunch of links or checkboxes is a good way to show clients the path to the pre-canned query results.
Ray is right - EXPLAIN is good, but it can take a little getting used to so this

http://dev.mysql.com/doc/refman/5.0/en/using-explain.html

is a good read. If I had 500 points for every time a MySQL performance problem was caused by bad indexing, I'd be a Guru at MySQL by now!
Avatar of jwzk
jwzk

ASKER

Thanks for your responses, they are very informative. I am aware of the EXPLAIN syntax, and it helps tremendously in some cases, simply by adding a single index.

@Ray I can't really estimate the arrival rate, but the data will be inserted in real-time, not per day. I would also need the reporting to be in real-time, or very close to it. I would likely use Brian's approach for that "Once or twice a day I would copy and empty the table and the process the copy into the system.", but it would run as often as possible (really depends on number of records the system is getting).
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of jwzk

ASKER

Thanks for your help guys.
One small point (these things always occur after you write stuff up) is that renaming a table is usually a more efficient operation than copying a table. This is because the rename just changes an entry in the file system index (or database schema).

So, after processing the data delete the "copied" table so that when you are ready you can rename the existing table, create an empty table to replace the renamed one (both are cheap, fast operations) and then process the "copy" table.

You could even trigger the "copy" on completion of processing. Rather than process once per day you "copy" and process the data and at the end of the processing you then "copy" the current table and prepare a new empty one. The advantage of this is that you get the data processed through as fast as the system can handle it, but without ever overloading the system.

"renaming a table" - excellent idea!