• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 281
  • Last Modified:

Database Design

What type of database design do you have to use for something like google analytics? where there is a massive amount of data collection on a daily basis, but with a ton of breakdowns per hit, and the ability to use mysql GROUP functions, as well as manual search of the data?

I basically need to come up with a normalization technique that will allow me to store aggregate information (like total hits, unique hits) for each user (say I have admin can view everything, Katie can only view rows with "Katie's" ID), without needing to run a full table scan on the original data.

For example, I might want the total hits for all rows where the browser was Firefox, and also where the screen size was 1024x768 - year to date.
0
jwzk
Asked:
jwzk
  • 3
  • 3
  • 2
2 Solutions
 
Beverley PortlockCommented:
The key is to have small records with appropriate numbers of indexes. The number of times I come across badly performing MySQL databases because there are no indexes, insufficient indexes or the wrong indexes is unbelievable. Indeed it has got to the point that telling people to check their indexes is almost my version of fixing the problem by turning the computer off and on again.

So here it is - the BIG secret. If a field appears in a WHERE, JOIN or ORDER BY clause then make sure it has an index.

That's it. 95% of your performance problems are now solved. Seriously!

Now that we have that out of the way we can move on to the design issues. For most tables first and second normal form will be sufficient, so have no repeating groups of data "horizontally" with a record and make sure that every field in a record is fully dependent on the key. "Transitive dependencies" are not that common and you can probably not worry about them too much.

If you have a table that has a high number of INSERT operations then try avoiding having any indexes on it at all. If you need to analyse the table then copy it to another version and add the indexes in one go. A "bulk index" is often a very fast operation compared to adding lots and lots of records to a table that already has indexes.

So, in your case I would have a collection table with no indexes that can accumulate data rapidly with little overhead but which is useless for analysis. Once or twice a day I would copy and empty the table and the process the copy into the system. This is often why some systems (like Google perhaps) run a few hours behind. Once the data is in the "analysis"  part of the system then you can have as many indexes as you need.
0
 
Ray PaseurCommented:
Everything Brian said.

You can use EXPLAIN to see what MySQL is doing with your queries.  Almost always a good idea if a query runs longer than a few seconds.

You indicated that there was a "a massive amount of data collection on a daily basis" but Google Analytics is collecting in real time.  Arrival rates matter, and if you can tell us some more about the transaction load and arrival rates we may be able to give more specific comments.

Also, looking at this: "total hits, unique hits" - you might want to have some pre-canned queries that you can run during off-hours.  Store the query results somewhere.  Then clients can see the most recent (yesterday) data without reference to the full tables.  A web page with a bunch of links or checkboxes is a good way to show clients the path to the pre-canned query results.
0
 
Beverley PortlockCommented:
Ray is right - EXPLAIN is good, but it can take a little getting used to so this

http://dev.mysql.com/doc/refman/5.0/en/using-explain.html

is a good read. If I had 500 points for every time a MySQL performance problem was caused by bad indexing, I'd be a Guru at MySQL by now!
0
Network Scalability - Handle Complex Environments

Monitor your entire network from a single platform. Free 30 Day Trial Now!

 
jwzkAuthor Commented:
Thanks for your responses, they are very informative. I am aware of the EXPLAIN syntax, and it helps tremendously in some cases, simply by adding a single index.

@Ray I can't really estimate the arrival rate, but the data will be inserted in real-time, not per day. I would also need the reporting to be in real-time, or very close to it. I would likely use Brian's approach for that "Once or twice a day I would copy and empty the table and the process the copy into the system.", but it would run as often as possible (really depends on number of records the system is getting).
0
 
Ray PaseurCommented:
"run as often as possible" - a good strategy might be to trigger a cron job that handles the sweep-and-delete process.  I do some of this sort of thing in my apps and it works great.  So I am not directly dependent on the number of records the system is getting - I just run the process regularly and let the record count take care of itself.  If you did this every minute and ran your queries against the static data bases you might find that near-real-time queries gave pretty good results without any interference in the speed of the data gathering process.

best regards, ~Ray
0
 
jwzkAuthor Commented:
Thanks for your help guys.
0
 
Beverley PortlockCommented:
One small point (these things always occur after you write stuff up) is that renaming a table is usually a more efficient operation than copying a table. This is because the rename just changes an entry in the file system index (or database schema).

So, after processing the data delete the "copied" table so that when you are ready you can rename the existing table, create an empty table to replace the renamed one (both are cheap, fast operations) and then process the "copy" table.

You could even trigger the "copy" on completion of processing. Rather than process once per day you "copy" and process the data and at the end of the processing you then "copy" the current table and prepare a new empty one. The advantage of this is that you get the data processed through as fast as the system can handle it, but without ever overloading the system.

0
 
Ray PaseurCommented:
"renaming a table" - excellent idea!
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

Join & Write a Comment

Featured Post

Free Tool: IP Lookup

Get more info about an IP address or domain name, such as organization, abuse contacts and geolocation.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

  • 3
  • 3
  • 2
Tackle projects and never again get stuck behind a technical roadblock.
Join Now