freag34
asked on
General Database Strategy - Efficient storage of fitness data
Hello,
Very new to SQL, and would very much appreciate a couple of pointers on proper database strategy for my application.
I'm desiging a mySQL database to store details on athletes and workout data
- Athlete info such as name, email, etc...
- Workout data
* Top level summary - cumulative time, average heart rate
* Detailed workout data includes sample time and instantaneous heart rate data (e.g. every 5 seconds or 10 seconds). Some workouts may include up to 4-5 different data types at this sampling rate, whereas other workouts will only have 1-2.
The database must be scalable, and robust enough to store thousands of athletes, each who may have hundreds of workouts, and each workout may have hundreds of datapoints. One obvious required table stores athlete particulars. What I'm less sure about is how to best and most efficiently store workout data.
For workout data, one table could include a top level summary of workouts: athleteID, cumulate time, average heart rate, etc... and other summary data.
My biggest question is how/where should I store the large amount of detailed workout data (e.g. hundreds of datapoints of sample time and heart rate).
If I create a single table to store ALL workout data points for all athletes and all workouts, it would seem to be an unmanageable table with millions of rows. It would require a dedicated column for workoutID, which seems a bit wasteful. Even more wasteful might be the fact that many workouts will only include 1-2 data types (fields), but others will have 4-5 - so lots of empty cells.
If I create a table for each workout, I could have thousands of workout tables. This seems complicated to create and recall so many tables, each with a different name (perhaps each named with a unique workoutID). I suppose in this case I would also need a place to store a growing list of workoutIDs for each athlete - and so my athlete table would become inefficient if I kept that a single table, since certain athletes would have many more workoutID columns than others.
Thanks so much for any general strategy tips.
Very new to SQL, and would very much appreciate a couple of pointers on proper database strategy for my application.
I'm desiging a mySQL database to store details on athletes and workout data
- Athlete info such as name, email, etc...
- Workout data
* Top level summary - cumulative time, average heart rate
* Detailed workout data includes sample time and instantaneous heart rate data (e.g. every 5 seconds or 10 seconds). Some workouts may include up to 4-5 different data types at this sampling rate, whereas other workouts will only have 1-2.
The database must be scalable, and robust enough to store thousands of athletes, each who may have hundreds of workouts, and each workout may have hundreds of datapoints. One obvious required table stores athlete particulars. What I'm less sure about is how to best and most efficiently store workout data.
For workout data, one table could include a top level summary of workouts: athleteID, cumulate time, average heart rate, etc... and other summary data.
My biggest question is how/where should I store the large amount of detailed workout data (e.g. hundreds of datapoints of sample time and heart rate).
If I create a single table to store ALL workout data points for all athletes and all workouts, it would seem to be an unmanageable table with millions of rows. It would require a dedicated column for workoutID, which seems a bit wasteful. Even more wasteful might be the fact that many workouts will only include 1-2 data types (fields), but others will have 4-5 - so lots of empty cells.
If I create a table for each workout, I could have thousands of workout tables. This seems complicated to create and recall so many tables, each with a different name (perhaps each named with a unique workoutID). I suppose in this case I would also need a place to store a growing list of workoutIDs for each athlete - and so my athlete table would become inefficient if I kept that a single table, since certain athletes would have many more workoutID columns than others.
Thanks so much for any general strategy tips.
ASKER
Thanks, some very good tips. A few follow-up questions so I understand the efficiency better:
1) Say I use one large table for tens of millions of workout datapoints. One field would be sessionID - all datapoints from one session have the same sessionID. How fast (very rough ballpark) will it be to extract one set of a couple of thousand workout datapoints using a sessionID query from a table of tens of millions of rows? Tens of milliseconds, hundreds of milliseconds,more?
2) Can my table go up to hundreds of millions of rows without concern? If that becomes a concern, I could reasonably easily reduce the table size by a factor of 26 by having 26 tables, one for each letter of the alphabet of user's last name? Is that a reasonable idea or newbie idea :)
With the previous advice in mind along with further research, I have come up with a proposed database. Can I please get some comments on the efficiency and reasonableness of the below:
USER
UserID PRIMARY_KEY,
LastName
FirstName
SESSION
SessionID PRIMARY_KEY
UserID
SessionDate
Duration
NumLaps
LAP
LapID PRIMARY_KEY
SessionID
LapNumber
Duration
Distance
AveHR
AveSpeed
DATAPOINT
DataPointID PRIMARY_KEY
LapID
SessionID
SampleTime
HeartRate
Speed
etc…
Two Questions re: table above:
1) Under LAP, I propose having AveHR and AveSpeed to speed up obtaining this data. These numbers could be calculated from DATAPOINT queries, but I would like the user to have a near-instantaneous experience, and having summary data such as averages stored in LAP would help? Is this a reasonable idea?
2) Under DATAPOINT I propose having both LapID and SessionID. Only LapID would be strictly necessary because SessionID can be obtained from the LAP table, however having both would make for much easier queries on DATAPOINT I believe - for example to quickly obtain all datapoints from one session would only take one query instead of many. Is this reasonable?
1) Say I use one large table for tens of millions of workout datapoints. One field would be sessionID - all datapoints from one session have the same sessionID. How fast (very rough ballpark) will it be to extract one set of a couple of thousand workout datapoints using a sessionID query from a table of tens of millions of rows? Tens of milliseconds, hundreds of milliseconds,more?
2) Can my table go up to hundreds of millions of rows without concern? If that becomes a concern, I could reasonably easily reduce the table size by a factor of 26 by having 26 tables, one for each letter of the alphabet of user's last name? Is that a reasonable idea or newbie idea :)
With the previous advice in mind along with further research, I have come up with a proposed database. Can I please get some comments on the efficiency and reasonableness of the below:
USER
UserID PRIMARY_KEY,
LastName
FirstName
SESSION
SessionID PRIMARY_KEY
UserID
SessionDate
Duration
NumLaps
LAP
LapID PRIMARY_KEY
SessionID
LapNumber
Duration
Distance
AveHR
AveSpeed
DATAPOINT
DataPointID PRIMARY_KEY
LapID
SessionID
SampleTime
HeartRate
Speed
etc…
Two Questions re: table above:
1) Under LAP, I propose having AveHR and AveSpeed to speed up obtaining this data. These numbers could be calculated from DATAPOINT queries, but I would like the user to have a near-instantaneous experience, and having summary data such as averages stored in LAP would help? Is this a reasonable idea?
2) Under DATAPOINT I propose having both LapID and SessionID. Only LapID would be strictly necessary because SessionID can be obtained from the LAP table, however having both would make for much easier queries on DATAPOINT I believe - for example to quickly obtain all datapoints from one session would only take one query instead of many. Is this reasonable?
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
thanks very much, great tips
If you are worried about space, look at efficient data sizing. For example, heart rate could be stored as an unsigned int, as your max value should not be more than 256 (I hope!) If you declare all columns as NOT NULL there is less overhead for empty values.