Link to home
Start Free TrialLog in
Avatar of davidcahan
davidcahanFlag for United States of America

asked on

Novel Solution needed for table that has 271 Million Rows of Data

Inherited a table that has 271 million rows of data (and growing each day).  My first approach to make this more sustainable is to convince the VP of IT that we need to change the nature of how we are storing the statistical data.  Unfortunately that could be a tough sell.  

For auditing purposes our customers need to know exactly when a file played (we are a digital signage company).  It isn't enough to know that file X played 10 times in a 30 minute period of time (or 1 hour or 15 minute or whatever time frame I come up with).  Admittedly, if we changed over to a count/time, then the number of rows decreases drastically.

Each "player" has an ID.  One thought I came up with was making a table for each player.  then use dynamic sql to determine which table to insert the data into and which to select from.  I could create tables by accounts and then each table would have that account's players in it. .  But I have to say that that just seems hackish.  

I wonder though how Google manages such large data sets.  They must be doing something to break the data up into manageable pieces and then figuring out a way to get to that data based on the user that is logged in.  Maybe my idea isn't so bad.

Need practical advice.  L
Avatar of Michael Dyer
Michael Dyer
Flag of United States of America image

Instead of creatign a file for each player, have you considered breaking it up into a separate table for each file name and then store the playing events in the table under the specific file name.  That way, you can use dynamic SQL to query the specific file's table.  The total number of records remains the same, but the query should run much faster.
Avatar of sshah254

Link to home
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of davidcahan


I don't know much about table partitioning.  Could you tell me more?
do you actually have a problem at the moment, or a you just trying to head one off?

how skewed is the data?
what is the rate of new "file" creation....
what is the lifecycle  of a file (e.g played many in first 6 months then hardly ever, or is it cyclical, or ....
is there any archive process...?

how often , and how is the data accessed?
   e.g.  normally just accessed for Monthly/Quarterly stats
           indepth analysis of a "file"

Just trying to get some ideas on what the "Shape of the data is" , my initial though is some sort of partitioning schema for the data...

how do you store it currently on just the one instance?  or is it distributed across multiple servers/instances... do you have separate production and analysis databases?

how much in Gigabytes,,, what is growth in gigabytes? , how many users?

.... just a thought

try getting it into the VP's mind that you need to start planning for a database move anyway...
 (2005 is old) 2008 r2 is out ... etc...  Now is probably the time to be looking at the architecture
of the system again....

@Lowfatspread --  one man's problems are another man's annoyances, so it's hard to say if I have a problem or I'm just heading one off, it's probably somewhere in between the two.

re: skewed data:  I'm not sure what that means

re: files -- files aren't really created (at least not in terms of the data I am tracking).  a "unit" has a show that plays on it.  A show is made up of files.  A file can play anywhere from 20 secs. to 3 mins (average would be about 30 secs).  When a show reaches the end, it loops back around to the beginning.  Each time a file plays, we log the time it played, the unit it played on, what it's fileID is and what zone on the screen it played on (unit's can have 1-3 zones that each have their own show playing).  If you think about a unit with multiple shows running on it 24/7, that's a lot of records each day.

re: data access - data is accessed once/month by me.  I'm currently pulling just the raw data out of the table, I'm not trying to group or count by some time increment (ex. how often a file plays every 15 mins).  EVENTUALLY, I'd like to build a web based report that allows our customers to run the report.   I'm sure they would love in depth analysis of a particular file.  That way if they know that for 2 weeks, a pack of chewing gum played 10 times/15mins and they increased that to 20 times/15mins and saw a 20% increase in gum sales that their signage system is having a positive impact on sales.

currently, a report for one month for one client who might only have one unit can be upwards of 65,000 rows.  some of our customers claim they need the exact time a file played for compliance but I'm guessing that's just my companies interpretation of the requirement.  I'm guessing if I pressed the end users, then if I gave them a count of how often a file played every 15 or 30 mins, that would be enough. And if each row was limited to a count per time period (so on insert if I'm still in that time period, I do an update to the count instead), that would significantly lower the number of rows.

I'm archiving old records after 3 months.  3 months worth of data is still 271 million rows.  With indexes, the table eats up about 42GB.  Total DB size is about 57GB.   It's all on one instance.
Can you give us an example of your schema?

At my last employer I built a database schema for capturing continuous and daily build metrics for thousands of systems.  Not only did we capture the data for every build, we captured metrics for each individual build component (each component would be represented by a project in Visual Studio)  Usually about 80 components per build.  For each build component we had name, version, number of files, location of the files, average build time per file, total build time, longest build time, number of successes, failures and warnings.  We stored a lot more than that though.  We had system details at the time of the build as well as overall build project data et cetera.

As you can imagine we had millions and millions of rows.  After 5 months the table with component build metrics was over 50 million records.  There were other tables with in excess of 10 million records.

The entire database was 4.6 GB.  That is not a misprint.  Last I heard (I left the company) the database is now 11 months old and only about 12 GB in size.

The way I achieved such small size was by fully normalizing the database.  Another metric system that isn't normalized as well as mine reached 150 GB after 3 months and has about 80 million records.  

So if you can include your schema let's discuss how to minimize storage by fully normalizing the data.
Link to home
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
id   int  pk
cm_id int  index 60% fill
lm_id  int  index 60% fill
um_id int  index 60% fill
date  datetime  index 60% fill
dateAdded datetime   index 60% fill
isFileDrop  bit   index 60%

I have to have the indexes on cm_id,lm_id,um_id and date because I use that to make sure I'm not adding a dupe record and I'm also using them in different combinations for selects, groups, where's and joins.   I may be able to do without an index on DateAdded and isFileDrop.  

and the pk I don't really need.  in fact, technically speaking, the combination of cm_id,lm_id,um_id and date make each record unique.
would it be a major problem if I simply removed the pk altogether?

other suggestions obviously wanted as well

Link to home
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Date is when a file played on a machine in the field.  DateAdded is when the record was added to the table.

I'll try removing the PK and the DateAdded and IsFileDrop Indexes and see what happens.
also, is it "bad" to have no primary key at all?  
Not necessarily.  In logging tables I rarely add a primary key because the process of inserting records is faster when there is no indexing.
If you can find another unique column or expression you are free to kill primary key.
make sure you set constraint to ensure that column is unique and face the consequences, because most likely it is not.
Where did the table partitioning suggestion go? (way up top)

What people often do is partition their detail data off on to cheaper slower storage. So you have one years data per HD accross a bunch of SQL data files. Partitioning makes this transparent to the user. You don't have to especially work out which table to select from depending on the year etc. - its all transparent.

Then you summarise this detail data into something useful (like a SSAS cube) which gives you useful summarised answers. You could even build one cube per client and send them off to your clients. A really powerful thing is to integrate your play data with their sales data within a cube and see those kinds of metrics you were talking about.

Then you can drill back to your detail data if you really have to know exactly when the event occured.
Here's what I did for now (and I have no idea who to give points to):

I altered 2 columns from int to smallint, altered one column from datetime to smalldate time.  I dropped the identity pk column (cause it wasn't be used to join to anything) and instead made a clustered pk from:
date asc,
um_id asc,
cm_id asc,
lm_id asc

I dropped the index from isFileDrop bit and from DateAdded datetime.  Tested my common queries against it and they ran as good as they usually did.

Once we get on Enterprise SQL Server, I'm definitely going to need to partition on Unit.  I'm guessing that is going to increase my performance.  

@nmcdermaid -- do you have any tutorial/informational links to SSAS cubes.  I don't know much about them but from the sounds of it they can be very useful to my end users and to my DB architecture.