Solved

Novel Solution needed for table that has 271 Million Rows of Data

Posted on 2011-02-11
20
360 Views
Last Modified: 2012-05-11
Inherited a table that has 271 million rows of data (and growing each day).  My first approach to make this more sustainable is to convince the VP of IT that we need to change the nature of how we are storing the statistical data.  Unfortunately that could be a tough sell.  

For auditing purposes our customers need to know exactly when a file played (we are a digital signage company).  It isn't enough to know that file X played 10 times in a 30 minute period of time (or 1 hour or 15 minute or whatever time frame I come up with).  Admittedly, if we changed over to a count/time, then the number of rows decreases drastically.

Each "player" has an ID.  One thought I came up with was making a table for each player.  then use dynamic sql to determine which table to insert the data into and which to select from.  I could create tables by accounts and then each table would have that account's players in it. .  But I have to say that that just seems hackish.  

I wonder though how Google manages such large data sets.  They must be doing something to break the data up into manageable pieces and then figuring out a way to get to that data based on the user that is logged in.  Maybe my idea isn't so bad.

Need practical advice.  L
0
Comment
Question by:davidcahan
  • 7
  • 3
  • 2
  • +4
20 Comments
 
LVL 14

Expert Comment

by:Michael Dyer
Comment Utility
Instead of creatign a file for each player, have you considered breaking it up into a separate table for each file name and then store the playing events in the table under the specific file name.  That way, you can use dynamic SQL to query the specific file's table.  The total number of records remains the same, but the query should run much faster.
0
 
LVL 9

Assisted Solution

by:sshah254
sshah254 earned 100 total points
Comment Utility
Google does not store it like MS or Oracle does ... Google's technology is way different.  If you are willing to switch to Google's big table, then that is one way out.  :-)

Have you thought about partitioning your tables?  It might be easier to do that than what you are thinking about.

Ss
0
 

Author Comment

by:davidcahan
Comment Utility
I don't know much about table partitioning.  Could you tell me more?
0
 
LVL 50

Expert Comment

by:Lowfatspread
Comment Utility
do you actually have a problem at the moment, or a you just trying to head one off?


how skewed is the data?
what is the rate of new "file" creation....
what is the lifecycle  of a file (e.g played many in first 6 months then hardly ever, or is it cyclical, or ....
is there any archive process...?

how often , and how is the data accessed?
   e.g.  normally just accessed for Monthly/Quarterly stats
           indepth analysis of a "file"

Just trying to get some ideas on what the "Shape of the data is" , my initial though is some sort of partitioning schema for the data...

how do you store it currently on just the one instance?  or is it distributed across multiple servers/instances... do you have separate production and analysis databases?

how much in Gigabytes,,, what is growth in gigabytes? , how many users?



0
 
LVL 50

Expert Comment

by:Lowfatspread
Comment Utility
.... just a thought

try getting it into the VP's mind that you need to start planning for a database move anyway...
 (2005 is old) 2008 r2 is out ... etc...  Now is probably the time to be looking at the architecture
of the system again....


0
 
LVL 9

Expert Comment

by:sshah254
Comment Utility
0
 

Author Comment

by:davidcahan
Comment Utility
@Lowfatspread --  one man's problems are another man's annoyances, so it's hard to say if I have a problem or I'm just heading one off, it's probably somewhere in between the two.

re: skewed data:  I'm not sure what that means

re: files -- files aren't really created (at least not in terms of the data I am tracking).  a "unit" has a show that plays on it.  A show is made up of files.  A file can play anywhere from 20 secs. to 3 mins (average would be about 30 secs).  When a show reaches the end, it loops back around to the beginning.  Each time a file plays, we log the time it played, the unit it played on, what it's fileID is and what zone on the screen it played on (unit's can have 1-3 zones that each have their own show playing).  If you think about a unit with multiple shows running on it 24/7, that's a lot of records each day.

re: data access - data is accessed once/month by me.  I'm currently pulling just the raw data out of the table, I'm not trying to group or count by some time increment (ex. how often a file plays every 15 mins).  EVENTUALLY, I'd like to build a web based report that allows our customers to run the report.   I'm sure they would love in depth analysis of a particular file.  That way if they know that for 2 weeks, a pack of chewing gum played 10 times/15mins and they increased that to 20 times/15mins and saw a 20% increase in gum sales that their signage system is having a positive impact on sales.

currently, a report for one month for one client who might only have one unit can be upwards of 65,000 rows.  some of our customers claim they need the exact time a file played for compliance but I'm guessing that's just my companies interpretation of the requirement.  I'm guessing if I pressed the end users, then if I gave them a count of how often a file played every 15 or 30 mins, that would be enough. And if each row was limited to a count per time period (so on insert if I'm still in that time period, I do an update to the count instead), that would significantly lower the number of rows.

I'm archiving old records after 3 months.  3 months worth of data is still 271 million rows.  With indexes, the table eats up about 42GB.  Total DB size is about 57GB.   It's all on one instance.
0
 

Author Comment

by:davidcahan
Comment Utility
anyone?
0
 
LVL 51

Expert Comment

by:tedbilly
Comment Utility
Can you give us an example of your schema?

At my last employer I built a database schema for capturing continuous and daily build metrics for thousands of systems.  Not only did we capture the data for every build, we captured metrics for each individual build component (each component would be represented by a project in Visual Studio)  Usually about 80 components per build.  For each build component we had name, version, number of files, location of the files, average build time per file, total build time, longest build time, number of successes, failures and warnings.  We stored a lot more than that though.  We had system details at the time of the build as well as overall build project data et cetera.

As you can imagine we had millions and millions of rows.  After 5 months the table with component build metrics was over 50 million records.  There were other tables with in excess of 10 million records.

The entire database was 4.6 GB.  That is not a misprint.  Last I heard (I left the company) the database is now 11 months old and only about 12 GB in size.

The way I achieved such small size was by fully normalizing the database.  Another metric system that isn't normalized as well as mine reached 150 GB after 3 months and has about 80 million records.  

So if you can include your schema let's discuss how to minimize storage by fully normalizing the data.
0
IT, Stop Being Called Into Every Meeting

Highfive is so simple that setting up every meeting room takes just minutes and every employee will be able to start or join a call from any room with ease. Never be called into a meeting just to get it started again. This is how video conferencing should work!

 
LVL 61

Assisted Solution

by:gheist
gheist earned 200 total points
Comment Utility
Fix the slowness:
Use contig from sysinternals to defragment data files and directory tree up?
Log size and next chunk sizes probably need to be incresed to avoid future frgmentation.
Massive updates also suggest that you need to limit indexing to minimum required.
Limit SQL server memory usage to no more than half of RAM, because auto-sizing RAM implies swapping out other applications, and with huge dataset it happens really soon.

Can you post approx schema of the table? I could try guessing if 2008 can be of help.
Also evaluate alternatives - PostgreSQL is on par with SQL 2005 feature and performance wise, and saved licencing costs may allow for new server.
0
 

Author Comment

by:davidcahan
Comment Utility
id   int  pk
cm_id int  index 60% fill
lm_id  int  index 60% fill
um_id int  index 60% fill
date  datetime  index 60% fill
dateAdded datetime   index 60% fill
isFileDrop  bit   index 60%

I have to have the indexes on cm_id,lm_id,um_id and date because I use that to make sure I'm not adding a dupe record and I'm also using them in different combinations for selects, groups, where's and joins.   I may be able to do without an index on DateAdded and isFileDrop.  

and the pk I don't really need.  in fact, technically speaking, the combination of cm_id,lm_id,um_id and date make each record unique.
would it be a major problem if I simply removed the pk altogether?

other suggestions obviously wanted as well

0
 
LVL 51

Accepted Solution

by:
tedbilly earned 200 total points
Comment Utility
If the primary key is an unnecessary surrogate key then definitely remove it and make the other three a compound primary clustered key if they represent a unique record.

What is the difference between date and dateAdded?  If granularity down to the minute is good enough you could reduce them to smalldatetime types.

Excessive indexing can do more harm than good.  At a company I recently joined the team has a table with 20 million records with what they thought were appropriate indexes.  I removed all the indexes and a query that ran in 91 seconds was reduced to 50 seconds with NO indexes or primary keys.

I then created a new index with three indexed columns and two included columns for the most commonly run query and the query duration went down to 2 seconds!


0
 

Author Comment

by:davidcahan
Comment Utility
Date is when a file played on a machine in the field.  DateAdded is when the record was added to the table.

I'll try removing the PK and the DateAdded and IsFileDrop Indexes and see what happens.
0
 

Author Comment

by:davidcahan
Comment Utility
also, is it "bad" to have no primary key at all?  
0
 
LVL 51

Expert Comment

by:tedbilly
Comment Utility
Not necessarily.  In logging tables I rarely add a primary key because the process of inserting records is faster when there is no indexing.
0
 
LVL 61

Expert Comment

by:gheist
Comment Utility
If you can find another unique column or expression you are free to kill primary key.
make sure you set constraint to ensure that column is unique and face the consequences, because most likely it is not.
0
 
LVL 30

Expert Comment

by:nmcdermaid
Comment Utility
Where did the table partitioning suggestion go? (way up top)

What people often do is partition their detail data off on to cheaper slower storage. So you have one years data per HD accross a bunch of SQL data files. Partitioning makes this transparent to the user. You don't have to especially work out which table to select from depending on the year etc. - its all transparent.

Then you summarise this detail data into something useful (like a SSAS cube) which gives you useful summarised answers. You could even build one cube per client and send them off to your clients. A really powerful thing is to integrate your play data with their sales data within a cube and see those kinds of metrics you were talking about.

Then you can drill back to your detail data if you really have to know exactly when the event occured.
0
 

Author Comment

by:davidcahan
Comment Utility
Here's what I did for now (and I have no idea who to give points to):

I altered 2 columns from int to smallint, altered one column from datetime to smalldate time.  I dropped the identity pk column (cause it wasn't be used to join to anything) and instead made a clustered pk from:
 
date asc,
um_id asc,
cm_id asc,
lm_id asc

I dropped the index from isFileDrop bit and from DateAdded datetime.  Tested my common queries against it and they ran as good as they usually did.

Once we get on Enterprise SQL Server, I'm definitely going to need to partition on Unit.  I'm guessing that is going to increase my performance.  

@nmcdermaid -- do you have any tutorial/informational links to SSAS cubes.  I don't know much about them but from the sounds of it they can be very useful to my end users and to my DB architecture.
0

Featured Post

Do You Know the 4 Main Threat Actor Types?

Do you know the main threat actor types? Most attackers fall into one of four categories, each with their own favored tactics, techniques, and procedures.

Join & Write a Comment

The Delta outage: 650 cancelled flights, more than 1200 delayed flights, thousands of frustrated customers, tens of millions of dollars in damages – plus untold reputational damage to one of the world’s most trusted airlines. All due to a catastroph…
For both online and offline retail, the cross-channel business is the most recent pattern in the B2C trade space.
Using examples as well as descriptions, and references to Books Online, show the documentation available for datatypes, explain the available data types and show how data can be passed into and out of variables.
Viewers will learn how to use the UPDATE and DELETE statements to change or remove existing data from their tables. Make a table: Update a specific column given a specific row using the UPDATE statement: Remove a set of values using the DELETE s…

772 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

10 Experts available now in Live!

Get 1:1 Help Now