Need Data Model for 200 million csv rows

Posted on 2007-11-16
Last Modified: 2008-03-06
Hi all, I'm working on an app to retrieve records from a monstrous data store.  What I need is a data model for how to store this data, preferablly in sql server 2005, which is really not my area of expertise.  

I have a comma delimited text file that documents web page activity with rows that contain a timestamp and data regarding the action taken, user id...

Fields are timestamp (when the action occurred), application (application requested), user id, op type (command/show/null), instruction (command being executed or page requested)

2007-01-01 12:00:00
2007-01-02 12:01:00




So file looks like:
2007-01-01 12:00:00,1384438,238,C,sf
2007-01-02 12:01:00,1483838,456,S,default-navigation

There are 200 million rows of data in the file, this is from one year. This is going to be ongoing, so next year, the data will be 400 million rows total, and so on.

I need a way (that scales) to store all this data so I can efficiently request large and small scale inquiries, such as the following:

What's the average amount of time a user takes to complete application xyz.

is there a normal path a user takes to view page xyz.

what do users typically do after performing instruction xyz.

on average, for all users that have submitted instruction xyz, what set of instructions did they execute before, and in what order.

display the data chronologically

from an application standpoint, i'm thinking i'll have a window that gives users options pertaining to the item selected, (timestamp - avg, accumulative, difference; optype (c/s/none) by using a datareader (forward only/read only (firehose cursor)).

I'm really open to any ideas here.

Thanks in advanced.
Question by:kellyclu
  • 2
LVL 23

Expert Comment

ID: 20304091
I suggest you first import data using bulk insert (or DTS/SSIS) *as it is* as far as schema is concerned then build views based on a design effort.  In other words, I suggest separating the ETL process from any design effort.

Hope htis helps...
LVL 23

Expert Comment

ID: 20304131
Yeah and make sure you data model is normalized.  normalization = database design optimization

Hope this helps...
LVL 12

Accepted Solution

kselvia earned 500 total points
ID: 20304141
Interesting problem.   I was waiting to see what others suggested.

I agree, initially load the data using BULK INSERT.

Beyond that, I think each question will need to be addressed separately.

Reporting tables generated to answer each question.

Results should be pre-generated with enough detail to answer questions, but summarized enough to make the response time fast.

All processing should be done in stored procedures.

Your example issues are varried and some more complicated than others.  Off the top of my head;

1) What's the average amount of time a user takes to complete application xyz.

That is pretty straight forward and can be pre-calculated for all users and applications.  I assume there is an event that indicates begin and end.  Will there always be both or can an end event be missing (say the user just disconnects)

2) is there a normal path a user takes to view page xyz.

I am thinking you might want to construct an access path table.   First you need an instruction table;
instruction(instructionid , instructionname )
populate it with all possible instructions after your initial data load.

1, sf
2, default-navigation
3, login
4, search
5, logout

Then an access path table
accesspath (pathid, path varchar(8000))

Accesspath would contain '~' + + '~' + instruction ... for each unique series of steps taken from application start to application end.  This is not normalized data but it may be more useful..  I'm thinking off-the-top of my head so this may not lead anywhere.  (That's why I wanted to see what others thought)
pathid, path
1, ~2~3~4~5
2, ~3~5

You already have unique userids.  Create a userhistory table

userhistory (id, userid, pathid , startdate, enddate)
1,238,1,10-10-07 3:00am, 10-10-07 330am

Then to get some stats - top 5% of paths taken to get to xyz

select top 5 percent
count(1) ct, replace ( path + '!', dbo.get_path('x,y,z') + '!', '' ) -- what comes before path xyz
from userhistory u
join accesspath a on a.pathid = u.pathid
where a.path like '%' + dbo.get_path('x,y,z')  -- get_path() this would return for example ~3~4~5
group by replace ( path + '!', dbo.get_path('x,y,z') + '!', '' )  
order by 1 desc

Having a lot of paths constructed as ~ delimited numbers and doing string manipulation on them may not be the best approach, but set based queries on hundreds of millions of rows is going to take a lot of time.

3) what do users typically do after performing instruction xyz.
4) on average, for all users that have submitted instruction xyz, what set of instructions did they execute before, and in what order.

Those questions can be answered with the data model above.
Don't worry about populating the tables from your 200 million row input. That part is easy. Ask if you want help doing it.  

Well that's my first idea.
Maybe someone else will offer some better suggestions.


Featured Post

Live: Real-Time Solutions, Start Here

Receive instant 1:1 support from technology experts, using our real-time conversation and whiteboard interface. Your first 5 minutes are always free.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
query execution hang 5 28
MS SQL Merging data from table into another table 1 31
Sql server function help 15 27
How to search for strings inside db views 4 24
Everyone has problem when going to load data into Data warehouse (EDW). They all need to confirm that data quality is good but they don't no how to proceed. Microsoft has provided new task within SSIS 2008 called "Data Profiler Task". It solve th…
Ever needed a SQL 2008 Database replicated/mirrored/log shipped on another server but you can't take the downtime inflicted by initial snapshot or disconnect while T-logs are restored or mirror applied? You can use SQL Server Initialize from Backup…
Via a live example, show how to setup several different housekeeping processes for a SQL Server.
Viewers will learn how to use the UPDATE and DELETE statements to change or remove existing data from their tables. Make a table: Update a specific column given a specific row using the UPDATE statement: Remove a set of values using the DELETE s…

813 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

11 Experts available now in Live!

Get 1:1 Help Now