Need Data Model for 200 million csv rows

Posted on 2007-11-16
Medium Priority
Last Modified: 2008-03-06
Hi all, I'm working on an app to retrieve records from a monstrous data store.  What I need is a data model for how to store this data, preferablly in sql server 2005, which is really not my area of expertise.  

I have a comma delimited text file that documents web page activity with rows that contain a timestamp and data regarding the action taken, user id...

Fields are timestamp (when the action occurred), application (application requested), user id, op type (command/show/null), instruction (command being executed or page requested)

2007-01-01 12:00:00
2007-01-02 12:01:00




So file looks like:
2007-01-01 12:00:00,1384438,238,C,sf
2007-01-02 12:01:00,1483838,456,S,default-navigation

There are 200 million rows of data in the file, this is from one year. This is going to be ongoing, so next year, the data will be 400 million rows total, and so on.

I need a way (that scales) to store all this data so I can efficiently request large and small scale inquiries, such as the following:

What's the average amount of time a user takes to complete application xyz.

is there a normal path a user takes to view page xyz.

what do users typically do after performing instruction xyz.

on average, for all users that have submitted instruction xyz, what set of instructions did they execute before, and in what order.

display the data chronologically

from an application standpoint, i'm thinking i'll have a window that gives users options pertaining to the item selected, (timestamp - avg, accumulative, difference; optype (c/s/none) by using a datareader (forward only/read only (firehose cursor)).

I'm really open to any ideas here.

Thanks in advanced.
Question by:kellyclu
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 2
LVL 23

Expert Comment

ID: 20304091
I suggest you first import data using bulk insert (or DTS/SSIS) *as it is* as far as schema is concerned then build views based on a design effort.  In other words, I suggest separating the ETL process from any design effort.

Hope htis helps...
LVL 23

Expert Comment

ID: 20304131
Yeah and make sure you data model is normalized.  normalization = database design optimization

Hope this helps...
LVL 12

Accepted Solution

kselvia earned 1500 total points
ID: 20304141
Interesting problem.   I was waiting to see what others suggested.

I agree, initially load the data using BULK INSERT.

Beyond that, I think each question will need to be addressed separately.

Reporting tables generated to answer each question.

Results should be pre-generated with enough detail to answer questions, but summarized enough to make the response time fast.

All processing should be done in stored procedures.

Your example issues are varried and some more complicated than others.  Off the top of my head;

1) What's the average amount of time a user takes to complete application xyz.

That is pretty straight forward and can be pre-calculated for all users and applications.  I assume there is an event that indicates begin and end.  Will there always be both or can an end event be missing (say the user just disconnects)

2) is there a normal path a user takes to view page xyz.

I am thinking you might want to construct an access path table.   First you need an instruction table;
instruction(instructionid , instructionname )
populate it with all possible instructions after your initial data load.

1, sf
2, default-navigation
3, login
4, search
5, logout

Then an access path table
accesspath (pathid, path varchar(8000))

Accesspath would contain '~' + instruction.id + '~' + instruction ... for each unique series of steps taken from application start to application end.  This is not normalized data but it may be more useful..  I'm thinking off-the-top of my head so this may not lead anywhere.  (That's why I wanted to see what others thought)
pathid, path
1, ~2~3~4~5
2, ~3~5

You already have unique userids.  Create a userhistory table

userhistory (id, userid, pathid , startdate, enddate)
1,238,1,10-10-07 3:00am, 10-10-07 330am

Then to get some stats - top 5% of paths taken to get to xyz

select top 5 percent
count(1) ct, replace ( path + '!', dbo.get_path('x,y,z') + '!', '' ) -- what comes before path xyz
from userhistory u
join accesspath a on a.pathid = u.pathid
where a.path like '%' + dbo.get_path('x,y,z')  -- get_path() this would return for example ~3~4~5
group by replace ( path + '!', dbo.get_path('x,y,z') + '!', '' )  
order by 1 desc

Having a lot of paths constructed as ~ delimited numbers and doing string manipulation on them may not be the best approach, but set based queries on hundreds of millions of rows is going to take a lot of time.

3) what do users typically do after performing instruction xyz.
4) on average, for all users that have submitted instruction xyz, what set of instructions did they execute before, and in what order.

Those questions can be answered with the data model above.
Don't worry about populating the tables from your 200 million row input. That part is easy. Ask if you want help doing it.  

Well that's my first idea.
Maybe someone else will offer some better suggestions.


Featured Post

Industry Leaders: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

A Stored Procedure in Microsoft SQL Server is a powerful feature that it can be used to execute the Data Manipulation Language (DML) or Data Definition Language (DDL). Depending on business requirements, a single Stored Procedure can return differe…
This month, Experts Exchange sat down with resident SQL expert, Jim Horn, for an in-depth look into the makings of a successful career in SQL.
Familiarize people with the process of retrieving data from SQL Server using an Access pass-thru query. Microsoft Access is a very powerful client/server development tool. One of the ways that you can retrieve data from a SQL Server is by using a pa…
Viewers will learn how to use the SELECT statement in SQL and will be exposed to the many uses the SELECT statement has.

650 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question