Need Data Model for 200 million csv rows

Posted on 2007-11-16
Medium Priority
Last Modified: 2008-03-06
Hi all, I'm working on an app to retrieve records from a monstrous data store.  What I need is a data model for how to store this data, preferablly in sql server 2005, which is really not my area of expertise.  

I have a comma delimited text file that documents web page activity with rows that contain a timestamp and data regarding the action taken, user id...

Fields are timestamp (when the action occurred), application (application requested), user id, op type (command/show/null), instruction (command being executed or page requested)

2007-01-01 12:00:00
2007-01-02 12:01:00




So file looks like:
2007-01-01 12:00:00,1384438,238,C,sf
2007-01-02 12:01:00,1483838,456,S,default-navigation

There are 200 million rows of data in the file, this is from one year. This is going to be ongoing, so next year, the data will be 400 million rows total, and so on.

I need a way (that scales) to store all this data so I can efficiently request large and small scale inquiries, such as the following:

What's the average amount of time a user takes to complete application xyz.

is there a normal path a user takes to view page xyz.

what do users typically do after performing instruction xyz.

on average, for all users that have submitted instruction xyz, what set of instructions did they execute before, and in what order.

display the data chronologically

from an application standpoint, i'm thinking i'll have a window that gives users options pertaining to the item selected, (timestamp - avg, accumulative, difference; optype (c/s/none) by using a datareader (forward only/read only (firehose cursor)).

I'm really open to any ideas here.

Thanks in advanced.
Question by:kellyclu
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 2
LVL 23

Expert Comment

ID: 20304091
I suggest you first import data using bulk insert (or DTS/SSIS) *as it is* as far as schema is concerned then build views based on a design effort.  In other words, I suggest separating the ETL process from any design effort.

Hope htis helps...
LVL 23

Expert Comment

ID: 20304131
Yeah and make sure you data model is normalized.  normalization = database design optimization

Hope this helps...
LVL 12

Accepted Solution

kselvia earned 1500 total points
ID: 20304141
Interesting problem.   I was waiting to see what others suggested.

I agree, initially load the data using BULK INSERT.

Beyond that, I think each question will need to be addressed separately.

Reporting tables generated to answer each question.

Results should be pre-generated with enough detail to answer questions, but summarized enough to make the response time fast.

All processing should be done in stored procedures.

Your example issues are varried and some more complicated than others.  Off the top of my head;

1) What's the average amount of time a user takes to complete application xyz.

That is pretty straight forward and can be pre-calculated for all users and applications.  I assume there is an event that indicates begin and end.  Will there always be both or can an end event be missing (say the user just disconnects)

2) is there a normal path a user takes to view page xyz.

I am thinking you might want to construct an access path table.   First you need an instruction table;
instruction(instructionid , instructionname )
populate it with all possible instructions after your initial data load.

1, sf
2, default-navigation
3, login
4, search
5, logout

Then an access path table
accesspath (pathid, path varchar(8000))

Accesspath would contain '~' + instruction.id + '~' + instruction ... for each unique series of steps taken from application start to application end.  This is not normalized data but it may be more useful..  I'm thinking off-the-top of my head so this may not lead anywhere.  (That's why I wanted to see what others thought)
pathid, path
1, ~2~3~4~5
2, ~3~5

You already have unique userids.  Create a userhistory table

userhistory (id, userid, pathid , startdate, enddate)
1,238,1,10-10-07 3:00am, 10-10-07 330am

Then to get some stats - top 5% of paths taken to get to xyz

select top 5 percent
count(1) ct, replace ( path + '!', dbo.get_path('x,y,z') + '!', '' ) -- what comes before path xyz
from userhistory u
join accesspath a on a.pathid = u.pathid
where a.path like '%' + dbo.get_path('x,y,z')  -- get_path() this would return for example ~3~4~5
group by replace ( path + '!', dbo.get_path('x,y,z') + '!', '' )  
order by 1 desc

Having a lot of paths constructed as ~ delimited numbers and doing string manipulation on them may not be the best approach, but set based queries on hundreds of millions of rows is going to take a lot of time.

3) what do users typically do after performing instruction xyz.
4) on average, for all users that have submitted instruction xyz, what set of instructions did they execute before, and in what order.

Those questions can be answered with the data model above.
Don't worry about populating the tables from your 200 million row input. That part is easy. Ask if you want help doing it.  

Well that's my first idea.
Maybe someone else will offer some better suggestions.


Featured Post

On Demand Webinar: Networking for the Cloud Era

Did you know SD-WANs can improve network connectivity? Check out this webinar to learn how an SD-WAN simplified, one-click tool can help you migrate and manage data in the cloud.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Load balancing is the method of dividing the total amount of work performed by one computer between two or more computers. Its aim is to get more work done in the same amount of time, ensuring that all the users get served faster.
I have a large data set and a SSIS package. How can I load this file in multi threading?
Familiarize people with the process of utilizing SQL Server functions from within Microsoft Access. Microsoft Access is a very powerful client/server development tool. One of the SQL Server objects that you can interact with from within Microsoft Ac…
This videos aims to give the viewer a basic demonstration of how a user can query current session information by using the SYS_CONTEXT function

777 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question