Solved

Need Data Model for 200 million csv rows

Posted on 2007-11-16
3
518 Views
Last Modified: 2008-03-06
Hi all, I'm working on an app to retrieve records from a monstrous data store.  What I need is a data model for how to store this data, preferablly in sql server 2005, which is really not my area of expertise.  

I have a comma delimited text file that documents web page activity with rows that contain a timestamp and data regarding the action taken, user id...

Fields are timestamp (when the action occurred), application (application requested), user id, op type (command/show/null), instruction (command being executed or page requested)

timestamp            
2007-01-01 12:00:00
2007-01-02 12:01:00
...

application
1384438
1483838
...
 
user
238
456
...

optype
C
S
...

instruction
sf
default-navigation


So file looks like:
2007-01-01 12:00:00,1384438,238,C,sf
2007-01-02 12:01:00,1483838,456,S,default-navigation



There are 200 million rows of data in the file, this is from one year. This is going to be ongoing, so next year, the data will be 400 million rows total, and so on.

I need a way (that scales) to store all this data so I can efficiently request large and small scale inquiries, such as the following:

What's the average amount of time a user takes to complete application xyz.

is there a normal path a user takes to view page xyz.

what do users typically do after performing instruction xyz.

on average, for all users that have submitted instruction xyz, what set of instructions did they execute before, and in what order.

display the data chronologically


from an application standpoint, i'm thinking i'll have a window that gives users options pertaining to the item selected, (timestamp - avg, accumulative, difference; optype (c/s/none) by using a datareader (forward only/read only (firehose cursor)).

I'm really open to any ideas here.

Thanks in advanced.
0
Comment
Question by:kellyclu
  • 2
3 Comments
 
LVL 23

Expert Comment

by:Racim BOUDJAKDJI
ID: 20304091
I suggest you first import data using bulk insert (or DTS/SSIS) *as it is* as far as schema is concerned then build views based on a design effort.  In other words, I suggest separating the ETL process from any design effort.

Hope htis helps...
0
 
LVL 23

Expert Comment

by:Racim BOUDJAKDJI
ID: 20304131
Yeah and make sure you data model is normalized.  normalization = database design optimization

Hope this helps...
0
 
LVL 12

Accepted Solution

by:
kselvia earned 500 total points
ID: 20304141
Interesting problem.   I was waiting to see what others suggested.

I agree, initially load the data using BULK INSERT.

Beyond that, I think each question will need to be addressed separately.

Reporting tables generated to answer each question.

Results should be pre-generated with enough detail to answer questions, but summarized enough to make the response time fast.

All processing should be done in stored procedures.

Your example issues are varried and some more complicated than others.  Off the top of my head;

1) What's the average amount of time a user takes to complete application xyz.

That is pretty straight forward and can be pre-calculated for all users and applications.  I assume there is an event that indicates begin and end.  Will there always be both or can an end event be missing (say the user just disconnects)

2) is there a normal path a user takes to view page xyz.

I am thinking you might want to construct an access path table.   First you need an instruction table;
instruction(instructionid , instructionname )
populate it with all possible instructions after your initial data load.

i.e.
1, sf
2, default-navigation
3, login
4, search
5, logout
...

Then an access path table
accesspath (pathid, path varchar(8000))

Accesspath would contain '~' + instruction.id + '~' + instruction ... for each unique series of steps taken from application start to application end.  This is not normalized data but it may be more useful..  I'm thinking off-the-top of my head so this may not lead anywhere.  (That's why I wanted to see what others thought)
ie.
pathid, path
1, ~2~3~4~5
2, ~3~5

You already have unique userids.  Create a userhistory table

userhistory (id, userid, pathid , startdate, enddate)
1,238,1,10-10-07 3:00am, 10-10-07 330am

Then to get some stats - top 5% of paths taken to get to xyz

select top 5 percent
count(1) ct, replace ( path + '!', dbo.get_path('x,y,z') + '!', '' ) -- what comes before path xyz
from userhistory u
join accesspath a on a.pathid = u.pathid
where a.path like '%' + dbo.get_path('x,y,z')  -- get_path() this would return for example ~3~4~5
group by replace ( path + '!', dbo.get_path('x,y,z') + '!', '' )  
order by 1 desc


Having a lot of paths constructed as ~ delimited numbers and doing string manipulation on them may not be the best approach, but set based queries on hundreds of millions of rows is going to take a lot of time.

3) what do users typically do after performing instruction xyz.
4) on average, for all users that have submitted instruction xyz, what set of instructions did they execute before, and in what order.

Those questions can be answered with the data model above.
Don't worry about populating the tables from your 200 million row input. That part is easy. Ask if you want help doing it.  

Well that's my first idea.
 
Maybe someone else will offer some better suggestions.

 
0

Featured Post

Find Ransomware Secrets With All-Source Analysis

Ransomware has become a major concern for organizations; its prevalence has grown due to past successes achieved by threat actors. While each ransomware variant is different, we’ve seen some common tactics and trends used among the authors of the malware.

Join & Write a Comment

I wrote this interesting script that really help me find jobs or procedures when working in a huge environment. I could I have written it as a Procedure but then I would have to have it on each machine or have a link to a server-related search that …
Ever needed a SQL 2008 Database replicated/mirrored/log shipped on another server but you can't take the downtime inflicted by initial snapshot or disconnect while T-logs are restored or mirror applied? You can use SQL Server Initialize from Backup…
This video shows, step by step, how to configure Oracle Heterogeneous Services via the Generic Gateway Agent in order to make a connection from an Oracle session and access a remote SQL Server database table.
Viewers will learn how to use the SELECT statement in SQL to return specific rows and columns, with various degrees of sorting and limits in place.

760 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

18 Experts available now in Live!

Get 1:1 Help Now