Solved

Need Data Model for 200 million csv rows

Posted on 2007-11-16
3
525 Views
Last Modified: 2008-03-06
Hi all, I'm working on an app to retrieve records from a monstrous data store.  What I need is a data model for how to store this data, preferablly in sql server 2005, which is really not my area of expertise.  

I have a comma delimited text file that documents web page activity with rows that contain a timestamp and data regarding the action taken, user id...

Fields are timestamp (when the action occurred), application (application requested), user id, op type (command/show/null), instruction (command being executed or page requested)

timestamp            
2007-01-01 12:00:00
2007-01-02 12:01:00
...

application
1384438
1483838
...
 
user
238
456
...

optype
C
S
...

instruction
sf
default-navigation


So file looks like:
2007-01-01 12:00:00,1384438,238,C,sf
2007-01-02 12:01:00,1483838,456,S,default-navigation



There are 200 million rows of data in the file, this is from one year. This is going to be ongoing, so next year, the data will be 400 million rows total, and so on.

I need a way (that scales) to store all this data so I can efficiently request large and small scale inquiries, such as the following:

What's the average amount of time a user takes to complete application xyz.

is there a normal path a user takes to view page xyz.

what do users typically do after performing instruction xyz.

on average, for all users that have submitted instruction xyz, what set of instructions did they execute before, and in what order.

display the data chronologically


from an application standpoint, i'm thinking i'll have a window that gives users options pertaining to the item selected, (timestamp - avg, accumulative, difference; optype (c/s/none) by using a datareader (forward only/read only (firehose cursor)).

I'm really open to any ideas here.

Thanks in advanced.
0
Comment
Question by:kellyclu
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 2
3 Comments
 
LVL 23

Expert Comment

by:Racim BOUDJAKDJI
ID: 20304091
I suggest you first import data using bulk insert (or DTS/SSIS) *as it is* as far as schema is concerned then build views based on a design effort.  In other words, I suggest separating the ETL process from any design effort.

Hope htis helps...
0
 
LVL 23

Expert Comment

by:Racim BOUDJAKDJI
ID: 20304131
Yeah and make sure you data model is normalized.  normalization = database design optimization

Hope this helps...
0
 
LVL 12

Accepted Solution

by:
kselvia earned 500 total points
ID: 20304141
Interesting problem.   I was waiting to see what others suggested.

I agree, initially load the data using BULK INSERT.

Beyond that, I think each question will need to be addressed separately.

Reporting tables generated to answer each question.

Results should be pre-generated with enough detail to answer questions, but summarized enough to make the response time fast.

All processing should be done in stored procedures.

Your example issues are varried and some more complicated than others.  Off the top of my head;

1) What's the average amount of time a user takes to complete application xyz.

That is pretty straight forward and can be pre-calculated for all users and applications.  I assume there is an event that indicates begin and end.  Will there always be both or can an end event be missing (say the user just disconnects)

2) is there a normal path a user takes to view page xyz.

I am thinking you might want to construct an access path table.   First you need an instruction table;
instruction(instructionid , instructionname )
populate it with all possible instructions after your initial data load.

i.e.
1, sf
2, default-navigation
3, login
4, search
5, logout
...

Then an access path table
accesspath (pathid, path varchar(8000))

Accesspath would contain '~' + instruction.id + '~' + instruction ... for each unique series of steps taken from application start to application end.  This is not normalized data but it may be more useful..  I'm thinking off-the-top of my head so this may not lead anywhere.  (That's why I wanted to see what others thought)
ie.
pathid, path
1, ~2~3~4~5
2, ~3~5

You already have unique userids.  Create a userhistory table

userhistory (id, userid, pathid , startdate, enddate)
1,238,1,10-10-07 3:00am, 10-10-07 330am

Then to get some stats - top 5% of paths taken to get to xyz

select top 5 percent
count(1) ct, replace ( path + '!', dbo.get_path('x,y,z') + '!', '' ) -- what comes before path xyz
from userhistory u
join accesspath a on a.pathid = u.pathid
where a.path like '%' + dbo.get_path('x,y,z')  -- get_path() this would return for example ~3~4~5
group by replace ( path + '!', dbo.get_path('x,y,z') + '!', '' )  
order by 1 desc


Having a lot of paths constructed as ~ delimited numbers and doing string manipulation on them may not be the best approach, but set based queries on hundreds of millions of rows is going to take a lot of time.

3) what do users typically do after performing instruction xyz.
4) on average, for all users that have submitted instruction xyz, what set of instructions did they execute before, and in what order.

Those questions can be answered with the data model above.
Don't worry about populating the tables from your 200 million row input. That part is easy. Ask if you want help doing it.  

Well that's my first idea.
 
Maybe someone else will offer some better suggestions.

 
0

Featured Post

DevOps Toolchain Recommendations

Read this Gartner Research Note and discover how your IT organization can automate and optimize DevOps processes using a toolchain architecture.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Why is this different from all of the other step by step guides?  Because I make a living as a DBA and not as a writer and I lived through this experience. Defining the name: When I talk to people they say different names on this subject stuff l…
In part one, we reviewed the prerequisites required for installing SQL Server vNext. In this part we will explore how to install Microsoft's SQL Server on Ubuntu 16.04.
This videos aims to give the viewer a basic demonstration of how a user can query current session information by using the SYS_CONTEXT function
Viewers will learn how to use the UPDATE and DELETE statements to change or remove existing data from their tables. Make a table: Update a specific column given a specific row using the UPDATE statement: Remove a set of values using the DELETE s…

705 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question