Solved

Table normalization for a custom Google Analytic alike system

Posted on 2014-09-10
16
84 Views
Last Modified: 2016-02-24
Hello,

I'm asked to create a Google Analytics alike application (of course just a simple one) for our organization so as you may know first thing which crosses my mind was thinking of IIS log file and use LogParser to extract information so this is my plan and I will appreciate it if you give me some heads up as far as the kind of tables I would need to have.

Currently I am think of a set of tablets as following:

0) All the information are stored into a table as RawInfo

This raw table gets following column/info from log file:

Id (incremental), date, time, c-ip, cs-username, s-sitename, s-ip, s-port, cs-method, cs-uri-stem, cs-uri-query, cs-status,
sc-substatus, sc-win32-status, sc-bytes, cs-bytes, time-taken, cs-version, cs-host, cs-UserAgent, cs-cookie, cs-Referer

1) From RawInfo we split the information in 8 tables base on the need we have.
Tables are:

Visitor table:
id, date, time, c-ip, cs-username, s-sitename, method-id, status, win32status, time-taken, cs-version, cs-host

Method table:
method-id, method-name e.g. GET/SET


Uri-Stem table:
id, date, time, c-ip, cs-uri-stem

Uri-Query table:
id, date, time, c-ip, cs-uri-query

HttpRef table:
status-code, description  
e.g. 200 = OK,  404 = page not found  and etc

UserAgent table:
id, date, time, c-ip,cs-UserAgent

Cookies table:
id, date, time, c-ip,cs-cookes

Referer table:
id, date, time, c-ip, cs-Referer


2) I'm thinking of using a combinational Primary key of  id, date, time, c-ip

May some experts discuss/comment on this approach ?

Thanks.
Ak
0
Comment
Question by:akohan
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 8
  • 7
16 Comments
 
LVL 69

Expert Comment

by:Scott Pletcher
ID: 40315883
I'd make the clustering key datetime, as I think that would be much more useful for limiting searches (I wouldn't split date and time here, that just causes you more work).  

If you want a dummy PK of identity, that's OK too, but it's not really necessary (assuming you're careful to never reset the id value).
0
 
LVL 48

Expert Comment

by:PortletPaul
ID: 40315926
why re-invent this?

have you considered the multitude of open-source products available for web analytics?
then add commercial products as options too.
0
 

Author Comment

by:akohan
ID: 40315931
Hello Scott,

So you are saying mixing date and time columns and storing it as DateTime data type.

SELECT MixDateAndTime = Date + Time FROM MyRowInfoTable

Open in new window


Yes, I will stick to that PK for ever to locate other information. My assumption (I hope I am right) is that each request can be considered unique by having its date+time+clientIP (== c-ip). right? or maybe I won't even need to add an extra ID (incremental).

Any comments?
0
DevOps Toolchain Recommendations

Read this Gartner Research Note and discover how your IT organization can automate and optimize DevOps processes using a toolchain architecture.

 

Author Comment

by:akohan
ID: 40315933
Hello Paul,

Yes, I proposed that but they rather go with a custom one  :(
0
 
LVL 69

Expert Comment

by:Scott Pletcher
ID: 40318213
Again, the incremental id as a PK is ok, but you do NOT want that be the clustering key.  Instead, cluster on the combined date and time (confirm: combine date and time into a single column).  Easiest is to create the clustering key first -- then by default the PK will not be clustered.
0
 

Author Comment

by:akohan
ID: 40318301
OK but what if two request hit a page at the same date and time?
am I missing something here?
0
 
LVL 69

Expert Comment

by:Scott Pletcher
ID: 40320531
I expect some will hit at the same time.  SQL will resolve the dups, or you can use an identity to prevent dups.  Either way, that clustering will still provide the quickest, most used method of limiting the rows to be read by queries.
0
 

Author Comment

by:akohan
ID: 40331184
Right but still I would need the IP address of clients to locate other information about a specific visitor.  again question is how can I mix up DateTime with an IP value which is varchar?
0
 
LVL 69

Expert Comment

by:Scott Pletcher
ID: 40331199
You don't need, nor want, to "mix them up".

Simply specify multiple key columns:
( date, time, ip )
0
 

Author Comment

by:akohan
ID: 40331245
So you mean making following columns as multiple keys?
Id int
VisitorDate  varchar
visitorTime  varchar

But in your comment above you said" "I'd make the clustering key datetime"
So should I convert them and then set  client-IP (varchar) and  VisitDay (DateTime) as PK?
0
 
LVL 69

Expert Comment

by:Scott Pletcher
ID: 40331252
Date, time and id is fine:
( VisitorDate, visitorTime, Id )

The important thing is to put date and time first, not id.  I'm assuming that date & time are the date and time of access, and thus will automatically be incremental.
0
 

Author Comment

by:akohan
ID: 40331308
Yes, I did  that so put date, time first then ID.  now how can I deal with other tables as far as maintaining a relation?
0
 
LVL 69

Expert Comment

by:Scott Pletcher
ID: 40331311
You still relate by ID; you have no good alternative.  Thus, you will still need an index on ( ID, date, time) as well (typically without other columns, but you might need one or two INCLUDE columns depending on the specific queries you run).  As long as the query on the other table restricts by date and time, you still gain tremendous efficiencies from the proper key structure.
0
 

Author Comment

by:akohan
ID: 40331329
So let me ask you this question. What would be the role of IP or client-IP in this game?

Initially, I was thinking of having IP value in a table (all unique IPs) and then relate it to other tables. From other table I mean again Date,Time, IP, etc as one to many relationship.

Can you tell me what would be wrong with that ?

Thank you
0
 
LVL 69

Accepted Solution

by:
Scott Pletcher earned 500 total points
ID: 40331347
You can still do all that.

My thought was that you would typically work with a set of rows within a date range.  If you do, then clustering the table that way will drastically improve the efficiency of those queries, no matter how you need match IP address or other data columns in the future.
0
 

Author Closing Comment

by:akohan
ID: 40604134
OK thank you!
0

Featured Post

Industry Leaders: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

This article shows gives you an overview on SQL Server 2016 row level security. You will also get to know the usages of row-level-security and how it works
When table data gets too large to manage or queries take too long to execute the solution is often to buy bigger hardware or assign more CPUs and memory resources to the machine to solve the problem. However, the best, cheapest and most effective so…
Viewers will learn how the fundamental information of how to create a table.
Viewers will learn how to use the UPDATE and DELETE statements to change or remove existing data from their tables. Make a table: Update a specific column given a specific row using the UPDATE statement: Remove a set of values using the DELETE s…

738 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question