Solved

Table normalization for a custom Google Analytic alike system

Posted on 2014-09-10
16
79 Views
Last Modified: 2016-02-24
Hello,

I'm asked to create a Google Analytics alike application (of course just a simple one) for our organization so as you may know first thing which crosses my mind was thinking of IIS log file and use LogParser to extract information so this is my plan and I will appreciate it if you give me some heads up as far as the kind of tables I would need to have.

Currently I am think of a set of tablets as following:

0) All the information are stored into a table as RawInfo

This raw table gets following column/info from log file:

Id (incremental), date, time, c-ip, cs-username, s-sitename, s-ip, s-port, cs-method, cs-uri-stem, cs-uri-query, cs-status,
sc-substatus, sc-win32-status, sc-bytes, cs-bytes, time-taken, cs-version, cs-host, cs-UserAgent, cs-cookie, cs-Referer

1) From RawInfo we split the information in 8 tables base on the need we have.
Tables are:

Visitor table:
id, date, time, c-ip, cs-username, s-sitename, method-id, status, win32status, time-taken, cs-version, cs-host

Method table:
method-id, method-name e.g. GET/SET


Uri-Stem table:
id, date, time, c-ip, cs-uri-stem

Uri-Query table:
id, date, time, c-ip, cs-uri-query

HttpRef table:
status-code, description  
e.g. 200 = OK,  404 = page not found  and etc

UserAgent table:
id, date, time, c-ip,cs-UserAgent

Cookies table:
id, date, time, c-ip,cs-cookes

Referer table:
id, date, time, c-ip, cs-Referer


2) I'm thinking of using a combinational Primary key of  id, date, time, c-ip

May some experts discuss/comment on this approach ?

Thanks.
Ak
0
Comment
Question by:akohan
  • 8
  • 7
16 Comments
 
LVL 69

Expert Comment

by:Scott Pletcher
ID: 40315883
I'd make the clustering key datetime, as I think that would be much more useful for limiting searches (I wouldn't split date and time here, that just causes you more work).  

If you want a dummy PK of identity, that's OK too, but it's not really necessary (assuming you're careful to never reset the id value).
0
 
LVL 48

Expert Comment

by:PortletPaul
ID: 40315926
why re-invent this?

have you considered the multitude of open-source products available for web analytics?
then add commercial products as options too.
0
 

Author Comment

by:akohan
ID: 40315931
Hello Scott,

So you are saying mixing date and time columns and storing it as DateTime data type.

SELECT MixDateAndTime = Date + Time FROM MyRowInfoTable

Open in new window


Yes, I will stick to that PK for ever to locate other information. My assumption (I hope I am right) is that each request can be considered unique by having its date+time+clientIP (== c-ip). right? or maybe I won't even need to add an extra ID (incremental).

Any comments?
0
Use Case: Protecting a Hybrid Cloud Infrastructure

Microsoft Azure is rapidly becoming the norm in dynamic IT environments. This document describes the challenges that organizations face when protecting data in a hybrid cloud IT environment and presents a use case to demonstrate how Acronis Backup protects all data.

 

Author Comment

by:akohan
ID: 40315933
Hello Paul,

Yes, I proposed that but they rather go with a custom one  :(
0
 
LVL 69

Expert Comment

by:Scott Pletcher
ID: 40318213
Again, the incremental id as a PK is ok, but you do NOT want that be the clustering key.  Instead, cluster on the combined date and time (confirm: combine date and time into a single column).  Easiest is to create the clustering key first -- then by default the PK will not be clustered.
0
 

Author Comment

by:akohan
ID: 40318301
OK but what if two request hit a page at the same date and time?
am I missing something here?
0
 
LVL 69

Expert Comment

by:Scott Pletcher
ID: 40320531
I expect some will hit at the same time.  SQL will resolve the dups, or you can use an identity to prevent dups.  Either way, that clustering will still provide the quickest, most used method of limiting the rows to be read by queries.
0
 

Author Comment

by:akohan
ID: 40331184
Right but still I would need the IP address of clients to locate other information about a specific visitor.  again question is how can I mix up DateTime with an IP value which is varchar?
0
 
LVL 69

Expert Comment

by:Scott Pletcher
ID: 40331199
You don't need, nor want, to "mix them up".

Simply specify multiple key columns:
( date, time, ip )
0
 

Author Comment

by:akohan
ID: 40331245
So you mean making following columns as multiple keys?
Id int
VisitorDate  varchar
visitorTime  varchar

But in your comment above you said" "I'd make the clustering key datetime"
So should I convert them and then set  client-IP (varchar) and  VisitDay (DateTime) as PK?
0
 
LVL 69

Expert Comment

by:Scott Pletcher
ID: 40331252
Date, time and id is fine:
( VisitorDate, visitorTime, Id )

The important thing is to put date and time first, not id.  I'm assuming that date & time are the date and time of access, and thus will automatically be incremental.
0
 

Author Comment

by:akohan
ID: 40331308
Yes, I did  that so put date, time first then ID.  now how can I deal with other tables as far as maintaining a relation?
0
 
LVL 69

Expert Comment

by:Scott Pletcher
ID: 40331311
You still relate by ID; you have no good alternative.  Thus, you will still need an index on ( ID, date, time) as well (typically without other columns, but you might need one or two INCLUDE columns depending on the specific queries you run).  As long as the query on the other table restricts by date and time, you still gain tremendous efficiencies from the proper key structure.
0
 

Author Comment

by:akohan
ID: 40331329
So let me ask you this question. What would be the role of IP or client-IP in this game?

Initially, I was thinking of having IP value in a table (all unique IPs) and then relate it to other tables. From other table I mean again Date,Time, IP, etc as one to many relationship.

Can you tell me what would be wrong with that ?

Thank you
0
 
LVL 69

Accepted Solution

by:
Scott Pletcher earned 500 total points
ID: 40331347
You can still do all that.

My thought was that you would typically work with a set of rows within a date range.  If you do, then clustering the table that way will drastically improve the efficiency of those queries, no matter how you need match IP address or other data columns in the future.
0
 

Author Closing Comment

by:akohan
ID: 40604134
OK thank you!
0

Featured Post

Netscaler Common Configuration How To guides

If you use NetScaler you will want to see these guides. The NetScaler How To Guides show administrators how to get NetScaler up and configured by providing instructions for common scenarios and some not so common ones.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

These days, all we hear about hacktivists took down so and so websites and retrieved thousands of user’s data. One of the techniques to get unauthorized access to database is by performing SQL injection. This article is quite lengthy which gives bas…
In this article we will learn how to fix  “Cannot install SQL Server 2014 Service Pack 2: Unable to install windows installer msi file” error ?
Via a live example, show how to extract insert data into a SQL Server database table using the Import/Export option and Bulk Insert.
Via a live example, show how to setup several different housekeeping processes for a SQL Server.

821 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question