Table normalization for a custom Google Analytic alike system

Posted on 2014-09-10
Last Modified: 2016-02-24

I'm asked to create a Google Analytics alike application (of course just a simple one) for our organization so as you may know first thing which crosses my mind was thinking of IIS log file and use LogParser to extract information so this is my plan and I will appreciate it if you give me some heads up as far as the kind of tables I would need to have.

Currently I am think of a set of tablets as following:

0) All the information are stored into a table as RawInfo

This raw table gets following column/info from log file:

Id (incremental), date, time, c-ip, cs-username, s-sitename, s-ip, s-port, cs-method, cs-uri-stem, cs-uri-query, cs-status,
sc-substatus, sc-win32-status, sc-bytes, cs-bytes, time-taken, cs-version, cs-host, cs-UserAgent, cs-cookie, cs-Referer

1) From RawInfo we split the information in 8 tables base on the need we have.
Tables are:

Visitor table:
id, date, time, c-ip, cs-username, s-sitename, method-id, status, win32status, time-taken, cs-version, cs-host

Method table:
method-id, method-name e.g. GET/SET

Uri-Stem table:
id, date, time, c-ip, cs-uri-stem

Uri-Query table:
id, date, time, c-ip, cs-uri-query

HttpRef table:
status-code, description  
e.g. 200 = OK,  404 = page not found  and etc

UserAgent table:
id, date, time, c-ip,cs-UserAgent

Cookies table:
id, date, time, c-ip,cs-cookes

Referer table:
id, date, time, c-ip, cs-Referer

2) I'm thinking of using a combinational Primary key of  id, date, time, c-ip

May some experts discuss/comment on this approach ?

Question by:akohan
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 8
  • 7
LVL 69

Expert Comment

by:Scott Pletcher
ID: 40315883
I'd make the clustering key datetime, as I think that would be much more useful for limiting searches (I wouldn't split date and time here, that just causes you more work).  

If you want a dummy PK of identity, that's OK too, but it's not really necessary (assuming you're careful to never reset the id value).
LVL 49

Expert Comment

ID: 40315926
why re-invent this?

have you considered the multitude of open-source products available for web analytics?
then add commercial products as options too.

Author Comment

ID: 40315931
Hello Scott,

So you are saying mixing date and time columns and storing it as DateTime data type.

SELECT MixDateAndTime = Date + Time FROM MyRowInfoTable

Open in new window

Yes, I will stick to that PK for ever to locate other information. My assumption (I hope I am right) is that each request can be considered unique by having its date+time+clientIP (== c-ip). right? or maybe I won't even need to add an extra ID (incremental).

Any comments?
Get Actionable Data from Your Monitoring Solution

Your communication platform is only as good as the relevance of the information you send. Ensure your alerts get to the right people every time with actionable responses. Create escalation rules that ensure everyone follows the process and nothing is left to chance.


Author Comment

ID: 40315933
Hello Paul,

Yes, I proposed that but they rather go with a custom one  :(
LVL 69

Expert Comment

by:Scott Pletcher
ID: 40318213
Again, the incremental id as a PK is ok, but you do NOT want that be the clustering key.  Instead, cluster on the combined date and time (confirm: combine date and time into a single column).  Easiest is to create the clustering key first -- then by default the PK will not be clustered.

Author Comment

ID: 40318301
OK but what if two request hit a page at the same date and time?
am I missing something here?
LVL 69

Expert Comment

by:Scott Pletcher
ID: 40320531
I expect some will hit at the same time.  SQL will resolve the dups, or you can use an identity to prevent dups.  Either way, that clustering will still provide the quickest, most used method of limiting the rows to be read by queries.

Author Comment

ID: 40331184
Right but still I would need the IP address of clients to locate other information about a specific visitor.  again question is how can I mix up DateTime with an IP value which is varchar?
LVL 69

Expert Comment

by:Scott Pletcher
ID: 40331199
You don't need, nor want, to "mix them up".

Simply specify multiple key columns:
( date, time, ip )

Author Comment

ID: 40331245
So you mean making following columns as multiple keys?
Id int
VisitorDate  varchar
visitorTime  varchar

But in your comment above you said" "I'd make the clustering key datetime"
So should I convert them and then set  client-IP (varchar) and  VisitDay (DateTime) as PK?
LVL 69

Expert Comment

by:Scott Pletcher
ID: 40331252
Date, time and id is fine:
( VisitorDate, visitorTime, Id )

The important thing is to put date and time first, not id.  I'm assuming that date & time are the date and time of access, and thus will automatically be incremental.

Author Comment

ID: 40331308
Yes, I did  that so put date, time first then ID.  now how can I deal with other tables as far as maintaining a relation?
LVL 69

Expert Comment

by:Scott Pletcher
ID: 40331311
You still relate by ID; you have no good alternative.  Thus, you will still need an index on ( ID, date, time) as well (typically without other columns, but you might need one or two INCLUDE columns depending on the specific queries you run).  As long as the query on the other table restricts by date and time, you still gain tremendous efficiencies from the proper key structure.

Author Comment

ID: 40331329
So let me ask you this question. What would be the role of IP or client-IP in this game?

Initially, I was thinking of having IP value in a table (all unique IPs) and then relate it to other tables. From other table I mean again Date,Time, IP, etc as one to many relationship.

Can you tell me what would be wrong with that ?

Thank you
LVL 69

Accepted Solution

Scott Pletcher earned 500 total points
ID: 40331347
You can still do all that.

My thought was that you would typically work with a set of rows within a date range.  If you do, then clustering the table that way will drastically improve the efficiency of those queries, no matter how you need match IP address or other data columns in the future.

Author Closing Comment

ID: 40604134
OK thank you!

Featured Post

Ransomware-A Revenue Bonanza for Service Providers

Ransomware – malware that gets on your customers’ computers, encrypts their data, and extorts a hefty ransom for the decryption keys – is a surging new threat.  The purpose of this eBook is to educate the reader about ransomware attacks.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

A Stored Procedure in Microsoft SQL Server is a powerful feature that it can be used to execute the Data Manipulation Language (DML) or Data Definition Language (DDL). Depending on business requirements, a single Stored Procedure can return differe…
This post looks at MongoDB and MySQL, and covers high-level MongoDB strengths, weaknesses, features, and uses from the perspective of an SQL user.
Via a live example, show how to extract insert data into a SQL Server database table using the Import/Export option and Bulk Insert.
Viewers will learn how to use the UPDATE and DELETE statements to change or remove existing data from their tables. Make a table: Update a specific column given a specific row using the UPDATE statement: Remove a set of values using the DELETE s…

690 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question