Solved

Table normalization for a custom Google Analytic alike system

Posted on 2014-09-10
16
77 Views
Last Modified: 2016-02-24
Hello,

I'm asked to create a Google Analytics alike application (of course just a simple one) for our organization so as you may know first thing which crosses my mind was thinking of IIS log file and use LogParser to extract information so this is my plan and I will appreciate it if you give me some heads up as far as the kind of tables I would need to have.

Currently I am think of a set of tablets as following:

0) All the information are stored into a table as RawInfo

This raw table gets following column/info from log file:

Id (incremental), date, time, c-ip, cs-username, s-sitename, s-ip, s-port, cs-method, cs-uri-stem, cs-uri-query, cs-status,
sc-substatus, sc-win32-status, sc-bytes, cs-bytes, time-taken, cs-version, cs-host, cs-UserAgent, cs-cookie, cs-Referer

1) From RawInfo we split the information in 8 tables base on the need we have.
Tables are:

Visitor table:
id, date, time, c-ip, cs-username, s-sitename, method-id, status, win32status, time-taken, cs-version, cs-host

Method table:
method-id, method-name e.g. GET/SET


Uri-Stem table:
id, date, time, c-ip, cs-uri-stem

Uri-Query table:
id, date, time, c-ip, cs-uri-query

HttpRef table:
status-code, description  
e.g. 200 = OK,  404 = page not found  and etc

UserAgent table:
id, date, time, c-ip,cs-UserAgent

Cookies table:
id, date, time, c-ip,cs-cookes

Referer table:
id, date, time, c-ip, cs-Referer


2) I'm thinking of using a combinational Primary key of  id, date, time, c-ip

May some experts discuss/comment on this approach ?

Thanks.
Ak
0
Comment
Question by:akohan
  • 8
  • 7
16 Comments
 
LVL 69

Expert Comment

by:Scott Pletcher
ID: 40315883
I'd make the clustering key datetime, as I think that would be much more useful for limiting searches (I wouldn't split date and time here, that just causes you more work).  

If you want a dummy PK of identity, that's OK too, but it's not really necessary (assuming you're careful to never reset the id value).
0
 
LVL 48

Expert Comment

by:PortletPaul
ID: 40315926
why re-invent this?

have you considered the multitude of open-source products available for web analytics?
then add commercial products as options too.
0
 

Author Comment

by:akohan
ID: 40315931
Hello Scott,

So you are saying mixing date and time columns and storing it as DateTime data type.

SELECT MixDateAndTime = Date + Time FROM MyRowInfoTable

Open in new window


Yes, I will stick to that PK for ever to locate other information. My assumption (I hope I am right) is that each request can be considered unique by having its date+time+clientIP (== c-ip). right? or maybe I won't even need to add an extra ID (incremental).

Any comments?
0
PRTG Network Monitor: Intuitive Network Monitoring

Network Monitoring is essential to ensure that computer systems and network devices are running. Use PRTG to monitor LANs, servers, websites, applications and devices, bandwidth, virtual environments, remote systems, IoT, and many more. PRTG is easy to set up & use.

 

Author Comment

by:akohan
ID: 40315933
Hello Paul,

Yes, I proposed that but they rather go with a custom one  :(
0
 
LVL 69

Expert Comment

by:Scott Pletcher
ID: 40318213
Again, the incremental id as a PK is ok, but you do NOT want that be the clustering key.  Instead, cluster on the combined date and time (confirm: combine date and time into a single column).  Easiest is to create the clustering key first -- then by default the PK will not be clustered.
0
 

Author Comment

by:akohan
ID: 40318301
OK but what if two request hit a page at the same date and time?
am I missing something here?
0
 
LVL 69

Expert Comment

by:Scott Pletcher
ID: 40320531
I expect some will hit at the same time.  SQL will resolve the dups, or you can use an identity to prevent dups.  Either way, that clustering will still provide the quickest, most used method of limiting the rows to be read by queries.
0
 

Author Comment

by:akohan
ID: 40331184
Right but still I would need the IP address of clients to locate other information about a specific visitor.  again question is how can I mix up DateTime with an IP value which is varchar?
0
 
LVL 69

Expert Comment

by:Scott Pletcher
ID: 40331199
You don't need, nor want, to "mix them up".

Simply specify multiple key columns:
( date, time, ip )
0
 

Author Comment

by:akohan
ID: 40331245
So you mean making following columns as multiple keys?
Id int
VisitorDate  varchar
visitorTime  varchar

But in your comment above you said" "I'd make the clustering key datetime"
So should I convert them and then set  client-IP (varchar) and  VisitDay (DateTime) as PK?
0
 
LVL 69

Expert Comment

by:Scott Pletcher
ID: 40331252
Date, time and id is fine:
( VisitorDate, visitorTime, Id )

The important thing is to put date and time first, not id.  I'm assuming that date & time are the date and time of access, and thus will automatically be incremental.
0
 

Author Comment

by:akohan
ID: 40331308
Yes, I did  that so put date, time first then ID.  now how can I deal with other tables as far as maintaining a relation?
0
 
LVL 69

Expert Comment

by:Scott Pletcher
ID: 40331311
You still relate by ID; you have no good alternative.  Thus, you will still need an index on ( ID, date, time) as well (typically without other columns, but you might need one or two INCLUDE columns depending on the specific queries you run).  As long as the query on the other table restricts by date and time, you still gain tremendous efficiencies from the proper key structure.
0
 

Author Comment

by:akohan
ID: 40331329
So let me ask you this question. What would be the role of IP or client-IP in this game?

Initially, I was thinking of having IP value in a table (all unique IPs) and then relate it to other tables. From other table I mean again Date,Time, IP, etc as one to many relationship.

Can you tell me what would be wrong with that ?

Thank you
0
 
LVL 69

Accepted Solution

by:
Scott Pletcher earned 500 total points
ID: 40331347
You can still do all that.

My thought was that you would typically work with a set of rows within a date range.  If you do, then clustering the table that way will drastically improve the efficiency of those queries, no matter how you need match IP address or other data columns in the future.
0
 

Author Closing Comment

by:akohan
ID: 40604134
OK thank you!
0

Featured Post

VMware Disaster Recovery and Data Protection

In this expert guide, you’ll learn about the components of a Modern Data Center. You will use cases for the value-added capabilities of Veeam®, including combining backup and replication for VMware disaster recovery and using replication for data center migration.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

This article explains how to reset the password of the sa account on a Microsoft SQL Server.  The steps in this article work in SQL 2005, 2008, 2008 R2, 2012, 2014 and 2016.
Slowly Changing Dimension Transformation component in data task flow is very useful for us to manage and control how data changes in SSIS.
Using examples as well as descriptions, and references to Books Online, show the documentation available for date manipulation functions and by using a select few of these functions, show how date based data can be manipulated with these functions.
Using examples as well as descriptions, and references to Books Online, show the different Recovery Models available in SQL Server and explain, as well as show how full, differential and transaction log backups are performed

809 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question