Solved

Table normalization for a custom Google Analytic alike system

Posted on 2014-09-10
16
75 Views
Last Modified: 2016-02-24
Hello,

I'm asked to create a Google Analytics alike application (of course just a simple one) for our organization so as you may know first thing which crosses my mind was thinking of IIS log file and use LogParser to extract information so this is my plan and I will appreciate it if you give me some heads up as far as the kind of tables I would need to have.

Currently I am think of a set of tablets as following:

0) All the information are stored into a table as RawInfo

This raw table gets following column/info from log file:

Id (incremental), date, time, c-ip, cs-username, s-sitename, s-ip, s-port, cs-method, cs-uri-stem, cs-uri-query, cs-status,
sc-substatus, sc-win32-status, sc-bytes, cs-bytes, time-taken, cs-version, cs-host, cs-UserAgent, cs-cookie, cs-Referer

1) From RawInfo we split the information in 8 tables base on the need we have.
Tables are:

Visitor table:
id, date, time, c-ip, cs-username, s-sitename, method-id, status, win32status, time-taken, cs-version, cs-host

Method table:
method-id, method-name e.g. GET/SET


Uri-Stem table:
id, date, time, c-ip, cs-uri-stem

Uri-Query table:
id, date, time, c-ip, cs-uri-query

HttpRef table:
status-code, description  
e.g. 200 = OK,  404 = page not found  and etc

UserAgent table:
id, date, time, c-ip,cs-UserAgent

Cookies table:
id, date, time, c-ip,cs-cookes

Referer table:
id, date, time, c-ip, cs-Referer


2) I'm thinking of using a combinational Primary key of  id, date, time, c-ip

May some experts discuss/comment on this approach ?

Thanks.
Ak
0
Comment
Question by:akohan
  • 8
  • 7
16 Comments
 
LVL 69

Expert Comment

by:ScottPletcher
ID: 40315883
I'd make the clustering key datetime, as I think that would be much more useful for limiting searches (I wouldn't split date and time here, that just causes you more work).  

If you want a dummy PK of identity, that's OK too, but it's not really necessary (assuming you're careful to never reset the id value).
0
 
LVL 48

Expert Comment

by:PortletPaul
ID: 40315926
why re-invent this?

have you considered the multitude of open-source products available for web analytics?
then add commercial products as options too.
0
 

Author Comment

by:akohan
ID: 40315931
Hello Scott,

So you are saying mixing date and time columns and storing it as DateTime data type.

SELECT MixDateAndTime = Date + Time FROM MyRowInfoTable

Open in new window


Yes, I will stick to that PK for ever to locate other information. My assumption (I hope I am right) is that each request can be considered unique by having its date+time+clientIP (== c-ip). right? or maybe I won't even need to add an extra ID (incremental).

Any comments?
0
 

Author Comment

by:akohan
ID: 40315933
Hello Paul,

Yes, I proposed that but they rather go with a custom one  :(
0
 
LVL 69

Expert Comment

by:ScottPletcher
ID: 40318213
Again, the incremental id as a PK is ok, but you do NOT want that be the clustering key.  Instead, cluster on the combined date and time (confirm: combine date and time into a single column).  Easiest is to create the clustering key first -- then by default the PK will not be clustered.
0
 

Author Comment

by:akohan
ID: 40318301
OK but what if two request hit a page at the same date and time?
am I missing something here?
0
 
LVL 69

Expert Comment

by:ScottPletcher
ID: 40320531
I expect some will hit at the same time.  SQL will resolve the dups, or you can use an identity to prevent dups.  Either way, that clustering will still provide the quickest, most used method of limiting the rows to be read by queries.
0
 

Author Comment

by:akohan
ID: 40331184
Right but still I would need the IP address of clients to locate other information about a specific visitor.  again question is how can I mix up DateTime with an IP value which is varchar?
0
Enabling OSINT in Activity Based Intelligence

Activity based intelligence (ABI) requires access to all available sources of data. Recorded Future allows analysts to observe structured data on the open, deep, and dark web.

 
LVL 69

Expert Comment

by:ScottPletcher
ID: 40331199
You don't need, nor want, to "mix them up".

Simply specify multiple key columns:
( date, time, ip )
0
 

Author Comment

by:akohan
ID: 40331245
So you mean making following columns as multiple keys?
Id int
VisitorDate  varchar
visitorTime  varchar

But in your comment above you said" "I'd make the clustering key datetime"
So should I convert them and then set  client-IP (varchar) and  VisitDay (DateTime) as PK?
0
 
LVL 69

Expert Comment

by:ScottPletcher
ID: 40331252
Date, time and id is fine:
( VisitorDate, visitorTime, Id )

The important thing is to put date and time first, not id.  I'm assuming that date & time are the date and time of access, and thus will automatically be incremental.
0
 

Author Comment

by:akohan
ID: 40331308
Yes, I did  that so put date, time first then ID.  now how can I deal with other tables as far as maintaining a relation?
0
 
LVL 69

Expert Comment

by:ScottPletcher
ID: 40331311
You still relate by ID; you have no good alternative.  Thus, you will still need an index on ( ID, date, time) as well (typically without other columns, but you might need one or two INCLUDE columns depending on the specific queries you run).  As long as the query on the other table restricts by date and time, you still gain tremendous efficiencies from the proper key structure.
0
 

Author Comment

by:akohan
ID: 40331329
So let me ask you this question. What would be the role of IP or client-IP in this game?

Initially, I was thinking of having IP value in a table (all unique IPs) and then relate it to other tables. From other table I mean again Date,Time, IP, etc as one to many relationship.

Can you tell me what would be wrong with that ?

Thank you
0
 
LVL 69

Accepted Solution

by:
ScottPletcher earned 500 total points
ID: 40331347
You can still do all that.

My thought was that you would typically work with a set of rows within a date range.  If you do, then clustering the table that way will drastically improve the efficiency of those queries, no matter how you need match IP address or other data columns in the future.
0
 

Author Closing Comment

by:akohan
ID: 40604134
OK thank you!
0

Featured Post

Find Ransomware Secrets With All-Source Analysis

Ransomware has become a major concern for organizations; its prevalence has grown due to past successes achieved by threat actors. While each ransomware variant is different, we’ve seen some common tactics and trends used among the authors of the malware.

Join & Write a Comment

In this article we will get to know that how can we recover deleted data if it happens accidently. We really can recover deleted rows if we know the time when data is deleted by using the transaction log.
Ever wondered why sometimes your SQL Server is slow or unresponsive with connections spiking up but by the time you go in, all is well? The following article will show you how to install and configure a SQL job that will send you email alerts includ…
Viewers will learn how to use the SELECT statement in SQL to return specific rows and columns, with various degrees of sorting and limits in place.
Viewers will learn how to use the SELECT statement in SQL and will be exposed to the many uses the SELECT statement has.

706 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

19 Experts available now in Live!

Get 1:1 Help Now