Link to home
Start Free TrialLog in
Avatar of uluttrell
uluttrell

asked on

Calculate Statistics From Log file

This is not a homework assignment.  

I have log files for multiple servers that are kept in a repository.  The log file's filename format is application.servername.timestampinYYYYMMDD.log.

Each of the log files in the directory has the following format:

YYYYMMDD Time-TimeZone    Servername Application PID;  MeasuredStatistic          (XX/YY)  Value for TimeStamp
20030817 000216010-0400 servername application 4567;MeasuredStatisticOne(12/18) 5217
20030817 000216110-0400 servername application 4567;MeasuredStatisticTwo(12/14) 419276
20030817 000216110-0400 servername application 4567;MeasuredStatisticThree(12/31) 57912
20030817 000216110-0400 servername application 4567;MeasuredStatisticFour(12/12) 72
20030817 000216110-0400 servername application 4567;MeasuredStatisticFive(12/13) 1451718

The log files roll over at midnight.  Each statistic is measured at a random interval.

I am trying to write a perl script that will do the following:
For Each server
*determine the number of times that a given statistic appears on a day for a server.  Use that number to calculate the average for the MeasuredStatistic for the day.
* sum the days to determine the totals and averages for the week for a server.
* determine the average for all statics for all servers and export to a csv file to be used in an Excel spreadsheet.
* determine totals for all servers and export to a csv file to be used in an Excel spreadsheet.

I have written the following code, but it is not producing the desired results.
=====Begin code.pl
#! /usr/bin/perl

%module_count = ();
%module_sum = ();

while (<>) {
       chomp;
       next if (/^\s*$/);

       my ($date, $time, $host, $server, $pid, $metric, $value) = split(/\s/);

       $module_count{$module}++;
       $module_sum{$module} += $percent;
}

foreach $module (sort keys %module_count) {
       printf "%s %dx average is %d%%\n",
               $module,
               $module_count{$module},
               $module_sum{$module} / $module_count{$module};
=====End code.pl

How would I script this properly in perl?

ASKER CERTIFIED SOLUTION
Avatar of inq123
inq123

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of uluttrell
uluttrell

ASKER

Hi inq123,
When I attempt to run the code, I get the following errors for the following lines:
Global symbol "$data" requires explicit package name at
      $server_count{$server}->{weekstat}->{int($data-$sunday/7)}++; # records server's total for each week
Global symbol "$data" requires explicit package name at
      $all_count{weekstat}->{int($data-$sunday/7)}++; # same as above
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Just came back from a vacation.

Yeah, it should be $date not $data.  Sorry for the typo.  Use strict catches typo like this and make it easier to debug, so do use use strict in your programs.  It'll help a lot.
Thank you both for the comment and pointers.  I will follow up with how they work.
Hi, uluttrell, I just realized my code has a bug.  The

use Time::Local;
$sunday = timelocal(0,0,0,2,7,1998); # 1998/8/3, 0 am
$week = 7 * 24 * 3600;
my ($year, $month, $day) = $date =~ /(\d{4,4})(\d{2,2})(\d{2,2})/;
$now = timelocal(0,0,0,$day-1,$month-1,$year);
$server_count{$server}->{weekstat}->{int(($now-$sunday)/$week)}++;

The previous method looked cute but it's not working correctly.
Hi, inq123, Thanks for the correction.  Would you please post the final version?
This time I made sure everything works perfectly.  Now use this log file, save as test.log:

20030817 000216010-0400 servername application 4567;MeasuredStatisticOne(12/18) 5217
20030815 000216110-0400 servername application 4567;MeasuredStatisticTwo(12/14) 419276
20030817 000216110-0400 servername1 application 4567;MeasuredStatisticThree(12/31) 57912
20030816 000216110-0400 servername1 application 4567;MeasuredStatisticFour(12/12) 72
20030817 000216110-0400 servername application 4567;MeasuredStatisticFive(12/13) 1451718

Then save this script as test.pl:

#! /usr/bin/perl
use strict;
use Time::Local;

my $sunday = timelocal(0,0,0,2,7,2003); # 2003/8/3, 0 am
my $daylength = 24 * 3600;
my $week = 7 * $daylength;

my (%server_count, %all_count);

while (<>) {
     next if (/^\s*$/);
     chomp;

     my ($date, $time, $host, $server, $tmp, $value) = split(/\s/);
     my ($pid, $metric) = split(/;/, $tmp);

     $server_count{$host}->{daystat}->{$date}++; # records each server's number of times per day
     my ($year, $month, $day) = $date =~ /(\d{4,4})(\d{2,2})(\d{2,2})/;
     my $now = timelocal(0,0,0,$day-1,$month-1,$year);
     $server_count{$host}->{weekstat}->{int(($now-$sunday)/$week)}++; # records server's total for each week, convenient
     $all_count{daystat}->{$date}++; # records all servers' per day stat. not really needed, but convenient as we don't have to add servers up
     $all_count{weekstat}->{int(($now-$sunday)/$week)}++; # same as above
}

foreach my $host (keys %server_count)
{
  foreach my $weekno (keys %{$server_count{$host}->{weekstat}})
  {
    my $start = localtime($sunday + $weekno * $week + $daylength);
    print "for server $host, weekly count for $start - " . localtime($sunday + ($weekno+1) * $week + $daylength) . " is $server_count{$host}->{weekstat}->{$weekno}\n";
  }
}

Now finally launch the script with perl test.pl < test.log, you'll see everything works
use this log file would test even better as my old method would've worked on the log file above, but not this one as the month changed in this one:

20030817 000216010-0400 servername application 4567;MeasuredStatisticOne(12/18) 5217
20030815 000216110-0400 servername application 4567;MeasuredStatisticTwo(12/14) 419276
20030817 000216110-0400 servername1 application 4567;MeasuredStatisticThree(12/31) 57912
20030816 000216110-0400 servername1 application 4567;MeasuredStatisticFour(12/12) 72
20030917 000216110-0400 servername application 4567;MeasuredStatisticFive(12/13) 1451718
Thanks so much inq123.  It works great and I can tweak it further.  I appreciate all of your help :)
Hi inq123,

I rethought the problem and decidied that I could cat all the stat files for a week into a single file.  
Now, instead of the average for the servers, I would like the average for each of the measuredStatistics.  How would I do this in perl?

I will assign more points for this because this is a variation on my original submittal.
I could sure write the program but please explain the measuredStatistics, or give me an equation how to calculate the average as I do not quite understand the format and meaning of each number.  And I do not know you want to average the measuredStats against what, each day, week, metric?  Please give me some specifics.
The measuredStatistic is a general term.  I apologize for being vague.  

The application on each server measures on 8 distinct statistics.  These statistics are populated to the server's log file at random intervals.  Each time that statistic appears in the file that is the total for that time of the day, ie if the line reads
20030817 000216010-0400 servername application 4567;AccumulatedConnections(12/18) 5217
the total AccumulatedConnections for the day ending at 000216010 is 5217.

With the files rolling over at midnight, the latest time stamp for any one of the 8 statistics shows the total for that particular statistic for that day.  

Does this help any?
That still does not explain what the average for measuredStatistc is.  I cannot fully understand how such a log would work.  I mean, if the stats were populated at random interval, and files were changed (files rolled over? what does that mean?) at midnight, then how can you guarantee the last stats shows the total for that day? (why you said the last time stamp shows the total for that day? What does timestamp have to do with any stat?)

I think I'm almost more confused now than before.  But if you simply give me an equation for calculating the average using the example log file you gave out, then I can write out the script for you.
Rolled over means that a new log file is created at midnight.  The new log file is server.log.  This file is a symlink to the current day's log file.

I'll keep trying some suggestions.

I did not mean to confuse you more.
I understand.  I just meant that I'm still at a lost as to how should I calculate average.  Let me know if I can help you.