Calculate Statistics From Log file

uluttrell
uluttrell used Ask the Experts™
on
This is not a homework assignment.  

I have log files for multiple servers that are kept in a repository.  The log file's filename format is application.servername.timestampinYYYYMMDD.log.

Each of the log files in the directory has the following format:

YYYYMMDD Time-TimeZone    Servername Application PID;  MeasuredStatistic          (XX/YY)  Value for TimeStamp
20030817 000216010-0400 servername application 4567;MeasuredStatisticOne(12/18) 5217
20030817 000216110-0400 servername application 4567;MeasuredStatisticTwo(12/14) 419276
20030817 000216110-0400 servername application 4567;MeasuredStatisticThree(12/31) 57912
20030817 000216110-0400 servername application 4567;MeasuredStatisticFour(12/12) 72
20030817 000216110-0400 servername application 4567;MeasuredStatisticFive(12/13) 1451718

The log files roll over at midnight.  Each statistic is measured at a random interval.

I am trying to write a perl script that will do the following:
For Each server
*determine the number of times that a given statistic appears on a day for a server.  Use that number to calculate the average for the MeasuredStatistic for the day.
* sum the days to determine the totals and averages for the week for a server.
* determine the average for all statics for all servers and export to a csv file to be used in an Excel spreadsheet.
* determine totals for all servers and export to a csv file to be used in an Excel spreadsheet.

I have written the following code, but it is not producing the desired results.
=====Begin code.pl
#! /usr/bin/perl

%module_count = ();
%module_sum = ();

while (<>) {
       chomp;
       next if (/^\s*$/);

       my ($date, $time, $host, $server, $pid, $metric, $value) = split(/\s/);

       $module_count{$module}++;
       $module_sum{$module} += $percent;
}

foreach $module (sort keys %module_count) {
       printf "%s %dx average is %d%%\n",
               $module,
               $module_count{$module},
               $module_sum{$module} / $module_count{$module};
=====End code.pl

How would I script this properly in perl?

Comment
Watch Question

Do more with

Expert Office
EXPERT OFFICE® is a registered trademark of EXPERTS EXCHANGE®
Commented:
I've included codes below that should give you a very good idea as to how to accomplish all you want.  I didn't do all you asked for, mostly because I do not completely understand exactly how some stats should be calculated (for example, average for a server -- average over a week per day, or over all the time per day?  It's ambiguous).  However my codes below should provide you enough detail to let you easily get the stats yourself.  I also commented out some codes I felt not needed, and added "use strict" as it's always a good idea to have it.

#! /usr/bin/perl
use strict;
my (%module_count, %module_sum, %server_count, %all_count);
my $sunday = 20030803; # used as a convenience, should be earlier than all date in file

while (<>) {
      next if (/^\s*$/);
      chomp;

      my ($date, $time, $host, $server, $tmp, $value) = split(/\s/);
      my ($pid, $metric) = split(/;/, $tmp);

      $server_count{$server}->{daystat}->{$date}++; # records each server's number of times per day
      $server_count{$server}->{weekstat}->{int($data-$sunday/7)}++; # records server's total for each week
      $all_count{daystat}->{$date}++; # records all servers' per day stat. not really needed, but convenient as we don't have to add servers up
      $all_count{weekstat}->{int($data-$sunday/7)}++; # same as above
#       $module_count{$module}++;
#       $module_sum{$module} += $percent;
}

foreach my $server (keys %server_count)
{
  foreach my $week (keys %{$server_count{$server}})
  {
    my $start = $sunday + 7 * ($week);
    print "for server $server, weekly count for $start - " . $start+7 . " is $server_count{$server}->{weekstat}->{$week}\n";
  }
}

# foreach $module (sort keys %module_count) {
#       printf "%s %dx average is %d%%\n",
#               $module,
#               $module_count{$module},
#               $module_sum{$module} / $module_count{$module};

Author

Commented:
Hi inq123,
When I attempt to run the code, I get the following errors for the following lines:
Global symbol "$data" requires explicit package name at
      $server_count{$server}->{weekstat}->{int($data-$sunday/7)}++; # records server's total for each week
Global symbol "$data" requires explicit package name at
      $all_count{weekstat}->{int($data-$sunday/7)}++; # same as above
Commented:
As far as I can tell, I think it should be $date. If that's the case, then just change that and retry, and appreciate use strict :P
Bootstrap 4: Exploring New Features

Learn how to use and navigate the new features included in Bootstrap 4, the most popular HTML, CSS, and JavaScript framework for developing responsive, mobile-first websites.

Commented:
Just came back from a vacation.

Yeah, it should be $date not $data.  Sorry for the typo.  Use strict catches typo like this and make it easier to debug, so do use use strict in your programs.  It'll help a lot.

Author

Commented:
Thank you both for the comment and pointers.  I will follow up with how they work.

Commented:
Hi, uluttrell, I just realized my code has a bug.  The

use Time::Local;
$sunday = timelocal(0,0,0,2,7,1998); # 1998/8/3, 0 am
$week = 7 * 24 * 3600;
my ($year, $month, $day) = $date =~ /(\d{4,4})(\d{2,2})(\d{2,2})/;
$now = timelocal(0,0,0,$day-1,$month-1,$year);
$server_count{$server}->{weekstat}->{int(($now-$sunday)/$week)}++;

The previous method looked cute but it's not working correctly.

Author

Commented:
Hi, inq123, Thanks for the correction.  Would you please post the final version?

Commented:
This time I made sure everything works perfectly.  Now use this log file, save as test.log:

20030817 000216010-0400 servername application 4567;MeasuredStatisticOne(12/18) 5217
20030815 000216110-0400 servername application 4567;MeasuredStatisticTwo(12/14) 419276
20030817 000216110-0400 servername1 application 4567;MeasuredStatisticThree(12/31) 57912
20030816 000216110-0400 servername1 application 4567;MeasuredStatisticFour(12/12) 72
20030817 000216110-0400 servername application 4567;MeasuredStatisticFive(12/13) 1451718

Then save this script as test.pl:

#! /usr/bin/perl
use strict;
use Time::Local;

my $sunday = timelocal(0,0,0,2,7,2003); # 2003/8/3, 0 am
my $daylength = 24 * 3600;
my $week = 7 * $daylength;

my (%server_count, %all_count);

while (<>) {
     next if (/^\s*$/);
     chomp;

     my ($date, $time, $host, $server, $tmp, $value) = split(/\s/);
     my ($pid, $metric) = split(/;/, $tmp);

     $server_count{$host}->{daystat}->{$date}++; # records each server's number of times per day
     my ($year, $month, $day) = $date =~ /(\d{4,4})(\d{2,2})(\d{2,2})/;
     my $now = timelocal(0,0,0,$day-1,$month-1,$year);
     $server_count{$host}->{weekstat}->{int(($now-$sunday)/$week)}++; # records server's total for each week, convenient
     $all_count{daystat}->{$date}++; # records all servers' per day stat. not really needed, but convenient as we don't have to add servers up
     $all_count{weekstat}->{int(($now-$sunday)/$week)}++; # same as above
}

foreach my $host (keys %server_count)
{
  foreach my $weekno (keys %{$server_count{$host}->{weekstat}})
  {
    my $start = localtime($sunday + $weekno * $week + $daylength);
    print "for server $host, weekly count for $start - " . localtime($sunday + ($weekno+1) * $week + $daylength) . " is $server_count{$host}->{weekstat}->{$weekno}\n";
  }
}

Now finally launch the script with perl test.pl < test.log, you'll see everything works

Commented:
use this log file would test even better as my old method would've worked on the log file above, but not this one as the month changed in this one:

20030817 000216010-0400 servername application 4567;MeasuredStatisticOne(12/18) 5217
20030815 000216110-0400 servername application 4567;MeasuredStatisticTwo(12/14) 419276
20030817 000216110-0400 servername1 application 4567;MeasuredStatisticThree(12/31) 57912
20030816 000216110-0400 servername1 application 4567;MeasuredStatisticFour(12/12) 72
20030917 000216110-0400 servername application 4567;MeasuredStatisticFive(12/13) 1451718

Author

Commented:
Thanks so much inq123.  It works great and I can tweak it further.  I appreciate all of your help :)

Author

Commented:
Hi inq123,

I rethought the problem and decidied that I could cat all the stat files for a week into a single file.  
Now, instead of the average for the servers, I would like the average for each of the measuredStatistics.  How would I do this in perl?

I will assign more points for this because this is a variation on my original submittal.

Commented:
I could sure write the program but please explain the measuredStatistics, or give me an equation how to calculate the average as I do not quite understand the format and meaning of each number.  And I do not know you want to average the measuredStats against what, each day, week, metric?  Please give me some specifics.

Author

Commented:
The measuredStatistic is a general term.  I apologize for being vague.  

The application on each server measures on 8 distinct statistics.  These statistics are populated to the server's log file at random intervals.  Each time that statistic appears in the file that is the total for that time of the day, ie if the line reads
20030817 000216010-0400 servername application 4567;AccumulatedConnections(12/18) 5217
the total AccumulatedConnections for the day ending at 000216010 is 5217.

With the files rolling over at midnight, the latest time stamp for any one of the 8 statistics shows the total for that particular statistic for that day.  

Does this help any?

Commented:
That still does not explain what the average for measuredStatistc is.  I cannot fully understand how such a log would work.  I mean, if the stats were populated at random interval, and files were changed (files rolled over? what does that mean?) at midnight, then how can you guarantee the last stats shows the total for that day? (why you said the last time stamp shows the total for that day? What does timestamp have to do with any stat?)

I think I'm almost more confused now than before.  But if you simply give me an equation for calculating the average using the example log file you gave out, then I can write out the script for you.

Author

Commented:
Rolled over means that a new log file is created at midnight.  The new log file is server.log.  This file is a symlink to the current day's log file.

I'll keep trying some suggestions.

I did not mean to confuse you more.

Commented:
I understand.  I just meant that I'm still at a lost as to how should I calculate average.  Let me know if I can help you.

Do more with

Expert Office
Submit tech questions to Ask the Experts™ at any time to receive solutions, advice, and new ideas from leading industry professionals.

Start 7-Day Free Trial