Unix Shell script or Perl that checks for high CPU process & renice it

I need a Perl or Shell script that will poll, say the top 10 CPU processes
given by "top" every 30 secs & if after polling 8 times & a particular
process CPU consumption is above 70% for all the 8 polls, then it will
renice the process to a friendlier priority.

Below is one script but it's not quite what I wanted


Preferably this script runs from crontab : if in cron we can only set to
run every minute, then this script will have to run twice per minute,
say by placing a "sleep 30 secs" ?

attached a sample but it's not quite I wanted
a.txt
sunhuxAsked:
Who is Participating?
 
apresenceConnect With a Mentor Commented:
I wrote the attached script for you.  Should be what you want.  Just change the variables at the top of the script to tune it the way you'd like.  I put in the parameters you need based on your description.

The script monitors processes that match the criteria you provide, polling every 30 seconds (configurable).  When it finds one that has CPU utilization above the configured amount, it adds that process to a "watch list".  The next time through the loop:
- If the process is still taking up CPU above the limit, a counter is incremented.  If this happens for the configured amount of time, then the renice command is executed
- If the process no longer exists, it is removed from the watch list
- If the process is no longer taking up CPU above the limit, it is removed from the watch list (it will be added back if the process starts taking up too much CPU again)

It can pick up new processes as they start, or pick up processes already started before the script was started.

For a test, I created a build_cache.x shell script that took up some CPU and created a few background processes.  I also changed the polling interval and cpu threshold so I could get a result more quickly.

If you want some additional debugging, uncomment the #print lines.

Setting this up from a cron job would not allow us to track how long certain processes were running (unless we saved some data to a file or something, but if we're doing it every 30 seconds that'll take up some disk io).  I suggest you set a startup script that just starts my script in the background with nohup and forward the output to a file in /var/log somewhere.

Testing output:
root@beta:~/exex/test12 $ ./procmon.pl
Monitoring Configuration:
  Polling Interval                 : 5
  Number of Top Processes to Check : 10
  Process Pattern                  : build_cache\.x
  CPU Threshold (%)                : 3
  Time Threshold (minutes)         : 1 (That's 12 loops)
  Nice level                       : 19
Found 7 process(es)
Watching: 12161 93.6 root: 0.0   4460  1068 pts/0    RN   15:51   3:14 /bin/sh ./build_cache.x
Watching: 12158 92.9 root: 0.0   4456  1060 pts/0    RN   15:51   3:18 /bin/sh ./build_cache.x
Watching: 12154 85.8 root: 0.0   4460  1068 pts/0    RN   15:51   3:14 /bin/sh ./build_cache.x
Watching: 12215 5.1 root: 0.0   4456  1092 pts/0    S    15:54   0:00 /bin/sh ./build_cache.x
Watching: 12376 5.0 root: 0.0   4456  1096 pts/0    R    15:54   0:00 /bin/sh ./build_cache.x
Watching: 12524 4.6 root: 0.0   4460  1096 pts/0    R    15:54   0:00 /bin/sh ./build_cache.x
Watching: 12292 4.6 root: 0.0   4456  1096 pts/0    S    15:54   0:00 /bin/sh ./build_cache.x
Found 7 process(es)
Watching 12161 with 91.8% cpu; Loops=1
Watching 12158 with 91.2% cpu; Loops=1
Watching 12154 with 84.4% cpu; Loops=1
Watching 12376 with 5.0% cpu; Loops=1
Watching 12215 with 4.8% cpu; Loops=1
Watching 12524 with 4.5% cpu; Loops=1
Watching 12292 with 4.4% cpu; Loops=1

...

Found 4 process(es)
Watching 12376 with 4.8% cpu; Loops=5
Watching 12215 with 4.8% cpu; Loops=5
Watching 12524 with 4.7% cpu; Loops=5
Watching 12292 with 4.6% cpu; Loops=5
No longer watching 12161 (Process ended)
No longer watching 12158 (Process ended)
No longer watching 12154 (Process ended)
Found 4 process(es)
Watching 12376 with 5.0% cpu; Loops=6
Watching 12215 with 5.0% cpu; Loops=6
Watching 12524 with 4.9% cpu; Loops=6
Watching 12292 with 4.8% cpu; Loops=6

...

Found 4 process(es)
Process 12376 cpu utilization has met threshold, nicing ...
12376: old priority 0, new priority 19
Process 12292 cpu utilization has met threshold, nicing ...
12292: old priority 0, new priority 19
Process 12215 cpu utilization has met threshold, nicing ...
12215: old priority 0, new priority 19
Process 12524 cpu utilization has met threshold, nicing ...
12524: old priority 0, new priority 19
Found 4 process(es)
Already niced 12376 with 5.2% cpu; Loops=13
Already niced 12292 with 5.2% cpu; Loops=13
Already niced 12215 with 5.2% cpu; Loops=13
Already niced 12524 with 5.1% cpu; Loops=13

...

<Edited by SouthMod to remove email>

#!/usr/bin/perl

$ps_bin = '/bin/ps';
$mailx_bin = '/usr/bin/mailx';
$mail_recip = 'myemail@mydomain.com';
$pid_pattern = 'build_cache\.x';
$cpu_max = 70;
$grace_period = 4;
$top_procs = 10;
$poll_delay = 30;
$nice_level = 19;

my %watch_list = ();
$grace_loops = $grace_period * (60 / $poll_delay);

print "Monitoring Configuration:\n";
print "  Polling Interval                 : $poll_delay\n";
print "  Number of Top Processes to Check : $top_procs\n";
print "  Process Pattern                  : $pid_pattern\n";
print "  CPU Threshold (%)                : $cpu_max\n";
print "  Time Threshold (minutes)         : $grace_period (That's $grace_loops loops)\n";
print "  Nice level                       : $nice_level\n";

while (1)
{
  # Get list of processes
  open(FH, "$ps_bin aux|");
  %pid_list = ();
  $header = <FH>; # Skip header line
  while (<FH>)
  {
    ($id,$pid,$cpu,$rest) = split(/\s+/, $_, 4);

    # Get rid of trailing spaces/newline
    $rest =~ s/\s+$//; 

    # Only interested in processes matching our pattern
    if ($rest =~ /($pid_pattern)/)
    {
      $pid_list{$pid} = {
        id => $id,
        cpu => $cpu,
        rest => $rest
      };
      #print "Proc: id=$id pid=$pid cpu=$cpu rest=$rest\n";
    }
  }
  close(FH);
  print "Found " . (scalar keys %pid_list) . " process(es)\n";

  # Mark all the pids we're watching for pruning
  foreach $pid (keys %watch_list)
  {
    $watch_list{$pid}{prune} = 1;
  }

  # Process list sorted by top CPU
  $num_checked = 0;
  foreach $pid (sort {$pid_list{$b}{cpu} cmp $pid_list{$a}{cpu}} keys %pid_list)
  {
    $pid_item = $pid_list{$pid};
    $watch_item = $watch_list{$pid};
    $cpu = $pid_item->{cpu};
    if ($num_checked <= $top_procs)
    {
      if ($cpu >= $cpu_max)
      {
        if ($watch_item == undef)
        {
          print "Watching: $pid $cpu " . $pid_item->{id} . ": " . $pid_item->{rest} . "\n";
          $watch_list{$pid} = {
            count => 1,
            prune => 0
          };
        }
        else
        {
          if ($watch_item->{count} >= $grace_loops)
          {
            if ($watch_item->{count} == $grace_loops)
            {
              print "Process $pid cpu utilization has met threshold, nicing ...\n";
              # Do your nice command here, ex:
              system("renice $nice_level -p $pid");
            }
            else
            {
              print "Already niced $pid with $cpu% cpu; Loops=" . $watch_item->{count} . "\n";
            }
          }
          else
          {
            print "Watching $pid with $cpu% cpu; Loops=" . $watch_item->{count} . "\n";
          }
       
          $watch_item->{count} = $watch_item->{count} + 1;

          # This process still has high cpu, so we want to keep watching it
          $watch_item->{prune} = 0;
        }
      }
    }
  }

  # Prune any watch_list items that didn't have high CPU
  foreach $pid (keys %watch_list)
  {
    if ($watch_list{$pid}{prune} == 1)
    {
      if ($pid_list{$pid} == undef)
      {
        print "No longer watching $pid (Process ended)\n";
      }
      else
      {
        print "No longer watching $pid with cpu " . $pid_list{$pid}{cpu} . "% (cpu below threshold)\n";
      }
      delete $watch_list{$pid};
    }
  }

  sleep($poll_delay);
}

Open in new window

0
 
gheistConnect With a Mentor Commented:
Probably you can start with enabling process accounting (see "man accton" for system-specific details)
Then look for heaviest processes and try to figure out some trends in usage.
build_cache is drupal bg service task you should run it at low-usage times and with low priority.
Probably SNMP monitoring tools like mrtg can assist you in finding peak hours....
0
 
gheistCommented:
PS your system probably contains "process aging" facility which reduces PRI (not NICE) so that long-lived processes do not consume too much.
PPS i suggest to let it run and watch closely. one runaway process probably is caused by bottleneck somewhere else...
0
Get your problem seen by more experts

Be seen. Boost your question’s priority for more expert views and faster solutions

 
gheistCommented:
e.g missing MySQL backend index and build_cache scanning table which now is big many times....
0
 
sunhuxAuthor Commented:

Thanks chaps.


Hi APresence,

Excellent script.  If I may request for 2 enhancements to your script:

1)  renice only top CPU processes with names that have the sub-string
    "AWSERVICES" OR "javaw" in them only & not any other processes

2) to append to a logfile the processes that have been reniced & time/date
    they were reniced.  I thought of amending the "system ..." section as follows
    but let me know if I've amended the codes below correctly:
              {
              print "Process $pid cpu utilization has met threshold, nicing ...\n";
              # Do your nice command here, ex:
              system("renice $nice_level -p $pid");
              system("echo 'Process' $pid ' reniced at ' $date >> /var/tmp/renicedpid.log")
              }
0
 
sunhuxAuthor Commented:

>  top CPU processes with names that have the sub-string    "AWSERVICES" OR "javaw"
Just to elaborate, I meant processes with above sub-string as given by "ps -ef" or "ps -aux"
0
 
apresenceConnect With a Mentor Commented:
Checking for specific processes, easy fix.  The pid_pattern is a regular expression, so it can handle alternates.  Using your requested process names:
$pid_pattern = '(AWSERVICES|javaw)';

Your log code will work (Although there will be no space between the word Process and your PID number), but doing a system() call just to append to a file is a bit overkill.  This is probably better:
  open (LOGFILE, '>>/var/tmp/renicedpid.log');
  print LOGFILE "Process $pid reniced at " . localtime() . "\n";
  close (LOGFILE);

If you have additional requests for modifications or features, I suggest you open a new question ;).
0
 
gheistCommented:
if you say which operating system you are using (uname -a)
I will assist on tuning process aging correctly so that it ages processes soon enough to make "control" script opsolete.
Deal?
0
 
apresenceCommented:
@sunhux: You might want to give gheist a chance to elaborate on his idea.  I'm curious myself.  He needs the output of your "uname -a" command.
@gheist: Based on his sample script which references /usr/ucb and his user name, I'd say it's a safe bet he's running on Solaris.  Not sure what version, however.
@sunhux: Please assign points after we have an answer from gheist.  Thanks!
0
 
gheistConnect With a Mentor Commented:
Solaris 9-10 (most probable) http://www.princeton.edu/~unix/Solaris/troubleshoot/schedule.html
/usr/ucb also is present on HP-UX and AIX
on ux one would use rtsched instead of renice
on aix one would use "smitty wlm"
0
 
apresenceCommented:
Please note that the page that gheist posted is not easily readable using IE8 (one word per line... very hard to read).  Looks good in FireFox, however.
0
 
sunhuxAuthor Commented:

sorry for the late response.

Basically I have 2 OS : RHES 4.6 and HP-UX B11.11
0
 
sunhuxAuthor Commented:
excellent
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.