Solved

Unix Shell script or Perl that checks for high CPU process & renice it

Posted on 2010-09-05
13
1,015 Views
Last Modified: 2012-05-10
I need a Perl or Shell script that will poll, say the top 10 CPU processes
given by "top" every 30 secs & if after polling 8 times & a particular
process CPU consumption is above 70% for all the 8 polls, then it will
renice the process to a friendlier priority.

Below is one script but it's not quite what I wanted


Preferably this script runs from crontab : if in cron we can only set to
run every minute, then this script will have to run twice per minute,
say by placing a "sleep 30 secs" ?

attached a sample but it's not quite I wanted
a.txt
0
Comment
Question by:sunhux
  • 5
  • 4
  • 4
13 Comments
 
LVL 62

Assisted Solution

by:gheist
gheist earned 100 total points
ID: 33610596
Probably you can start with enabling process accounting (see "man accton" for system-specific details)
Then look for heaviest processes and try to figure out some trends in usage.
build_cache is drupal bg service task you should run it at low-usage times and with low priority.
Probably SNMP monitoring tools like mrtg can assist you in finding peak hours....
0
 
LVL 62

Expert Comment

by:gheist
ID: 33610604
PS your system probably contains "process aging" facility which reduces PRI (not NICE) so that long-lived processes do not consume too much.
PPS i suggest to let it run and watch closely. one runaway process probably is caused by bottleneck somewhere else...
0
 
LVL 62

Expert Comment

by:gheist
ID: 33610608
e.g missing MySQL backend index and build_cache scanning table which now is big many times....
0
Technology Partners: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 
LVL 6

Accepted Solution

by:
apresence earned 400 total points
ID: 33613329
I wrote the attached script for you.  Should be what you want.  Just change the variables at the top of the script to tune it the way you'd like.  I put in the parameters you need based on your description.

The script monitors processes that match the criteria you provide, polling every 30 seconds (configurable).  When it finds one that has CPU utilization above the configured amount, it adds that process to a "watch list".  The next time through the loop:
- If the process is still taking up CPU above the limit, a counter is incremented.  If this happens for the configured amount of time, then the renice command is executed
- If the process no longer exists, it is removed from the watch list
- If the process is no longer taking up CPU above the limit, it is removed from the watch list (it will be added back if the process starts taking up too much CPU again)

It can pick up new processes as they start, or pick up processes already started before the script was started.

For a test, I created a build_cache.x shell script that took up some CPU and created a few background processes.  I also changed the polling interval and cpu threshold so I could get a result more quickly.

If you want some additional debugging, uncomment the #print lines.

Setting this up from a cron job would not allow us to track how long certain processes were running (unless we saved some data to a file or something, but if we're doing it every 30 seconds that'll take up some disk io).  I suggest you set a startup script that just starts my script in the background with nohup and forward the output to a file in /var/log somewhere.

Testing output:
root@beta:~/exex/test12 $ ./procmon.pl
Monitoring Configuration:
  Polling Interval                 : 5
  Number of Top Processes to Check : 10
  Process Pattern                  : build_cache\.x
  CPU Threshold (%)                : 3
  Time Threshold (minutes)         : 1 (That's 12 loops)
  Nice level                       : 19
Found 7 process(es)
Watching: 12161 93.6 root: 0.0   4460  1068 pts/0    RN   15:51   3:14 /bin/sh ./build_cache.x
Watching: 12158 92.9 root: 0.0   4456  1060 pts/0    RN   15:51   3:18 /bin/sh ./build_cache.x
Watching: 12154 85.8 root: 0.0   4460  1068 pts/0    RN   15:51   3:14 /bin/sh ./build_cache.x
Watching: 12215 5.1 root: 0.0   4456  1092 pts/0    S    15:54   0:00 /bin/sh ./build_cache.x
Watching: 12376 5.0 root: 0.0   4456  1096 pts/0    R    15:54   0:00 /bin/sh ./build_cache.x
Watching: 12524 4.6 root: 0.0   4460  1096 pts/0    R    15:54   0:00 /bin/sh ./build_cache.x
Watching: 12292 4.6 root: 0.0   4456  1096 pts/0    S    15:54   0:00 /bin/sh ./build_cache.x
Found 7 process(es)
Watching 12161 with 91.8% cpu; Loops=1
Watching 12158 with 91.2% cpu; Loops=1
Watching 12154 with 84.4% cpu; Loops=1
Watching 12376 with 5.0% cpu; Loops=1
Watching 12215 with 4.8% cpu; Loops=1
Watching 12524 with 4.5% cpu; Loops=1
Watching 12292 with 4.4% cpu; Loops=1

...

Found 4 process(es)
Watching 12376 with 4.8% cpu; Loops=5
Watching 12215 with 4.8% cpu; Loops=5
Watching 12524 with 4.7% cpu; Loops=5
Watching 12292 with 4.6% cpu; Loops=5
No longer watching 12161 (Process ended)
No longer watching 12158 (Process ended)
No longer watching 12154 (Process ended)
Found 4 process(es)
Watching 12376 with 5.0% cpu; Loops=6
Watching 12215 with 5.0% cpu; Loops=6
Watching 12524 with 4.9% cpu; Loops=6
Watching 12292 with 4.8% cpu; Loops=6

...

Found 4 process(es)
Process 12376 cpu utilization has met threshold, nicing ...
12376: old priority 0, new priority 19
Process 12292 cpu utilization has met threshold, nicing ...
12292: old priority 0, new priority 19
Process 12215 cpu utilization has met threshold, nicing ...
12215: old priority 0, new priority 19
Process 12524 cpu utilization has met threshold, nicing ...
12524: old priority 0, new priority 19
Found 4 process(es)
Already niced 12376 with 5.2% cpu; Loops=13
Already niced 12292 with 5.2% cpu; Loops=13
Already niced 12215 with 5.2% cpu; Loops=13
Already niced 12524 with 5.1% cpu; Loops=13

...

<Edited by SouthMod to remove email>

#!/usr/bin/perl

$ps_bin = '/bin/ps';
$mailx_bin = '/usr/bin/mailx';
$mail_recip = 'myemail@mydomain.com';
$pid_pattern = 'build_cache\.x';
$cpu_max = 70;
$grace_period = 4;
$top_procs = 10;
$poll_delay = 30;
$nice_level = 19;

my %watch_list = ();
$grace_loops = $grace_period * (60 / $poll_delay);

print "Monitoring Configuration:\n";
print "  Polling Interval                 : $poll_delay\n";
print "  Number of Top Processes to Check : $top_procs\n";
print "  Process Pattern                  : $pid_pattern\n";
print "  CPU Threshold (%)                : $cpu_max\n";
print "  Time Threshold (minutes)         : $grace_period (That's $grace_loops loops)\n";
print "  Nice level                       : $nice_level\n";

while (1)
{
  # Get list of processes
  open(FH, "$ps_bin aux|");
  %pid_list = ();
  $header = <FH>; # Skip header line
  while (<FH>)
  {
    ($id,$pid,$cpu,$rest) = split(/\s+/, $_, 4);

    # Get rid of trailing spaces/newline
    $rest =~ s/\s+$//; 

    # Only interested in processes matching our pattern
    if ($rest =~ /($pid_pattern)/)
    {
      $pid_list{$pid} = {
        id => $id,
        cpu => $cpu,
        rest => $rest
      };
      #print "Proc: id=$id pid=$pid cpu=$cpu rest=$rest\n";
    }
  }
  close(FH);
  print "Found " . (scalar keys %pid_list) . " process(es)\n";

  # Mark all the pids we're watching for pruning
  foreach $pid (keys %watch_list)
  {
    $watch_list{$pid}{prune} = 1;
  }

  # Process list sorted by top CPU
  $num_checked = 0;
  foreach $pid (sort {$pid_list{$b}{cpu} cmp $pid_list{$a}{cpu}} keys %pid_list)
  {
    $pid_item = $pid_list{$pid};
    $watch_item = $watch_list{$pid};
    $cpu = $pid_item->{cpu};
    if ($num_checked <= $top_procs)
    {
      if ($cpu >= $cpu_max)
      {
        if ($watch_item == undef)
        {
          print "Watching: $pid $cpu " . $pid_item->{id} . ": " . $pid_item->{rest} . "\n";
          $watch_list{$pid} = {
            count => 1,
            prune => 0
          };
        }
        else
        {
          if ($watch_item->{count} >= $grace_loops)
          {
            if ($watch_item->{count} == $grace_loops)
            {
              print "Process $pid cpu utilization has met threshold, nicing ...\n";
              # Do your nice command here, ex:
              system("renice $nice_level -p $pid");
            }
            else
            {
              print "Already niced $pid with $cpu% cpu; Loops=" . $watch_item->{count} . "\n";
            }
          }
          else
          {
            print "Watching $pid with $cpu% cpu; Loops=" . $watch_item->{count} . "\n";
          }
       
          $watch_item->{count} = $watch_item->{count} + 1;

          # This process still has high cpu, so we want to keep watching it
          $watch_item->{prune} = 0;
        }
      }
    }
  }

  # Prune any watch_list items that didn't have high CPU
  foreach $pid (keys %watch_list)
  {
    if ($watch_list{$pid}{prune} == 1)
    {
      if ($pid_list{$pid} == undef)
      {
        print "No longer watching $pid (Process ended)\n";
      }
      else
      {
        print "No longer watching $pid with cpu " . $pid_list{$pid}{cpu} . "% (cpu below threshold)\n";
      }
      delete $watch_list{$pid};
    }
  }

  sleep($poll_delay);
}

Open in new window

0
 

Author Comment

by:sunhux
ID: 33614809

Thanks chaps.


Hi APresence,

Excellent script.  If I may request for 2 enhancements to your script:

1)  renice only top CPU processes with names that have the sub-string
    "AWSERVICES" OR "javaw" in them only & not any other processes

2) to append to a logfile the processes that have been reniced & time/date
    they were reniced.  I thought of amending the "system ..." section as follows
    but let me know if I've amended the codes below correctly:
              {
              print "Process $pid cpu utilization has met threshold, nicing ...\n";
              # Do your nice command here, ex:
              system("renice $nice_level -p $pid");
              system("echo 'Process' $pid ' reniced at ' $date >> /var/tmp/renicedpid.log")
              }
0
 

Author Comment

by:sunhux
ID: 33614817

>  top CPU processes with names that have the sub-string    "AWSERVICES" OR "javaw"
Just to elaborate, I meant processes with above sub-string as given by "ps -ef" or "ps -aux"
0
 
LVL 6

Assisted Solution

by:apresence
apresence earned 400 total points
ID: 33615403
Checking for specific processes, easy fix.  The pid_pattern is a regular expression, so it can handle alternates.  Using your requested process names:
$pid_pattern = '(AWSERVICES|javaw)';

Your log code will work (Although there will be no space between the word Process and your PID number), but doing a system() call just to append to a file is a bit overkill.  This is probably better:
  open (LOGFILE, '>>/var/tmp/renicedpid.log');
  print LOGFILE "Process $pid reniced at " . localtime() . "\n";
  close (LOGFILE);

If you have additional requests for modifications or features, I suggest you open a new question ;).
0
 
LVL 62

Expert Comment

by:gheist
ID: 33615564
if you say which operating system you are using (uname -a)
I will assist on tuning process aging correctly so that it ages processes soon enough to make "control" script opsolete.
Deal?
0
 
LVL 6

Expert Comment

by:apresence
ID: 33624398
@sunhux: You might want to give gheist a chance to elaborate on his idea.  I'm curious myself.  He needs the output of your "uname -a" command.
@gheist: Based on his sample script which references /usr/ucb and his user name, I'd say it's a safe bet he's running on Solaris.  Not sure what version, however.
@sunhux: Please assign points after we have an answer from gheist.  Thanks!
0
 
LVL 62

Assisted Solution

by:gheist
gheist earned 100 total points
ID: 33624896
Solaris 9-10 (most probable) http://www.princeton.edu/~unix/Solaris/troubleshoot/schedule.html
/usr/ucb also is present on HP-UX and AIX
on ux one would use rtsched instead of renice
on aix one would use "smitty wlm"
0
 
LVL 6

Expert Comment

by:apresence
ID: 33624909
Please note that the page that gheist posted is not easily readable using IE8 (one word per line... very hard to read).  Looks good in FireFox, however.
0
 

Author Comment

by:sunhux
ID: 33637726

sorry for the late response.

Basically I have 2 OS : RHES 4.6 and HP-UX B11.11
0
 

Author Closing Comment

by:sunhux
ID: 34506721
excellent
0

Featured Post

Free Tool: Site Down Detector

Helpful to verify reports of your own downtime, or to double check a downed website you are trying to access.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

There are many situations when we need to display the data in sorted order. For example: Student details by name or by rank or by total marks etc. If you are working on data driven based projects then you will use sorting techniques very frequently.…
How to remove superseded packages in windows w60 or w61 installation media (.wim) or online system to prevent unnecessary space. w60 means Windows Vista or Windows Server 2008. w61 means Windows 7 or Windows Server 2008 R2. There are various …
In a previous video, we went over how to export a DynamoDB table into Amazon S3.  In this video, we show how to load the export from S3 into a DynamoDB table.

685 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question