Solved

Unix Shell script or Perl that checks for high CPU process & renice it

Posted on 2010-09-05
13
1,000 Views
Last Modified: 2012-05-10
I need a Perl or Shell script that will poll, say the top 10 CPU processes
given by "top" every 30 secs & if after polling 8 times & a particular
process CPU consumption is above 70% for all the 8 polls, then it will
renice the process to a friendlier priority.

Below is one script but it's not quite what I wanted


Preferably this script runs from crontab : if in cron we can only set to
run every minute, then this script will have to run twice per minute,
say by placing a "sleep 30 secs" ?

attached a sample but it's not quite I wanted
a.txt
0
Comment
Question by:sunhux
  • 5
  • 4
  • 4
13 Comments
 
LVL 61

Assisted Solution

by:gheist
gheist earned 100 total points
Comment Utility
Probably you can start with enabling process accounting (see "man accton" for system-specific details)
Then look for heaviest processes and try to figure out some trends in usage.
build_cache is drupal bg service task you should run it at low-usage times and with low priority.
Probably SNMP monitoring tools like mrtg can assist you in finding peak hours....
0
 
LVL 61

Expert Comment

by:gheist
Comment Utility
PS your system probably contains "process aging" facility which reduces PRI (not NICE) so that long-lived processes do not consume too much.
PPS i suggest to let it run and watch closely. one runaway process probably is caused by bottleneck somewhere else...
0
 
LVL 61

Expert Comment

by:gheist
Comment Utility
e.g missing MySQL backend index and build_cache scanning table which now is big many times....
0
 
LVL 6

Accepted Solution

by:
apresence earned 400 total points
Comment Utility
I wrote the attached script for you.  Should be what you want.  Just change the variables at the top of the script to tune it the way you'd like.  I put in the parameters you need based on your description.

The script monitors processes that match the criteria you provide, polling every 30 seconds (configurable).  When it finds one that has CPU utilization above the configured amount, it adds that process to a "watch list".  The next time through the loop:
- If the process is still taking up CPU above the limit, a counter is incremented.  If this happens for the configured amount of time, then the renice command is executed
- If the process no longer exists, it is removed from the watch list
- If the process is no longer taking up CPU above the limit, it is removed from the watch list (it will be added back if the process starts taking up too much CPU again)

It can pick up new processes as they start, or pick up processes already started before the script was started.

For a test, I created a build_cache.x shell script that took up some CPU and created a few background processes.  I also changed the polling interval and cpu threshold so I could get a result more quickly.

If you want some additional debugging, uncomment the #print lines.

Setting this up from a cron job would not allow us to track how long certain processes were running (unless we saved some data to a file or something, but if we're doing it every 30 seconds that'll take up some disk io).  I suggest you set a startup script that just starts my script in the background with nohup and forward the output to a file in /var/log somewhere.

Testing output:
root@beta:~/exex/test12 $ ./procmon.pl
Monitoring Configuration:
  Polling Interval                 : 5
  Number of Top Processes to Check : 10
  Process Pattern                  : build_cache\.x
  CPU Threshold (%)                : 3
  Time Threshold (minutes)         : 1 (That's 12 loops)
  Nice level                       : 19
Found 7 process(es)
Watching: 12161 93.6 root: 0.0   4460  1068 pts/0    RN   15:51   3:14 /bin/sh ./build_cache.x
Watching: 12158 92.9 root: 0.0   4456  1060 pts/0    RN   15:51   3:18 /bin/sh ./build_cache.x
Watching: 12154 85.8 root: 0.0   4460  1068 pts/0    RN   15:51   3:14 /bin/sh ./build_cache.x
Watching: 12215 5.1 root: 0.0   4456  1092 pts/0    S    15:54   0:00 /bin/sh ./build_cache.x
Watching: 12376 5.0 root: 0.0   4456  1096 pts/0    R    15:54   0:00 /bin/sh ./build_cache.x
Watching: 12524 4.6 root: 0.0   4460  1096 pts/0    R    15:54   0:00 /bin/sh ./build_cache.x
Watching: 12292 4.6 root: 0.0   4456  1096 pts/0    S    15:54   0:00 /bin/sh ./build_cache.x
Found 7 process(es)
Watching 12161 with 91.8% cpu; Loops=1
Watching 12158 with 91.2% cpu; Loops=1
Watching 12154 with 84.4% cpu; Loops=1
Watching 12376 with 5.0% cpu; Loops=1
Watching 12215 with 4.8% cpu; Loops=1
Watching 12524 with 4.5% cpu; Loops=1
Watching 12292 with 4.4% cpu; Loops=1

...

Found 4 process(es)
Watching 12376 with 4.8% cpu; Loops=5
Watching 12215 with 4.8% cpu; Loops=5
Watching 12524 with 4.7% cpu; Loops=5
Watching 12292 with 4.6% cpu; Loops=5
No longer watching 12161 (Process ended)
No longer watching 12158 (Process ended)
No longer watching 12154 (Process ended)
Found 4 process(es)
Watching 12376 with 5.0% cpu; Loops=6
Watching 12215 with 5.0% cpu; Loops=6
Watching 12524 with 4.9% cpu; Loops=6
Watching 12292 with 4.8% cpu; Loops=6

...

Found 4 process(es)
Process 12376 cpu utilization has met threshold, nicing ...
12376: old priority 0, new priority 19
Process 12292 cpu utilization has met threshold, nicing ...
12292: old priority 0, new priority 19
Process 12215 cpu utilization has met threshold, nicing ...
12215: old priority 0, new priority 19
Process 12524 cpu utilization has met threshold, nicing ...
12524: old priority 0, new priority 19
Found 4 process(es)
Already niced 12376 with 5.2% cpu; Loops=13
Already niced 12292 with 5.2% cpu; Loops=13
Already niced 12215 with 5.2% cpu; Loops=13
Already niced 12524 with 5.1% cpu; Loops=13

...

<Edited by SouthMod to remove email>

#!/usr/bin/perl

$ps_bin = '/bin/ps';
$mailx_bin = '/usr/bin/mailx';
$mail_recip = 'myemail@mydomain.com';
$pid_pattern = 'build_cache\.x';
$cpu_max = 70;
$grace_period = 4;
$top_procs = 10;
$poll_delay = 30;
$nice_level = 19;

my %watch_list = ();
$grace_loops = $grace_period * (60 / $poll_delay);

print "Monitoring Configuration:\n";
print "  Polling Interval                 : $poll_delay\n";
print "  Number of Top Processes to Check : $top_procs\n";
print "  Process Pattern                  : $pid_pattern\n";
print "  CPU Threshold (%)                : $cpu_max\n";
print "  Time Threshold (minutes)         : $grace_period (That's $grace_loops loops)\n";
print "  Nice level                       : $nice_level\n";

while (1)
{
  # Get list of processes
  open(FH, "$ps_bin aux|");
  %pid_list = ();
  $header = <FH>; # Skip header line
  while (<FH>)
  {
    ($id,$pid,$cpu,$rest) = split(/\s+/, $_, 4);

    # Get rid of trailing spaces/newline
    $rest =~ s/\s+$//; 

    # Only interested in processes matching our pattern
    if ($rest =~ /($pid_pattern)/)
    {
      $pid_list{$pid} = {
        id => $id,
        cpu => $cpu,
        rest => $rest
      };
      #print "Proc: id=$id pid=$pid cpu=$cpu rest=$rest\n";
    }
  }
  close(FH);
  print "Found " . (scalar keys %pid_list) . " process(es)\n";

  # Mark all the pids we're watching for pruning
  foreach $pid (keys %watch_list)
  {
    $watch_list{$pid}{prune} = 1;
  }

  # Process list sorted by top CPU
  $num_checked = 0;
  foreach $pid (sort {$pid_list{$b}{cpu} cmp $pid_list{$a}{cpu}} keys %pid_list)
  {
    $pid_item = $pid_list{$pid};
    $watch_item = $watch_list{$pid};
    $cpu = $pid_item->{cpu};
    if ($num_checked <= $top_procs)
    {
      if ($cpu >= $cpu_max)
      {
        if ($watch_item == undef)
        {
          print "Watching: $pid $cpu " . $pid_item->{id} . ": " . $pid_item->{rest} . "\n";
          $watch_list{$pid} = {
            count => 1,
            prune => 0
          };
        }
        else
        {
          if ($watch_item->{count} >= $grace_loops)
          {
            if ($watch_item->{count} == $grace_loops)
            {
              print "Process $pid cpu utilization has met threshold, nicing ...\n";
              # Do your nice command here, ex:
              system("renice $nice_level -p $pid");
            }
            else
            {
              print "Already niced $pid with $cpu% cpu; Loops=" . $watch_item->{count} . "\n";
            }
          }
          else
          {
            print "Watching $pid with $cpu% cpu; Loops=" . $watch_item->{count} . "\n";
          }
       
          $watch_item->{count} = $watch_item->{count} + 1;

          # This process still has high cpu, so we want to keep watching it
          $watch_item->{prune} = 0;
        }
      }
    }
  }

  # Prune any watch_list items that didn't have high CPU
  foreach $pid (keys %watch_list)
  {
    if ($watch_list{$pid}{prune} == 1)
    {
      if ($pid_list{$pid} == undef)
      {
        print "No longer watching $pid (Process ended)\n";
      }
      else
      {
        print "No longer watching $pid with cpu " . $pid_list{$pid}{cpu} . "% (cpu below threshold)\n";
      }
      delete $watch_list{$pid};
    }
  }

  sleep($poll_delay);
}

Open in new window

0
 

Author Comment

by:sunhux
Comment Utility

Thanks chaps.


Hi APresence,

Excellent script.  If I may request for 2 enhancements to your script:

1)  renice only top CPU processes with names that have the sub-string
    "AWSERVICES" OR "javaw" in them only & not any other processes

2) to append to a logfile the processes that have been reniced & time/date
    they were reniced.  I thought of amending the "system ..." section as follows
    but let me know if I've amended the codes below correctly:
              {
              print "Process $pid cpu utilization has met threshold, nicing ...\n";
              # Do your nice command here, ex:
              system("renice $nice_level -p $pid");
              system("echo 'Process' $pid ' reniced at ' $date >> /var/tmp/renicedpid.log")
              }
0
 

Author Comment

by:sunhux
Comment Utility

>  top CPU processes with names that have the sub-string    "AWSERVICES" OR "javaw"
Just to elaborate, I meant processes with above sub-string as given by "ps -ef" or "ps -aux"
0
How to run any project with ease

Manage projects of all sizes how you want. Great for personal to-do lists, project milestones, team priorities and launch plans.
- Combine task lists, docs, spreadsheets, and chat in one
- View and edit from mobile/offline
- Cut down on emails

 
LVL 6

Assisted Solution

by:apresence
apresence earned 400 total points
Comment Utility
Checking for specific processes, easy fix.  The pid_pattern is a regular expression, so it can handle alternates.  Using your requested process names:
$pid_pattern = '(AWSERVICES|javaw)';

Your log code will work (Although there will be no space between the word Process and your PID number), but doing a system() call just to append to a file is a bit overkill.  This is probably better:
  open (LOGFILE, '>>/var/tmp/renicedpid.log');
  print LOGFILE "Process $pid reniced at " . localtime() . "\n";
  close (LOGFILE);

If you have additional requests for modifications or features, I suggest you open a new question ;).
0
 
LVL 61

Expert Comment

by:gheist
Comment Utility
if you say which operating system you are using (uname -a)
I will assist on tuning process aging correctly so that it ages processes soon enough to make "control" script opsolete.
Deal?
0
 
LVL 6

Expert Comment

by:apresence
Comment Utility
@sunhux: You might want to give gheist a chance to elaborate on his idea.  I'm curious myself.  He needs the output of your "uname -a" command.
@gheist: Based on his sample script which references /usr/ucb and his user name, I'd say it's a safe bet he's running on Solaris.  Not sure what version, however.
@sunhux: Please assign points after we have an answer from gheist.  Thanks!
0
 
LVL 61

Assisted Solution

by:gheist
gheist earned 100 total points
Comment Utility
Solaris 9-10 (most probable) http://www.princeton.edu/~unix/Solaris/troubleshoot/schedule.html
/usr/ucb also is present on HP-UX and AIX
on ux one would use rtsched instead of renice
on aix one would use "smitty wlm"
0
 
LVL 6

Expert Comment

by:apresence
Comment Utility
Please note that the page that gheist posted is not easily readable using IE8 (one word per line... very hard to read).  Looks good in FireFox, however.
0
 

Author Comment

by:sunhux
Comment Utility

sorry for the late response.

Basically I have 2 OS : RHES 4.6 and HP-UX B11.11
0
 

Author Closing Comment

by:sunhux
Comment Utility
excellent
0

Featured Post

Why You Should Analyze Threat Actor TTPs

After years of analyzing threat actor behavior, it’s become clear that at any given time there are specific tactics, techniques, and procedures (TTPs) that are particularly prevalent. By analyzing and understanding these TTPs, you can dramatically enhance your security program.

Join & Write a Comment

Checking the Alert Log in AWS RDS Oracle can be a pain through their user interface.  I made a script to download the Alert Log, look for errors, and email me the trace files.  In this article I'll describe what I did and share my script.
Active Directory replication delay is the cause to many problems.  Here is a super easy script to force Active Directory replication to all sites with by using an elevated PowerShell command prompt, and a tool to verify your changes.
Learn how to find files with the shell using the find and locate commands. Use locate to find a needle in a haystack.: With locate, check if the file still exists.: Use find to get the actual location of the file.:
Learn how to navigate the file tree with the shell. Use pwd to print the current working directory: Use ls to list a directory's contents: Use cd to change to a new directory: Use wildcards instead of typing out long directory names: Use ../ to move…

763 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

12 Experts available now in Live!

Get 1:1 Help Now