runaway apache processes

The Redhat Linux server that hosts my website has recently started crashing with 100% memory usage.

When I look a top I see the following:

top - 19:12:23 up 51 min,  1 user,  load average: 145.93, 144.28, 120.78
Tasks: 248 total,   1 running, 247 sleeping,   0 stopped,   0 zombie
Cpu(s):  1.8% us,  1.2% sy,  0.0% ni, 35.7% id, 61.4% wa,  0.0% hi,  0.0% si
Mem:   2066044k total,  1984556k used,    81488k free,    14988k buffers
Swap:   522072k total,   522072k used,        0k free,    30108k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 2644 mysql     15   0  558m 110m 2272 S  1.0  5.5   1:44.00 mysqld
 4904 apache    15   0 46276  28m 2916 D  0.0  1.4   0:07.84 httpd
 4804 apache    15   0 46316  28m 2916 D  0.0  1.4   0:07.74 httpd
 4797 apache    15   0 46380  27m 2916 D  0.0  1.4   0:07.84 httpd
 4773 apache    18   0 46384  27m 2916 D  0.0  1.4   0:07.30 httpd
 4803 apache    15   0 46380  27m 2916 D  0.3  1.4   0:07.75 httpd
 4684 apache    15   0 46360  27m 2908 D  0.0  1.4   0:07.73 httpd
 4796 apache    15   0 46344  27m 2916 D  0.0  1.4   0:07.86 httpd
 4650 apache    15   0 46412  27m 2916 D  0.0  1.4   0:07.92 httpd
 4732 apache    15   0 46252  27m 2908 D  0.0  1.4   0:07.78 httpd
 4774 apache    15   0 46416  27m 2916 D  0.0  1.4   0:08.13 httpd
 4742 apache    18   0 46276  27m 2808 D  0.0  1.4   0:07.30 httpd
 4870 apache    15   0 46308  27m 2916 D  0.0  1.4   0:07.85 httpd
 4679 apache    18   0 46168  27m 2416 D  0.0  1.3   0:07.19 httpd
 4721 apache    18   0 46168  26m 2416 D  0.0  1.3   0:07.28 httpd
 4794 apache    18   0 46168  26m 2416 D  0.0  1.3   0:07.35 httpd
 4802 apache    18   0 46168  26m 2416 D  0.0  1.3   0:07.30 httpd
 4943 apache    15   0 46288  26m 2908 D  0.0  1.3   0:07.88 httpd
 4666 apache    15   0 46368  26m 2876 D  0.0  1.3   0:08.60 httpd
 3578 apache    15   0 46368  26m 2916 D  0.0  1.3   0:08.75 httpd
 4508 apache    15   0 46324  26m 2916 D  0.0  1.3   0:08.07 httpd
 4772 apache    18   0 46168  26m 2416 D  0.3  1.3   0:07.28 httpd
 4501 apache    15   0 46248  26m 2908 D  0.0  1.3   0:07.73 httpd
 3439 apache    15   0 46328  26m 2916 D  0.0  1.3   0:08.60 httpd
 4511 apache    15   0 46376  26m 2916 D  0.0  1.3   0:07.80 httpd
 4510 apache    15   0 46356  26m 2916 D  0.0  1.3   0:08.29 httpd
 4505 apache    15   0 46408  26m 2916 D  0.0  1.3   0:08.05 httpd
 2812 apache    15   0 46340  25m 2916 D  0.0  1.3   0:08.61 httpd
 5002 apache    18   0 46168  25m 2416 D  0.0  1.3   0:07.39 httpd
 4731 apache    15   0 46332  25m 2916 D  0.0  1.3   0:08.15 httpd
 4548 apache    15   0 46372  25m 2908 D  0.0  1.2   0:07.65 httpd
 5197 apache    15   0 46256  25m 2916 D  0.0  1.2   0:07.84 httpd
 4869 apache    15   0 46372  24m 2916 D  0.0  1.2   0:08.57 httpd
 4425 apache    15   0 46340  24m 2916 D  0.0  1.2   0:07.85 httpd
 4341 apache    15   0 46364  24m 2908 D  0.0  1.2   0:08.23 httpd
 4866 apache    15   0 46340  24m 2916 D  0.0  1.2   0:08.69 httpd
 4418 apache    15   0 46336  24m 2908 D  0.0  1.2   0:07.74 httpd
 4509 apache    15   0 46220  24m 2908 D  0.0  1.2   0:07.69 httpd

As you can see there are lots of "D" httpd processes, D meaning dead??

I've read that this is related to "run away" processes, but what can I do to debug and further identify what is the root cause of these processes, so that I can resolve this issue.

At peak website times, it takes less than an hour for the server to become totally bogged down like this, with the only resolution being to reboot the server.
dealclickcoukAsked:
Who is Participating?
 
ygouthamCommented:
if it is that kind of a command you want to run then try using something like

top -b -n 1 | grep httpd

this would give you the exact top listing that you are looking for but only the httpd processes and you can count the number of columns to see which ones you want to pick up...  a work around if that satisfies the need
0
 
DonConsolioCommented:
"D" means uninterruptable sleep - usually this means waiting for IO

might be either a hardware problem (check /var/log/messages for disk problems
or similar troubles) or a hanging CGI script.

0
 
dealclickcoukAuthor Commented:
Thx for the tip, I've looked through the msg log, and around the time when I saw lots of D processes I see lots of these types of msgs:

Aug  5 13:00:07 localhost kernel:
Aug  5 13:00:07 localhost kernel: Free pages:       13796kB (512kB HighMem)
Aug  5 13:00:08 localhost kernel: Active:360561 inactive:138632 dirty:0 writeback:0 unstable:0 free:3449 slab:4983 mapped:500559 pagetables:4517
Aug  5 13:00:08 localhost kernel: DMA free:12556kB min:16kB low:32kB high:48kB active:0kB inactive:0kB present:16384kB pages_scanned:628 all_unreclaimable? yes
Aug  5 13:00:08 localhost kernel: protections[]: 0 0 0
Aug  5 13:00:08 localhost kernel: Normal free:728kB min:928kB low:1856kB high:2784kB active:523784kB inactive:330912kB present:901120kB pages_scanned:12222012 all_unreclaimable? yes
Aug  5 13:00:08 localhost kernel: protections[]: 0 0 0
Aug  5 13:00:09 localhost kernel: HighMem free:512kB min:512kB low:1024kB high:1536kB active:918460kB inactive:223616kB present:1170368kB pages_scanned:10911002 all_unreclaimable? no
Aug  5 13:00:09 localhost kernel: protections[]: 0 0 0
Aug  5 13:00:09 localhost kernel: DMA: 5*4kB 5*8kB 3*16kB 5*32kB 4*64kB 2*128kB 2*256kB 2*512kB 2*1024kB 2*2048kB 1*4096kB = 12556kB
Aug  5 13:00:09 localhost kernel: Normal: 0*4kB 1*8kB 1*16kB 0*32kB 1*64kB 1*128kB 0*256kB 1*512kB 0*1024kB 0*2048kB 0*4096kB = 728kB
Aug  5 13:00:10 localhost kernel: HighMem: 0*4kB 8*8kB 4*16kB 4*32kB 0*64kB 0*128kB 1*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 512kB
Aug  5 13:00:10 localhost kernel: Swap cache: add 132469, delete 132468, find 865/1078, race 0+6
Aug  5 13:00:10 localhost kernel: 0 bounce buffer pages
Aug  5 13:00:10 localhost kernel: Free swap:            0kB
Aug  5 13:00:10 localhost kernel: 521968 pages of RAM
Aug  5 13:00:11 localhost kernel: 292592 pages of HIGHMEM
Aug  5 13:00:11 localhost kernel: 5555 reserved pages
Aug  5 13:00:11 localhost kernel: 100263 pages shared
Aug  5 13:00:11 localhost kernel: 1 pages swap cached
Aug  5 13:00:12 localhost kernel: Out of Memory: Killed process 5057 (httpd).
Aug  5 13:00:12 localhost kernel: oom-killer: gfp_mask=0xd0
Aug  5 13:00:12 localhost kernel: Mem-info:
Aug  5 13:00:12 localhost kernel: DMA per-cpu:
Aug  5 13:00:12 localhost kernel: cpu 0 hot: low 2, high 6, batch 1
Aug  5 13:00:13 localhost kernel: cpu 0 cold: low 0, high 2, batch 1
Aug  5 13:00:13 localhost kernel: cpu 1 hot: low 2, high 6, batch 1
Aug  5 13:00:13 localhost kernel: cpu 1 cold: low 0, high 2, batch 1
Aug  5 13:00:13 localhost kernel: Normal per-cpu:
Aug  5 13:00:14 localhost kernel: cpu 0 hot: low 32, high 96, batch 16
Aug  5 13:00:14 localhost kernel: cpu 0 cold: low 0, high 32, batch 16
Aug  5 13:00:14 localhost kernel: cpu 1 hot: low 32, high 96, batch 16
Aug  5 13:00:14 localhost kernel: cpu 1 cold: low 0, high 32, batch 16
Aug  5 13:00:15 localhost kernel: HighMem per-cpu:
Aug  5 13:00:15 localhost kernel: cpu 0 hot: low 32, high 96, batch 16
Aug  5 13:00:15 localhost kernel: cpu 0 cold: low 0, high 32, batch 16
Aug  5 13:00:15 localhost kernel: cpu 1 hot: low 32, high 96, batch 16
Aug  5 13:00:16 localhost kernel: cpu 1 cold: low 0, high 32, batch 16
Aug  5 13:00:16 localhost kernel:

I'm not sure if this is a problem, or normal, but it did stick out.

As this sever is used just as a webserver is there anyway to limit the % of mem & cpu that each process uses and how long before auto temination, ie even if a process is D, if it has been that way for more than 30secs then the end user will most probably got bored and left or hot refresh, so really no point waiting for IO regardless.
0
Ultimate Tool Kit for Technology Solution Provider

Broken down into practical pointers and step-by-step instructions, the IT Service Excellence Tool Kit delivers expert advice for technology solution providers. Get your free copy now.

 
dealclickcoukAuthor Commented:

Also looking a little bit further back int the log I saw lots and lots of this type of entry:

Aug  5 08:54:02 localhost sshd(pam_unix)[1031]: authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=hs-622.dedicated.hostalia.com  user=root
Aug  5 08:54:31 localhost sshd(pam_unix)[1097]: authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=hs-622.dedicated.hostalia.com  user=root
Aug  5 08:54:32 localhost sshd(pam_unix)[1100]: check pass; user unknown
Aug  5 08:54:32 localhost sshd(pam_unix)[1100]: authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=hs-622.dedicated.hostalia.com
Aug  5 08:54:34 localhost sshd(pam_unix)[1105]: authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=hs-622.dedicated.hostalia.com  user=root
Aug  5 08:54:35 localhost sshd(pam_unix)[1108]: check pass; user unknown
Aug  5 08:54:35 localhost sshd(pam_unix)[1108]: authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=hs-622.dedicated.hostalia.com
Aug  5 08:54:37 localhost sshd(pam_unix)[1112]: authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=hs-622.dedicated.hostalia.com  user=root
Aug  5 08:54:38 localhost sshd(pam_unix)[1115]: check pass; user unknown
Aug  5 08:54:38 localhost sshd(pam_unix)[1115]: authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=hs-622.dedicated.hostalia.com
Aug  5 08:54:39 localhost sshd(pam_unix)[1119]: authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=hs-622.dedicated.hostalia.com  user=root
Aug  5 08:54:41 localhost sshd(pam_unix)[1121]: check pass; user unknown
Aug  5 08:54:41 localhost sshd(pam_unix)[1121]: authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=hs-622.dedicated.hostalia.com
Aug  5 08:54:42 localhost sshd(pam_unix)[1123]: authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=hs-622.dedicated.hostalia.com  user=root
Aug  5 08:54:43 localhost sshd(pam_unix)[1125]: check pass; user unknown
Aug  5 08:54:43 localhost sshd(pam_unix)[1125]: authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=hs-622.dedicated.hostalia.com
Aug  5 08:54:45 localhost sshd(pam_unix)[1127]: check pass; user unknown
Aug  5 08:54:45 localhost sshd(pam_unix)[1127]: authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=hs-622.dedicated.hostalia.com
Aug  5 08:54:46 localhost sshd(pam_unix)[1131]: check pass; user unknown
Aug  5 08:54:46 localhost sshd(pam_unix)[1131]: authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=hs-622.dedicated.hostalia.com
Aug  5 08:54:48 localhost sshd(pam_unix)[1133]: check pass; user unknown
Aug  5 08:54:48 localhost sshd(pam_unix)[1133]: authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=hs-622.dedicated.hostalia.com
Aug  5 08:54:49 localhost sshd(pam_unix)[1136]: authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=hs-622.dedicated.hostalia.com  user=admin
Aug  5 08:54:50 localhost sshd(pam_unix)[1141]: check pass; user unknown
Aug  5 08:54:50 localhost sshd(pam_unix)[1141]: authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=hs-622.dedicated.hostalia.com
Aug  5 08:54:52 localhost sshd(pam_unix)[1144]: authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=hs-622.dedicated.hostalia.com  user=admin
Aug  5 08:54:53 localhost sshd(pam_unix)[1147]: check pass; user unknown
Aug  5 08:54:53 localhost sshd(pam_unix)[1147]: authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=hs-622.dedicated.hostalia.com
Aug  5 08:54:54 localhost sshd(pam_unix)[1152]: check pass; user unknown
ruser= rhost=hs-622.dedicated.hostalia.com
Aug  5 08:55:01 localhost crond(pam_unix)[1166]: session opened for user root by (uid=0)
Aug  5 08:55:01 localhost crond(pam_unix)[1167]: session opened for user root by (uid=0)
Aug  5 08:55:01 localhost crond(pam_unix)[1167]: session closed for user root
Aug  5 08:55:01 localhost crond(pam_unix)[1166]: session closed for user root
Aug  5 08:55:02 localhost sshd(pam_unix)[1183]: authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=hs-622.dedicated.hostalia.com  user=admin
Aug  5 08:55:03 localhost sshd(pam_unix)[1188]: authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=hs-622.dedicated.hostalia.com  user=mysql
Aug  5 08:55:04 localhost sshd(pam_unix)[1192]: authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=hs-622.dedicated.hostalia.com  user=admin
Aug  5 08:55:05 localhost sshd(pam_unix)[1195]: check pass; user unknown
Aug  5 08:55:05 localhost sshd(pam_unix)[1195]: authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=hs-622.dedicated.hostalia.com
Aug  5 08:55:07 localhost sshd(pam_unix)[1197]: check pass; user unknown

Is this some kind of hack/attack, ie so many authentication failure & user unknown logs?
0
 
ygouthamCommented:
i have my own custom script that i use to kill runaway processes.

might require some clean up for your purposes

#######
<?php

$pid = getmypid();
$sid = session_id();

// mcheck variable is for limiting the process to a certain amount of memory in MB
$mcheck = 50;

print "Your PID is $pid ---- Session id = $sid";
print "<pre>";
print "<h1>All HTTP Processes list</h1><br><h2>Memory Limited to $mcheck MB only</h2>";
exec("ps lax | grep httpd", $output, $return);
$ctr=0;

//run the 'ps -lax | grep httpd' command and take the output to parse line by line

foreach($output as $file) {
      print "$file<br>";
      while($test=strstr($file, "  ") ){$file = str_replace("  ", " ", $file);}
      $p = explode(" ", $file);
      $pa =  $pa.$p[2].",";
      $ma =  $ma.$p[7].",";
      $st = $st.$p[8].",";
      }
      $pa = explode(",", $pa);
      $ma = explode(",", $ma);
      $max = count($pa);
      $st = explode(",", $st);

      //run the 'ps -lax | grep httpd' command and take the output to parse line by line and identify the
               //process running in excess of 50 mb to be killed
      // below block of code is only for GOUTHAM. Not useable outside!!!
      foreach($ma as $val2){
      $mem = ($val2 / 1024) - $mcheck;
      if($val2 > ($mcheck * 1024)){
            $expid = $pa[$ctr] ;
            $p = mysql_query("select username from login where pid = $expid order by pidtime desc limit 1");
            while ($c_row = mysql_fetch_row($p) ){foreach($c_row as $field8) {$user=$field8; }}
            print "<br>".$pa[$ctr]." pid running in excess of $mem mb of memory from user $user";
            exec("kill -9 ".$pa[$ctr], $misc, $ret2);
            }
      $ctr++;
      }

      //below code is for killing any process that returns a "interr" status while running 'ps -lax | grep httpd' command
      
      $ctr=0;
      foreach($st as $val3){
      if($val3 == "interr"){
            $expid = $pa[$ctr];
            exec("kill -9 ".$pa[$ctr], $misc, $ret2);
            }
      }
      
            //for($i=0;$i<$max;$i++){print " ".$st[$i]." status  - ".$pa[$i]."  .<br>";  }

print "</pre>";

print "<p><p>Content Auto Refreshes in 1 minute</p></p>";
?>

####### END OF CODE ##########
0
 
ygouthamCommented:
it sure looks like someone trying to access your server through an SSH .  change the files

/etc/hosts.deny

and add a line at the end

SSHD:  ALL EXCEPT your.ip.address.here, your_other.ip.address.here, so.on_and.so.forth

that means that any one trying to connect to your server through ssh from the outside world would be immediately be denied service.  
0
 
dealclickcoukAuthor Commented:

ygoutham Thx for the script, I will give that a try out, but am I right in assuming that this will not kill the "D" processes, because as DonConsolio says above that "D" means uninterruptable sleep - and so can't be killed?
0
 
ygouthamCommented:
mine takes the third variable which is the memory used. you can however use the 10th variable which shows currently active or dead and act accordingly.

check for the D status and modify the script accordingly and you should be done.
0
 
dealclickcoukAuthor Commented:

ygoutham, thx again, but what I meant was is it actually possible to kill these D processes, because they are uninterruptable ?
0
 
ygouthamCommented:
kill -9 kills a running or a dead process.  should not be a problem
0
 
dealclickcoukAuthor Commented:

ygoutham, just going through the code, just to check is the part with the sql query, is that something internal?

ie if I remove those two lines will it still function:

        $p = mysql_query("select username from login where pid = $expid order by pidtime desc limit 1");
        while ($c_row = mysql_fetch_row($p) ){foreach($c_row as $field8) {$user=$field8; }}


0
 
ygouthamCommented:
yes.  i was trying to check from my login table as to who is using too much of system resources and therefore the comment before the block.  

in fact all references to mysql can be safely removed.
0
 
ygouthamCommented:
in fact you can just populate the array with pid numbers for $pa (which is my pid array and pardon my quixotic ways of naming variables), and you can pick up the 10th variable in your $st array (with $st = $st.$pa[9].","; )

that would give an array of pid numbers with corresponding live or D statuses and proceed from there.
0
 
dealclickcoukAuthor Commented:
ygoutham: ,amy thx for this, it looks like a winner, could this be adjusted to kill any process using too much CPU as well?
0
 
ygouthamCommented:
you need to specify what process to kill.  if you write something and end up killing "init" the entire system crashes!!!
0
 
dealclickcoukAuthor Commented:

OK, but using your existing script , ie only killing httpd child process, how would u identify the amount of CPU being used.  

The issue is when I look at the process list in top I see things like this:

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
12551 mysql     15   0  557m 162m 3204 S 51.5  8.1   7:08.89 mysqld
15617 apache    21   0 47068  36m 7384 R 99.9  1.8   0:07.95 httpd

as you see here PID 15617 is using 99.9% of the CPU which cant be good, and sometimes I see several of these processes which inturn kill the server for other processes
0
 
ygouthamCommented:
he, you can do a multiple check by taking the pid and seeing if it has a D status and if the CPU usage and time run is beyond a particular number and then proceed to give the kill command.  when you combine all the three then it makes sense.  if you pick up only one aspect to proceed, then as you put it, it becomes a dangerous tool to be run.
0
 
dealclickcoukAuthor Commented:
thats exact;y what I want to do, but I'm not sure, using your script which paramter would be the CPU useage...
0
 
DonConsolioCommented:
You could try to impose limits on apache and/or php

php.ini: max_execution_time=xxx

apache: RLimitCPU, RLimitMEM, RLimitNPROC (http://httpd.apache.org/docs/2.2/mod/core.html#rlimitcpu)

0
 
ygouthamCommented:
unfortunately it does not have the cpu usage time.  probably you need to include only the memory usage and the status "D" with the time run to see if it exceeds a particular time limit...  is that an option or you want to specifically check on CPU usage???
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.