asked on

runaway apache processes

The Redhat Linux server that hosts my website has recently started crashing with 100% memory usage.

When I look a top I see the following:

top - 19:12:23 up 51 min, 1 user, load average: 145.93, 144.28, 120.78
Tasks: 248 total, 1 running, 247 sleeping, 0 stopped, 0 zombie
Cpu(s): 1.8% us, 1.2% sy, 0.0% ni, 35.7% id, 61.4% wa, 0.0% hi, 0.0% si
Mem: 2066044k total, 1984556k used, 81488k free, 14988k buffers
Swap: 522072k total, 522072k used, 0k free, 30108k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
2644 mysql 15 0 558m 110m 2272 S 1.0 5.5 1:44.00 mysqld
4904 apache 15 0 46276 28m 2916 D 0.0 1.4 0:07.84 httpd
4804 apache 15 0 46316 28m 2916 D 0.0 1.4 0:07.74 httpd
4797 apache 15 0 46380 27m 2916 D 0.0 1.4 0:07.84 httpd
4773 apache 18 0 46384 27m 2916 D 0.0 1.4 0:07.30 httpd
4803 apache 15 0 46380 27m 2916 D 0.3 1.4 0:07.75 httpd
4684 apache 15 0 46360 27m 2908 D 0.0 1.4 0:07.73 httpd
4796 apache 15 0 46344 27m 2916 D 0.0 1.4 0:07.86 httpd
4650 apache 15 0 46412 27m 2916 D 0.0 1.4 0:07.92 httpd
4732 apache 15 0 46252 27m 2908 D 0.0 1.4 0:07.78 httpd
4774 apache 15 0 46416 27m 2916 D 0.0 1.4 0:08.13 httpd
4742 apache 18 0 46276 27m 2808 D 0.0 1.4 0:07.30 httpd
4870 apache 15 0 46308 27m 2916 D 0.0 1.4 0:07.85 httpd
4679 apache 18 0 46168 27m 2416 D 0.0 1.3 0:07.19 httpd
4721 apache 18 0 46168 26m 2416 D 0.0 1.3 0:07.28 httpd
4794 apache 18 0 46168 26m 2416 D 0.0 1.3 0:07.35 httpd
4802 apache 18 0 46168 26m 2416 D 0.0 1.3 0:07.30 httpd
4943 apache 15 0 46288 26m 2908 D 0.0 1.3 0:07.88 httpd
4666 apache 15 0 46368 26m 2876 D 0.0 1.3 0:08.60 httpd
3578 apache 15 0 46368 26m 2916 D 0.0 1.3 0:08.75 httpd
4508 apache 15 0 46324 26m 2916 D 0.0 1.3 0:08.07 httpd
4772 apache 18 0 46168 26m 2416 D 0.3 1.3 0:07.28 httpd
4501 apache 15 0 46248 26m 2908 D 0.0 1.3 0:07.73 httpd
3439 apache 15 0 46328 26m 2916 D 0.0 1.3 0:08.60 httpd
4511 apache 15 0 46376 26m 2916 D 0.0 1.3 0:07.80 httpd
4510 apache 15 0 46356 26m 2916 D 0.0 1.3 0:08.29 httpd
4505 apache 15 0 46408 26m 2916 D 0.0 1.3 0:08.05 httpd
2812 apache 15 0 46340 25m 2916 D 0.0 1.3 0:08.61 httpd
5002 apache 18 0 46168 25m 2416 D 0.0 1.3 0:07.39 httpd
4731 apache 15 0 46332 25m 2916 D 0.0 1.3 0:08.15 httpd
4548 apache 15 0 46372 25m 2908 D 0.0 1.2 0:07.65 httpd
5197 apache 15 0 46256 25m 2916 D 0.0 1.2 0:07.84 httpd
4869 apache 15 0 46372 24m 2916 D 0.0 1.2 0:08.57 httpd
4425 apache 15 0 46340 24m 2916 D 0.0 1.2 0:07.85 httpd
4341 apache 15 0 46364 24m 2908 D 0.0 1.2 0:08.23 httpd
4866 apache 15 0 46340 24m 2916 D 0.0 1.2 0:08.69 httpd
4418 apache 15 0 46336 24m 2908 D 0.0 1.2 0:07.74 httpd
4509 apache 15 0 46220 24m 2908 D 0.0 1.2 0:07.69 httpd

As you can see there are lots of "D" httpd processes, D meaning dead??

I've read that this is related to "run away" processes, but what can I do to debug and further identify what is the root cause of these processes, so that I can resolve this issue.

At peak website times, it takes less than an hour for the server to become totally bogged down like this, with the only resolution being to reboot the server.

DonConsolio

"D" means uninterruptable sleep - usually this means waiting for IO

might be either a hardware problem (check /var/log/messages for disk problems
or similar troubles) or a hanging CGI script.

dealclickcouk

ASKER

Thx for the tip, I've looked through the msg log, and around the time when I saw lots of D processes I see lots of these types of msgs:

Aug 5 13:00:07 localhost kernel:
Aug 5 13:00:07 localhost kernel: Free pages: 13796kB (512kB HighMem)
Aug 5 13:00:08 localhost kernel: Active:360561 inactive:138632 dirty:0 writeback:0 unstable:0 free:3449 slab:4983 mapped:500559 pagetables:4517
Aug 5 13:00:08 localhost kernel: DMA free:12556kB min:16kB low:32kB high:48kB active:0kB inactive:0kB present:16384kB pages_scanned:628 all_unreclaimable? yes
Aug 5 13:00:08 localhost kernel: protections[]: 0 0 0
Aug 5 13:00:08 localhost kernel: Normal free:728kB min:928kB low:1856kB high:2784kB active:523784kB inactive:330912kB present:901120kB pages_scanned:12222012 all_unreclaimable? yes
Aug 5 13:00:08 localhost kernel: protections[]: 0 0 0
Aug 5 13:00:09 localhost kernel: HighMem free:512kB min:512kB low:1024kB high:1536kB active:918460kB inactive:223616kB present:1170368kB pages_scanned:10911002 all_unreclaimable? no
Aug 5 13:00:09 localhost kernel: protections[]: 0 0 0
Aug 5 13:00:09 localhost kernel: DMA: 5*4kB 5*8kB 3*16kB 5*32kB 4*64kB 2*128kB 2*256kB 2*512kB 2*1024kB 2*2048kB 1*4096kB = 12556kB
Aug 5 13:00:09 localhost kernel: Normal: 0*4kB 1*8kB 1*16kB 0*32kB 1*64kB 1*128kB 0*256kB 1*512kB 0*1024kB 0*2048kB 0*4096kB = 728kB
Aug 5 13:00:10 localhost kernel: HighMem: 0*4kB 8*8kB 4*16kB 4*32kB 0*64kB 0*128kB 1*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 512kB
Aug 5 13:00:10 localhost kernel: Swap cache: add 132469, delete 132468, find 865/1078, race 0+6
Aug 5 13:00:10 localhost kernel: 0 bounce buffer pages
Aug 5 13:00:10 localhost kernel: Free swap: 0kB
Aug 5 13:00:10 localhost kernel: 521968 pages of RAM
Aug 5 13:00:11 localhost kernel: 292592 pages of HIGHMEM
Aug 5 13:00:11 localhost kernel: 5555 reserved pages
Aug 5 13:00:11 localhost kernel: 100263 pages shared
Aug 5 13:00:11 localhost kernel: 1 pages swap cached
Aug 5 13:00:12 localhost kernel: Out of Memory: Killed process 5057 (httpd).
Aug 5 13:00:12 localhost kernel: oom-killer: gfp_mask=0xd0
Aug 5 13:00:12 localhost kernel: Mem-info:
Aug 5 13:00:12 localhost kernel: DMA per-cpu:
Aug 5 13:00:12 localhost kernel: cpu 0 hot: low 2, high 6, batch 1
Aug 5 13:00:13 localhost kernel: cpu 0 cold: low 0, high 2, batch 1
Aug 5 13:00:13 localhost kernel: cpu 1 hot: low 2, high 6, batch 1
Aug 5 13:00:13 localhost kernel: cpu 1 cold: low 0, high 2, batch 1
Aug 5 13:00:13 localhost kernel: Normal per-cpu:
Aug 5 13:00:14 localhost kernel: cpu 0 hot: low 32, high 96, batch 16
Aug 5 13:00:14 localhost kernel: cpu 0 cold: low 0, high 32, batch 16
Aug 5 13:00:14 localhost kernel: cpu 1 hot: low 32, high 96, batch 16
Aug 5 13:00:14 localhost kernel: cpu 1 cold: low 0, high 32, batch 16
Aug 5 13:00:15 localhost kernel: HighMem per-cpu:
Aug 5 13:00:15 localhost kernel: cpu 0 hot: low 32, high 96, batch 16
Aug 5 13:00:15 localhost kernel: cpu 0 cold: low 0, high 32, batch 16
Aug 5 13:00:15 localhost kernel: cpu 1 hot: low 32, high 96, batch 16
Aug 5 13:00:16 localhost kernel: cpu 1 cold: low 0, high 32, batch 16
Aug 5 13:00:16 localhost kernel:

I'm not sure if this is a problem, or normal, but it did stick out.

As this sever is used just as a webserver is there anyway to limit the % of mem & cpu that each process uses and how long before auto temination, ie even if a process is D, if it has been that way for more than 30secs then the end user will most probably got bored and left or hot refresh, so really no point waiting for IO regardless.

dealclickcouk

ASKER

Also looking a little bit further back int the log I saw lots and lots of this type of entry:

Aug 5 08:54:02 localhost sshd(pam_unix)[1031]: authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=hs-622.dedicated.hostalia.com user=root
Aug 5 08:54:31 localhost sshd(pam_unix)[1097]: authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=hs-622.dedicated.hostalia.com user=root
Aug 5 08:54:32 localhost sshd(pam_unix)[1100]: check pass; user unknown
Aug 5 08:54:32 localhost sshd(pam_unix)[1100]: authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=hs-622.dedicated.hostalia.com
Aug 5 08:54:34 localhost sshd(pam_unix)[1105]: authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=hs-622.dedicated.hostalia.com user=root
Aug 5 08:54:35 localhost sshd(pam_unix)[1108]: check pass; user unknown
Aug 5 08:54:35 localhost sshd(pam_unix)[1108]: authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=hs-622.dedicated.hostalia.com
Aug 5 08:54:37 localhost sshd(pam_unix)[1112]: authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=hs-622.dedicated.hostalia.com user=root
Aug 5 08:54:38 localhost sshd(pam_unix)[1115]: check pass; user unknown
Aug 5 08:54:38 localhost sshd(pam_unix)[1115]: authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=hs-622.dedicated.hostalia.com
Aug 5 08:54:39 localhost sshd(pam_unix)[1119]: authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=hs-622.dedicated.hostalia.com user=root
Aug 5 08:54:41 localhost sshd(pam_unix)[1121]: check pass; user unknown
Aug 5 08:54:41 localhost sshd(pam_unix)[1121]: authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=hs-622.dedicated.hostalia.com
Aug 5 08:54:42 localhost sshd(pam_unix)[1123]: authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=hs-622.dedicated.hostalia.com user=root
Aug 5 08:54:43 localhost sshd(pam_unix)[1125]: check pass; user unknown
Aug 5 08:54:43 localhost sshd(pam_unix)[1125]: authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=hs-622.dedicated.hostalia.com
Aug 5 08:54:45 localhost sshd(pam_unix)[1127]: check pass; user unknown
Aug 5 08:54:45 localhost sshd(pam_unix)[1127]: authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=hs-622.dedicated.hostalia.com
Aug 5 08:54:46 localhost sshd(pam_unix)[1131]: check pass; user unknown
Aug 5 08:54:46 localhost sshd(pam_unix)[1131]: authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=hs-622.dedicated.hostalia.com
Aug 5 08:54:48 localhost sshd(pam_unix)[1133]: check pass; user unknown
Aug 5 08:54:48 localhost sshd(pam_unix)[1133]: authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=hs-622.dedicated.hostalia.com
Aug 5 08:54:49 localhost sshd(pam_unix)[1136]: authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=hs-622.dedicated.hostalia.com user=admin
Aug 5 08:54:50 localhost sshd(pam_unix)[1141]: check pass; user unknown
Aug 5 08:54:50 localhost sshd(pam_unix)[1141]: authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=hs-622.dedicated.hostalia.com
Aug 5 08:54:52 localhost sshd(pam_unix)[1144]: authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=hs-622.dedicated.hostalia.com user=admin
Aug 5 08:54:53 localhost sshd(pam_unix)[1147]: check pass; user unknown
Aug 5 08:54:53 localhost sshd(pam_unix)[1147]: authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=hs-622.dedicated.hostalia.com
Aug 5 08:54:54 localhost sshd(pam_unix)[1152]: check pass; user unknown
ruser= rhost=hs-622.dedicated.hostalia.com
Aug 5 08:55:01 localhost crond(pam_unix)[1166]: session opened for user root by (uid=0)
Aug 5 08:55:01 localhost crond(pam_unix)[1167]: session opened for user root by (uid=0)
Aug 5 08:55:01 localhost crond(pam_unix)[1167]: session closed for user root
Aug 5 08:55:01 localhost crond(pam_unix)[1166]: session closed for user root
Aug 5 08:55:02 localhost sshd(pam_unix)[1183]: authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=hs-622.dedicated.hostalia.com user=admin
Aug 5 08:55:03 localhost sshd(pam_unix)[1188]: authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=hs-622.dedicated.hostalia.com user=mysql
Aug 5 08:55:04 localhost sshd(pam_unix)[1192]: authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=hs-622.dedicated.hostalia.com user=admin
Aug 5 08:55:05 localhost sshd(pam_unix)[1195]: check pass; user unknown
Aug 5 08:55:05 localhost sshd(pam_unix)[1195]: authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=hs-622.dedicated.hostalia.com
Aug 5 08:55:07 localhost sshd(pam_unix)[1197]: check pass; user unknown

Is this some kind of hack/attack, ie so many authentication failure & user unknown logs?

ygoutham

i have my own custom script that i use to kill runaway processes.

might require some clean up for your purposes

#######
<?php

$pid = getmypid();
$sid = session_id();

// mcheck variable is for limiting the process to a certain amount of memory in MB
$mcheck = 50;

print "Your PID is $pid ---- Session id = $sid";
print "<pre>";
print "<h1>All HTTP Processes list</h1><br><h2>Memory Limited to $mcheck MB only</h2>";
exec("ps lax | grep httpd", $output, $return);
$ctr=0;

//run the 'ps -lax | grep httpd' command and take the output to parse line by line

foreach($output as $file) {
      print "$file<br>";
      while($test=strstr($file, " ") ){$file = str_replace(" ", " ", $file);}
      $p = explode(" ", $file);
      $pa = $pa.$p[2].",";
      $ma = $ma.$p[7].",";
      $st = $st.$p[8].",";
      }
      $pa = explode(",", $pa);
      $ma = explode(",", $ma);
      $max = count($pa);
      $st = explode(",", $st);

      //run the 'ps -lax | grep httpd' command and take the output to parse line by line and identify the
//process running in excess of 50 mb to be killed
      // below block of code is only for GOUTHAM. Not useable outside!!!
      foreach($ma as $val2){
      $mem = ($val2 / 1024) - $mcheck;
      if($val2 > ($mcheck * 1024)){
            $expid = $pa[$ctr] ;
            $p = mysql_query("select username from login where pid = $expid order by pidtime desc limit 1");
            while ($c_row = mysql_fetch_row($p) ){foreach($c_row as $field8) {$user=$field8; }}
            print "<br>".$pa[$ctr]." pid running in excess of $mem mb of memory from user $user";
            exec("kill -9 ".$pa[$ctr], $misc, $ret2);
            }
      $ctr++;
      }

      //below code is for killing any process that returns a "interr" status while running 'ps -lax | grep httpd' command

      $ctr=0;
      foreach($st as $val3){
      if($val3 == "interr"){
            $expid = $pa[$ctr];
            exec("kill -9 ".$pa[$ctr], $misc, $ret2);
            }
      }

            //for($i=0;$i<$max;$i++){print " ".$st[$i]." status - ".$pa[$i]." .<br>"; }

print "</pre>";

print "<p><p>Content Auto Refreshes in 1 minute</p></p>";
?>

####### END OF CODE ##########

ygoutham

it sure looks like someone trying to access your server through an SSH . change the files

/etc/hosts.deny

and add a line at the end

SSHD: ALL EXCEPT your.ip.address.here, your_other.ip.address.here, so.on_and.so.forth

that means that any one trying to connect to your server through ssh from the outside world would be immediately be denied service.

dealclickcouk

ASKER

ygoutham Thx for the script, I will give that a try out, but am I right in assuming that this will not kill the "D" processes, because as DonConsolio says above that "D" means uninterruptable sleep - and so can't be killed?

ygoutham

mine takes the third variable which is the memory used. you can however use the 10th variable which shows currently active or dead and act accordingly.

check for the D status and modify the script accordingly and you should be done.

dealclickcouk

ASKER

ygoutham, thx again, but what I meant was is it actually possible to kill these D processes, because they are uninterruptable ?

ygoutham

kill -9 kills a running or a dead process. should not be a problem

dealclickcouk

ASKER

ygoutham, just going through the code, just to check is the part with the sql query, is that something internal?

ie if I remove those two lines will it still function:

$p = mysql_query("select username from login where pid = $expid order by pidtime desc limit 1");
while ($c_row = mysql_fetch_row($p) ){foreach($c_row as $field8) {$user=$field8; }}

ygoutham

yes. i was trying to check from my login table as to who is using too much of system resources and therefore the comment before the block.

in fact all references to mysql can be safely removed.

ygoutham

in fact you can just populate the array with pid numbers for $pa (which is my pid array and pardon my quixotic ways of naming variables), and you can pick up the 10th variable in your $st array (with $st = $st.$pa[9].","; )

that would give an array of pid numbers with corresponding live or D statuses and proceed from there.

dealclickcouk

ASKER

ygoutham: ,amy thx for this, it looks like a winner, could this be adjusted to kill any process using too much CPU as well?

ygoutham

you need to specify what process to kill. if you write something and end up killing "init" the entire system crashes!!!

dealclickcouk

ASKER

OK, but using your existing script , ie only killing httpd child process, how would u identify the amount of CPU being used.

The issue is when I look at the process list in top I see things like this:

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
12551 mysql 15 0 557m 162m 3204 S 51.5 8.1 7:08.89 mysqld
15617 apache 21 0 47068 36m 7384 R 99.9 1.8 0:07.95 httpd

as you see here PID 15617 is using 99.9% of the CPU which cant be good, and sometimes I see several of these processes which inturn kill the server for other processes

ygoutham

he, you can do a multiple check by taking the pid and seeing if it has a D status and if the CPU usage and time run is beyond a particular number and then proceed to give the kill command. when you combine all the three then it makes sense. if you pick up only one aspect to proceed, then as you put it, it becomes a dangerous tool to be run.

dealclickcouk

ASKER

thats exact;y what I want to do, but I'm not sure, using your script which paramter would be the CPU useage...

DonConsolio

You could try to impose limits on apache and/or php

php.ini: max_execution_time=xxx

apache: RLimitCPU, RLimitMEM, RLimitNPROC (http://httpd.apache.org/docs/2.2/mod/core.html#rlimitcpu)

ygoutham

unfortunately it does not have the cpu usage time. probably you need to include only the memory usage and the status "D" with the time run to see if it exceeds a particular time limit... is that an option or you want to specifically check on CPU usage???

ASKER CERTIFIED SOLUTION

ygoutham

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial