MaxClient httpd/apache processes consistently reached and then connections refused

Posted on 2004-09-17
Last Modified: 2008-01-09
We have a RedHat Linux 7.3 site running apache 1.3.27, with MaxClients=256. Occasionally have seen the number of httpd processes go to 256, stay like that for a few minutes, then back down to well under 100. During those few minutes some connections are not accepted. The site is running  about 80-100 sites across about 500 virtual domains (most of which are unused). About 5 of the sites do 90% of the traffic.

We recompiled httpd to enable settingMaxClients higher, and set it to 512 (I know it is dangerous to do so, but we have 1GB RAM and thus took the risk).  MaxRequestsPerChild = 200. KeepAlive is On, MaxKeepAliveRequests = 200, KeepAliveTimeout = 15.

MySQL connections was increased from 400 to 600 when we increased MaxClients to 512.

After all this we still see the same activity. Running a script to count httpd processes every 15 seconds. Here is one such episode. Note the sudden jump at 10:17:05, climb to MaxClient processes for a few minutes and then back down.

42       09/17/04 10:15:34
42       09/17/04 10:15:49
40       09/17/04 10:16:04
36       09/17/04 10:16:19
36       09/17/04 10:16:34
36       09/17/04 10:16:50
36       09/17/04 10:17:05
178      09/17/04 10:17:37
242      09/17/04 10:17:53
243      09/17/04 10:18:11
243      09/17/04 10:18:27
272      09/17/04 10:18:43
397      09/17/04 10:19:14
405      09/17/04 10:19:34
513      09/17/04 10:20:07
514      09/17/04 10:20:23
514      09/17/04 10:20:41
514      09/17/04 10:20:58
(... repeats every 15 seconds)
514      09/17/04 10:24:49
486      09/17/04 10:25:48
72       09/17/04 10:26:09
73       09/17/04 10:26:24
72       09/17/04 10:26:39
65       09/17/04 10:26:54
56       09/17/04 10:27:10

(Note, when it says 514 it's actually 512, we just use "ps aux | fgrep httpd | wc -l" which results in 2 extra counts).

We have looked at traffic logs on all sites and can't see anything that would cause this type of traffic, IF it is traffic that's the problem. But, most of our sites are PHP+MySQL based, so maybe something is going on with some web site's programming that would cause this behavior. Anyway, we see max of around 5000-10,000 hits (from all combined server logs) in a 10-minute interval. (I will verify this number again shortly). That doesn't seem like too much but, maybe it is?

Other details:
* Seems to happen about 2-3 times a day.
* Does seem to happen during times when you would expect more use, i.e., doesn't happen in early a.m. hours.
* None of the httpd processes, when maxed out, are in any sort of spin loop. They all have 0 to 5 seconds of CPU time.
* System load average stays under 2.
* When this happens we have looked carefully at top, memory, swap etc., and all seem fine. System is not thrashing or going to disk for pages while these MaxClients processes are working. It's just... "stuck" for a while, then spontaneously resolves and apache processes die off rapidly.

Whatever help you can provide to help us track through this problem would be appreciated. I will award points if I feel significant help is rendered even if it doesn't result in a solution per-se.



Question by:andrewfx
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 6
  • 5

Expert Comment

ID: 12090308
Have you done an analysis of the actual traffic logs? When you say 500-10000 hits, do you mean requests, or unique IP addresses? More importantly, since this is a regular occurence, do all these requests come from the same IP address or subnet? Or is there a spike in activity on a particular vhost? Perhaps someone is periodically mirroring a site, or worse, benchmarking the server. It might even be a DDoS attack.

It does seem unlikely that a single individual could generate this much traffic on their own due to bandwidth limitations, but is it possible that there is a rogue machine on your local network? The bandwidth of your LAN would probably be enough to saturate you with requests in the way you describe.

Obviously, all this is just wild speculation, but I think a good look at your traffic logs would reveal more.

Author Comment

ID: 12101949
Thanks for the response. For some reason I did not get any email from EE like I normally do, otherwise I'd have responded earlier.

Wild speculation is just fine!

Yes, I have now analyzed logs around the incident above. Here is the per-minute number of HITS recorded in ALL vhost logfiles. (Defining HIT = one line in the logfile. So, we count number of lines in all vhost logfiles that have the corresponding date-time-minute, then add 'em all up across all vhosts). So this is not unique users or anything. Just hits. Yeah if it was unique users then that's a lot of traffic.

------  -----
10:05 489
10:06 383
10:07 441
10:08 551
10:09 445
10:10 600
10:11 542
10:12 911
10:13 720
10:14 907
10:15 786
10:16 587
10:17 596
10:18 881
10:19 455
10:20 638
10:21 133
10:22 0
10:23 0
10:24 0
10:25 0
10:26 1347
10:27 468
10:28 550
10:29 463
10:30 457
10:31 421

As in my first posting, the limit of MaxClients httpd was hit around 10:20. Soon after that, everything is STUCK; no more hits are recorded.

Around 10:25 is when (in this case) I did an apachectl restart to get it unstuck.


(a) Does the traffic leading up to the incident seem like a lot?? To me it doesn't. Server is AMD 1.7Mhz w/ 1 GB ram and SCSI disks. Like I said, swap is not being touched during all this. But, maybe 900+ hits in a minute is likely to touch off some problem? There's just very little info out there on this type of stuff, it seems.

But, we didn't get any such incident over the weekend, when traffic is typically light. Number of servers exceeded 100 for few minutes then was usually below 30. We shall see what happens this week.

(b) If this IS purely a traffic issue, it should still be recording some large number of hits during 10:21-10:25, right? (Even if some accesses are being denied).

(c) Could this be a DoS issue? (But it seems dumb for someone to DoS us for so little time and so rarely). But is it possible to detect when httpd is in a DoS state? I.e. if I get to MaxClients again, how can I tell from netstat or something else if we're being DoS'ed?

(d) Any dangers to triggering an apachectl restart from a script if this situation is detected?

A new week, more traffic, we'll see what happens.

Thanks for the answer!


Expert Comment

ID: 12102261
I don't think the number of hits you are experiencing is particularly unusual, given the number of sites you are hosting. Interestingly, the number of hits doesn't seem to correlate exactly with the number of processes, for instance, at 10:17, when the number of processes is increasing, the number of hits is actually lower than a minute ago. What is really odd, is that you get *no* hits while the processes are maxed out - one would expect the existing processes to handle at least one request each at a time, and only requests above that limit should be rejected.

I offer the theory that all these processes are therefore hung, and so cannot service any clients, while Apache is unable to create new processes to service new clients. This may well be a bug in Apache, or possibly a CGI program which enters an infinite loop. I think http periodically kills off unresponsive processes, which explains why the problem eventually fixes itself.

Looking at the apache changelog, I see one issue (fixed in v1.3.28 I think) which might be causing this:

  *) Update timeout algorithm in free_proc_chain. If a subprocess
     did not exit immediately, the thread would sleep for 3 seconds
     before checking the subprocess exit status again. In a very
     common case when the subprocess was an HTTP server CGI script,
     the CGI script actually exited a fraction of a second into the 3
     second sleep, which effectively limited the server to serving one
     CGI request every 3 seconds across a persistent connection.
     PRs 6961, 8664 [Bill Stoddard]

Also, fixed in v1.3.29:

  *) Prevent creation of subprocess Zombies when using CGI wrappers
     such as suExec and cgiwrap. PR 21737. [Numerous]

Most interesting is

  *) SECURITY: CAN-2004-0174 (
     Fix starvation issue on listening sockets where a short-lived
     connection on a rarely-accessed listening socket will cause a
     child to hold the accept mutex and block out new connections until
     another connection arrives on that rarely-accessed listening socket.
     Enabled for some platforms known to have the issue (accept()
     blocking after select() returns readable).  Define
     NONBLOCK_WHEN_MULTI_LISTEN if needed for your platform and not
     already defined.  [Jeff Trawick, Brad Nicholes, Joe Orton]

More info on this last issue is available here:
Although this bug is not *known* to affect linux systems, it just might, and would probably cause you the problems you are having.

I suggest two possible remedies:
1. Upgrade Apache to the latest 1.x version (1.3.31) - obviously this isn't as easy as it sounds, since you'll need some downtime, but since there are several security advisories which affect your current version, it's something you will have to do sometime anyway.

2. Lower MaxRequestsPerChild from 200 down to maybe 25 or 50. This will mean each process will terminate sooner, making it less likely to leak memory or other resources. This will mean a performance hit, due to increased process creation overhead. Also, make sure that MaxSpareServers is set to a sensible value, so that idle processes don't hang around doing nothing.

If nothing works, then maybe you've found a new bug in httpd, and you need to take this to the apache mailing lists.
Get 15 Days FREE Full-Featured Trial

Benefit from a mission critical IT monitoring with Monitis Premium or get it FREE for your entry level monitoring needs.
-Over 200,000 users
-More than 300,000 websites monitored
-Used in 197 countries
-Recommended by 98% of users


Expert Comment

ID: 12102380
P.S. You should probably also lower your settings for MaxKeepAliveRequests and KeepAliveTimeout - 15 seconds is a long time for a process to be idle. Generally, a client downloading all the elements of a webpage would issue each request almost immediately after each other. At present, you could have a process idling for 15 seconds while the user is busy reading the text of the webpage, meanwhile, there are no free processes to handle other user's requests.

Author Comment

ID: 12102423

Well, there's a lot of stuff to digest in there.

(a) For some reason I've had this thought in my mind that 1.3 development stopped with 1.3.27. I didn't realize it was up to 1.3.31 ..! So I guess you're right, we have to upgrade at some point. Actually downtime is not an issue, it's just a lot of work to compile apache the way we need it (php/ssl/.. etc.).

(b) Just about all the vhosts are once we developed ourselves, and we use PHP, not CGI. Could a PHP page with an infinite loop just as well cause this as a CGI? (Actually it can't be an infinite loop since there are no processes in spin-loops; but maybe some sort of block state, maybe waiting for a mysql connection, for example?)

(c) Other than lowering MaxRequestsPerChild (which I was thinking of doing anyway), are there any other directives to tell apache (the main root apache that forks off the children) to kill off unresponsive child procs quickly? Obviously there are downsides to that too - killing a legit child - but we could tune it. Things like TimeOut/KeepAlive/MaxKeelAliveRequests/KeepAliveTimeout?

(d) Min/MaxSpareServers are set to 5/20 respectively. Is that "reasonable" ?



Expert Comment

ID: 12102425
P.P.S. I notice that the time your server remains deadlocked is around 5 mins - this is also the default value for the TimeOut directive - perhaps setting TimeOut lower (60 seconds perhaps?) would limit the time before the server rights itself.

Accepted Solution

mishagale earned 500 total points
ID: 12102583
(b) Sometimes, if you aren't using mod_php, then PHP pages are interpreted in the same way as CGI scripts. Even with mod_php, there is no reason a PHP page cannot become deadlocked like any other program. Maybe you should check whether MySQL hangs at the same time as httpd.

(c) I can't find a directive to do exactly what we want, but sending the main parent process a SIGHUP will cause it to kill all it's children, and then spawn new ones according to StartServers. SIGUSR1 does something similar, but won't kill of any children that are in the middle of something. Maybe tweaking the TimeOut directive will help?

(d) I would have thought that setting perfectly ok.

Author Comment

ID: 12102744

Yeah, we're using mod_php but you're certainly right that a PHP page can get blocked like any other. What is puzzling is how one PHP page could cause ALL the children to lock up, I doubt if all the requests could come for that one page.

I'll check into MySQL next time it happens.

OK, thanks for all this (and sorry some of my answers came while you were writing more thoughts). It is definitely helpful and along the lines of what I was hoping someone would walk me through.

I'm going to monitor the server closely this week and try doing some of the things you suggested, and we'll see what happens.  I'll also investigate upgrading apache, either to latest 1.3.x, or maybe see if apache 2.0.x is ready for primetime yet.

Please do me one last favor, ping me on Wednesday if I haven't awarded any points yet and I'll take care of that. And if you happen to think of anything in the interim, please post it, and if I observe anything interesting I'll post it here as well.



Author Comment

ID: 12103709
Hi there,

Well, the issue came up again soon after our discussion this morning. I was alerted as soon as number of procs crossed 200, so I saw it was maxed at 256 (current MaxClients).

This time I did a netstat -a and found 200+ connections like this:

tcp        1  13032 SOMEHOST:64787 CLOSE_WAIT
tcp        1  13032 SOMEHOST:64788 CLOSE_WAIT
tcp        1  13032 SOMEHOST:64789 CLOSE_WAIT
tcp        1  13032 SOMEHOST:64790 CLOSE_WAIT
tcp        1  13032 SOMEHOST: 2312 CLOSE_WAIT
tcp        1  13032 SOMEHOST:64777 CLOSE_WAIT
tcp        1  13032 SOMEHOST:64778 CLOSE_WAIT

..slightly edited, where "SOMEHOST" is all the same host from a major corporation, so I prefer not to name them here.

Any idea what could be going on? I'm pretty sure this is the problem though.  Did some searching and found:

Of course, further research is strictly for academic curiosity. I blasted that sucker into an iptables DROP rule so "SOMEHOST" can forget trying to access any of our sites. I'll keep watching and whomever else wants to try this same trick will be easy enough to deny, and I certainly have no problem denying access to a whole x.y.z.w/16 until their sysadmin asks to be reconnected.

Listen, your answers helped me think through a lot of this and gave me a better understanding of how all this works (I've been running this server + vhosts for 3 years now and still feel like I know 10% of what there is to know..).  With your help, maybe I know 11% now. I'll investigate the different settings for TimeOut and such, to possibly defend against such things better in the future. Beyond the technical advice, it's just real comforting knowing that someone knowledgeable out there is thinking about your problem and trying to help.

Thank you! I really appreciate it.


Expert Comment

ID: 12104013
The fact that your connections are in a CLOSE_WAIT state suggests that either you or SOMEHOST are failing to acknowledge the other parties TCP FIN packets, i.e. one of you is attempting to close the connection, and the other is simply ceasing to send data, rather than sending acknowlegement and terminating gracefully. Without looking at detailed tcpdump, I could't tell which end was the problem, but I'd guess it isn't Apache.
Anyway, I think Apache is finally aborting the connection after the number of seconds specified by TimeOut, by default 300. The Apache docs say this value is "far more than necessary in most situations." I'd suggest lowering it to around 30-60 seconds.

If SOMEHOST are at fault, then it could reasonably be construed as evidence of a DoS, but since it was only for a few minutes at a time, more likely a badly configured router or operating system. Either way, it might not be a bad idea to notify their sysadmins.

Glad you were able to solve your problem.

Author Comment

ID: 12104211

You're probably right that it is not a DoS,  I suppose someone can do us a lot worse if they want to.

I was thinking about the TimeOut adjustment in particular (that you had earlier recommended). Will play around with that.

I'd notify the sysadmins of SOMEHOST, but it's a giant company and all I have for them is customer service e-mails. If someone loses access (and is a legit user), we will presumably hear about it, and then can take it from there to find out what's going on.




Featured Post

Free Tool: ZipGrep

ZipGrep is a utility that can list and search zip (.war, .ear, .jar, etc) archives for text patterns, without the need to extract the archive's contents.

One of a set of tools we're offering as a way to say thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Hi, in this article I'm going to teach you how to run your own site, and how to let people in (without IP). I'll talk about and explain each step... :) By the way, everything in this Tutorial is completely free and legal. This article is for …
In Solr 4.0 it is possible to atomically (or partially) update individual fields in a document. This article will show the operations possible for atomic updating as well as setting up your Solr instance to be able to perform the actions. One major …
In this video we outline the Physical Segments view of NetCrunch network monitor. By following this brief how-to video, you will be able to learn how NetCrunch visualizes your network, how granular is the information collected, as well as where to f…
Michael from AdRem Software outlines event notifications and Automatic Corrective Actions in network monitoring. Automatic Corrective Actions are scripts, which can automatically run upon discovery of a certain undesirable condition in your network.…

630 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question