Link to home
Start Free TrialLog in
Avatar of andrewfx
andrewfx

asked on

MaxClient httpd/apache processes consistently reached and then connections refused

We have a RedHat Linux 7.3 site running apache 1.3.27, with MaxClients=256. Occasionally have seen the number of httpd processes go to 256, stay like that for a few minutes, then back down to well under 100. During those few minutes some connections are not accepted. The site is running  about 80-100 sites across about 500 virtual domains (most of which are unused). About 5 of the sites do 90% of the traffic.

We recompiled httpd to enable settingMaxClients higher, and set it to 512 (I know it is dangerous to do so, but we have 1GB RAM and thus took the risk).  MaxRequestsPerChild = 200. KeepAlive is On, MaxKeepAliveRequests = 200, KeepAliveTimeout = 15.

MySQL connections was increased from 400 to 600 when we increased MaxClients to 512.

After all this we still see the same activity. Running a script to count httpd processes every 15 seconds. Here is one such episode. Note the sudden jump at 10:17:05, climb to MaxClient processes for a few minutes and then back down.

42       09/17/04 10:15:34
42       09/17/04 10:15:49
40       09/17/04 10:16:04
36       09/17/04 10:16:19
36       09/17/04 10:16:34
36       09/17/04 10:16:50
36       09/17/04 10:17:05
178      09/17/04 10:17:37
242      09/17/04 10:17:53
243      09/17/04 10:18:11
243      09/17/04 10:18:27
272      09/17/04 10:18:43
397      09/17/04 10:19:14
405      09/17/04 10:19:34
513      09/17/04 10:20:07
514      09/17/04 10:20:23
514      09/17/04 10:20:41
514      09/17/04 10:20:58
(... repeats every 15 seconds)
514      09/17/04 10:24:49
486      09/17/04 10:25:48
72       09/17/04 10:26:09
73       09/17/04 10:26:24
72       09/17/04 10:26:39
65       09/17/04 10:26:54
56       09/17/04 10:27:10

(Note, when it says 514 it's actually 512, we just use "ps aux | fgrep httpd | wc -l" which results in 2 extra counts).

We have looked at traffic logs on all sites and can't see anything that would cause this type of traffic, IF it is traffic that's the problem. But, most of our sites are PHP+MySQL based, so maybe something is going on with some web site's programming that would cause this behavior. Anyway, we see max of around 5000-10,000 hits (from all combined server logs) in a 10-minute interval. (I will verify this number again shortly). That doesn't seem like too much but, maybe it is?

Other details:
* Seems to happen about 2-3 times a day.
* Does seem to happen during times when you would expect more use, i.e., doesn't happen in early a.m. hours.
* None of the httpd processes, when maxed out, are in any sort of spin loop. They all have 0 to 5 seconds of CPU time.
* System load average stays under 2.
* When this happens we have looked carefully at top, memory, swap etc., and all seem fine. System is not thrashing or going to disk for pages while these MaxClients processes are working. It's just... "stuck" for a while, then spontaneously resolves and apache processes die off rapidly.

Whatever help you can provide to help us track through this problem would be appreciated. I will award points if I feel significant help is rendered even if it doesn't result in a solution per-se.

Thanks!

Andrew


Avatar of mishagale
mishagale

Have you done an analysis of the actual traffic logs? When you say 500-10000 hits, do you mean requests, or unique IP addresses? More importantly, since this is a regular occurence, do all these requests come from the same IP address or subnet? Or is there a spike in activity on a particular vhost? Perhaps someone is periodically mirroring a site, or worse, benchmarking the server. It might even be a DDoS attack.

It does seem unlikely that a single individual could generate this much traffic on their own due to bandwidth limitations, but is it possible that there is a rogue machine on your local network? The bandwidth of your LAN would probably be enough to saturate you with requests in the way you describe.

Obviously, all this is just wild speculation, but I think a good look at your traffic logs would reveal more.
Avatar of andrewfx

ASKER

Thanks for the response. For some reason I did not get any email from EE like I normally do, otherwise I'd have responded earlier.

Wild speculation is just fine!

Yes, I have now analyzed logs around the incident above. Here is the per-minute number of HITS recorded in ALL vhost logfiles. (Defining HIT = one line in the logfile. So, we count number of lines in all vhost logfiles that have the corresponding date-time-minute, then add 'em all up across all vhosts). So this is not unique users or anything. Just hits. Yeah if it was unique users then that's a lot of traffic.

TIME  HITS
------  -----
10:05 489
10:06 383
10:07 441
10:08 551
10:09 445
10:10 600
10:11 542
10:12 911
10:13 720
10:14 907
10:15 786
10:16 587
10:17 596
10:18 881
10:19 455
10:20 638
10:21 133
10:22 0
10:23 0
10:24 0
10:25 0
10:26 1347
10:27 468
10:28 550
10:29 463
10:30 457
10:31 421

As in my first posting, the limit of MaxClients httpd was hit around 10:20. Soon after that, everything is STUCK; no more hits are recorded.

Around 10:25 is when (in this case) I did an apachectl restart to get it unstuck.

Questions:

(a) Does the traffic leading up to the incident seem like a lot?? To me it doesn't. Server is AMD 1.7Mhz w/ 1 GB ram and SCSI disks. Like I said, swap is not being touched during all this. But, maybe 900+ hits in a minute is likely to touch off some problem? There's just very little info out there on this type of stuff, it seems.

But, we didn't get any such incident over the weekend, when traffic is typically light. Number of servers exceeded 100 for few minutes then was usually below 30. We shall see what happens this week.

(b) If this IS purely a traffic issue, it should still be recording some large number of hits during 10:21-10:25, right? (Even if some accesses are being denied).

(c) Could this be a DoS issue? (But it seems dumb for someone to DoS us for so little time and so rarely). But is it possible to detect when httpd is in a DoS state? I.e. if I get to MaxClients again, how can I tell from netstat or something else if we're being DoS'ed?

(d) Any dangers to triggering an apachectl restart from a script if this situation is detected?

A new week, more traffic, we'll see what happens.

Thanks for the answer!

Andrew
I don't think the number of hits you are experiencing is particularly unusual, given the number of sites you are hosting. Interestingly, the number of hits doesn't seem to correlate exactly with the number of processes, for instance, at 10:17, when the number of processes is increasing, the number of hits is actually lower than a minute ago. What is really odd, is that you get *no* hits while the processes are maxed out - one would expect the existing processes to handle at least one request each at a time, and only requests above that limit should be rejected.

I offer the theory that all these processes are therefore hung, and so cannot service any clients, while Apache is unable to create new processes to service new clients. This may well be a bug in Apache, or possibly a CGI program which enters an infinite loop. I think http periodically kills off unresponsive processes, which explains why the problem eventually fixes itself.

Looking at the apache changelog, I see one issue (fixed in v1.3.28 I think) which might be causing this:



  *) Update timeout algorithm in free_proc_chain. If a subprocess
     did not exit immediately, the thread would sleep for 3 seconds
     before checking the subprocess exit status again. In a very
     common case when the subprocess was an HTTP server CGI script,
     the CGI script actually exited a fraction of a second into the 3
     second sleep, which effectively limited the server to serving one
     CGI request every 3 seconds across a persistent connection.
     PRs 6961, 8664 [Bill Stoddard]

Also, fixed in v1.3.29:

  *) Prevent creation of subprocess Zombies when using CGI wrappers
     such as suExec and cgiwrap. PR 21737. [Numerous]

Most interesting is

  *) SECURITY: CAN-2004-0174 (cve.mitre.org)
     Fix starvation issue on listening sockets where a short-lived
     connection on a rarely-accessed listening socket will cause a
     child to hold the accept mutex and block out new connections until
     another connection arrives on that rarely-accessed listening socket.
     Enabled for some platforms known to have the issue (accept()
     blocking after select() returns readable).  Define
     NONBLOCK_WHEN_MULTI_LISTEN if needed for your platform and not
     already defined.  [Jeff Trawick, Brad Nicholes, Joe Orton]

More info on this last issue is available here: http://www.cve.mitre.org/cgi-bin/cvename.cgi?name=CAN-2004-0174
Although this bug is not *known* to affect linux systems, it just might, and would probably cause you the problems you are having.

I suggest two possible remedies:
1. Upgrade Apache to the latest 1.x version (1.3.31) - obviously this isn't as easy as it sounds, since you'll need some downtime, but since there are several security advisories which affect your current version, it's something you will have to do sometime anyway.

2. Lower MaxRequestsPerChild from 200 down to maybe 25 or 50. This will mean each process will terminate sooner, making it less likely to leak memory or other resources. This will mean a performance hit, due to increased process creation overhead. Also, make sure that MaxSpareServers is set to a sensible value, so that idle processes don't hang around doing nothing.

If nothing works, then maybe you've found a new bug in httpd, and you need to take this to the apache mailing lists.
P.S. You should probably also lower your settings for MaxKeepAliveRequests and KeepAliveTimeout - 15 seconds is a long time for a process to be idle. Generally, a client downloading all the elements of a webpage would issue each request almost immediately after each other. At present, you could have a process idling for 15 seconds while the user is busy reading the text of the webpage, meanwhile, there are no free processes to handle other user's requests.

Well, there's a lot of stuff to digest in there.

(a) For some reason I've had this thought in my mind that 1.3 development stopped with 1.3.27. I didn't realize it was up to 1.3.31 ..! So I guess you're right, we have to upgrade at some point. Actually downtime is not an issue, it's just a lot of work to compile apache the way we need it (php/ssl/.. etc.).

(b) Just about all the vhosts are once we developed ourselves, and we use PHP, not CGI. Could a PHP page with an infinite loop just as well cause this as a CGI? (Actually it can't be an infinite loop since there are no processes in spin-loops; but maybe some sort of block state, maybe waiting for a mysql connection, for example?)

(c) Other than lowering MaxRequestsPerChild (which I was thinking of doing anyway), are there any other directives to tell apache (the main root apache that forks off the children) to kill off unresponsive child procs quickly? Obviously there are downsides to that too - killing a legit child - but we could tune it. Things like TimeOut/KeepAlive/MaxKeelAliveRequests/KeepAliveTimeout?


(d) Min/MaxSpareServers are set to 5/20 respectively. Is that "reasonable" ?

Thanks,

Andrew
P.P.S. I notice that the time your server remains deadlocked is around 5 mins - this is also the default value for the TimeOut directive - perhaps setting TimeOut lower (60 seconds perhaps?) would limit the time before the server rights itself.
ASKER CERTIFIED SOLUTION
Avatar of mishagale
mishagale

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial

Yeah, we're using mod_php but you're certainly right that a PHP page can get blocked like any other. What is puzzling is how one PHP page could cause ALL the children to lock up, I doubt if all the requests could come for that one page.

I'll check into MySQL next time it happens.

OK, thanks for all this (and sorry some of my answers came while you were writing more thoughts). It is definitely helpful and along the lines of what I was hoping someone would walk me through.

I'm going to monitor the server closely this week and try doing some of the things you suggested, and we'll see what happens.  I'll also investigate upgrading apache, either to latest 1.3.x, or maybe see if apache 2.0.x is ready for primetime yet.

Please do me one last favor, ping me on Wednesday if I haven't awarded any points yet and I'll take care of that. And if you happen to think of anything in the interim, please post it, and if I observe anything interesting I'll post it here as well.

Thanks,

Andrew
Hi there,

Well, the issue came up again soon after our discussion this morning. I was alerted as soon as number of procs crossed 200, so I saw it was maxed at 256 (current MaxClients).

This time I did a netstat -a and found 200+ connections like this:

tcp        1  13032 ourserver.com:http SOMEHOST:64787 CLOSE_WAIT
tcp        1  13032 ourserver.com:http SOMEHOST:64788 CLOSE_WAIT
tcp        1  13032 ourserver.com:http SOMEHOST:64789 CLOSE_WAIT
tcp        1  13032 ourserver.com:http SOMEHOST:64790 CLOSE_WAIT
tcp        1  13032 ourserver.com:http SOMEHOST: 2312 CLOSE_WAIT
tcp        1  13032 ourserver.com:http SOMEHOST:64777 CLOSE_WAIT
tcp        1  13032 ourserver.com:http SOMEHOST:64778 CLOSE_WAIT

..slightly edited, where "SOMEHOST" is all the same host from a major corporation, so I prefer not to name them here.

Any idea what could be going on? I'm pretty sure this is the problem though.  Did some searching and found:
http://archive.apache.org/gnats/5412

Of course, further research is strictly for academic curiosity. I blasted that sucker into an iptables DROP rule so "SOMEHOST" can forget trying to access any of our sites. I'll keep watching and whomever else wants to try this same trick will be easy enough to deny, and I certainly have no problem denying access to a whole x.y.z.w/16 until their sysadmin asks to be reconnected.

Listen, your answers helped me think through a lot of this and gave me a better understanding of how all this works (I've been running this server + vhosts for 3 years now and still feel like I know 10% of what there is to know..).  With your help, maybe I know 11% now. I'll investigate the different settings for TimeOut and such, to possibly defend against such things better in the future. Beyond the technical advice, it's just real comforting knowing that someone knowledgeable out there is thinking about your problem and trying to help.

Thank you! I really appreciate it.

Andrew
The fact that your connections are in a CLOSE_WAIT state suggests that either you or SOMEHOST are failing to acknowledge the other parties TCP FIN packets, i.e. one of you is attempting to close the connection, and the other is simply ceasing to send data, rather than sending acknowlegement and terminating gracefully. Without looking at detailed tcpdump, I could't tell which end was the problem, but I'd guess it isn't Apache.
Anyway, I think Apache is finally aborting the connection after the number of seconds specified by TimeOut, by default 300. The Apache docs say this value is "far more than necessary in most situations." I'd suggest lowering it to around 30-60 seconds.

If SOMEHOST are at fault, then it could reasonably be construed as evidence of a DoS, but since it was only for a few minutes at a time, more likely a badly configured router or operating system. Either way, it might not be a bad idea to notify their sysadmins.

Glad you were able to solve your problem.

You're probably right that it is not a DoS,  I suppose someone can do us a lot worse if they want to.

I was thinking about the TimeOut adjustment in particular (that you had earlier recommended). Will play around with that.

I'd notify the sysadmins of SOMEHOST, but it's a giant company and all I have for them is customer service e-mails. If someone loses access (and is a legit user), we will presumably hear about it, and then can take it from there to find out what's going on.

Thanks,

Andrew