I have been trying to solve a bottleneck in my server whereby Apache reaches the MaxClients value then crashes. I have done significant research into log files - the Apache error log (at LogLevel debug), MySQL error log, MySQL slow queries log, MySQL general log and the server messages log. There is no detail about the crash. Occasionally the Apache error_log will write "MaxClients reached, consider raising MaxClients" but in many cases my automatic restart script detects the bottleneck and restarts Apache before the crash (which is inevitable).
What is happening is Apache threads are building up in the W or "Sending Reply" state until MaxClients is hit.
I have performed the following during my week long investigation into the problem:
1. Initially I assumed it was database related - I tracked down and fixed a lot of slow queries and reduced the time a MySQL connection is open in scripts.
2. I rebuilt Apache to 2.2.11 with worker MPM instead of prefork (the difference is worker crashes when it reaches MaxClients and prefork just grinds to a halt with W state threads)
3. I installed a newer version of APC (3.0.18 -> 3.0.19)
4. I received "expert" advice from liveperson.com where I was advised to comment out the mod_bwlimited Apache module even though it wasn't being used to throttle any virtual hosts (I use WHM).
5. I disabled munin, a graphing plugin for WHM.
Over the course of this I have been changing the httpd config variables to all sorts of different values based off different hunches:
- Tried a high MaxClients value (1000) which just increases the length of bottlenecks before crashes.
- Tried low MaxClients values (100-300) in order to see if a high MaxClients value was causing bandwidth to saturate (I have a 100mbit port).
- Tried high MaxRequestsPerChild value (0 - infinite) to reduce CPU usage associated with destroying processes too often.
- Tried low MaxRequestsPerChild values (10-500) to reduce potential memory leaks.
- Tried KeepAlives On (I usually always keep them off) with a KeepAliveTimeout of 2 in order to test the theory that in-page AJAX calls and calls to JS and CSS includes were causing too much overhead.
- Tried a Timeout of 5 seconds to test the theory that the Apache threads were staying in W state for too long (formerly Timeout was 10 seconds).
I am certain this is not a database issue as when the bottleneck occurs the MySQL threads attached to many Apache connections are in Sleep state. I initially suspected that a heavy query on a table was locking out other threads and causing the wait, but it isn't the case. I assure you I always assume a problem is database related but I looked extensively at it before I moved on to looking at Apache and system resource usage.
Some other theories I had that were database related:
- I thought the MySQL query cache might be causing bottlenecks when a table is invalidated causing many queries to be removed from the cache - I have now disabled the MySQL query cache from caching by default.
- I thought MySQL connections were being maintained for too long so I am now performing all the queries at the top and immediately closing the MySQL connection on many pages (but not all pages yet).
I have looked at the IPs and request URLs of Apache threads when the bottlenecks occur - I am certain it is not a DDOS attack as there is no indication of many of the same IP or request URL.
Finally, I have done significant investigation into log files. I created a script that logs the output of vmstat, ps aux, top, netstat, Apache extended status, MySQL processlist and iostat to files every 4 seconds. I can show you the output of each of these commands at and around the point of a crash. There is nothing obviously wrong with memory or CPU usage to my eyes but I am not a Unix expert.
My system specs:
Centos 4.7 i686 standard
cPanel 11.24.4-R35075 - WHM 11.24.2 - X 3.9
- The server serves almost entirely dynamically generated content - all static files are served off other servers.
- The server transfers between 90 and 120GB per day.
- I use mod_deflate.
- Apache handles between 5 and 7 million requests per day.
- I use APC extensively in my scripts.
- My scripts are fairly CPU intensive with many includes and classes/objects.
- I use little or no third party PHP scripts.
- While there are multiple virtual hosts listed in httpd.conf the server almost exclusively handles one website (fpsbanana.com)
I have attached a ZIP file containing the following:
- my httpd.conf.
- the last 1000 lines of the Apache error_log after a crash.
- logs of various unix commands - search for 06:14:36 - this is the point of a crash in these logs (netstat logged at larger intervals).
Your help would be much appreciated. I have lost a lot of sleep over this one!