• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 3656
  • Last Modified:

Apache Bottleneck


I have been trying to solve a bottleneck in my server whereby Apache reaches the MaxClients value then crashes. I have done significant research into log files - the Apache error log (at LogLevel debug), MySQL error log, MySQL slow queries log, MySQL general log and the server messages log. There is no detail about the crash. Occasionally the Apache error_log will write "MaxClients reached, consider raising MaxClients" but in many cases my automatic restart script detects the bottleneck and restarts Apache before the crash (which is inevitable).

What is happening is Apache threads are building up in the W or "Sending Reply" state until MaxClients is hit.

I have performed the following during my week long investigation into the problem:
1. Initially I assumed it was database related - I tracked down and fixed a lot of slow queries and reduced the time a MySQL connection is open in scripts.
2. I rebuilt Apache to 2.2.11 with worker MPM instead of prefork (the difference is worker crashes when it reaches MaxClients and prefork just grinds to a halt with W state threads)
3. I installed a newer version of APC (3.0.18 -> 3.0.19)
4. I received "expert" advice from liveperson.com where I was advised to comment out the mod_bwlimited Apache module even though it wasn't being used to throttle any virtual hosts (I use WHM).
5. I disabled munin, a graphing plugin for WHM.

Over the course of this I have been changing the httpd config variables to all sorts of different values based off different hunches:

- Tried a high MaxClients value (1000) which just increases the length of bottlenecks before crashes.
- Tried low MaxClients values (100-300) in order to see if a high MaxClients value was causing bandwidth to saturate (I have a 100mbit port).
- Tried high MaxRequestsPerChild value (0 - infinite) to reduce CPU usage associated with destroying processes too often.
- Tried low MaxRequestsPerChild values (10-500) to reduce potential memory leaks.
- Tried KeepAlives On (I usually always keep them off) with a KeepAliveTimeout of 2 in order to test the theory that in-page AJAX calls and calls to JS and CSS includes were causing too much overhead.
- Tried a Timeout of 5 seconds to test the theory that the Apache threads were staying in W state for too long (formerly Timeout was 10 seconds).

I am certain this is not a database issue as when the bottleneck occurs the MySQL threads attached to many Apache connections are in Sleep state. I initially suspected that a heavy query on a table was locking out other threads and causing the wait, but it isn't the case. I assure you I always assume a problem is database related but I looked extensively at it before I moved on to looking at Apache and system resource usage.

Some other theories I had that were database related:
- I thought the MySQL query cache might be causing bottlenecks when a table is invalidated causing many queries to be removed from the cache - I have now disabled the MySQL query cache from caching by default.
- I thought MySQL connections were being maintained for too long so I am now performing all the queries at the top and immediately closing the MySQL connection on many pages (but not all pages yet).

I have looked at the IPs and request URLs of Apache threads when the bottlenecks occur - I am certain it is not a DDOS attack as there is no indication of many of the same IP or request URL.

Finally, I have done significant investigation into log files. I created a script that logs the output of vmstat, ps aux, top, netstat, Apache extended status, MySQL processlist and iostat to files every 4 seconds. I can show you the output of each of these commands at and around the point of a crash. There is nothing obviously wrong with memory or CPU usage to my eyes but I am not a Unix expert.

My system specs:

Centos 4.7 i686 standard
Apache 2.2.11
PHP 5.2.9
APC 3.0.19
MySQL 5.0.67-community-log
cPanel 11.24.4-R35075 - WHM 11.24.2 - X 3.9

Other facts:
- The server serves almost entirely dynamically generated content - all static files are served off other servers.
- The server transfers between 90 and 120GB per day.
- I use mod_deflate.
- Apache handles between 5 and 7 million requests per day.
- I use APC extensively in my scripts.
- My scripts are fairly CPU intensive with many includes and classes/objects.
- I use little or no third party PHP scripts.
- While there are multiple virtual hosts listed in httpd.conf the server almost exclusively handles one website (fpsbanana.com)

I have attached a ZIP file containing the following:
- my httpd.conf.
- the last 1000 lines of the Apache error_log after a crash.
- logs of various unix commands - search for 06:14:36 - this is the point of a crash in these logs (netstat logged at larger intervals).

Your help would be much appreciated. I have lost a lot of sleep over this one!
  • 5
  • 4
  • 3
7 Solutions
You went into great detail into what steps you have taken, to diagnose the problem, but you did not include the hardware information.
CPU Information
Usually vmstat, iostat, should not be run frequently, but when run get a sequential report.
i.e. run iostat -xtc 5 5 out of cron every five minutes.
I.e. collect 5 data samples 5 seconds apart.
Similarly for vmstat -n 5 5, five data points five seconds apart.
echo "iostat run" `date`
iostat -xtc 5 5
echo "Vmstat run" `date`
vmstat -n 5 5
echo "-----" `date`

You have several things running on the system which might benefit from separation i.e. dedicated Mysql server, mail server.

You might want to tune your TCP settings keep_alive, time_wait in particular.
http://performancewiki.com/wordpress/main/tuning-linux-systems-for-websphere-application-server-60x Reduce the time_WAIT from 60 seconds to 30 or fewer.

You have four drives that are being written to in large numbers.
sda7 (Log partition?) if so relocating the log to sdc1 could be beneficial in removing the log writes from the OS bottleneck handling.

tomp_glAuthor Commented:

Sorry, attached is my CPU, memory and disk information.

I am currently logging the commands you recommended and will post them when I experience another crash.

tomp_glAuthor Commented:
Hi Arnold,

Attached are the vmstat and iostat logs you recommended. A crash occurred between 05:02:02 and 05:03:02. I have also asked my host to make the changes to the logging location and tcp settings you suggested.

What do you make of these latest logs?

 [eBook] Windows Nano Server

Download this FREE eBook and learn all you need to get started with Windows Nano Server, including deployment options, remote management
and troubleshooting tips and tricks

Operating system with syncookies or at least huge somaxconn value can help.
Your system is not swapping
Try precompiled apache - might be something wrong in your build.
Basically keep apache processes in memory, and adjust connection queue if they are slow to handle requests.
Could you change the iostat query to -xt removing the -c.

You have two physical drives listed for every logical drive /dev/sda.
Is this a typo, or are you using Software RAID? Which Filesystem, ext2, ext3?

Do you have crash dump setup?  Does the system crash or just becomes unresponsive forcing a reboot?
If the system actually crashing and if you have crash dump setup, did you look at analyzing the dumpfiles?

Run httpd -M to see which modules are loaded.

Is the primary use of the site for Cpanel access?

The problem is trying to determine the conditions leading to the crash.
The issue could be a RACE condition in which a server process locks resources preventing any further responses.
Proxy modules are huge resource hog.
Use Squid to get some efficiency for static content.
tomp_glAuthor Commented:
Hi Arnold,

I'm not sure why the disks are listed twice (that info was obtained from cPanel/WHM), none are in a RAID configuration. I ran the mount command which might serve better:

/dev/sda6 on / type ext3 (rw,usrquota)
none on /proc type proc (rw)
none on /sys type sysfs (rw)
none on /dev/pts type devpts (rw,gid=5,mode=620)
usbfs on /proc/bus/usb type usbfs (rw)
/dev/sda1 on /boot type ext3 (rw)
none on /dev/shm type tmpfs (rw)
/dev/sda8 on /home type ext3 (rw,usrquota)
/dev/sda7 on /tmp type ext3 (rw,noexec)
/dev/sda2 on /usr type ext3 (rw,usrquota)
/dev/sda3 on /var type ext3 (rw,usrquota)
/dev/sdd1 on /database type ext3 (rw)
none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw)
/tmp on /tmp type none (rw,noexec,nosuid,bind)
sunrpc on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw)
/dev/sde1 on /files1 type ext3 (rw)
/dev/sdf1 on /backup1 type ext3 (rw)
/dev/sdb1 on /backup type ext3 (rw)

The primary use of the server is for a web application (fpsbanana.com, a PHP/MySQL heavy community with forums, user profiles etc). I use cPanel/WHM because it simplifies many areas of server administration for me (DNS, kernel updates etc etc...). I understand that it has bloat and ideally I'd love a bare OS with just the LAMP essentials but I don't feel confident or skilled enough to do that just yet.

Apache modules:

core_module (static)
authn_file_module (static)
authn_default_module (static)
authz_host_module (static)
authz_groupfile_module (static)
authz_user_module (static)
authz_default_module (static)
auth_basic_module (static)
include_module (static)
filter_module (static)
deflate_module (static)
log_config_module (static)
logio_module (static)
env_module (static)
expires_module (static)
headers_module (static)
setenvif_module (static)
proxy_module (static)
proxy_connect_module (static)
proxy_ftp_module (static)
proxy_http_module (static)
proxy_ajp_module (static)
proxy_balancer_module (static)
ssl_module (static)
mpm_worker_module (static)
http_module (static)
mime_module (static)
status_module (static)
autoindex_module (static)
asis_module (static)
info_module (static)
suexec_module (static)
cgid_module (static)
negotiation_module (static)
dir_module (static)
actions_module (static)
userdir_module (static)
alias_module (static)
rewrite_module (static)
so_module (static)
bwlimited_module (shared)
php5_module (shared)

I have a script that writes a timestamp to a file, and I run this script with lynx every minute. I then have another script that checks the timestamp in this file every minute. If the timestamp is older than 60 seconds the checking script waits a further 30 seconds and re-opens the file and checks the timestamp again. If it's still older than 60 seconds (after 90 seconds) the checking script restarts httpd.

This means that Apache could be restarted if it's crashed or still up but not properly handling requests. Sometimes the error_log will record "MaxClients reached..." (if it crashed), other times it won't (if it's restarted during the bottleneck and before the crash). Typically it restarts during the bottleneck. It also depends on the value of MaxClients as this influences the duration of the bottleneck before the crash.

I wasn't performing crash dumps (they're new to me). I've added the necessary lines to httpd.conf and will report back when I get a crash.

I haven't had as many crashed in the last few hours as my peak period (daytime in the US, and particularly, daytime in the US on weekends) is closing. I am currently logging the new iostat command and will report back when I have the results of a crash.


This server serves almost entirely dynamic content (PHP scripts). I already have other servers serving my static content. I noticed mod_proxy in the list of compiled modules - apparently this is required by cPanel. In regards to your suggestion of trying a precompiled build, I am limited by what cPanel/WHM offers - it will probably break things if I do this.
sysctl net.core.somaxconn=4096 (assuming your load - 3 seconds in pre-socket state maximum for apache to take socket)
sysctl net.core.somaxconn >> /etc/sysctl.conf

tomp_glAuthor Commented:
Attached are the new iowait and vmstat logs. A bottleneck occurred between 21:16:01 and 21:17:01.

I added CoreDumpDirectory /tmp to httpd.conf but I can't find anything in /tmp other than php session files and temporary file uploads.

What is reading/writing to sdd,sde,sdf?
What process,memory, resource limits do you have on the system?

ulimit for the nobody user (httpd)?

You are running many things on the system, it might be out of resources.
Process list seems to always have 20 http processies.
Try raising the max from 20 to 30 and see whether the issue persists you have he memory resources to add additional httpd handlers.  The problem might be that all the stuff that you are running on the system, web, database, mail server, imap server, nessus, IDS, iptables, etc. at one point or another enter the resource limits.
To secure the system, you might reduce the resource usage by using SELinux.

The issue with httpd might be a symptom rather than a cause for it.

Do you use web based email access through IMAP?
Can you check the Courier-Imap server's log during the same time.
Your one minute update to the log check might not be useful.
Increase the delay you can tolerate from one minute to two minutes. Does the process log what it does?  An option might be to add into this process the collection of iostat, vmstat, etc data points when this event is detected.  Your current data points are always around the "event" i.e. a few minutes before or a few minutes after.  Collecting items in parallel (background) while will add to the resource use might provide a clearer picture of what the system was doing with netstat, vmstat and iostat data collected over the same time period.

Only disk that can cause waits is sde - what is written out to it regularly? Does it represent sortof deadlock i.e while database busy apache unable to write logs etc? Can you split output going there to multiple disks?

Is your last output from worker or from prefork?
tomp_glAuthor Commented:
Thank you for your advice. I have decided to create a new server without cpanel and run only Apache on it, and connect it directly to my current main server. I think getting to the bottom of this issue is more trouble than it's worth, particularly with the interplay of cPanel .

Featured Post

Concerto's Cloud Advisory Services

Want to avoid the missteps to gaining all the benefits of the cloud? Learn more about the different assessment options from our Cloud Advisory team.

  • 5
  • 4
  • 3
Tackle projects and never again get stuck behind a technical roadblock.
Join Now