asked on

Apache Bottleneck

Hi,

I have been trying to solve a bottleneck in my server whereby Apache reaches the MaxClients value then crashes. I have done significant research into log files - the Apache error log (at LogLevel debug), MySQL error log, MySQL slow queries log, MySQL general log and the server messages log. There is no detail about the crash. Occasionally the Apache error_log will write "MaxClients reached, consider raising MaxClients" but in many cases my automatic restart script detects the bottleneck and restarts Apache before the crash (which is inevitable).

What is happening is Apache threads are building up in the W or "Sending Reply" state until MaxClients is hit.

I have performed the following during my week long investigation into the problem:
1. Initially I assumed it was database related - I tracked down and fixed a lot of slow queries and reduced the time a MySQL connection is open in scripts.
2. I rebuilt Apache to 2.2.11 with worker MPM instead of prefork (the difference is worker crashes when it reaches MaxClients and prefork just grinds to a halt with W state threads)
3. I installed a newer version of APC (3.0.18 -> 3.0.19)
4. I received "expert" advice from liveperson.com where I was advised to comment out the mod_bwlimited Apache module even though it wasn't being used to throttle any virtual hosts (I use WHM).
5. I disabled munin, a graphing plugin for WHM.

Over the course of this I have been changing the httpd config variables to all sorts of different values based off different hunches:

- Tried a high MaxClients value (1000) which just increases the length of bottlenecks before crashes.
- Tried low MaxClients values (100-300) in order to see if a high MaxClients value was causing bandwidth to saturate (I have a 100mbit port).
- Tried high MaxRequestsPerChild value (0 - infinite) to reduce CPU usage associated with destroying processes too often.
- Tried low MaxRequestsPerChild values (10-500) to reduce potential memory leaks.
- Tried KeepAlives On (I usually always keep them off) with a KeepAliveTimeout of 2 in order to test the theory that in-page AJAX calls and calls to JS and CSS includes were causing too much overhead.
- Tried a Timeout of 5 seconds to test the theory that the Apache threads were staying in W state for too long (formerly Timeout was 10 seconds).

I am certain this is not a database issue as when the bottleneck occurs the MySQL threads attached to many Apache connections are in Sleep state. I initially suspected that a heavy query on a table was locking out other threads and causing the wait, but it isn't the case. I assure you I always assume a problem is database related but I looked extensively at it before I moved on to looking at Apache and system resource usage.

Some other theories I had that were database related:
- I thought the MySQL query cache might be causing bottlenecks when a table is invalidated causing many queries to be removed from the cache - I have now disabled the MySQL query cache from caching by default.
- I thought MySQL connections were being maintained for too long so I am now performing all the queries at the top and immediately closing the MySQL connection on many pages (but not all pages yet).

I have looked at the IPs and request URLs of Apache threads when the bottlenecks occur - I am certain it is not a DDOS attack as there is no indication of many of the same IP or request URL.

Finally, I have done significant investigation into log files. I created a script that logs the output of vmstat, ps aux, top, netstat, Apache extended status, MySQL processlist and iostat to files every 4 seconds. I can show you the output of each of these commands at and around the point of a crash. There is nothing obviously wrong with memory or CPU usage to my eyes but I am not a Unix expert.

My system specs:

Centos 4.7 i686 standard
Apache 2.2.11
PHP 5.2.9
APC 3.0.19
MySQL 5.0.67-community-log
cPanel 11.24.4-R35075 - WHM 11.24.2 - X 3.9

Other facts:
- The server serves almost entirely dynamically generated content - all static files are served off other servers.
- The server transfers between 90 and 120GB per day.
- I use mod_deflate.
- Apache handles between 5 and 7 million requests per day.
- I use APC extensively in my scripts.
- My scripts are fairly CPU intensive with many includes and classes/objects.
- I use little or no third party PHP scripts.
- While there are multiple virtual hosts listed in httpd.conf the server almost exclusively handles one website (fpsbanana.com)

I have attached a ZIP file containing the following:
- my httpd.conf.
- the last 1000 lines of the Apache error_log after a crash.
- logs of various unix commands - search for 06:14:36 - this is the point of a crash in these logs (netstat logged at larger intervals).

Your help would be much appreciated. I have lost a lot of sleep over this one!
logs.zip

ASKER CERTIFIED SOLUTION

arnold

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

tomp_gl

ASKER

Hi,

Sorry, attached is my CPU, memory and disk information.

I am currently logging the commands you recommended and will post them when I experience another crash.

Thanks
system-specs.txt

tomp_gl

ASKER

Hi Arnold,

Attached are the vmstat and iostat logs you recommended. A crash occurred between 05:02:02 and 05:03:02. I have also asked my host to make the changes to the logging location and tcp settings you suggested.

What do you make of these latest logs?

Thanks
5min-vmstat.txt
5min-iostat.txt

SOLUTION

gheist

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

SOLUTION

arnold

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

SOLUTION

gheist

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

tomp_gl

ASKER

Hi Arnold,

I'm not sure why the disks are listed twice (that info was obtained from cPanel/WHM), none are in a RAID configuration. I ran the mount command which might serve better:

/dev/sda6 on / type ext3 (rw,usrquota)
none on /proc type proc (rw)
none on /sys type sysfs (rw)
none on /dev/pts type devpts (rw,gid=5,mode=620)
usbfs on /proc/bus/usb type usbfs (rw)
/dev/sda1 on /boot type ext3 (rw)
none on /dev/shm type tmpfs (rw)
/dev/sda8 on /home type ext3 (rw,usrquota)
/dev/sda7 on /tmp type ext3 (rw,noexec)
/dev/sda2 on /usr type ext3 (rw,usrquota)
/dev/sda3 on /var type ext3 (rw,usrquota)
/dev/sdd1 on /database type ext3 (rw)
none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw)
/tmp on /tmp type none (rw,noexec,nosuid,bind)
sunrpc on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw)
/dev/sde1 on /files1 type ext3 (rw)
/dev/sdf1 on /backup1 type ext3 (rw)
/dev/sdb1 on /backup type ext3 (rw)

The primary use of the server is for a web application (fpsbanana.com, a PHP/MySQL heavy community with forums, user profiles etc). I use cPanel/WHM because it simplifies many areas of server administration for me (DNS, kernel updates etc etc...). I understand that it has bloat and ideally I'd love a bare OS with just the LAMP essentials but I don't feel confident or skilled enough to do that just yet.

Apache modules:

core_module (static)
authn_file_module (static)
authn_default_module (static)
authz_host_module (static)
authz_groupfile_module (static)
authz_user_module (static)
authz_default_module (static)
auth_basic_module (static)
include_module (static)
filter_module (static)
deflate_module (static)
log_config_module (static)
logio_module (static)
env_module (static)
expires_module (static)
headers_module (static)
setenvif_module (static)
proxy_module (static)
proxy_connect_module (static)
proxy_ftp_module (static)
proxy_http_module (static)
proxy_ajp_module (static)
proxy_balancer_module (static)
ssl_module (static)
mpm_worker_module (static)
http_module (static)
mime_module (static)
status_module (static)
autoindex_module (static)
asis_module (static)
info_module (static)
suexec_module (static)
cgid_module (static)
negotiation_module (static)
dir_module (static)
actions_module (static)
userdir_module (static)
alias_module (static)
rewrite_module (static)
so_module (static)
bwlimited_module (shared)
php5_module (shared)

I have a script that writes a timestamp to a file, and I run this script with lynx every minute. I then have another script that checks the timestamp in this file every minute. If the timestamp is older than 60 seconds the checking script waits a further 30 seconds and re-opens the file and checks the timestamp again. If it's still older than 60 seconds (after 90 seconds) the checking script restarts httpd.

This means that Apache could be restarted if it's crashed or still up but not properly handling requests. Sometimes the error_log will record "MaxClients reached..." (if it crashed), other times it won't (if it's restarted during the bottleneck and before the crash). Typically it restarts during the bottleneck. It also depends on the value of MaxClients as this influences the duration of the bottleneck before the crash.

I wasn't performing crash dumps (they're new to me). I've added the necessary lines to httpd.conf and will report back when I get a crash.

I haven't had as many crashed in the last few hours as my peak period (daytime in the US, and particularly, daytime in the US on weekends) is closing. I am currently logging the new iostat command and will report back when I have the results of a crash.

gheist,

This server serves almost entirely dynamic content (PHP scripts). I already have other servers serving my static content. I noticed mod_proxy in the list of compiled modules - apparently this is required by cPanel. In regards to your suggestion of trying a precompiled build, I am limited by what cPanel/WHM offers - it will probably break things if I do this.

SOLUTION

gheist

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

tomp_gl

ASKER

Attached are the new iowait and vmstat logs. A bottleneck occurred between 21:16:01 and 21:17:01.

I added CoreDumpDirectory /tmp to httpd.conf but I can't find anything in /tmp other than php session files and temporary file uploads.

Thanks
5min-iostat2.txt
5min-vmstat2.txt

SOLUTION

arnold

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

SOLUTION

gheist

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

tomp_gl

ASKER

Thank you for your advice. I have decided to create a new server without cpanel and run only Apache on it, and connect it directly to my current main server. I think getting to the bottom of this issue is more trouble than it's worth, particularly with the interplay of cPanel .