asked on

Apache gets slow until restart - why?

We are running a web app written in perl via Apache 2.0 on Linux RHEL4 within an intranet. Users complained that the app performance was slower than usual. After testing everything from the network to the server drives, we found that the solution is to restart apache. Every time users report performance problems and we restart Apache the users state that the app is very fast again for a day or so.

How would you recommend that we begin our search to figure out why restarting Apache is improving performance.

Thanks!

spiroc

SOLUTION

AdamsConsulting

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

cspiro

ASKER

In this case, 'slow' means that a web app screen takes 6 seconds to load versus 1 second. So, by the time we would try to run an strace on a pid it would be gone.

Is there a more generic way to output straces of all active pages served by apache over a let's say a 5-minute period and then compare betwen a slow and fast period?

AdamsConsulting

You could try an strace -f on the parent apache process and dump stdout and stderr to a file. The -f has it follow children. I agree that this 6 seconds to troubleshoot the problem is going to make it difficult. :(

AdamsConsulting

Also, did this just start happening, and what changed around the time that it started happening?

cspiro

ASKER

Would you be able to show the command that would accomplish "strace -f on the parent apache process and dump stdout and stderr to a file"? Thanks!

AdamsConsulting

ps -efww |grep httpd

You want the httpd process that has a parent process of 1 and likely running as root. For example:

root 30680 1 0 05:42 ? 00:00:00 /usr/local/apache/bin/httpd
apache 30681 30680 0 05:42 ? 00:01:13 /usr/local/apache/bin/httpd
apache 30697 30680 0 05:42 ? 00:01:11 /usr/local/apache/bin/httpd

You want the PID from the first process, as the third column (parent process) is 1.

Then use that pid to run your strace, and direct the standard out to a log file, as well the standard error:

strace -f -p 30680 > output.txt 2>&1

Let that run until the incident reoccurs. Then press CTRL-c to exit strace and read the output.txt file.

Optionally, if you want to keep your session open to do this or if your session seems to be timing out, you can have it survive your session with:

screen strace -f -p 30680 > output.txt 2>&1

If you don't have screen installed, install it with:

up2date screen

If you use the method with screen, you can leave the session by typing:

CTRL-a CTRL-d

Then feel freel to log out, and the process will still be running, attached to a virtual terminal. To get back to the session, type:

screen -r

Then once attached, follow the instructions above by pressing CTRL-c to exit the strace.

I know this sounds confusing but you can't go wrong if you just follow my instructions. :)

AdamsConsulting

I forgot that you'll want to add the -t parameter to strace to log a timestamp if you'll be reviewing the logs later instead of in realtime

strace -t -f -p 30680 > output.txt 2>&1

cspiro

ASKER

The client told met that the system was slow and I recorded the tracing below. It seems normal to me.
Process 32177 attached - interrupt to quit
14:13:12 select(0, NULL, NULL, NULL, {0, 337000}) = 0 (Timeout)
14:13:12 waitpid(-1, 0xbffffb70, WNOHANG|WSTOPPED) = 0
14:13:12 select(0, NULL, NULL, NULL, {1, 0}) = 0 (Timeout)
14:13:13 waitpid(-1, 0xbffffb70, WNOHANG|WSTOPPED) = 0
14:13:13 select(0, NULL, NULL, NULL, {1, 0}) = 0 (Timeout)
14:13:14 waitpid(-1, 0xbffffb70, WNOHANG|WSTOPPED) = 0
14:13:14 select(0, NULL, NULL, NULL, {1, 0}) = 0 (Timeout)
14:13:15 waitpid(-1, 0xbffffb70, WNOHANG|WSTOPPED) = 0
14:13:15 select(0, NULL, NULL, NULL, {1, 0}) = 0 (Timeout)
14:13:16 waitpid(-1, 0xbffffb70, WNOHANG|WSTOPPED) = 0
14:13:16 select(0, NULL, NULL, NULL, {1, 0}) = 0 (Timeout)
14:13:17 waitpid(-1, 0xbffffb70, WNOHANG|WSTOPPED) = 0
14:13:17 select(0, NULL, NULL, NULL, {1, 0}) = 0 (Timeout)
14:13:18 waitpid(-1, 0xbffffb70, WNOHANG|WSTOPPED) = 0
14:13:18 select(0, NULL, NULL, NULL, {1, 0}) = 0 (Timeout)
14:13:19 waitpid(-1, 0xbffffb70, WNOHANG|WSTOPPED) = 0
14:13:19 select(0, NULL, NULL, NULL, {1, 0}) = 0 (Timeout)
14:13:20 waitpid(-1, 0xbffffb70, WNOHANG|WSTOPPED) = 0
14:13:20 select(0, NULL, NULL, NULL, {1, 0}) = 0 (Timeout)
14:13:21 waitpid(-1, 0xbffffb70, WNOHANG|WSTOPPED) = 0
14:13:21 select(0, NULL, NULL, NULL, {1, 0}) = 0 (Timeout)
14:13:22 waitpid(-1, 0xbffffb70, WNOHANG|WSTOPPED) = 0
14:13:22 select(0, NULL, NULL, NULL, {1, 0} <unfinished ...>
Process 32177 detached

How would you interpret this?

AdamsConsulting

That looks like the strace from the parent process, I'm not sure why it didn't follow the child and show what the child was doing. Did you forget to type the "-f" parameter?

cspiro

ASKER

I am including the -f parameter. man strace shows that -f is the correct parameter for child processes but the output is what I showed you above.

However, I did notice something unusual. The first apache process had a pid of 476 where all the others started after 19480. When I straced only that pid I get
22:22:09 semop(9928711, 0xb7e42b44, 1 <unfinished ...>

Is that a lingering stuck pid?

Am I fishing too far?? Why isn't -f working?? Are we having fun yet??

ASKER CERTIFIED SOLUTION

cspiro

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial