cspiro
asked on
Apache gets slow until restart - why?
We are running a web app written in perl via Apache 2.0 on Linux RHEL4 within an intranet. Users complained that the app performance was slower than usual. After testing everything from the network to the server drives, we found that the solution is to restart apache. Every time users report performance problems and we restart Apache the users state that the app is very fast again for a day or so.
How would you recommend that we begin our search to figure out why restarting Apache is improving performance.
Thanks!
spiroc
How would you recommend that we begin our search to figure out why restarting Apache is improving performance.
Thanks!
spiroc
SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
You could try an strace -f on the parent apache process and dump stdout and stderr to a file. The -f has it follow children. I agree that this 6 seconds to troubleshoot the problem is going to make it difficult. :(
Also, did this just start happening, and what changed around the time that it started happening?
ASKER
Would you be able to show the command that would accomplish "strace -f on the parent apache process and dump stdout and stderr to a file"? Thanks!
ps -efww |grep httpd
You want the httpd process that has a parent process of 1 and likely running as root. For example:
root 30680 1 0 05:42 ? 00:00:00 /usr/local/apache/bin/http d
apache 30681 30680 0 05:42 ? 00:01:13 /usr/local/apache/bin/http d
apache 30697 30680 0 05:42 ? 00:01:11 /usr/local/apache/bin/http d
You want the PID from the first process, as the third column (parent process) is 1.
Then use that pid to run your strace, and direct the standard out to a log file, as well the standard error:
strace -f -p 30680 > output.txt 2>&1
Let that run until the incident reoccurs. Then press CTRL-c to exit strace and read the output.txt file.
Optionally, if you want to keep your session open to do this or if your session seems to be timing out, you can have it survive your session with:
screen strace -f -p 30680 > output.txt 2>&1
If you don't have screen installed, install it with:
up2date screen
If you use the method with screen, you can leave the session by typing:
CTRL-a CTRL-d
Then feel freel to log out, and the process will still be running, attached to a virtual terminal. To get back to the session, type:
screen -r
Then once attached, follow the instructions above by pressing CTRL-c to exit the strace.
I know this sounds confusing but you can't go wrong if you just follow my instructions. :)
You want the httpd process that has a parent process of 1 and likely running as root. For example:
root 30680 1 0 05:42 ? 00:00:00 /usr/local/apache/bin/http
apache 30681 30680 0 05:42 ? 00:01:13 /usr/local/apache/bin/http
apache 30697 30680 0 05:42 ? 00:01:11 /usr/local/apache/bin/http
You want the PID from the first process, as the third column (parent process) is 1.
Then use that pid to run your strace, and direct the standard out to a log file, as well the standard error:
strace -f -p 30680 > output.txt 2>&1
Let that run until the incident reoccurs. Then press CTRL-c to exit strace and read the output.txt file.
Optionally, if you want to keep your session open to do this or if your session seems to be timing out, you can have it survive your session with:
screen strace -f -p 30680 > output.txt 2>&1
If you don't have screen installed, install it with:
up2date screen
If you use the method with screen, you can leave the session by typing:
CTRL-a CTRL-d
Then feel freel to log out, and the process will still be running, attached to a virtual terminal. To get back to the session, type:
screen -r
Then once attached, follow the instructions above by pressing CTRL-c to exit the strace.
I know this sounds confusing but you can't go wrong if you just follow my instructions. :)
I forgot that you'll want to add the -t parameter to strace to log a timestamp if you'll be reviewing the logs later instead of in realtime
strace -t -f -p 30680 > output.txt 2>&1
strace -t -f -p 30680 > output.txt 2>&1
ASKER
The client told met that the system was slow and I recorded the tracing below. It seems normal to me.
Process 32177 attached - interrupt to quit
14:13:12 select(0, NULL, NULL, NULL, {0, 337000}) = 0 (Timeout)
14:13:12 waitpid(-1, 0xbffffb70, WNOHANG|WSTOPPED) = 0
14:13:12 select(0, NULL, NULL, NULL, {1, 0}) = 0 (Timeout)
14:13:13 waitpid(-1, 0xbffffb70, WNOHANG|WSTOPPED) = 0
14:13:13 select(0, NULL, NULL, NULL, {1, 0}) = 0 (Timeout)
14:13:14 waitpid(-1, 0xbffffb70, WNOHANG|WSTOPPED) = 0
14:13:14 select(0, NULL, NULL, NULL, {1, 0}) = 0 (Timeout)
14:13:15 waitpid(-1, 0xbffffb70, WNOHANG|WSTOPPED) = 0
14:13:15 select(0, NULL, NULL, NULL, {1, 0}) = 0 (Timeout)
14:13:16 waitpid(-1, 0xbffffb70, WNOHANG|WSTOPPED) = 0
14:13:16 select(0, NULL, NULL, NULL, {1, 0}) = 0 (Timeout)
14:13:17 waitpid(-1, 0xbffffb70, WNOHANG|WSTOPPED) = 0
14:13:17 select(0, NULL, NULL, NULL, {1, 0}) = 0 (Timeout)
14:13:18 waitpid(-1, 0xbffffb70, WNOHANG|WSTOPPED) = 0
14:13:18 select(0, NULL, NULL, NULL, {1, 0}) = 0 (Timeout)
14:13:19 waitpid(-1, 0xbffffb70, WNOHANG|WSTOPPED) = 0
14:13:19 select(0, NULL, NULL, NULL, {1, 0}) = 0 (Timeout)
14:13:20 waitpid(-1, 0xbffffb70, WNOHANG|WSTOPPED) = 0
14:13:20 select(0, NULL, NULL, NULL, {1, 0}) = 0 (Timeout)
14:13:21 waitpid(-1, 0xbffffb70, WNOHANG|WSTOPPED) = 0
14:13:21 select(0, NULL, NULL, NULL, {1, 0}) = 0 (Timeout)
14:13:22 waitpid(-1, 0xbffffb70, WNOHANG|WSTOPPED) = 0
14:13:22 select(0, NULL, NULL, NULL, {1, 0} <unfinished ...>
Process 32177 detached
How would you interpret this?
Process 32177 attached - interrupt to quit
14:13:12 select(0, NULL, NULL, NULL, {0, 337000}) = 0 (Timeout)
14:13:12 waitpid(-1, 0xbffffb70, WNOHANG|WSTOPPED) = 0
14:13:12 select(0, NULL, NULL, NULL, {1, 0}) = 0 (Timeout)
14:13:13 waitpid(-1, 0xbffffb70, WNOHANG|WSTOPPED) = 0
14:13:13 select(0, NULL, NULL, NULL, {1, 0}) = 0 (Timeout)
14:13:14 waitpid(-1, 0xbffffb70, WNOHANG|WSTOPPED) = 0
14:13:14 select(0, NULL, NULL, NULL, {1, 0}) = 0 (Timeout)
14:13:15 waitpid(-1, 0xbffffb70, WNOHANG|WSTOPPED) = 0
14:13:15 select(0, NULL, NULL, NULL, {1, 0}) = 0 (Timeout)
14:13:16 waitpid(-1, 0xbffffb70, WNOHANG|WSTOPPED) = 0
14:13:16 select(0, NULL, NULL, NULL, {1, 0}) = 0 (Timeout)
14:13:17 waitpid(-1, 0xbffffb70, WNOHANG|WSTOPPED) = 0
14:13:17 select(0, NULL, NULL, NULL, {1, 0}) = 0 (Timeout)
14:13:18 waitpid(-1, 0xbffffb70, WNOHANG|WSTOPPED) = 0
14:13:18 select(0, NULL, NULL, NULL, {1, 0}) = 0 (Timeout)
14:13:19 waitpid(-1, 0xbffffb70, WNOHANG|WSTOPPED) = 0
14:13:19 select(0, NULL, NULL, NULL, {1, 0}) = 0 (Timeout)
14:13:20 waitpid(-1, 0xbffffb70, WNOHANG|WSTOPPED) = 0
14:13:20 select(0, NULL, NULL, NULL, {1, 0}) = 0 (Timeout)
14:13:21 waitpid(-1, 0xbffffb70, WNOHANG|WSTOPPED) = 0
14:13:21 select(0, NULL, NULL, NULL, {1, 0}) = 0 (Timeout)
14:13:22 waitpid(-1, 0xbffffb70, WNOHANG|WSTOPPED) = 0
14:13:22 select(0, NULL, NULL, NULL, {1, 0} <unfinished ...>
Process 32177 detached
How would you interpret this?
That looks like the strace from the parent process, I'm not sure why it didn't follow the child and show what the child was doing. Did you forget to type the "-f" parameter?
ASKER
I am including the -f parameter. man strace shows that -f is the correct parameter for child processes but the output is what I showed you above.
However, I did notice something unusual. The first apache process had a pid of 476 where all the others started after 19480. When I straced only that pid I get
22:22:09 semop(9928711, 0xb7e42b44, 1 <unfinished ...>
Is that a lingering stuck pid?
Am I fishing too far?? Why isn't -f working?? Are we having fun yet??
However, I did notice something unusual. The first apache process had a pid of 476 where all the others started after 19480. When I straced only that pid I get
22:22:09 semop(9928711, 0xb7e42b44, 1 <unfinished ...>
Is that a lingering stuck pid?
Am I fishing too far?? Why isn't -f working?? Are we having fun yet??
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
Is there a more generic way to output straces of all active pages served by apache over a let's say a 5-minute period and then compare betwen a slow and fast period?