Where does the big load of a web server come from ?


I have this chain of top outputs here: http://www.hostme.ro/top_mare
Also, I have a single output of top, at another time, here: http://www.hostme.ro/top2

Which process(es) cause the big load ? How can I find out more details about what exactly is taking down my server ?

Thank you.
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Digg.com or Slashdot?
softexp23Author Commented:
nqailfus, what do you mean ?
I was referring to websites that tend to drive a lot of traffic to other websites.
The Ultimate Tool Kit for Technolgy Solution Provi

Broken down into practical pointers and step-by-step instructions, the IT Service Excellence Tool Kit delivers expert advice for technology solution providers. Get your free copy for valuable how-to assets including sample agreements, checklists, flowcharts, and more!

softexp23Author Commented:
Oh, yes, i understand. But I don't think that's it. I want to find a way with a script or a piece of code to find out exactly what and how is increasing the load, which processes and how much does each contribute to the load. Because from top you can't tell (see top2 output from example)

 Also if it's httpd, i need to know which are the users (the web users, the virtual hosts, not the unix users) which consume most cpu or i/o.
softexp23Author Commented:
For answering my question, also please consult the links with my specific top outputs
Well, before anyone comes with some weird answer, I'll tell you that the server loads are what you're running.  It would seem that your Apache (httpd) and your php server are taking on A LOT of memory.  I would try and kill the 2 processes and see if that helps at all.  I'd like to see a top/uptime without httpd/apache and php running.
nociSoftware EngineerCommented:
Well your top output indicate there are >100 concurrent active processes, of wich a lot are httpd and exim.

That probably means you get a lot of requests for web pages and quite some mail gets delivered.

Both applications have logfiles.
Please check the apache access_log (if not available, then turn them on) and analyze those.
Then you known what URL's are hit, that might give a clue.

Same for exim, exim logs into own logfiles or into syslog. That also could tell what mail you receive.

Based on that knowledge you might come to the conclusion that these contain unwanted items, then you can take measures, or decide that it's ok....

Besides that you could try to take a snapshot of your network traffic and see what hits your server.
softexp23Author Commented:
noci, if I try to tail -f the acess_log of apache or the exim_mainlog, I get a lot of output which I can't decipher on the fly. And the logs doesn't tell me how much the cpu is busy processing one request or another. Mod_status of apache is closer to what I need, but it doesn't make much sense to me. I don't really understand his cpu related info.

I think I need 3 things:
1. On a period of 1minute or 60mins to see how the cpu time was sliced. Also I don't understand the difference between the cpu (percent) and cpu time from the top column. What contributes to the load, a process with high cpu column in top or with high cpu time ?

2. On that period which processes were active trying to keep the cpu busy

3. Sometimes the load comes from the big i/o time. How can I know on a particular moment which processes are writing like crazy on my hard disk.
nociSoftware EngineerCommented:
#1, first an explanation of top output.
>> top - 15:16:44 up 3 days,  1:59,  1 user,  load average: 130.41, 101.71, 49.23
This line tells the time = 15:16:44, is up for 3 days ,and nearly 2 hours, ther is one interactive user.
During the last minute there were 130 processes scheduled, during the last 5 minutes there were 101 process scheduled and during the last 15 minutes there were 49 processes scheduled.

Every X samples a slice is taken. At this slice:
Tasks: 465 total,   2 running, 458 sleeping,   0 stopped,   5 zombie
There are 465 processes in the system, of which 2 are waiting for CPU or active on the CPU. 485 do nothing,
5 are waiting for cleanup by the kernel.

Cpu(s): 23.9% us,  4.9% sy,  2.1% ni, 50.9% id, 18.0% wa,  0.1% hi,  0.0% si
Of your system: 23.9% is usermode time (real work done by processes
                           4.9% is kernel overhead
                           2.1% is time used by process that have nonstandard priorities
                          50.9% Your system is idle
                          18.0% of the time processes were waiting for IO with not runable process available

Mem:   2065200k total,  2044300k used,    20900k free,    24816k buffers
Swap:  2096440k total,   582996k used,  1513444k free,   164520k cached
These give the a breakdown on memory usage 2GB memory, 2GB swap, 580K Swap used.
if the 2GB memory, 24MB is used for buffering, and 164MB is cache.

Then follows a list of processes at the moment of the snapshot. It might not show all processes, the processes that were active but stop (exit, kill ..) just before the snapshot will not show up.

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  GROUP    SWAP COMMAND                                                                                                                  
24016 swissray  18   0 12908 4524 2168 D    9  0.2   0:00.05 swissray 8384 php                                                                                                                      
 6560 nobody    16   0 85604  31m  17m D    7  1.6   0:38.99 nobody    51m httpd                                                                                                                    
24001 root      16   0  2968 1124  712 R    5  0.1   0:00.07 root     1844 top                                                                                            

below this the process had 0% (or a small fraction of CPU used.) , the cpu column addsup to 21%,
so there is still ~8% not accounted for, maybe small fractions of activity further on in the list,
partly the process using the cpu just left.

Whell cpu time is measured and the relative use of that cpu time in the interval that just ended is translated to CPU usage. The system load will allways addup to 100% on the 3rd line (regardless if this is a single CPU or and 8way dual core or whatever.)

If you add the % column of the process table you might easily get to #CPU * 100% as maximum (200% on Dual cpu, or dual core cpu).

#2 Any process with the CPU column >0% AND the processes that were not in the process table anymore at the time of snapshot.

#3 The acct package  (http://www.gnu.org/directory/acct.html )  can split out which program uses what (accounting...) the program to use is sa (system accounting), but it doesn't keep statistics of running processes,
just the amounts of realtime / cpu time / IO a program uses. So it will tell you how mush ls uses, or httpd or exim used by all users.
Accounting per process is hard to do as unix essentially writes to ram and a background write write stuff to disk oncve every minute or so. (sync job) unless all buffer space is used, then an extra sync is called. That's why the filesystem is so fast, you just need to be sure the system doesn't go down abrubtly between syncs, you might end up with corrupted disks, hence a UPS is an required add on. Also there are filesystems that use logjournals that are written before action is taken, that means the system state is consistens, but your data writes might still get lost.

Another big holdup on systems is swapping or paging, adding memory helps releave this a bit, and paging to a separate physiscal disk also helps a lot. Esp. if it has it's own controller.

Quite some time must come from a big IO load, as the system is waiting for IO 18% of it's time. for this sample.

iostat gives you the io totals to determine which disk is hit the most.
maybe you can a much better performance by splitting out IO over multiple disks (and is possible multiple controllers) does help. Also a SCSI/SAS/FC architecture handles IO load quite a bit better then IDE/ATAPI like environments.

Using vmstat you can find out to what efect swapping hits you, look at the si & so columns (swap in / Swap out).
bo / bo (blocks in / blocks out) is the amount for normal IO done to disks (driver level)

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
nociSoftware EngineerCommented:
w.r.t. tail -f if it is too fast then there is a lot of traffic.

but tail can also be used to make a listing of the lasst say 100 lines:

tail -n 100 access_log
softexp23Author Commented:
Thanks for your detailed answer.
softexp23Author Commented:
noci: "below this the process had 0% (or a small fraction of CPU used.) , the cpu column addsup to 21%,
so there is still ~8% not accounted for, maybe small fractions of activity further on in the list"

Thanks noci. One more thing...in the output you've analyzed, if the cpu column addsup to 21%, there is still 79% left, right ? (not ~8%). So does this 79% come from little pieces under 1% ? Because the top doesn't show it.

In your opinion, looking at top2 output, what caused the huge load ?
nociSoftware EngineerCommented:
23% user
4% system
2% non-normal prio
------ +
29%  (31% if fractions are taken into account)    CPU is BUSY
29% of the CPU is realy used
Sum of per process slices => 21% ==> 8 (10) % BUSY time is split over many processes but not realy accounted for.

51% = idle 18% = total 69% CPU is doing nothing worthwile.
(= roughly 70%)

From top2 I realy can;t tell.

It might well be swapping.
That is mostly hard to see, except in a 'vmstat 10' ( in place of 10 you can name another interval in seconds.)
But it will cause long queues for the swapspace. (and other IO if also to the same physical disk)

It will be a job of trying to locate a bottleneck and then remove that.
It might work very good if you are able to limit the ammount of processes. F.e. do not accept unlimited apache/exim links.

It might be better to keep a few on hold and have better throughput.
softexp23Author Commented:
Thank you noci.
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today

From novice to tech pro — start learning today.