?
Solved

Where does the big load of a web server come from ?

Posted on 2008-01-31
14
Medium Priority
?
289 Views
Last Modified: 2013-12-16
Hello,

I have this chain of top outputs here: http://www.hostme.ro/top_mare
Also, I have a single output of top, at another time, here: http://www.hostme.ro/top2

Which process(es) cause the big load ? How can I find out more details about what exactly is taking down my server ?

Thank you.
0
Comment
Question by:softexp23
  • 7
  • 4
  • 2
  • +1
14 Comments
 
LVL 6

Expert Comment

by:ngailfus
ID: 20789357
Digg.com or Slashdot?
0
 

Author Comment

by:softexp23
ID: 20789383
nqailfus, what do you mean ?
0
 
LVL 6

Expert Comment

by:ngailfus
ID: 20789447
I was referring to websites that tend to drive a lot of traffic to other websites.
0
Never miss a deadline with monday.com

The revolutionary project management tool is here!   Plan visually with a single glance and make sure your projects get done.

 

Author Comment

by:softexp23
ID: 20789512
Oh, yes, i understand. But I don't think that's it. I want to find a way with a script or a piece of code to find out exactly what and how is increasing the load, which processes and how much does each contribute to the load. Because from top you can't tell (see top2 output from example)

 Also if it's httpd, i need to know which are the users (the web users, the virtual hosts, not the unix users) which consume most cpu or i/o.
0
 

Author Comment

by:softexp23
ID: 20789642
For answering my question, also please consult the links with my specific top outputs
0
 
LVL 1

Expert Comment

by:evilaim
ID: 20791895
Well, before anyone comes with some weird answer, I'll tell you that the server loads are what you're running.  It would seem that your Apache (httpd) and your php server are taking on A LOT of memory.  I would try and kill the 2 processes and see if that helps at all.  I'd like to see a top/uptime without httpd/apache and php running.
0
 
LVL 41

Expert Comment

by:noci
ID: 20792561
Well your top output indicate there are >100 concurrent active processes, of wich a lot are httpd and exim.

That probably means you get a lot of requests for web pages and quite some mail gets delivered.

Both applications have logfiles.
Please check the apache access_log (if not available, then turn them on) and analyze those.
Then you known what URL's are hit, that might give a clue.

Same for exim, exim logs into own logfiles or into syslog. That also could tell what mail you receive.

Based on that knowledge you might come to the conclusion that these contain unwanted items, then you can take measures, or decide that it's ok....

Besides that you could try to take a snapshot of your network traffic and see what hits your server.
0
 

Author Comment

by:softexp23
ID: 20795137
noci, if I try to tail -f the acess_log of apache or the exim_mainlog, I get a lot of output which I can't decipher on the fly. And the logs doesn't tell me how much the cpu is busy processing one request or another. Mod_status of apache is closer to what I need, but it doesn't make much sense to me. I don't really understand his cpu related info.

I think I need 3 things:
1. On a period of 1minute or 60mins to see how the cpu time was sliced. Also I don't understand the difference between the cpu (percent) and cpu time from the top column. What contributes to the load, a process with high cpu column in top or with high cpu time ?

2. On that period which processes were active trying to keep the cpu busy

3. Sometimes the load comes from the big i/o time. How can I know on a particular moment which processes are writing like crazy on my hard disk.
0
 
LVL 41

Accepted Solution

by:
noci earned 1500 total points
ID: 20796179
#1, first an explanation of top output.
>> top - 15:16:44 up 3 days,  1:59,  1 user,  load average: 130.41, 101.71, 49.23
This line tells the time = 15:16:44, is up for 3 days ,and nearly 2 hours, ther is one interactive user.
During the last minute there were 130 processes scheduled, during the last 5 minutes there were 101 process scheduled and during the last 15 minutes there were 49 processes scheduled.

Every X samples a slice is taken. At this slice:
Tasks: 465 total,   2 running, 458 sleeping,   0 stopped,   5 zombie
There are 465 processes in the system, of which 2 are waiting for CPU or active on the CPU. 485 do nothing,
5 are waiting for cleanup by the kernel.

Cpu(s): 23.9% us,  4.9% sy,  2.1% ni, 50.9% id, 18.0% wa,  0.1% hi,  0.0% si
Of your system: 23.9% is usermode time (real work done by processes
                           4.9% is kernel overhead
                           2.1% is time used by process that have nonstandard priorities
                          50.9% Your system is idle
                          18.0% of the time processes were waiting for IO with not runable process available

Mem:   2065200k total,  2044300k used,    20900k free,    24816k buffers
Swap:  2096440k total,   582996k used,  1513444k free,   164520k cached
These give the a breakdown on memory usage 2GB memory, 2GB swap, 580K Swap used.
if the 2GB memory, 24MB is used for buffering, and 164MB is cache.

Then follows a list of processes at the moment of the snapshot. It might not show all processes, the processes that were active but stop (exit, kill ..) just before the snapshot will not show up.

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  GROUP    SWAP COMMAND                                                                                                                  
24016 swissray  18   0 12908 4524 2168 D    9  0.2   0:00.05 swissray 8384 php                                                                                                                      
 6560 nobody    16   0 85604  31m  17m D    7  1.6   0:38.99 nobody    51m httpd                                                                                                                    
24001 root      16   0  2968 1124  712 R    5  0.1   0:00.07 root     1844 top                                                                                            

below this the process had 0% (or a small fraction of CPU used.) , the cpu column addsup to 21%,
so there is still ~8% not accounted for, maybe small fractions of activity further on in the list,
partly the process using the cpu just left.

Whell cpu time is measured and the relative use of that cpu time in the interval that just ended is translated to CPU usage. The system load will allways addup to 100% on the 3rd line (regardless if this is a single CPU or and 8way dual core or whatever.)

If you add the % column of the process table you might easily get to #CPU * 100% as maximum (200% on Dual cpu, or dual core cpu).

#2 Any process with the CPU column >0% AND the processes that were not in the process table anymore at the time of snapshot.

#3 The acct package  (http://www.gnu.org/directory/acct.html )  can split out which program uses what (accounting...) the program to use is sa (system accounting), but it doesn't keep statistics of running processes,
just the amounts of realtime / cpu time / IO a program uses. So it will tell you how mush ls uses, or httpd or exim used by all users.
Accounting per process is hard to do as unix essentially writes to ram and a background write write stuff to disk oncve every minute or so. (sync job) unless all buffer space is used, then an extra sync is called. That's why the filesystem is so fast, you just need to be sure the system doesn't go down abrubtly between syncs, you might end up with corrupted disks, hence a UPS is an required add on. Also there are filesystems that use logjournals that are written before action is taken, that means the system state is consistens, but your data writes might still get lost.

Another big holdup on systems is swapping or paging, adding memory helps releave this a bit, and paging to a separate physiscal disk also helps a lot. Esp. if it has it's own controller.

Quite some time must come from a big IO load, as the system is waiting for IO 18% of it's time. for this sample.

iostat gives you the io totals to determine which disk is hit the most.
maybe you can a much better performance by splitting out IO over multiple disks (and is possible multiple controllers) does help. Also a SCSI/SAS/FC architecture handles IO load quite a bit better then IDE/ATAPI like environments.

Using vmstat you can find out to what efect swapping hits you, look at the si & so columns (swap in / Swap out).
bo / bo (blocks in / blocks out) is the amount for normal IO done to disks (driver level)
0
 
LVL 41

Expert Comment

by:noci
ID: 20796536
w.r.t. tail -f if it is too fast then there is a lot of traffic.

but tail can also be used to make a listing of the lasst say 100 lines:

tail -n 100 access_log
0
 

Author Closing Comment

by:softexp23
ID: 31426857
Thanks for your detailed answer.
0
 

Author Comment

by:softexp23
ID: 20797391
noci: "below this the process had 0% (or a small fraction of CPU used.) , the cpu column addsup to 21%,
so there is still ~8% not accounted for, maybe small fractions of activity further on in the list"

Thanks noci. One more thing...in the output you've analyzed, if the cpu column addsup to 21%, there is still 79% left, right ? (not ~8%). So does this 79% come from little pieces under 1% ? Because the top doesn't show it.

In your opinion, looking at top2 output, what caused the huge load ?
0
 
LVL 41

Expert Comment

by:noci
ID: 20809645
23% user
4% system
2% non-normal prio
------ +
29%  (31% if fractions are taken into account)    CPU is BUSY
29% of the CPU is realy used
Sum of per process slices => 21% ==> 8 (10) % BUSY time is split over many processes but not realy accounted for.

51% = idle 18% = total 69% CPU is doing nothing worthwile.
(= roughly 70%)

From top2 I realy can;t tell.

It might well be swapping.
That is mostly hard to see, except in a 'vmstat 10' ( in place of 10 you can name another interval in seconds.)
But it will cause long queues for the swapspace. (and other IO if also to the same physical disk)

It will be a job of trying to locate a bottleneck and then remove that.
It might work very good if you are able to limit the ammount of processes. F.e. do not accept unlimited apache/exim links.

It might be better to keep a few on hold and have better throughput.
0
 

Author Comment

by:softexp23
ID: 20813139
Thank you noci.
0

Featured Post

Never miss a deadline with monday.com

The revolutionary project management tool is here!   Plan visually with a single glance and make sure your projects get done.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Join Greg Farro and Ethan Banks from Packet Pushers (http://packetpushers.net/podcast/podcasts/pq-show-93-smart-network-monitoring-paessler-sponsored/) and Greg Ross from Paessler (https://www.paessler.com/prtg) for a discussion about smart network …
This installment of Make It Better gives Media Temple customers the latest news, plugins, and tutorials to make their Grid shared hosting experience that much smoother.
Learn how to find files with the shell using the find and locate commands. Use locate to find a needle in a haystack.: With locate, check if the file still exists.: Use find to get the actual location of the file.:
How to Install VMware Tools in Red Hat Enterprise Linux 6.4 (RHEL 6.4) Step-by-Step Tutorial
Suggested Courses

588 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question