Solved

high load on one AIX 5.3 without CPU or IO used

Posted on 2010-08-13
18
4,142 Views
Last Modified: 2013-11-17
Hi,

I have a strange high load (betwreen 2 and 5) constantly and I see no CPU or IO used on that server.
from topas:

CPU  User%  Kern%  Wait%  Idle%  Physc   Entc  
ALL    0.1    0.3    0.0   99.6   0.01    0.6   Writes    

Name            PID  CPU%  PgSp Owner
topas       1757256   0.1   1.9 root
clstrmgr     331942   0.0   5.0 root
getty        241812   0.0   0.5 root
gil           69666   0.0   0.9 root
hats_dis     389356   0.0   1.8 root
hats_nim     507962   0.0   1.9 root
hatsd        704634   0.0   9.3 root
hats_nim     487456   0.0   1.9 root
hats_nim     450660   0.0   1.9 root

Disk    Busy%     KBPS     TPS KB-Read KB-Writ  PgspIn      
dac0      0.0      2.0     4.0     1.0     1.0  PgspOut    
hdisk2    0.0      1.0     2.0     0.5     0.5  PageIn      
hdisk17   0.0      1.0     2.0     0.5     0.5  PageOut      
dac1utm   0.0      0.0     0.0     0.0     0.0  Sios        
dac0utm   0.0      0.0     0.0     0.0     0.0                


 EVENTS/QUEUES    FILE/TTY
 Cswitch     312  Readch     7346
 Syscall     344  Writech    3152
 19  Rawin         8
 Writes       25  Ttyout     1059
 Forks         0  Igets         0
 Execs         0  Namei        24
 Runqueue    1.5  Dirblk        0
 Waitqueue   0.0

See attached cpu image.

/# vmstat 1 3
System configuration: lcpu=4 mem=6144MB ent=2.00
kthr    memory              page              faults              cpu
----- ----------- ------------------------ ------------ -----------------------
 r  b   avm   fre  re  pi  po  fr   sr  cy  in   sy  cs us sy id wa    pc    ec
 2  0 515253 819301   0   0   0   0    0   0  43  459 313  0  0 99  0  0.02   1.1
 2  0 515253 819301   0   0   0   0    0   0  28  254 319  0  0 99  0  0.01   0.6
 2  0 515254 819300   0   0   0   0    0   0  14  149 242  0  0 99  0  0.01   0.5

# uptime
03:15PM   up 99 days,  12:35,  5 users,  load average: 2.70, 3.27, 3.08

Where can I look to if I found the cause of this issue?

Thanks
cpu.png
0
Comment
Question by:sminfo
  • 7
  • 7
  • 4
18 Comments
 
LVL 68

Expert Comment

by:woolmilkporc
ID: 33429947
Salut,

now that I'm back from France let's look at this one.

The only thing which could catch one's eye is the gil process.

It's a kernel process ("Generalized Interrupt Level") which deals with TCP network acknowledgements and, more important, retransmissions.

So please examine yor network traffic, e.g. the "errs" column of netstat 1 (meaning 1 second interval),
or the "packets" column for high values.
Or check "topas" (left middle).

Is this perhaps an NFS/Smaba server with a lot of network traffic?

wmp

0
 

Author Comment

by:sminfo
ID: 33430023
See:

bsa550q2:/# netstat 1
    input   (en0)      output           input   (Total)    output
 packets  errs  packets  errs colls  packets  errs  packets  errs colls
1713251640     0 312324300     3     0 2856788509     0 1004924055    13     0
       7     0        3     0     0       15     0        9     0     0
       4     0        2     0     0       14     0       11     0     0
       7     0        4     0     0       13     0        7     0     0
       6     0        3     0     0       20     0       13     0     0
       3     0        4     0     0       13     0       18     0     0
       7     0        3     0     0       18     0       14     0     0
       5     0        2     0     0       11     0        8     0     0
       3     0        5     0     0       12     0       13     0     0
       3     0        2     0     0       11     0       11     0     0
       2     0        2     0     0        6     0        5     0     0
       3     0        2     0     0        9     0        8     0     0
       3     0        4     0     0       12     0       12     0     0
       5     0        2     0     0       16     0       12     0     0
      12     0        8     0     0       19     0       11     0     0
       3     0        4     0     0       13     0       14     0     0
       8     0        6     0     0       20     0       15     0     0
       6     0        6     0     0       14     0       15     0     0
       3     0        3     0     0        8     0        7     0     0
       7     0        3     0     0       14     0        9     0     0
       4     0        5     0     0       13     0       13     0     0


topas.bmp
0
 

Author Comment

by:sminfo
ID: 33430121
... and there's not nfs on that server.. it's really really odd this issue..

regards
Israel.
0
 
LVL 61

Expert Comment

by:gheist
ID: 33430195
swapping is happening
it causes highest priority IO

setting vmtune -M 1024 -m 1000
fixes it

longer lived fix is taming processes with WLM
AND buying extra memory
0
 

Author Comment

by:sminfo
ID: 33430300
wmp, sorry.. nfs is running on that server but it's not in use. look proctree:

# proctree
131244   /usr/sbin/srcmstr
   94382   /usr/sbin/portmap
   127206   /usr/sbin/snmpd
   1830962   /usr/sbin/tftpd -n
      200874   /usr/sbin/tftpd -n
   217222   /usr/sbin/syslogd
   233604   /usr/es/sbin/cluster/clcomd -d
   655478   /usr/sbin/gsclvmd
      270460   /usr/sbin/gsclvmd -r 30 -i 300 -t 300 -c 000030910000d6000000011819b2b337 -v 0
      307216   /usr/sbin/gsclvmd -r 30 -i 300 -t 300 -c 000030910000d6000000011819b256e0 -v 0
      397548   /usr/sbin/gsclvmd -r 30 -i 300 -t 300 -c 000030910000d6000000011819b1fb14 -v 0
      438494   /usr/sbin/gsclvmd -r 30 -i 300 -t 300 -c 000030910000d600000001181bdd4d5e -v 0
      491538   /usr/sbin/gsclvmd -r 30 -i 300 -t 300 -c 000030910000d6000000011819b30eb5 -v 0
      524428   /usr/sbin/gsclvmd -r 30 -i 300 -t 300 -c 000030910000d6000000011819b228fc -v 0
      540782   /usr/sbin/gsclvmd -r 30 -i 300 -t 300 -c 000030910000d6000000011819b33c6a -v 0
      565468   /usr/sbin/gsclvmd -r 30 -i 300 -t 300 -c 000030910000d6000000011819b19f89 -v 0
      577748   /usr/sbin/gsclvmd -r 30 -i 300 -t 300 -c 000030910000d6000000011819b171b2 -v 0
      643282   /usr/sbin/gsclvmd -r 30 -i 300 -t 300 -c 000030910000d6000000011819b1cd3d -v 0
      700644   /usr/sbin/gsclvmd -r 30 -i 300 -t 300 -c 000030910000d6000000011819b2e0eb -v 0
      712866   /usr/sbin/gsclvmd -r 30 -i 300 -t 300 -c 000030910000d6000000011819b142a7 -v 0
      766146   /usr/sbin/gsclvmd -r 30 -i 300 -t 300 -c 000030910000d600000001181bdd7d54 -v 0
      782558   /usr/sbin/gsclvmd -r 30 -i 300 -t 300 -c 000030910000d6000000011819b28578 -v 0
   295058   /usr/sbin/rsct/bin/rmcd -a IBM.LPCommands -r
   299156   /usr/java5/bin/java -Xbootclasspath/a:/var/websm/lwi/runtime/core/rcp/eclipse/p
   303224   /usr/sbin/rpc.statd -d 0 -t 50
   327870   /usr/sbin/rsct/bin/IBM.DRMd
   331942   /usr/es/sbin/cluster/clstrmgr
      544982   run_rcovcmd
   335896   /usr/sbin/snmpmibd
   704634   /usr/sbin/rsct/bin/hatsd -n 4 -o deadManSwitch
      339970   /usr/sbin/rsct/bin/hats_nim
      389356   /usr/sbin/rsct/bin/hats_diskhb_nim
      425984   /usr/sbin/rsct/bin/hats_diskhb_nim
      450660   /usr/sbin/rsct/bin/hats_nim
      487456   /usr/sbin/rsct/bin/hats_nim
      507962   /usr/sbin/rsct/bin/hats_nim
   348332   /usr/sbin/rsct/bin/vac5/IBM.CSMAgentRMd
   368866   /usr/sbin/rpc.lockd -d 0
   376990   /usr/es/sbin/cluster/clinfo
   413704   /usr/sbin/muxatmd
   434220   hagsd grpsvcs
   471134   /usr/sbin/nfsd 3891
   499800   /usr/sbin/qdaemon
   503884   /usr/sbin/biod 6
   528568   /usr/sbin/rsct/bin/IBM.HostRMd
   548942   /usr/sbin/xntpd
   626866   /usr/sbin/writesrv
   630870   /usr/sbin/rpc.mountd
   634996   haemd HACMP 4 xxcccc_cluster SECNOSUPPORT
   639210   /usr/sbin/aixmibd
   688292   harmad -t HACMP -n xxcccc_cluster
   770228   /usr/sbin/hostmibd
   794768   sendmail: accepting connections  nnections
   893026   /usr/sbin/inetd
      1228886   rlogind rlogind
         663642   -ksh
            1609826   proctree
      1786052   telnetd telnetd -a
         884840   -ksh
      1011938   rlogind rlogind
         1712352   -ksh
      1347782   telnetd telnetd -a
         938142   -ksh
            622756   -sh
   1261722   /usr/sbin/sshd a
98456   /usr/sbin/cron
106710   AtapeManager
114892   /usr/sbin/syncd 60
123132   random
147532   aioserver
163996   /usr/lib/errdemon
172116   /usr/dt/bin/dtlogin -daemon
184410   /usr/ccs/bin/shlap64
204992   /usr/bin/xmwlm -T -s 300 -R 1 -r 6 -o /etc/perf/daily/ -ypersistent=1 -ystart_t
213120   /usr/sbin/uprintfd
237610   aioserver
241812   /usr/sbin/getty /dev/console
249992   /opt/IBM_DS4000/jre/bin/java -Djava.compiler=NONE -Ddevmgr.datadir=/var/opt/SM
262286   /usr/opt/db2_08_01/bin/db2fmcd
278672   /opt/IBM_DS4000/jre/bin/java -Djava.compiler=NONE -Djava.library.path=/usr/SMag
286908   xmtopas -p3
311376   /usr/tivoli/tsm/server/bin/dsmserv quiet
   442412
319672   auditbin
356588   /home/db2as/das/adm/db2dasrrm
372918   /home/db2as/das/bin/db2fmd -i db2as -m /home/db2as/das/lib/libdb2dasgcf.a
381006   aioserver
385248   aioserver
405574   /bin/bsh /usr/lib/sa/sa1 300 12
   1089742   /usr/lib/sa/sadc 300 12 /var/adm/sa/sa13
417992   aioserver
462976   aioserver
520290   aioserver
536678   rpc.lockd
569558   aioserver
589832   nfsd
659546   aioserver
827580   aioserver
864432   aioserver
909422   aioserver
925916   aioserver
958694   aioserver
1007866   aioserver
1016050   aioserver
1024244   aioserver
1028342   aioserver
1032440   aioserver
1048630   aioserver
1179808   aioserver
1274040   aioserver
1310796   aioserver
1323236   aioserver
1356016   aioserver
1380552   aioserver
1454288   aioserver
1458388   aioserver
1462482   aioserver
1466582   aioserver
1470680   aioserver
1474786   aioserver
1478872   aioserver
1482978   aioserver
1487076   aioserver
1490970   aioserver
1499258   aioserver
1523822   aioserver
1536020   aioserver
1548480   aioserver

I also run entstat -d to all ethernet interfaces but don't see errors

NOTE: This is the next server to be hardening, so don't  scold me. :-) <-- Dont know if it's this word, I've just translate.

Any other command to run?

Thanks in deed


0
 
LVL 68

Expert Comment

by:woolmilkporc
ID: 33430313
There is no swapping at all. pi/po is zero, PgspIn/PgspOut as well.

There must be a process waking up very often, but doing very little work.

Maybe you should activate PROC_Create for root in audit/config (for a short time, of course) to see what's going on.






0
 

Author Comment

by:sminfo
ID: 33430384
Hi hgeist

I think there's not swap use:

monitor@: /home/monitor # lsps -a
Page Space      Physical Volume   Volume Group    Size %Used Active  Auto  Type
paging00        hdisk0            rootvg        4096MB     1   yes   yes    lv
hd6             hdisk0            rootvg        4096MB     1   yes   yes    lv

0
 
LVL 68

Accepted Solution

by:
woolmilkporc earned 500 total points
ID: 33430395
OK,

do you really have as many as 14 (fourteen) concurrent volume groups in your cluster??

If so, I think a load of 2-4 ist not that surprising.

"scold" is quite a nice word (yet not a nice act). I would never do that, btw.


0
 
LVL 68

Expert Comment

by:woolmilkporc
ID: 33430409
As for swapping: See my comment http:#a33430313 above.
0
Enabling OSINT in Activity Based Intelligence

Activity based intelligence (ABI) requires access to all available sources of data. Recorded Future allows analysts to observe structured data on the open, deep, and dark web.

 
LVL 68

Expert Comment

by:woolmilkporc
ID: 33430458
And further this one:

544982   run_rcovcmd

There is something going on with your cluster. Please check hacmp.out.
Some failover or verify/sync action is not complete, and the cluster is complaining about it.
0
 

Author Comment

by:sminfo
ID: 33430608
the message on the cluster is:

Aug 13 00:00:16 local0:info /usr/es/sbin/cluster/godmd[1757392]: Failed operation(1) return status 9.
Aug 13 00:00:16  local0:info /usr/es/sbin/cluster/godmd[1089668]: Failed operation(1) return status 9.
Aug 13 00:00:17  local0:info /usr/es/sbin/cluster/godmd[1228940]: Failed operation(1) return status 9.

Doc says it should be ignore...

So it seems the 14 VG is causing the high load, no?
0
 

Author Comment

by:sminfo
ID: 33430755
btw, what's  run_rcovcmd? it does not have a man page. I'm almost leaving, but I tried to connect from home.

thanks to both of you.

Israel.
0
 
LVL 68

Expert Comment

by:woolmilkporc
ID: 33430769
Yep,

that message is from nightly auto-verification and can be ignored.

And, sorry, I overlooked that you're obviously running HACMP 5.4.1 or later.
run_rcovcmd is not necessarily related to an error from these releases on.

I never saw a cluster containing 14 concurrent VGs.
I think it might very well be that this poor "Group Services Concurrent Logical Volume Management Daemon" (gsclvmd) could cause this high load, the more so because you're probably running more than one or two LVs per VG, am I right?




0
 
LVL 68

Expert Comment

by:woolmilkporc
ID: 33430835
run_rcovcmd:

In earlier releases it was used to control the duration of any event (and thus started along with the corresponding event) and complain if this duration was considered "too long".

In the newer releases it seems to run permanently, for what reasons ever.
0
 

Author Comment

by:sminfo
ID: 33430915

well, they say 14 VGs is not a high load to the cluster. ANd yes, it has a lot os LV inside VGs.

Have a nice weekend wmp.
0
 
LVL 61

Expert Comment

by:gheist
ID: 33435564
Load average of 2 is not high
I have seen like 50 on normally working 10-CPU 32bit system
0
 
LVL 61

Expert Comment

by:gheist
ID: 33435567
Page Space      Physical Volume   Volume Group    Size %Used Active  Auto  Type
paging00        hdisk0            rootvg        4096MB     1   yes   yes    lv
hd6             hdisk0            rootvg        4096MB     1   yes   yes    lv


delete paging00 and extend hd6
having two paging areas on same drive is very bad for performance.
0
 
LVL 61

Expert Comment

by:gheist
ID: 33435622
you might need to tune aioservers (smitty aio)
probably aio request queue gets full and (oracle) process is waiting, when aio should offload this waiting to kernel.
0

Featured Post

How your wiki can always stay up-to-date

Quip doubles as a “living” wiki and a project management tool that evolves with your organization. As you finish projects in Quip, the work remains, easily accessible to all team members, new and old.
- Increase transparency
- Onboard new hires faster
- Access from mobile/offline

Join & Write a Comment

Attention: This article will no longer be maintained. If you have any questions, please feel free to mail me. jgh@FreeBSD.org Please see http://www.freebsd.org/doc/en_US.ISO8859-1/articles/freebsd-update-server/ for the updated article. It is avail…
My previous tech tip, Installing the Solaris OS From the Flash Archive On a Tape (http://www.experts-exchange.com/articles/OS/Unix/Solaris/Installing-the-Solaris-OS-From-the-Flash-Archive-on-a-Tape.html), discussed installing the Solaris Operating S…
Learn how to get help with Linux/Unix bash shell commands. Use help to read help documents for built in bash shell commands.: Use man to interface with the online reference manuals for shell commands.: Use man to search man pages for unknown command…
Learn how to find files with the shell using the find and locate commands. Use locate to find a needle in a haystack.: With locate, check if the file still exists.: Use find to get the actual location of the file.:

707 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

13 Experts available now in Live!

Get 1:1 Help Now