Link to home
Start Free TrialLog in
Avatar of sminfo
sminfo

asked on

high load on one AIX 5.3 without CPU or IO used

Hi,

I have a strange high load (betwreen 2 and 5) constantly and I see no CPU or IO used on that server.
from topas:

CPU  User%  Kern%  Wait%  Idle%  Physc   Entc  
ALL    0.1    0.3    0.0   99.6   0.01    0.6   Writes    

Name            PID  CPU%  PgSp Owner
topas       1757256   0.1   1.9 root
clstrmgr     331942   0.0   5.0 root
getty        241812   0.0   0.5 root
gil           69666   0.0   0.9 root
hats_dis     389356   0.0   1.8 root
hats_nim     507962   0.0   1.9 root
hatsd        704634   0.0   9.3 root
hats_nim     487456   0.0   1.9 root
hats_nim     450660   0.0   1.9 root

Disk    Busy%     KBPS     TPS KB-Read KB-Writ  PgspIn      
dac0      0.0      2.0     4.0     1.0     1.0  PgspOut    
hdisk2    0.0      1.0     2.0     0.5     0.5  PageIn      
hdisk17   0.0      1.0     2.0     0.5     0.5  PageOut      
dac1utm   0.0      0.0     0.0     0.0     0.0  Sios        
dac0utm   0.0      0.0     0.0     0.0     0.0                


 EVENTS/QUEUES    FILE/TTY
 Cswitch     312  Readch     7346
 Syscall     344  Writech    3152
 19  Rawin         8
 Writes       25  Ttyout     1059
 Forks         0  Igets         0
 Execs         0  Namei        24
 Runqueue    1.5  Dirblk        0
 Waitqueue   0.0

See attached cpu image.

/# vmstat 1 3
System configuration: lcpu=4 mem=6144MB ent=2.00
kthr    memory              page              faults              cpu
----- ----------- ------------------------ ------------ -----------------------
 r  b   avm   fre  re  pi  po  fr   sr  cy  in   sy  cs us sy id wa    pc    ec
 2  0 515253 819301   0   0   0   0    0   0  43  459 313  0  0 99  0  0.02   1.1
 2  0 515253 819301   0   0   0   0    0   0  28  254 319  0  0 99  0  0.01   0.6
 2  0 515254 819300   0   0   0   0    0   0  14  149 242  0  0 99  0  0.01   0.5

# uptime
03:15PM   up 99 days,  12:35,  5 users,  load average: 2.70, 3.27, 3.08

Where can I look to if I found the cause of this issue?

Thanks
cpu.png
Avatar of woolmilkporc
woolmilkporc
Flag of Germany image

Salut,

now that I'm back from France let's look at this one.

The only thing which could catch one's eye is the gil process.

It's a kernel process ("Generalized Interrupt Level") which deals with TCP network acknowledgements and, more important, retransmissions.

So please examine yor network traffic, e.g. the "errs" column of netstat 1 (meaning 1 second interval),
or the "packets" column for high values.
Or check "topas" (left middle).

Is this perhaps an NFS/Smaba server with a lot of network traffic?

wmp

Avatar of sminfo
sminfo

ASKER

See:

bsa550q2:/# netstat 1
    input   (en0)      output           input   (Total)    output
 packets  errs  packets  errs colls  packets  errs  packets  errs colls
1713251640     0 312324300     3     0 2856788509     0 1004924055    13     0
       7     0        3     0     0       15     0        9     0     0
       4     0        2     0     0       14     0       11     0     0
       7     0        4     0     0       13     0        7     0     0
       6     0        3     0     0       20     0       13     0     0
       3     0        4     0     0       13     0       18     0     0
       7     0        3     0     0       18     0       14     0     0
       5     0        2     0     0       11     0        8     0     0
       3     0        5     0     0       12     0       13     0     0
       3     0        2     0     0       11     0       11     0     0
       2     0        2     0     0        6     0        5     0     0
       3     0        2     0     0        9     0        8     0     0
       3     0        4     0     0       12     0       12     0     0
       5     0        2     0     0       16     0       12     0     0
      12     0        8     0     0       19     0       11     0     0
       3     0        4     0     0       13     0       14     0     0
       8     0        6     0     0       20     0       15     0     0
       6     0        6     0     0       14     0       15     0     0
       3     0        3     0     0        8     0        7     0     0
       7     0        3     0     0       14     0        9     0     0
       4     0        5     0     0       13     0       13     0     0


topas.bmp
Avatar of sminfo

ASKER

... and there's not nfs on that server.. it's really really odd this issue..

regards
Israel.
swapping is happening
it causes highest priority IO

setting vmtune -M 1024 -m 1000
fixes it

longer lived fix is taming processes with WLM
AND buying extra memory
Avatar of sminfo

ASKER

wmp, sorry.. nfs is running on that server but it's not in use. look proctree:

# proctree
131244   /usr/sbin/srcmstr
   94382   /usr/sbin/portmap
   127206   /usr/sbin/snmpd
   1830962   /usr/sbin/tftpd -n
      200874   /usr/sbin/tftpd -n
   217222   /usr/sbin/syslogd
   233604   /usr/es/sbin/cluster/clcomd -d
   655478   /usr/sbin/gsclvmd
      270460   /usr/sbin/gsclvmd -r 30 -i 300 -t 300 -c 000030910000d6000000011819b2b337 -v 0
      307216   /usr/sbin/gsclvmd -r 30 -i 300 -t 300 -c 000030910000d6000000011819b256e0 -v 0
      397548   /usr/sbin/gsclvmd -r 30 -i 300 -t 300 -c 000030910000d6000000011819b1fb14 -v 0
      438494   /usr/sbin/gsclvmd -r 30 -i 300 -t 300 -c 000030910000d600000001181bdd4d5e -v 0
      491538   /usr/sbin/gsclvmd -r 30 -i 300 -t 300 -c 000030910000d6000000011819b30eb5 -v 0
      524428   /usr/sbin/gsclvmd -r 30 -i 300 -t 300 -c 000030910000d6000000011819b228fc -v 0
      540782   /usr/sbin/gsclvmd -r 30 -i 300 -t 300 -c 000030910000d6000000011819b33c6a -v 0
      565468   /usr/sbin/gsclvmd -r 30 -i 300 -t 300 -c 000030910000d6000000011819b19f89 -v 0
      577748   /usr/sbin/gsclvmd -r 30 -i 300 -t 300 -c 000030910000d6000000011819b171b2 -v 0
      643282   /usr/sbin/gsclvmd -r 30 -i 300 -t 300 -c 000030910000d6000000011819b1cd3d -v 0
      700644   /usr/sbin/gsclvmd -r 30 -i 300 -t 300 -c 000030910000d6000000011819b2e0eb -v 0
      712866   /usr/sbin/gsclvmd -r 30 -i 300 -t 300 -c 000030910000d6000000011819b142a7 -v 0
      766146   /usr/sbin/gsclvmd -r 30 -i 300 -t 300 -c 000030910000d600000001181bdd7d54 -v 0
      782558   /usr/sbin/gsclvmd -r 30 -i 300 -t 300 -c 000030910000d6000000011819b28578 -v 0
   295058   /usr/sbin/rsct/bin/rmcd -a IBM.LPCommands -r
   299156   /usr/java5/bin/java -Xbootclasspath/a:/var/websm/lwi/runtime/core/rcp/eclipse/p
   303224   /usr/sbin/rpc.statd -d 0 -t 50
   327870   /usr/sbin/rsct/bin/IBM.DRMd
   331942   /usr/es/sbin/cluster/clstrmgr
      544982   run_rcovcmd
   335896   /usr/sbin/snmpmibd
   704634   /usr/sbin/rsct/bin/hatsd -n 4 -o deadManSwitch
      339970   /usr/sbin/rsct/bin/hats_nim
      389356   /usr/sbin/rsct/bin/hats_diskhb_nim
      425984   /usr/sbin/rsct/bin/hats_diskhb_nim
      450660   /usr/sbin/rsct/bin/hats_nim
      487456   /usr/sbin/rsct/bin/hats_nim
      507962   /usr/sbin/rsct/bin/hats_nim
   348332   /usr/sbin/rsct/bin/vac5/IBM.CSMAgentRMd
   368866   /usr/sbin/rpc.lockd -d 0
   376990   /usr/es/sbin/cluster/clinfo
   413704   /usr/sbin/muxatmd
   434220   hagsd grpsvcs
   471134   /usr/sbin/nfsd 3891
   499800   /usr/sbin/qdaemon
   503884   /usr/sbin/biod 6
   528568   /usr/sbin/rsct/bin/IBM.HostRMd
   548942   /usr/sbin/xntpd
   626866   /usr/sbin/writesrv
   630870   /usr/sbin/rpc.mountd
   634996   haemd HACMP 4 xxcccc_cluster SECNOSUPPORT
   639210   /usr/sbin/aixmibd
   688292   harmad -t HACMP -n xxcccc_cluster
   770228   /usr/sbin/hostmibd
   794768   sendmail: accepting connections  nnections
   893026   /usr/sbin/inetd
      1228886   rlogind rlogind
         663642   -ksh
            1609826   proctree
      1786052   telnetd telnetd -a
         884840   -ksh
      1011938   rlogind rlogind
         1712352   -ksh
      1347782   telnetd telnetd -a
         938142   -ksh
            622756   -sh
   1261722   /usr/sbin/sshd a
98456   /usr/sbin/cron
106710   AtapeManager
114892   /usr/sbin/syncd 60
123132   random
147532   aioserver
163996   /usr/lib/errdemon
172116   /usr/dt/bin/dtlogin -daemon
184410   /usr/ccs/bin/shlap64
204992   /usr/bin/xmwlm -T -s 300 -R 1 -r 6 -o /etc/perf/daily/ -ypersistent=1 -ystart_t
213120   /usr/sbin/uprintfd
237610   aioserver
241812   /usr/sbin/getty /dev/console
249992   /opt/IBM_DS4000/jre/bin/java -Djava.compiler=NONE -Ddevmgr.datadir=/var/opt/SM
262286   /usr/opt/db2_08_01/bin/db2fmcd
278672   /opt/IBM_DS4000/jre/bin/java -Djava.compiler=NONE -Djava.library.path=/usr/SMag
286908   xmtopas -p3
311376   /usr/tivoli/tsm/server/bin/dsmserv quiet
   442412
319672   auditbin
356588   /home/db2as/das/adm/db2dasrrm
372918   /home/db2as/das/bin/db2fmd -i db2as -m /home/db2as/das/lib/libdb2dasgcf.a
381006   aioserver
385248   aioserver
405574   /bin/bsh /usr/lib/sa/sa1 300 12
   1089742   /usr/lib/sa/sadc 300 12 /var/adm/sa/sa13
417992   aioserver
462976   aioserver
520290   aioserver
536678   rpc.lockd
569558   aioserver
589832   nfsd
659546   aioserver
827580   aioserver
864432   aioserver
909422   aioserver
925916   aioserver
958694   aioserver
1007866   aioserver
1016050   aioserver
1024244   aioserver
1028342   aioserver
1032440   aioserver
1048630   aioserver
1179808   aioserver
1274040   aioserver
1310796   aioserver
1323236   aioserver
1356016   aioserver
1380552   aioserver
1454288   aioserver
1458388   aioserver
1462482   aioserver
1466582   aioserver
1470680   aioserver
1474786   aioserver
1478872   aioserver
1482978   aioserver
1487076   aioserver
1490970   aioserver
1499258   aioserver
1523822   aioserver
1536020   aioserver
1548480   aioserver

I also run entstat -d to all ethernet interfaces but don't see errors

NOTE: This is the next server to be hardening, so don't  scold me. :-) <-- Dont know if it's this word, I've just translate.

Any other command to run?

Thanks in deed


There is no swapping at all. pi/po is zero, PgspIn/PgspOut as well.

There must be a process waking up very often, but doing very little work.

Maybe you should activate PROC_Create for root in audit/config (for a short time, of course) to see what's going on.






Avatar of sminfo

ASKER

Hi hgeist

I think there's not swap use:

monitor@: /home/monitor # lsps -a
Page Space      Physical Volume   Volume Group    Size %Used Active  Auto  Type
paging00        hdisk0            rootvg        4096MB     1   yes   yes    lv
hd6             hdisk0            rootvg        4096MB     1   yes   yes    lv

ASKER CERTIFIED SOLUTION
Avatar of woolmilkporc
woolmilkporc
Flag of Germany image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
As for swapping: See my comment http:#a33430313 above.
And further this one:

544982   run_rcovcmd

There is something going on with your cluster. Please check hacmp.out.
Some failover or verify/sync action is not complete, and the cluster is complaining about it.
Avatar of sminfo

ASKER

the message on the cluster is:

Aug 13 00:00:16 local0:info /usr/es/sbin/cluster/godmd[1757392]: Failed operation(1) return status 9.
Aug 13 00:00:16  local0:info /usr/es/sbin/cluster/godmd[1089668]: Failed operation(1) return status 9.
Aug 13 00:00:17  local0:info /usr/es/sbin/cluster/godmd[1228940]: Failed operation(1) return status 9.

Doc says it should be ignore...

So it seems the 14 VG is causing the high load, no?
Avatar of sminfo

ASKER

btw, what's  run_rcovcmd? it does not have a man page. I'm almost leaving, but I tried to connect from home.

thanks to both of you.

Israel.
Yep,

that message is from nightly auto-verification and can be ignored.

And, sorry, I overlooked that you're obviously running HACMP 5.4.1 or later.
run_rcovcmd is not necessarily related to an error from these releases on.

I never saw a cluster containing 14 concurrent VGs.
I think it might very well be that this poor "Group Services Concurrent Logical Volume Management Daemon" (gsclvmd) could cause this high load, the more so because you're probably running more than one or two LVs per VG, am I right?




run_rcovcmd:

In earlier releases it was used to control the duration of any event (and thus started along with the corresponding event) and complain if this duration was considered "too long".

In the newer releases it seems to run permanently, for what reasons ever.
Avatar of sminfo

ASKER


well, they say 14 VGs is not a high load to the cluster. ANd yes, it has a lot os LV inside VGs.

Have a nice weekend wmp.
Load average of 2 is not high
I have seen like 50 on normally working 10-CPU 32bit system
Page Space      Physical Volume   Volume Group    Size %Used Active  Auto  Type
paging00        hdisk0            rootvg        4096MB     1   yes   yes    lv
hd6             hdisk0            rootvg        4096MB     1   yes   yes    lv


delete paging00 and extend hd6
having two paging areas on same drive is very bad for performance.
you might need to tune aioservers (smitty aio)
probably aio request queue gets full and (oracle) process is waiting, when aio should offload this waiting to kernel.