asked on

high load on one AIX 5.3 without CPU or IO used

Hi,

I have a strange high load (betwreen 2 and 5) constantly and I see no CPU or IO used on that server.
from topas:

CPU User% Kern% Wait% Idle% Physc Entc
ALL 0.1 0.3 0.0 99.6 0.01 0.6 Writes

Name PID CPU% PgSp Owner
topas 1757256 0.1 1.9 root
clstrmgr 331942 0.0 5.0 root
getty 241812 0.0 0.5 root
gil 69666 0.0 0.9 root
hats_dis 389356 0.0 1.8 root
hats_nim 507962 0.0 1.9 root
hatsd 704634 0.0 9.3 root
hats_nim 487456 0.0 1.9 root
hats_nim 450660 0.0 1.9 root

Disk Busy% KBPS TPS KB-Read KB-Writ PgspIn
dac0 0.0 2.0 4.0 1.0 1.0 PgspOut
hdisk2 0.0 1.0 2.0 0.5 0.5 PageIn
hdisk17 0.0 1.0 2.0 0.5 0.5 PageOut
dac1utm 0.0 0.0 0.0 0.0 0.0 Sios
dac0utm 0.0 0.0 0.0 0.0 0.0

EVENTS/QUEUES FILE/TTY
Cswitch 312 Readch 7346
Syscall 344 Writech 3152
19 Rawin 8
Writes 25 Ttyout 1059
Forks 0 Igets 0
Execs 0 Namei 24
Runqueue 1.5 Dirblk 0
Waitqueue 0.0

See attached cpu image.

/# vmstat 1 3
System configuration: lcpu=4 mem=6144MB ent=2.00
kthr memory page faults cpu
----- ----------- ------------------------ ------------ -----------------------
r b avm fre re pi po fr sr cy in sy cs us sy id wa pc ec
2 0 515253 819301 0 0 0 0 0 0 43 459 313 0 0 99 0 0.02 1.1
2 0 515253 819301 0 0 0 0 0 0 28 254 319 0 0 99 0 0.01 0.6
2 0 515254 819300 0 0 0 0 0 0 14 149 242 0 0 99 0 0.01 0.5

# uptime
03:15PM up 99 days, 12:35, 5 users, load average: 2.70, 3.27, 3.08

Where can I look to if I found the cause of this issue?

Thanks
cpu.png

woolmilkporc

Salut,

now that I'm back from France let's look at this one.

The only thing which could catch one's eye is the gil process.

It's a kernel process ("Generalized Interrupt Level") which deals with TCP network acknowledgements and, more important, retransmissions.

So please examine yor network traffic, e.g. the "errs" column of netstat 1 (meaning 1 second interval),
or the "packets" column for high values.
Or check "topas" (left middle).

Is this perhaps an NFS/Smaba server with a lot of network traffic?

wmp

sminfo

ASKER

See:

bsa550q2:/# netstat 1
input (en0) output input (Total) output
packets errs packets errs colls packets errs packets errs colls
1713251640 0 312324300 3 0 2856788509 0 1004924055 13 0
7 0 3 0 0 15 0 9 0 0
4 0 2 0 0 14 0 11 0 0
7 0 4 0 0 13 0 7 0 0
6 0 3 0 0 20 0 13 0 0
3 0 4 0 0 13 0 18 0 0
7 0 3 0 0 18 0 14 0 0
5 0 2 0 0 11 0 8 0 0
3 0 5 0 0 12 0 13 0 0
3 0 2 0 0 11 0 11 0 0
2 0 2 0 0 6 0 5 0 0
3 0 2 0 0 9 0 8 0 0
3 0 4 0 0 12 0 12 0 0
5 0 2 0 0 16 0 12 0 0
12 0 8 0 0 19 0 11 0 0
3 0 4 0 0 13 0 14 0 0
8 0 6 0 0 20 0 15 0 0
6 0 6 0 0 14 0 15 0 0
3 0 3 0 0 8 0 7 0 0
7 0 3 0 0 14 0 9 0 0
4 0 5 0 0 13 0 13 0 0

topas.bmp

sminfo

ASKER

... and there's not nfs on that server.. it's really really odd this issue..

regards
Israel.

gheist

swapping is happening
it causes highest priority IO

setting vmtune -M 1024 -m 1000
fixes it

longer lived fix is taming processes with WLM
AND buying extra memory

sminfo

ASKER

wmp, sorry.. nfs is running on that server but it's not in use. look proctree:

# proctree
131244 /usr/sbin/srcmstr
94382 /usr/sbin/portmap
127206 /usr/sbin/snmpd
1830962 /usr/sbin/tftpd -n
200874 /usr/sbin/tftpd -n
217222 /usr/sbin/syslogd
233604 /usr/es/sbin/cluster/clcomd -d
655478 /usr/sbin/gsclvmd
270460 /usr/sbin/gsclvmd -r 30 -i 300 -t 300 -c 000030910000d6000000011819b2b337 -v 0
307216 /usr/sbin/gsclvmd -r 30 -i 300 -t 300 -c 000030910000d6000000011819b256e0 -v 0
397548 /usr/sbin/gsclvmd -r 30 -i 300 -t 300 -c 000030910000d6000000011819b1fb14 -v 0
438494 /usr/sbin/gsclvmd -r 30 -i 300 -t 300 -c 000030910000d600000001181bdd4d5e -v 0
491538 /usr/sbin/gsclvmd -r 30 -i 300 -t 300 -c 000030910000d6000000011819b30eb5 -v 0
524428 /usr/sbin/gsclvmd -r 30 -i 300 -t 300 -c 000030910000d6000000011819b228fc -v 0
540782 /usr/sbin/gsclvmd -r 30 -i 300 -t 300 -c 000030910000d6000000011819b33c6a -v 0
565468 /usr/sbin/gsclvmd -r 30 -i 300 -t 300 -c 000030910000d6000000011819b19f89 -v 0
577748 /usr/sbin/gsclvmd -r 30 -i 300 -t 300 -c 000030910000d6000000011819b171b2 -v 0
643282 /usr/sbin/gsclvmd -r 30 -i 300 -t 300 -c 000030910000d6000000011819b1cd3d -v 0
700644 /usr/sbin/gsclvmd -r 30 -i 300 -t 300 -c 000030910000d6000000011819b2e0eb -v 0
712866 /usr/sbin/gsclvmd -r 30 -i 300 -t 300 -c 000030910000d6000000011819b142a7 -v 0
766146 /usr/sbin/gsclvmd -r 30 -i 300 -t 300 -c 000030910000d600000001181bdd7d54 -v 0
782558 /usr/sbin/gsclvmd -r 30 -i 300 -t 300 -c 000030910000d6000000011819b28578 -v 0
295058 /usr/sbin/rsct/bin/rmcd -a IBM.LPCommands -r
299156 /usr/java5/bin/java -Xbootclasspath/a:/var/websm/lwi/runtime/core/rcp/eclipse/p
303224 /usr/sbin/rpc.statd -d 0 -t 50
327870 /usr/sbin/rsct/bin/IBM.DRMd
331942 /usr/es/sbin/cluster/clstrmgr
544982 run_rcovcmd
335896 /usr/sbin/snmpmibd
704634 /usr/sbin/rsct/bin/hatsd -n 4 -o deadManSwitch
339970 /usr/sbin/rsct/bin/hats_nim
389356 /usr/sbin/rsct/bin/hats_diskhb_nim
425984 /usr/sbin/rsct/bin/hats_diskhb_nim
450660 /usr/sbin/rsct/bin/hats_nim
487456 /usr/sbin/rsct/bin/hats_nim
507962 /usr/sbin/rsct/bin/hats_nim
348332 /usr/sbin/rsct/bin/vac5/IBM.CSMAgentRMd
368866 /usr/sbin/rpc.lockd -d 0
376990 /usr/es/sbin/cluster/clinfo
413704 /usr/sbin/muxatmd
434220 hagsd grpsvcs
471134 /usr/sbin/nfsd 3891
499800 /usr/sbin/qdaemon
503884 /usr/sbin/biod 6
528568 /usr/sbin/rsct/bin/IBM.HostRMd
548942 /usr/sbin/xntpd
626866 /usr/sbin/writesrv
630870 /usr/sbin/rpc.mountd
634996 haemd HACMP 4 xxcccc_cluster SECNOSUPPORT
639210 /usr/sbin/aixmibd
688292 harmad -t HACMP -n xxcccc_cluster
770228 /usr/sbin/hostmibd
794768 sendmail: accepting connections nnections
893026 /usr/sbin/inetd
1228886 rlogind rlogind
663642 -ksh
1609826 proctree
1786052 telnetd telnetd -a
884840 -ksh
1011938 rlogind rlogind
1712352 -ksh
1347782 telnetd telnetd -a
938142 -ksh
622756 -sh
1261722 /usr/sbin/sshd a
98456 /usr/sbin/cron
106710 AtapeManager
114892 /usr/sbin/syncd 60
123132 random
147532 aioserver
163996 /usr/lib/errdemon
172116 /usr/dt/bin/dtlogin -daemon
184410 /usr/ccs/bin/shlap64
204992 /usr/bin/xmwlm -T -s 300 -R 1 -r 6 -o /etc/perf/daily/ -ypersistent=1 -ystart_t
213120 /usr/sbin/uprintfd
237610 aioserver
241812 /usr/sbin/getty /dev/console
249992 /opt/IBM_DS4000/jre/bin/java -Djava.compiler=NONE -Ddevmgr.datadir=/var/opt/SM
262286 /usr/opt/db2_08_01/bin/db2fmcd
278672 /opt/IBM_DS4000/jre/bin/java -Djava.compiler=NONE -Djava.library.path=/usr/SMag
286908 xmtopas -p3
311376 /usr/tivoli/tsm/server/bin/dsmserv quiet
442412
319672 auditbin
356588 /home/db2as/das/adm/db2dasrrm
372918 /home/db2as/das/bin/db2fmd -i db2as -m /home/db2as/das/lib/libdb2dasgcf.a
381006 aioserver
385248 aioserver
405574 /bin/bsh /usr/lib/sa/sa1 300 12
1089742 /usr/lib/sa/sadc 300 12 /var/adm/sa/sa13
417992 aioserver
462976 aioserver
520290 aioserver
536678 rpc.lockd
569558 aioserver
589832 nfsd
659546 aioserver
827580 aioserver
864432 aioserver
909422 aioserver
925916 aioserver
958694 aioserver
1007866 aioserver
1016050 aioserver
1024244 aioserver
1028342 aioserver
1032440 aioserver
1048630 aioserver
1179808 aioserver
1274040 aioserver
1310796 aioserver
1323236 aioserver
1356016 aioserver
1380552 aioserver
1454288 aioserver
1458388 aioserver
1462482 aioserver
1466582 aioserver
1470680 aioserver
1474786 aioserver
1478872 aioserver
1482978 aioserver
1487076 aioserver
1490970 aioserver
1499258 aioserver
1523822 aioserver
1536020 aioserver
1548480 aioserver

I also run entstat -d to all ethernet interfaces but don't see errors

NOTE: This is the next server to be hardening, so don't scold me. :-) <-- Dont know if it's this word, I've just translate.

Any other command to run?

Thanks in deed

woolmilkporc

There is no swapping at all. pi/po is zero, PgspIn/PgspOut as well.

There must be a process waking up very often, but doing very little work.

Maybe you should activate PROC_Create for root in audit/config (for a short time, of course) to see what's going on.

sminfo

ASKER

Hi hgeist

I think there's not swap use:

monitor@: /home/monitor # lsps -a
Page Space Physical Volume Volume Group Size %Used Active Auto Type
paging00 hdisk0 rootvg 4096MB 1 yes yes lv
hd6 hdisk0 rootvg 4096MB 1 yes yes lv

ASKER CERTIFIED SOLUTION

woolmilkporc

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

woolmilkporc

As for swapping: See my comment http:#a33430313 above.

woolmilkporc

And further this one:

544982 run_rcovcmd

There is something going on with your cluster. Please check hacmp.out.
Some failover or verify/sync action is not complete, and the cluster is complaining about it.

sminfo

ASKER

the message on the cluster is:

Aug 13 00:00:16 local0:info /usr/es/sbin/cluster/godmd[1757392]: Failed operation(1) return status 9.
Aug 13 00:00:16 local0:info /usr/es/sbin/cluster/godmd[1089668]: Failed operation(1) return status 9.
Aug 13 00:00:17 local0:info /usr/es/sbin/cluster/godmd[1228940]: Failed operation(1) return status 9.

Doc says it should be ignore...

So it seems the 14 VG is causing the high load, no?

sminfo

ASKER

btw, what's run_rcovcmd? it does not have a man page. I'm almost leaving, but I tried to connect from home.

thanks to both of you.

Israel.

woolmilkporc

Yep,

that message is from nightly auto-verification and can be ignored.

And, sorry, I overlooked that you're obviously running HACMP 5.4.1 or later.
run_rcovcmd is not necessarily related to an error from these releases on.

I never saw a cluster containing 14 concurrent VGs.
I think it might very well be that this poor "Group Services Concurrent Logical Volume Management Daemon" (gsclvmd) could cause this high load, the more so because you're probably running more than one or two LVs per VG, am I right?

woolmilkporc

run_rcovcmd:

In earlier releases it was used to control the duration of any event (and thus started along with the corresponding event) and complain if this duration was considered "too long".

In the newer releases it seems to run permanently, for what reasons ever.

sminfo

ASKER

well, they say 14 VGs is not a high load to the cluster. ANd yes, it has a lot os LV inside VGs.

Have a nice weekend wmp.

gheist

Load average of 2 is not high
I have seen like 50 on normally working 10-CPU 32bit system

gheist

Page Space Physical Volume Volume Group Size %Used Active Auto Type
paging00 hdisk0 rootvg 4096MB 1 yes yes lv
hd6 hdisk0 rootvg 4096MB 1 yes yes lv

delete paging00 and extend hd6
having two paging areas on same drive is very bad for performance.

gheist

you might need to tune aioservers (smitty aio)
probably aio request queue gets full and (oracle) process is waiting, when aio should offload this waiting to kernel.