We help IT Professionals succeed at work.

AIX 6.1 strange Hangs

questil
questil asked
on
Hi,
I have AIX 6.1 machine that hangs lately very often.
When it hangs it doesn't response when I try to ssh the server, but to telnet it I can enter user & password and that's all; it doesn't proceed after I type the password. it's also respond the same when I try to login from the console (command line login)
It hangs this morning, and after hard reset I checked the errpt and there was nothing there that might provide a lead to the cause.
After the reboot I left an open ssh session to this server and now it hangs again (after 4.5 hours): I can't open new sessions, but the session I left open is alive.
Naturally,  I checked the errpt again, and there was nothing there that might provide a lead to the cause, the last entry was from 4.5 hours ago after the last boot this morning.

Any idea how to troubleshoot this problem and what might cause it?

Thanks,
Tal
Comment
Watch Question

CERTIFIED EXPERT
Most Valuable Expert 2013
Top Expert 2013

Commented:
Hi,

if you still have a working session -

Run  "topas" and check

- CPU consumption
- top processes
- paging (Pgspin/Pgspout)
- memory (%Comp/?Noncomp)
- disk busy (particularly the disks in rootvg)

Any hints?

Further, issue "netstat -m"

Do you see any failed malloc calls?

and "netstat -a | grep -v ^f"

Many open ports?

wmp

CERTIFIED EXPERT
Most Valuable Expert 2013
Top Expert 2013

Commented:
Also:

ps -ef | wc -l
(number of processes)

ps -ef | grep defunct | wc -l
(number of zombies)

ps -e -o vsz,comm | sort -n
(most memory consuming process (KB) at the bottom of the list)

Author

Commented:
"netstat -m" no failed malloc calls
I run all the other commands including topas and waited over 10 minutes for every command, but there was no output... attached screenshot for of the topas.
6.JPG
CERTIFIED EXPERT
Most Valuable Expert 2013
Top Expert 2013

Commented:
Although we don't see anything at all - I've got the feeling that there is a memory/paging problem.

Two things you could do (after resetting the machine, of course)

-  Collect debug information

Add to /etc/syslog.conf

*.debug /var/adm/debug.log rotate size 1m files 10

Issue

touch /var/adm/debug.log

and

refresh -s syslogd

-  Collect vmstat data:

Issue

nohup vmstat 60 >> /var/adm/vmstat.out &


Next time after restarting the system examine both files above:

- /var/adm/debug.log for general hints

- The pi/po columns of /var/adm/vmstat.out for increase to abnormally high values.

Also consider enabling system activity reports (if you didn't do it already):

Issue (as root):

crontab -e adm

and uncomment (activate) the 5 "SYSTEM ACTIVITY REPORTS" lines:

0 8-17 * * 1-5 /usr/lib/sa/sa1 1200 3 &
0 * * * 0,6 /usr/lib/sa/sa1 &
0 18-7 * * 1-5 /usr/lib/sa/sa1 &
5 18 * * 1-5 /usr/lib/sa/sa2 -s 8:00 -e 18:01 -i 3600 -ubcwyaqvmd &


Save the crontab.

The "sar" utility will show you what's going on with your system, particularly see

sar -r
(paging)

and

sar -d
(disk activity)


wmp

Author

Commented:
Thanks WMP,

I performed all you suggested, but as far as the "system activity reports" there was only 4 lines to uncomment:

#=================================================================
#      SYSTEM ACTIVITY REPORTS
#  8am-5pm activity reports every 20 mins during weekdays.
#  activity reports every an hour on Saturday and Sunday.
#  6pm-7am activity reports every an hour during weekdays.
#  Daily summary prepared at 18:05.
#=================================================================
0 8-17 * * 1-5 /usr/lib/sa/sa1 1200 3 &
0 * * * 0,6 /usr/lib/sa/sa1 &
0 18-7 * * 1-5 /usr/lib/sa/sa1 &
5 18 * * 1-5 /usr/lib/sa/sa2 -s 8:00 -e 18:01 -i 3600 -ubcwyaqvm &
#=================================================================

I run sar -r 1 10:
bash-3.2# sar -r 1 10

AIX israix06 1 6 00C587044C00    11/03/11

System configuration: lcpu=4 mem=7936MB  mode=Capped

10:46:20   slots cycle/s fault/s  odio/s
10:46:21 4189766    0.00  399.50    0.00
10:46:22 4189766    0.00    0.50    0.00
10:46:23 4189766    0.00   95.00    0.00
10:46:24 4189766    0.00   94.50    0.00
10:46:25 4189766    0.00  478.02    0.00
10:46:26 4189766    0.00 1030.47    0.00
10:46:27 4189748    0.00 1531.95    0.00
10:46:28 4189748    0.00 2105.50    0.00
10:46:29 4189748    0.00  179.50    0.00
10:46:30 4189748    0.00  114.50    0.00

Average  4189759       0    1211       0

I also run sar -d but there is no disk activity now.

BTW - the machine have 8GB RAM and after i installed it i increased the paging space to 16GB

Thanks,
Tal
CERTIFIED EXPERT
Most Valuable Expert 2013
Top Expert 2013

Commented:
OK,

seems I forgot how to count. Of course there are only 4 lines to uncomment.

Your paging rate looks rather normal (please note that faults/s is not a count of page faults that generate I/O, because some page faults can be resolved without I/O).

The paging space size of 16 GB is most probably sufficient.
My suspicion is not a "page space full" condition (this would have been reported in errpt anyway),
but rather high paging I/O rates.

I fear we will have to wait for the next hang. The sar data will survive a reboot, so we can easily check afterwards.

wmp

Author

Commented:
Thanks WMP!
I'l keep you posted.

Author

Commented:
Hi,

The system crashes again on Nov 6th. attached the log files.
 debug.log vmstat.out.log
CERTIFIED EXPERT
Most Valuable Expert 2013
Top Expert 2013
Commented:
Well,

it's not a paging problem, as opposed to what I assumed first.

But another thing:

Is your DB2 database on the server in question always up and running, or do you start it only from time to time?

I'm asking because your log shows unusually heavy activity of DB2 STMM (ca. 15 messages per half-minute interval), but after system restart at ca. 14:16 there were no messages anymore!

The self-tuning memory manager should only run once every 3 minutes, not 30 times per minute!

So I strongly assume that there is a STMM-DB2 problem here, a thing you should soon discuss with IBM support!
Keep the "debug.log" for their reference, I think it could be useful.

By the way, on another occasion you should clean up /etc/rc.d/init.d and /etc/rc.d/rc2.d in order to remove the obsolete starting and stopping of "fglam", which seems not to be installed on your machine.

Good luck!

wmp