Link to home
Start Free TrialLog in
Avatar of questil
questil

asked on

AIX 6.1 strange Hangs

Hi,
I have AIX 6.1 machine that hangs lately very often.
When it hangs it doesn't response when I try to ssh the server, but to telnet it I can enter user & password and that's all; it doesn't proceed after I type the password. it's also respond the same when I try to login from the console (command line login)
It hangs this morning, and after hard reset I checked the errpt and there was nothing there that might provide a lead to the cause.
After the reboot I left an open ssh session to this server and now it hangs again (after 4.5 hours): I can't open new sessions, but the session I left open is alive.
Naturally,  I checked the errpt again, and there was nothing there that might provide a lead to the cause, the last entry was from 4.5 hours ago after the last boot this morning.

Any idea how to troubleshoot this problem and what might cause it?

Thanks,
Tal
Avatar of woolmilkporc
woolmilkporc
Flag of Germany image

Hi,

if you still have a working session -

Run  "topas" and check

- CPU consumption
- top processes
- paging (Pgspin/Pgspout)
- memory (%Comp/?Noncomp)
- disk busy (particularly the disks in rootvg)

Any hints?

Further, issue "netstat -m"

Do you see any failed malloc calls?

and "netstat -a | grep -v ^f"

Many open ports?

wmp

Also:

ps -ef | wc -l
(number of processes)

ps -ef | grep defunct | wc -l
(number of zombies)

ps -e -o vsz,comm | sort -n
(most memory consuming process (KB) at the bottom of the list)
Avatar of questil
questil

ASKER

"netstat -m" no failed malloc calls
I run all the other commands including topas and waited over 10 minutes for every command, but there was no output... attached screenshot for of the topas.
6.JPG
Although we don't see anything at all - I've got the feeling that there is a memory/paging problem.

Two things you could do (after resetting the machine, of course)

-  Collect debug information

Add to /etc/syslog.conf

*.debug /var/adm/debug.log rotate size 1m files 10

Issue

touch /var/adm/debug.log

and

refresh -s syslogd

-  Collect vmstat data:

Issue

nohup vmstat 60 >> /var/adm/vmstat.out &


Next time after restarting the system examine both files above:

- /var/adm/debug.log for general hints

- The pi/po columns of /var/adm/vmstat.out for increase to abnormally high values.

Also consider enabling system activity reports (if you didn't do it already):

Issue (as root):

crontab -e adm

and uncomment (activate) the 5 "SYSTEM ACTIVITY REPORTS" lines:

0 8-17 * * 1-5 /usr/lib/sa/sa1 1200 3 &
0 * * * 0,6 /usr/lib/sa/sa1 &
0 18-7 * * 1-5 /usr/lib/sa/sa1 &
5 18 * * 1-5 /usr/lib/sa/sa2 -s 8:00 -e 18:01 -i 3600 -ubcwyaqvmd &


Save the crontab.

The "sar" utility will show you what's going on with your system, particularly see

sar -r
(paging)

and

sar -d
(disk activity)


wmp

Avatar of questil

ASKER

Thanks WMP,

I performed all you suggested, but as far as the "system activity reports" there was only 4 lines to uncomment:

#=================================================================
#      SYSTEM ACTIVITY REPORTS
#  8am-5pm activity reports every 20 mins during weekdays.
#  activity reports every an hour on Saturday and Sunday.
#  6pm-7am activity reports every an hour during weekdays.
#  Daily summary prepared at 18:05.
#=================================================================
0 8-17 * * 1-5 /usr/lib/sa/sa1 1200 3 &
0 * * * 0,6 /usr/lib/sa/sa1 &
0 18-7 * * 1-5 /usr/lib/sa/sa1 &
5 18 * * 1-5 /usr/lib/sa/sa2 -s 8:00 -e 18:01 -i 3600 -ubcwyaqvm &
#=================================================================

I run sar -r 1 10:
bash-3.2# sar -r 1 10

AIX israix06 1 6 00C587044C00    11/03/11

System configuration: lcpu=4 mem=7936MB  mode=Capped

10:46:20   slots cycle/s fault/s  odio/s
10:46:21 4189766    0.00  399.50    0.00
10:46:22 4189766    0.00    0.50    0.00
10:46:23 4189766    0.00   95.00    0.00
10:46:24 4189766    0.00   94.50    0.00
10:46:25 4189766    0.00  478.02    0.00
10:46:26 4189766    0.00 1030.47    0.00
10:46:27 4189748    0.00 1531.95    0.00
10:46:28 4189748    0.00 2105.50    0.00
10:46:29 4189748    0.00  179.50    0.00
10:46:30 4189748    0.00  114.50    0.00

Average  4189759       0    1211       0

I also run sar -d but there is no disk activity now.

BTW - the machine have 8GB RAM and after i installed it i increased the paging space to 16GB

Thanks,
Tal
OK,

seems I forgot how to count. Of course there are only 4 lines to uncomment.

Your paging rate looks rather normal (please note that faults/s is not a count of page faults that generate I/O, because some page faults can be resolved without I/O).

The paging space size of 16 GB is most probably sufficient.
My suspicion is not a "page space full" condition (this would have been reported in errpt anyway),
but rather high paging I/O rates.

I fear we will have to wait for the next hang. The sar data will survive a reboot, so we can easily check afterwards.

wmp
Avatar of questil

ASKER

Thanks WMP!
I'l keep you posted.
Avatar of questil

ASKER

Hi,

The system crashes again on Nov 6th. attached the log files.
 debug.log vmstat.out.log
ASKER CERTIFIED SOLUTION
Avatar of woolmilkporc
woolmilkporc
Flag of Germany image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial