Troubleshoot- login hung issues on several AIX server for several minutes

assistunix
assistunix used Ask the Experts™
on
Hello,

I need assistance in troubleshooting and finding the root cause of the hung login issues that took place on several servers at the same time for about 5minutes. All these servers are LPARS

Please advise as to how to go about troubleshooting this issue to find it's root cause, as to what could have caused this. The network team says that they do not see any errors in their logs on their end.
What logs , reports can i check?

Thanks
Comment
Watch Question

Do more with

Expert Office
EXPERT OFFICE® is a registered trademark of EXPERTS EXCHANGE®
Most Valuable Expert 2013
Top Expert 2013
Commented:
Well,

one could think of a DNS outage (login tries to write out "Last login ... ... from hostname" and has to do a DNS lookup if the IP address is not present in /etc/hosts).

Or maybe there's been an overload or other issue on the complete managed system at that very time?
Do you have "serviceable events" at your HMC dating back to the time of the issue?

If all the concerned LPARs get their network connections over VIO - maybe there's been something that kept the VIOS from servicing the clients in time? What do you get with "errlog | more" on the VIO server(s)?

wmp

Author

Commented:
The user login's in connection are from end user side of oracle server
The established connections were intact and not affected, but for about 5minutes the new connections were delayed and very slow and got hung.

Oracle software has a utility called tnsping which is just like the command ping, and the results of tnsping ping showed the delay in pinging and connection.

Response time on oracle server with oracle listener for connections should be less than 10 millisecond, however in this case the response time was from 16seconds to 200seconds on several servers.
Network access was very slow to the Oracle database servers

I have checked LPAR'S NMON reports of that time in question, and the cpu utilization seems fine.
Lpars 1,2 and 3 are coming from one set of VIO's (VIO1 AND VIO2) and lpars 4 and 5 are coming from another set of VIO's (VIO 3 and VIO4). But they are connected to Same HMC.
The is no serviceable event in HMC for the same time.

On VIO1 AND VIO2, i got plenty of DCB47997 disk operation errors going back before this connectivity issue arose, and are still there- That's the only error in recent weeks.

On VIO3 and VIO4, i also got some DCB47997 disk operation errors.

I will check VIO nmon reports to see if CPU utilization was high at the time of the issue.
I still believe it is LAN (network) team issue, but they indicate that they do see any issue on their side.

Could the disk operation error have caused the issue?
Most Valuable Expert 2013
Top Expert 2013

Commented:
DCB47997 errorrs? This is going to ring a bell!

Which AIX version do you run? Are we talking about EMC Powerpath or Hitachi HDLM?

The OS can seem to hang due to heavy I/O waits during frequent I/O retries, caused by e.g a "flapping" SAN port/trunk, often in conjunction with Powerpath or HDLM, but with other SAN drivers as well.

There should have been kind of an improvement with 5.3 TL 7 or 8, but I'm not really sure about this.

What was the frequency of the disk errors? If it was considerably high I'm rather sure that they could well be the reason for your "login hung" issues.

How do you use tnsping? Do you specify the servicename (TNSNAMES adapter) or the hostname (EZCONNECT adapter) as the target?

If it's the hostname the issue could still result from a DNS outage, but with TNSNAMES the DNS doesn't get involved.

wmp
Introduction to Web Design

Develop a strong foundation and understanding of web design by learning HTML, CSS, and additional tools to help you develop your own website.

Most Valuable Expert 2013
Top Expert 2013

Commented:
What I forgot to ask: What do the errorlogs of your SAN switches say?

Author

Commented:
the entries of disk operation errors didn't sit right with the time in question. so didn't check san logs.
i did check vio's as well, but didn't find anything in vio's either. It's being looked again by LAN team.

Author

Commented:
Thanks- it was network issue.

Do more with

Expert Office
Submit tech questions to Ask the Experts™ at any time to receive solutions, advice, and new ideas from leading industry professionals.

Start 7-Day Free Trial