assistunix
asked on
Troubleshoot- login hung issues on several AIX server for several minutes
Hello,
I need assistance in troubleshooting and finding the root cause of the hung login issues that took place on several servers at the same time for about 5minutes. All these servers are LPARS
Please advise as to how to go about troubleshooting this issue to find it's root cause, as to what could have caused this. The network team says that they do not see any errors in their logs on their end.
What logs , reports can i check?
Thanks
I need assistance in troubleshooting and finding the root cause of the hung login issues that took place on several servers at the same time for about 5minutes. All these servers are LPARS
Please advise as to how to go about troubleshooting this issue to find it's root cause, as to what could have caused this. The network team says that they do not see any errors in their logs on their end.
What logs , reports can i check?
Thanks
ASKER CERTIFIED SOLUTION
membership
Create a free account to see this answer
Signing up is free and takes 30 seconds. No credit card required.
DCB47997 errorrs? This is going to ring a bell!
Which AIX version do you run? Are we talking about EMC Powerpath or Hitachi HDLM?
The OS can seem to hang due to heavy I/O waits during frequent I/O retries, caused by e.g a "flapping" SAN port/trunk, often in conjunction with Powerpath or HDLM, but with other SAN drivers as well.
There should have been kind of an improvement with 5.3 TL 7 or 8, but I'm not really sure about this.
What was the frequency of the disk errors? If it was considerably high I'm rather sure that they could well be the reason for your "login hung" issues.
How do you use tnsping? Do you specify the servicename (TNSNAMES adapter) or the hostname (EZCONNECT adapter) as the target?
If it's the hostname the issue could still result from a DNS outage, but with TNSNAMES the DNS doesn't get involved.
wmp
Which AIX version do you run? Are we talking about EMC Powerpath or Hitachi HDLM?
The OS can seem to hang due to heavy I/O waits during frequent I/O retries, caused by e.g a "flapping" SAN port/trunk, often in conjunction with Powerpath or HDLM, but with other SAN drivers as well.
There should have been kind of an improvement with 5.3 TL 7 or 8, but I'm not really sure about this.
What was the frequency of the disk errors? If it was considerably high I'm rather sure that they could well be the reason for your "login hung" issues.
How do you use tnsping? Do you specify the servicename (TNSNAMES adapter) or the hostname (EZCONNECT adapter) as the target?
If it's the hostname the issue could still result from a DNS outage, but with TNSNAMES the DNS doesn't get involved.
wmp
What I forgot to ask: What do the errorlogs of your SAN switches say?
ASKER
the entries of disk operation errors didn't sit right with the time in question. so didn't check san logs.
i did check vio's as well, but didn't find anything in vio's either. It's being looked again by LAN team.
i did check vio's as well, but didn't find anything in vio's either. It's being looked again by LAN team.
ASKER
Thanks- it was network issue.
ASKER
The established connections were intact and not affected, but for about 5minutes the new connections were delayed and very slow and got hung.
Oracle software has a utility called tnsping which is just like the command ping, and the results of tnsping ping showed the delay in pinging and connection.
Response time on oracle server with oracle listener for connections should be less than 10 millisecond, however in this case the response time was from 16seconds to 200seconds on several servers.
Network access was very slow to the Oracle database servers
I have checked LPAR'S NMON reports of that time in question, and the cpu utilization seems fine.
Lpars 1,2 and 3 are coming from one set of VIO's (VIO1 AND VIO2) and lpars 4 and 5 are coming from another set of VIO's (VIO 3 and VIO4). But they are connected to Same HMC.
The is no serviceable event in HMC for the same time.
On VIO1 AND VIO2, i got plenty of DCB47997 disk operation errors going back before this connectivity issue arose, and are still there- That's the only error in recent weeks.
On VIO3 and VIO4, i also got some DCB47997 disk operation errors.
I will check VIO nmon reports to see if CPU utilization was high at the time of the issue.
I still believe it is LAN (network) team issue, but they indicate that they do see any issue on their side.
Could the disk operation error have caused the issue?