RHEL 3.0 rebooting when FC to SAN is removed in testing/verification stage.
Posted on 2006-05-15
I'm having a problem in my testing phase where in if i pull out the Fiber Cable from II HBA, that server goes for a reboot in about a minute or so. Here are the details:
IBM x346 with two QLogic HBA's.
IBM H16 SAN Switch with Zoning
IBM DS4300, all 4 Host Ports used on both the controllers; two LUN's have been carved out - 100 GB & 1 GB assigned to Controller A only.
8 Fiber Cables used between DS4300, SAN Switch & the Servers.
SAN Switch Ports 0, 1, 4, 5 used as Storage Ports connected to SAN Controller A/Host 1, Controller A/Host2, Controller B/Host1, Controller B/Host2 respectively.
SAN Switch Ports 2, 3, 6, 7 used as Host Ports and connected to Server 1, Server 2, Server 1, Server 2 respectively.
RHEL 3.0 Update 4(2.4.21-27 Kernel)
QLogic 7.05 Driver
RDAC-09.00.A5.13 for 2.4 Kernels
supporting Oracle RAC.
During testing, i'm pulling out Cable from SAN Switch Port 7 or from Server 2 HBA2. After about 1 minute 10 seconds, that server goes for a reboot. Messages from /var/log/messages as below:
May 15 13:01:03 ussd143 kernel: scsi(2): LOOP DOWN detected.
May 15 13:01:03 ussd143 kernel: scsi(1): RSCN database changed -0x1,0x700.
May 15 13:01:03 ussd143 kernel: scsi(1): Waiting for LIP to complete...
May 15 13:01:03 ussd143 kernel: scsi(1): Topology - (F_Port), Host Loop address 0xffff
...waits for sometime, goes for a reboot shooting the following messages:
May 15 13:02:12 ussd143 kernel: 94 [RAIDarray.mpp]Tayana:0:2:1 Selection Retry c ount exhausted
May 15 13:02:12 ussd143 kernel: 7 [RAIDarray.mpp]Tayana:0:2 Path Failed
May 15 13:02:12 ussd143 kernel: 495 [RAIDarray.mpp]Tayana:0:2:1 Cmnd failed. Try a new path. vcmnd SN 79758 pdev H2:C0:T0:L1 0x00/0x00/0x00 0x00010000
May 15 13:02:12 ussd143 kernel: 94 [RAIDarray.mpp]Tayana:0:3:1 Selection Retry c ount exhausted
I have 8 zones setup, 0, 2, 3 : 1, 2, 3 : 4, 6, 7 : 5, 6, 7 : 0, 6, 7 : 1, 6, 7 : 4, 2, 3 : 5, 2, 3.
Both the servers have the same behaviour. Oracle Services on the other server keeps working. Even, after the other server comes back after that reboot, with just one HBA/FC connected, it can see the LUN's and Oracle services are up.
Why this reboot is happening? Can someone please help me out with some ideas on this?