Want to protect your cyber security and still get fast solutions? Ask a secure question today.Go Premium

x
  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 2473
  • Last Modified:

RHEL 3.0 rebooting when FC to SAN is removed in testing/verification stage.

Hi,
I'm having a problem in my testing phase where in if i pull out the Fiber Cable from II HBA, that server goes for a reboot in about a minute or so. Here are the details:

Hardware:
IBM x346 with two QLogic HBA's.
IBM H16 SAN Switch with Zoning
IBM DS4300, all 4 Host Ports used on both the controllers; two LUN's have been carved out - 100 GB & 1 GB assigned to Controller A only.

8 Fiber Cables used between DS4300, SAN Switch & the Servers.

SAN Switch Ports 0, 1, 4, 5 used as Storage Ports connected to SAN Controller A/Host 1, Controller A/Host2, Controller B/Host1, Controller B/Host2 respectively.

SAN Switch Ports 2, 3, 6, 7 used as Host Ports and connected to Server 1, Server 2, Server 1, Server 2 respectively.

Software:
RHEL 3.0 Update 4(2.4.21-27 Kernel)
QLogic 7.05 Driver
RDAC-09.00.A5.13 for 2.4 Kernels

supporting Oracle RAC.

During testing, i'm pulling out Cable from SAN Switch Port 7 or from Server 2 HBA2. After about 1 minute 10 seconds, that server goes for a reboot. Messages from /var/log/messages as below:

May 15 13:01:03 ussd143 kernel: scsi(2): LOOP DOWN detected.
May 15 13:01:03 ussd143 kernel: scsi(1): RSCN database changed -0x1,0x700.
May 15 13:01:03 ussd143 kernel: scsi(1): Waiting for LIP to complete...
May 15 13:01:03 ussd143 kernel: scsi(1): Topology - (F_Port), Host Loop address 0xffff
 
...waits for sometime, goes for a reboot shooting the following messages:
 
May 15 13:02:12 ussd143 kernel: 94 [RAIDarray.mpp]Tayana:0:2:1 Selection Retry c ount exhausted
May 15 13:02:12 ussd143 kernel: 7 [RAIDarray.mpp]Tayana:0:2 Path Failed
May 15 13:02:12 ussd143 kernel: 495 [RAIDarray.mpp]Tayana:0:2:1 Cmnd failed. Try  a new path. vcmnd SN 79758 pdev H2:C0:T0:L1 0x00/0x00/0x00 0x00010000
May 15 13:02:12 ussd143 kernel: 94 [RAIDarray.mpp]Tayana:0:3:1 Selection Retry c ount exhausted

I have 8 zones setup, 0, 2, 3 : 1, 2, 3 : 4, 6, 7 : 5, 6, 7 : 0, 6, 7 : 1, 6, 7 : 4, 2, 3 : 5, 2, 3.

Both the servers have the same behaviour. Oracle Services on the other server keeps working. Even, after the other server comes back after that reboot, with just one HBA/FC connected, it can see the LUN's and Oracle services are up.

Why this reboot is happening? Can someone please help me out with some ideas on this?

Thanks,
0
anand_2000v
Asked:
anand_2000v
  • 5
2 Solutions
 
anand_2000vAuthor Commented:
An update,
We have also  observed that the machines go for a reboot even when the controler is disconnected directly from the SAN.
0
 
David_FongCommented:
Not an expert on this but are you using the failover driver for the qlogic card since the non-failover verson is also ver 7.05
http://support.qlogic.com/support/oem_detail_all.asp?oemid=304
0
 
anand_2000vAuthor Commented:
WE are using the non-failover verson
0
Technology Partners: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 
anand_2000vAuthor Commented:
We are using RDAC for  failover...but I don't know where the problem is occuring.
FYI
the /etc/mpp.conf file

VirtualDiskProductId=VirtualDisk
DebugLevel=0x0
NotReadyWaitTime=270
BusyWaitTime=270
QuiescenceWaitTime=270
InquiryWaitTime=60
MaxLunsPerArray=256
MaxPathsPerController=4
ScanInterval=60
InquiryInterval=1
MaxArrayModules=30
ErrorLevel=3
SelectionTimeoutRetryCount=0
UaRetryCount=10
RetryCount=10
SynchTimeout=60
FailOverQuiescenceTime=20
FailoverTimeout=120
FailBackToCurrentAllowed=1
DoUaRetry=1
ControllerIoWaitTime=300
ArrayIoWaitTime=600
DisableLUNRebalance=0
ArrayFailoverWaitTime=300
S2ToS3Key=7074f70602eafc1b
0
 
durindilCommented:
One thing to keep in mind is that when zoning, in order to keep events segregated, you want to use single initiator zones--that is one host per zone.  Right now, you have both HBA's from each host in a single zone--which causes state change notifications and interruptions from one HBA to be sent to the other as well.  instead of zoning 0,2,3, you should use 2,0,1,4,5 in one zone, or 2,0-2,1-2,4-2,5 in single pair zones.
0
 
anand_2000vAuthor Commented:
Ok got the problem.
I have also got Oracle RAC installed. There is an option called as CSS miscount. This being less than the time required for RDAC to switchover to new controller the machine were going for reboots.
0
 
anand_2000vAuthor Commented:
Sorry about the Admin comment. Just a browser problem
0

Featured Post

Free Tool: Subnet Calculator

The subnet calculator helps you design networks by taking an IP address and network mask and returning information such as network, broadcast address, and host range.

One of a set of tools we're offering as a way of saying thank you for being a part of the community.

  • 5
Tackle projects and never again get stuck behind a technical roadblock.
Join Now