MS Cluster does not fail over
Posted on 2005-03-19
I inherited a MS Cluster, 2 nodes, SCSI Quorum & Shared drives. The cluster runs one SQL instance.
Symptoms: I can not move any groups. The virtual server does not fail over when node A is lost. Everything "appears" to run correctly on node A. Cluster log errors are numerous but the main thrust "seems" to be drives are "not ready", "Wait operation timed out", "overlapped I/O operation", etc., plus some discomfort with "TCPIP not bound to ..." on rare occassions. Lots of error 1117 and many status/error 21, plus several status 1169, error 258, status 997, with a little dab of status 2 and 5 for good measure.
If you look at Device Manager under both Drives and Controllers you get this curious result:
Node A -> 3 entries (Disk Drives)
Node A -> SMART 532, slot 2 -> 0 = Hard Drives (SCSI/RAID controllers)
0 = Logical Drives
SMART 5i, slot -> 7 = HD
3 = LD
Node B -> 5 entries
Node B -> SMART 532, slot 2 -> 5 HD
SMART 5i, slot 0 -> 7 HD
I know you can't "see" the drives from node B until it fails over but, I thought I read somewhere that the "numbers" should match, in Device Manager. Also note that the node that works is Node A. (Looks backwards to me but, I have checked it 5 times now.)
Question: How do I fix it (of course) or lacking an actual fix, are there any well known test procedures or tricky command line magic that will help troubleshoot this beast? (I have read until I am blue. There are a lot of articles that tell you how to build it and a few that tell you how to "verify that it works correctly" but, I haven't seen any that tell you what to do if it doesn't work correctly!)
I have tried "many" things from TechNet, the WEB, etc. and all of them tell me what I already know, it doesn't work. (This is a 24/7, "enterprise" application so I would really prefer to fix it, not rebuild it, if at all possible.)