• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 316
  • Last Modified:

MS Cluster does not fail over

I inherited a MS Cluster, 2 nodes, SCSI Quorum & Shared drives.  The cluster runs one SQL instance.

Symptoms: I can not move any groups. The virtual server does not fail over when node A is lost. Everything "appears" to run correctly on node A. Cluster log errors are numerous but the main thrust "seems" to be drives are "not ready", "Wait operation timed out", "overlapped I/O operation", etc., plus some discomfort with "TCPIP not bound to ..." on rare occassions.  Lots of error 1117 and many status/error 21, plus several status 1169, error 258, status 997, with a little dab of status 2 and 5 for good measure.  

If you look at Device Manager under both Drives and Controllers you get this curious result:

Node A ->  3 entries                                                      (Disk Drives)
Node A ->  SMART 532, slot 2  -> 0 = Hard Drives           (SCSI/RAID controllers)
                                                  0 = Logical Drives
                 SMART 5i,  slot      -> 7 = HD  
                                                  3 = LD

Node B ->  5 entries
Node B -> SMART 532, slot 2  ->  5 HD
                                                 2 LD    
                SMART 5i,   slot 0  ->  7 HD    
                                                 3 LD
       
I know you can't "see" the drives from node B until it fails over but, I thought I read somewhere that the "numbers" should match, in Device Manager.   Also note that the node that works is Node A.  (Looks backwards to me but,  I have checked it 5 times now.)  

Question: How do I fix it (of course) or lacking an actual fix, are there any well known test procedures or tricky command line magic that will help troubleshoot this beast? (I have read until I am blue. There are a lot of articles that tell you how to build it and a few that tell you how to "verify that it works correctly" but, I haven't seen any that tell you what to do if it doesn't work correctly!)

I have tried "many" things from TechNet, the WEB, etc. and all of them tell me what I already know, it doesn't work. (This is a 24/7, "enterprise" application so I would really prefer to fix it, not rebuild it, if at all possible.)    

Regards,
Mike

0
MUnderhood
Asked:
MUnderhood
  • 4
1 Solution
 
tmackCommented:
have you checked your cable for the "heartbeat"? I have found those to go bad for what ever reason and cause all kinds of havoc.
0
 
MUnderhoodAuthor Commented:
I haven't checked it yet but it is on my list to look at.  I don't "see" anything consistant that points to the HB cable however, I have seen a few errors that made me suspicious.  The HB was misconfigured (according to the istall notes from MS) at one time. I think I have cleaned all that up. It did not seem to change anything.

I will be touching it soon and I will let you know. Thanks for the response

Regards,
Mike
 
0
 
MUnderhoodAuthor Commented:
tmack, et al

I have checked the HB cable and as far as I can tell all appears to be in place and well. I haven't seen anymore errors related to the NIC cards (either App or HB).

Further investigation into the logs revealed that it does appear to be trying to fail over (move group). I suspect that it was never installed correctly and probably never worked correctly.

Thanks for the assistance.
Mike
 
0
 
MUnderhoodAuthor Commented:
General Comment:

I was finally able to fire up the Array Configuration Utility (which gives a much better picture of the actual array configuration) and the drives/volumes look fine. (Although, I was only able to get the utility to run on the node that doesn't work !!  The "good node" complained that the browser was unsupported !!)  Apparently this cluster was built by Roseanne Roseannadanna.

Regards,
mike
0
 
MUnderhoodAuthor Commented:
Final Comment:  The problem was eventually traced to the heartbeat or possibly the binding order. I can not be sure since I changed two things at once. (That is why you shouldn't do that.) At any rate, the heartbeat does need to be set as per MS spec at 10 half and the binding order needs to be per their instructions. Apparently latency is a potential problem.

Regards,
Mike
0

Featured Post

Free Tool: ZipGrep

ZipGrep is a utility that can list and search zip (.war, .ear, .jar, etc) archives for text patterns, without the need to extract the archive's contents.

One of a set of tools we're offering as a way to say thank you for being a part of the community.

  • 4
Tackle projects and never again get stuck behind a technical roadblock.
Join Now