MS Cluster does not fail over

Posted on 2005-03-19
Medium Priority
Last Modified: 2013-11-15
I inherited a MS Cluster, 2 nodes, SCSI Quorum & Shared drives.  The cluster runs one SQL instance.

Symptoms: I can not move any groups. The virtual server does not fail over when node A is lost. Everything "appears" to run correctly on node A. Cluster log errors are numerous but the main thrust "seems" to be drives are "not ready", "Wait operation timed out", "overlapped I/O operation", etc., plus some discomfort with "TCPIP not bound to ..." on rare occassions.  Lots of error 1117 and many status/error 21, plus several status 1169, error 258, status 997, with a little dab of status 2 and 5 for good measure.  

If you look at Device Manager under both Drives and Controllers you get this curious result:

Node A ->  3 entries                                                      (Disk Drives)
Node A ->  SMART 532, slot 2  -> 0 = Hard Drives           (SCSI/RAID controllers)
                                                  0 = Logical Drives
                 SMART 5i,  slot      -> 7 = HD  
                                                  3 = LD

Node B ->  5 entries
Node B -> SMART 532, slot 2  ->  5 HD
                                                 2 LD    
                SMART 5i,   slot 0  ->  7 HD    
                                                 3 LD
I know you can't "see" the drives from node B until it fails over but, I thought I read somewhere that the "numbers" should match, in Device Manager.   Also note that the node that works is Node A.  (Looks backwards to me but,  I have checked it 5 times now.)  

Question: How do I fix it (of course) or lacking an actual fix, are there any well known test procedures or tricky command line magic that will help troubleshoot this beast? (I have read until I am blue. There are a lot of articles that tell you how to build it and a few that tell you how to "verify that it works correctly" but, I haven't seen any that tell you what to do if it doesn't work correctly!)

I have tried "many" things from TechNet, the WEB, etc. and all of them tell me what I already know, it doesn't work. (This is a 24/7, "enterprise" application so I would really prefer to fix it, not rebuild it, if at all possible.)    


Question by:MUnderhood
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 4

Accepted Solution

tmack earned 1500 total points
ID: 13597335
have you checked your cable for the "heartbeat"? I have found those to go bad for what ever reason and cause all kinds of havoc.

Author Comment

ID: 13597405
I haven't checked it yet but it is on my list to look at.  I don't "see" anything consistant that points to the HB cable however, I have seen a few errors that made me suspicious.  The HB was misconfigured (according to the istall notes from MS) at one time. I think I have cleaned all that up. It did not seem to change anything.

I will be touching it soon and I will let you know. Thanks for the response


Author Comment

ID: 13637617
tmack, et al

I have checked the HB cable and as far as I can tell all appears to be in place and well. I haven't seen anymore errors related to the NIC cards (either App or HB).

Further investigation into the logs revealed that it does appear to be trying to fail over (move group). I suspect that it was never installed correctly and probably never worked correctly.

Thanks for the assistance.

Author Comment

ID: 13637641
General Comment:

I was finally able to fire up the Array Configuration Utility (which gives a much better picture of the actual array configuration) and the drives/volumes look fine. (Although, I was only able to get the utility to run on the node that doesn't work !!  The "good node" complained that the browser was unsupported !!)  Apparently this cluster was built by Roseanne Roseannadanna.


Author Comment

ID: 13793703
Final Comment:  The problem was eventually traced to the heartbeat or possibly the binding order. I can not be sure since I changed two things at once. (That is why you shouldn't do that.) At any rate, the heartbeat does need to be set as per MS spec at 10 half and the binding order needs to be per their instructions. Apparently latency is a potential problem.


Featured Post

Free Tool: ZipGrep

ZipGrep is a utility that can list and search zip (.war, .ear, .jar, etc) archives for text patterns, without the need to extract the archive's contents.

One of a set of tools we're offering as a way to say thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

While rebooting windows server 2003 server , it's showing "active directory rebuilding indices please wait" at startup. It took a little while for this process to complete and once we logged on not all the services were started so another reboot is …
Microsoft will be releasing the Windows 10 Creators Update in just a matter of weeks. Are you prepared? Follow these steps to ensure everything goes smoothly and you don't lose valuable data on your PC.
This tutorial will walk an individual through the process of installing the necessary services and then configuring a Windows Server 2012 system as an iSCSI target. To install the necessary roles, go to Server Manager, and select Add Roles and Featu…
This tutorial will walk an individual through the process of installing of Data Protection Manager on a server running Windows Server 2012 R2, including the prerequisites. Microsoft .Net 3.5 is required. To install this feature, go to Server Manager…
Suggested Courses

777 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question