asked on

2 Node Windows 2012 Cluster issues with Quorum

Hello,

We have a 2 Node cluster (N1 & N2) running Hyper-V and VMs along with CSVs and running as "Node and Disk Majority (Quorum)" Quorum configuration.

In order to perform maintenance on N1 we live migrated all VMs across to N2 and all was working fine. After this shutting down N1 destroyed the cluster on N2 and we had no running VMs and the cluster was unavailable to attach to. The only to get things working again was to quickly restart N1.

After connecting to the cluster the following error message was observed on N2,

"The Cluster service is shutting down because quorum was lost. This could be due to the loss of network connectivity between some or all nodes in the cluster, or a failover of the witness disk."

The owner of the CSV disk resources and quorum was N1 which could explain the issue.

Questions from here

1) If a shutdown initiates on N1 why does it only migrate VMs automatically and not cluster resources that are required? Reading on I think we should "pause" the node so that it enters maintenance mode.

However the big (more worryingly) question is the following

2) In a 2 node cluster if the owner that the looks after the disk witness fails then the entire cluster will fail and provides no resilience whatsoever. Is this correct?

If this is true my 2-node cluster only has a 50/50 chance of remaining working after a failure of any one of the nodes!

Thanks

Mahesh

Cluster requires 51% votes in order to remain alive

With TWO node cluster and single Quorum disk both members and quorum give one vote each to cluster
So in case you loss quorum disk still your cluster will remain alive because of TWO votes from two cluster nodes
OR
if single node get down still cluster has 1 vote from quorum and one vote from another node

Now you need to check all cluster resources properties and check if both servers are selected as a possible owner, otherwise once server hosting resource fails, resource will not get up on another node
Also check dependencies of resources and ensure that on Quorum disk should not have any dependencies

During cluster node maintenance, When you live migrate VMs from one node to another also move quorum disk resource manually to another server (Ensure that all resource owner is another server in advance to avoid any unfortunate problems)

lastly I hope you have TWO network cards per server, 1 for heart beat and one for cluster
If you have only one NIC per cluster node, then there are chances of failure are more when you reboot any cluster node.

Mahesh.

nmxsupport

ASKER

Thank you Mahesh.

I checked and the Quorum disk has no dependencies and the Advanced Policies tab shows both N1 and N2 as possible owners.

In the Policies tab I have the following set,
* If resource fails, attempt restart on current node
Period for restarts = 15:00
Maximum restarts in the specified period = 1
Delay between restarts = 0.5
* If restart is unsuccesful, fail over all resources in this role
* If all the restart attempts fail, begin restarting again afte the specifed period = 01:00

There is no way to access any properties of the CSV volumes.

The critical error "The Cluster service is shutting down because quorum was lost. This could be due to the loss of network connectivity between some or all nodes in the cluster, or a failover of the witness disk." seems to indicate if the witness disk fails over, then the quorum is automatically lost and the cluster is shut down.

Mahesh

Clustered Shared Volumes allows nodes to share access to storage, which means that the applications on that piece of storage can run on any node, or on different nodes, at any time. CSV breaks the dependency between application resources (the VMs) and disk resources (for CSV disks) so that in a CSV environment it does not matter where the disk is mounted because it will appear local to all nodes in the cluster. CSV manages storage access differently than regular clustered disks

You can not add or change the Possible Owner of a CSV from the GUI. You need to use cluster.exe command line for that.
http://virtuallyaware.wordpress.com/2011/11/28/blog-highlight-add-possible-owner-to-a-cluster-shared-volume/

How many network cards do you have per server \ cluster node

I hope you have TWO network cards per server, 1 for heart beat and one for cluster
If you have only one NIC per cluster node, then there are chances of failure are more when you reboot any cluster node.

Mahesh.

nmxsupport

ASKER

Hello each server has 4 NICs, 2 teamed for LAN and 2 teamed for cluster/management. Switches are cross linked and provide additional resilience.

All NICs are connected to switches and therefore any node shutdown would not result in a "disconnected media" state for the NIC.

Struggling how the number of NICs is important as the point is I have 2 nodes and if an entire node fails then all NIc connections between the 2 nodes would be down, whether I had 2 or 20 wouldn't it?

It appears to me to be a side-effect of the clustering split brain. If the node that is the owner of the Quorum fail should the owner then transfer to the remaining node?

Mahesh

In your case Quorum is not getting transferred to another node in case of reboot of quorum owner, that is the problem

Can you check that quorum is set from storage end correctly (I mean cluster support is enabled from storage end) and also you have installed MPIO feature on both cluster nodes

Mahesh.

nmxsupport

ASKER

Storage is fibre channel SAN and MPIO is enabled.
All CSVs/quorum is visible from both nodes. I expect (but have not confirmed) that if one node fails the other node will still be able to see all the resources.

Tell me Mahesh,
In your view, in a 2 node "2012" cluster, given a failure of any node (bearing in mind only one will be the owner of the quorum at any time) should the cluster continue to operate correctly?

nmxsupport

ASKER

Interestingly re-running the Quorum wizard I get the following, will investigate further.

Quorum Configuration: Node Majority
Cluster Managed Voting: Enabled

The recommended setting for your number of voting nodes is Node and Disk Majority, however Node Majority was selected because an appropriate disk could not be found.
Your cluster quorum configuration will be changed to the configuration shown above.

ASKER CERTIFIED SOLUTION

Mahesh

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial