Link to home
Start Free TrialLog in
Avatar of pault01
pault01

asked on

Failover cluster fails when 1 node goes offline

Hi Experts,

I have a Microsoft Failover Cluster with 2 nodes and a Quorum disk (file share witness). Lately when I've taken one node offline for maintenance the cluster has shutdown all my VM's, I've had to reconnect to the cluster and then my VM's restart and come online. I always test the FSW before I take a node server offline and it always responds. Right now I can no longer trust the cluster. Below are the 2 events recorded...

Critical

File share witness resource 'File Share Witness' failed to arbitrate for the file share '\\<SERVERNAME>\WitnessCluster2'. Please ensure that file share '\\<SERVERNAME>\WitnessCluster2' exists and is accessible by the cluster.

Error

Cluster resource 'File Share Witness' of type 'File Share Witness' in clustered role 'Cluster Group' failed.

Based on the failure policies for the resource and role, the cluster service may try to bring the resource online on this node or move the group to another node of the cluster and then restart it.  Check the resource and group state using Failover Cluster Manager or the Get-ClusterResource Windows PowerShell cmdlet.


How do I stop this happening on my cluster, it does not make sense because the FSW is available and online but as soon as I restart a host node my cluster falls apart and its a real problem when you have 20 VM's that depend on it working.

The 2 nodes are running with Windows Server 2012 R2 Datacenter
The FSW is located on a server running Server 2016

Any assistance is greatly appreciated.

Paul
Avatar of Cliff Galiher
Cliff Galiher
Flag of United States of America image

First guess is the <servername> is failing to resolve when you shut down a node. Make sure all nodes use redundant DNS servers" append the domain name suffix properly, and if DNS servers are guests in the cluster, that they don't end up on the same node ever.

Ideally I'd be checking the file share witness AFTER a node is down and the cluster fails. Nnslookup should still work and the file share should still be reachable. If it isn't then you have a basis to troubleshoot.
The file share witness it totally separate, right? Not a VM on the cluster?

Why did you choose file share witness instead of a disk witness? You have shared storage, right? What kind? How about a a disk witness instead?

What does the cluster diagnostic report say?
Avatar of pault01
pault01

ASKER

Hi Cliff Galiher & Kevinsieh,

To answer both your questions;

The FSW is on a physical server outside the cluster. The FSW stays online throughout and is on the same subnet at the 2 node servers. From the node server that remains online I can access the FSW. Something is happening to the cluster as soon as the other node server turns off or restarts.

Thanks for comments so far.
ASKER CERTIFIED SOLUTION
Avatar of kevinhsieh
kevinhsieh
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of pault01

ASKER

One on the DNS server is a physical server, the other is in the cluster on a VM.

Is this a problem? Am I best to change to a disk witness?
I have always used disk witness of say 100 MB. Never had a problem with quorum.
Sounds like a permissions issue on the share the cluster service is connecting to. Make sure the permissions are set up correctly.

Use Azure. It's built-in to the Cluster service. It costs pennies per month. We have a lot of cluster witnesses set up this way.
Avatar of pault01

ASKER

Hi All,

Thank you all for your comments. In the end I changed from a file witness to a disk and I have then tested numerous scenarios with this and the cluster remains stable. Thank you all for your input. A special thanks to kevinsieh, your insight has helped solve my problem.

Thanks again,

Paul