Link to home
Start Free TrialLog in
Avatar of Jeff Pittman
Jeff Pittman

asked on

FailOver Cluster: CSV goes offline when a node is paused.

2 Node, Server2016 Hyper V Cluster connecting to a SAN via iscsi.
The San has a total of 3 LUNS.  2 For the vm's to "live" and one for Quorum.

When I pause Node1 and drain the roles.  The VM's drain to Node2.

When I pause Node2 and drain the roles, both  Cluster Disks go offline and the VM's autopause.

During the outage, I was able to ping the iSCSI ports on the SAN and I disabled AV/Firewall on the hosts but that did not resolve anything.
Only resuming Node2 brought the CSV's online.

This cluster was able to failover properly in the past and my lab is failing over correctly as well.

I don't want to run the ClusterVal Wizard until my next maintenance window ( JUST IN CASE ). Before I do run the CLusterVal, has anyone ran into this scenario before?

Cluster.log attached
Avatar of Philip Elder
Philip Elder
Flag of Canada image

Are there two switches in the SAN path? MPIO is enabled and configured for Least Block Depth?

When the CSVs go pause what does Failover Cluster Manager say?

And, I don't see any attachment?
Avatar of Jeff Pittman
Jeff Pittman

ASKER

Each node is directly connected to the SAN via qty 2 SFP+ 10GB cables (network 10.0.0.x /24 & 10.0.1.x/24 ) [Total of 4 connections to the SAN]

Good catch on the MPIO.

On Node 2 ( The node that puts the CSV offline when drained )

One LUN is Failover Only
Second and Quorom LUN is RoundRobin with SubSet

Node 1
All Luns are set to Round Robin with Subset

Both Lab servers are RR w Subset

The SAN does have an Active - Active Controller.  Let me research on best option (I"m open to your recommendations)


Node Drain Failure Event:

Node drain failed on Cluster node ClusterV2.

Reference the node's System and Application event logs and cluster logs to investigate the cause of the drain failure.  When the problem is resolved, you can retry the drain operation.


Attached is the FailOverCluster-CSVFS show in Luns going from Draining to AutoPause.
FailOverCluster-CsvFs.evtx
RoundRobin should not be set in a cluster setting IMO (I may be corrected on this as we do DAS not SAN). That basically tells MPIO to go around in circles so when a path is offline things may not operate as they seem.
Finally had a window for more testing.  I don't have definitive proof as this was the cause but when the CSV's went offline, Veeam Backup & Replication was backing up the VM's.  I performed another failover when Veeam was not running and everything failed over properly.
This question needs an answer!
Become an EE member today
7 DAY FREE TRIAL
Members can start a 7-Day Free trial then enjoy unlimited access to the platform.
View membership options
or
Learn why we charge membership fees
We get it - no one likes a content blocker. Take one extra minute and find out why we block content.