We seem to be having issues with our 8-Node 2016 Windows Server Failover Cluster where it becomes basically inoperable because of this message disconnecting from our CSV Volumes. This message appears quite a few times on all CSVs on this NAS until things settle down.
EVENT ID 5120
Cluster Shared Volume has entered a pause state becuse of 'STATUS_CONNECTION_DISCONN
020c)'. All I/O will temporarily be queued until a path to the volume is reestablished.
This seems shutdown all affected VMs on that CSV and they really don't come back up for quite a long time.
A run down of our current setup:
- 8-Node 2016 Failover Cluster
- QNAP NAS for VM Storage
- 192.168.101.X is our Stacked Switch Production Network. On each Host we have a VMTeam setup and are using it for our Hyper-V Switch (vSwitch).
- 192.168.103.X is our first dedicated iSCSI network. This switch is a 10GB switch that has a connection going to one of the 10GB ports on each our Hosts and a Fiber going go our NAS. 1 Port is going to our 101 network for management.
- 192.168.104.X is our second dedicated iSCSI network. This switch is a 10GB switch that has a connection going the other 10GB port on each our Hosts and a Fiber going go our NAS. 1 Port is going to our 101 network for management.
- 192.168.105.X is our Live Migration Network that is on a dedicated 1GB switch that connects to 2 ports on each of our Hosts to create a LIVE-MIGRATION_TEAM.
- Inside of Failover Cluster we have 4 networks:
1. PROD is our 101 network, it's set for Cluster Communication and can be used for Clients
2. Both of our iSCSI networks are set to NO Cluster Communication
3. LIVE-MIGRATION is set for Cluster Communication Traffic only.
For Live Migration Settings:
1. Our Live Migration is at the very top of the priority list and it's the only network checked.
We were working on our Cluster last night and noticed it went down as we were working on separate tasks, no longer the Cluster when the above message started happening again.
This has happened about 3 times in the last month and we are trying to find out why.
We found out that our previous admin had installed AV on each of our hosts. We have uninstalled it now. We also now have our Hosts updated with the latest Windows Updates and we upgraded our QNAP NAS to the latest firmware too.
We would like to find the cause of this and we're not sure what it is at this point? Our configuration within our NAS, switches, and Failover Cluster appear to be fine? However, we're unable to really find the culprit and why our Cluster appears to be extremely sensitive. Looking back at logs from last night, it appears as though it started around the time we were gathering information via script from our Cluster. The other times, we didn't run a script so it would have been a different reason. The script in the below link is the template script we used. This script has worked in other environments without the Cluster being impacted.
Any suggestions? If I need to provide any additional information, please let me know.