We help IT Professionals succeed at work.

Check out our new AWS podcast with Certified Expert, Phil Phillips! Listen to "How to Execute a Seamless AWS Migration" on EE or on your favorite podcast platform. Listen Now

x

2016 Failover Cluster Failures

High Priority
83 Views
Last Modified: 2020-06-08
We seem to be having issues with our 8-Node 2016 Windows Server Failover Cluster where it becomes basically inoperable because of this message disconnecting from our CSV Volumes. This message appears quite a few times on all CSVs on this NAS until things settle down.

EVENT ID 5120

Cluster Shared Volume has entered a pause state becuse of 'STATUS_CONNECTION_DISCONNECTED(c000020c)'. All I/O will temporarily be queued until a path to the volume is reestablished.

This seems shutdown all affected VMs on that CSV and they really don't come back up for quite a long time.

A run down of our current setup:
- 8-Node 2016 Failover Cluster
- QNAP NAS for VM Storage
- 192.168.101.X is our Stacked Switch Production Network. On each Host we have a VMTeam setup and are using it for our Hyper-V Switch (vSwitch).
- 192.168.103.X is our first dedicated iSCSI network. This switch is a 10GB switch that has a connection going to one of the 10GB ports on each our Hosts and a Fiber going go our NAS. 1 Port is going to our 101 network for management.
- 192.168.104.X is our second dedicated iSCSI network. This switch is a 10GB switch that has a connection going the other 10GB port on each our Hosts and a Fiber going go our NAS. 1 Port is going to our 101 network for management.
- 192.168.105.X is our Live Migration Network that is on a dedicated 1GB switch that connects to 2 ports on each of our Hosts to create a LIVE-MIGRATION_TEAM.

- Inside of Failover Cluster we have 4 networks:
1. PROD is our 101 network, it's set for Cluster Communication and can be used for Clients
2. Both of our iSCSI networks are set to NO Cluster Communication
3. LIVE-MIGRATION is set for Cluster Communication Traffic only.

For Live Migration Settings:
1. Our Live Migration is at the very top of the priority list and it's the only network checked.

We were working on our Cluster last night and noticed it went down as we were working on separate tasks, no longer the Cluster when the above message started happening again.

This has happened about 3 times in the last month and we are trying to find out why.

We found out that our previous admin had installed AV on each of our hosts. We have uninstalled it now. We also now have our Hosts updated with the latest Windows Updates and we upgraded our QNAP NAS to the latest firmware too.

We would like to find the cause of this and we're not sure what it is at this point? Our configuration within our NAS, switches, and Failover Cluster appear to be fine? However, we're unable to really find the culprit and why our Cluster appears to be extremely sensitive. Looking back at logs from last night, it appears as though it started around the time we were gathering information via script from our Cluster. The other times, we didn't run a script so it would have been a different reason. The script in the below link is the template script we used. This script has worked in other environments without the Cluster being impacted.

https://gallery.technet.microsoft.com/scriptcenter/Hyper-V-Reporting-Script-4adaf5d0

Any suggestions? If I need to provide any additional information, please let me know.
Comment
Watch Question

kevinhsiehNetwork Engineer
CERTIFIED EXPERT
Commented:
Unlock this solution with a free trial preview.
(No credit card required)
Get Preview
Philip ElderTechnical Architect - HA/Compute/Storage
CERTIFIED EXPERT
Commented:
Unlock this solution with a free trial preview.
(No credit card required)
Get Preview

Author

Commented:
I haven't heard of a QNAP for an 8-node cluster prior either. However, it appears to work? We have a TVS-EC1680U-SAS-RP model. Our QNAP NAS appears to be okay as far as CPU and RAM concerned. CPU sits around 9% and RAM about 18%.

I have noticed latency quite high on the Disk Pools in the past and yes, the ISCSI/MPIO both appear to be setup according to QNAP's recommendation. Both networks for our ISCSI are talking to our ISCSI Targers.

Our ISCSI Switches are 10GbE. However, regarding the topic of Jumbo Frames. We don't have that enabled. Our switches are Netgear XS716E switches. I see that the QNAP has MTU of 9000 set up on its 2 Fibert 10GbE connections, but don't see anything reflecting this on our switches. Ou Hosts also don't have Jumbo Packets enabled on any of the ports. Looking at the ISCSI Ports on our Hosts, those ports have all the default connection items selected, excluding IPv6. Should all checkboxes except for IPv4 be unchecked on our ISCSI ports?
      For example, Client for Microsoft Networks, File and Printer Sharing for Microsoft Networks, QoS Packet Scheduler, Microsoft LLDP Protocol Driver, Link-Layer Topology Responder and Link-Layer Topology Discovery Mapper I/O Driver are all checked on our ISCSI apapters.

Author

Commented:
Also, yes, we do have QNAP Support. I sent them a message last night inside of a previous ticket I had with them. I should hear back from them sometime Monday.
kevinhsiehNetwork Engineer
CERTIFIED EXPERT

Commented:
What kind of disks do you have for the data? RAID configuration? How may VMs are you running on those 8 hosts?

How many CSV are you running? How large are they?

How many IOPS are you consuming? How many can the array support?

So, you have a ticket in from Friday night, and you expect to hear back on Monday. This is why people in the know don't run large production clusters on QNAP.

Netgear switches also generally aren't considered production storage switches.

When I call my storage vendor, I am speaking with a L3 engineer in less than 60 seconds.

Author

Commented:
I inherited this network about 3 weeks ago so I was not present for the decision-making process of the equipment that is in place. However, I am here to resolve the inconsistencies that have begun to arise.

Although I agree that the equipment may not be enterprise-grade, it has been stated to me that it has run smoothly before.

To answer your questions now.

The previous admin configured all VMs to be in 3 CSVs, yes I know, also not best practice.
Basically all production CSVs are within 1 big RAID 6 Storage Pool. The main CSV is 29.3TB and stores about 90% of our 79 VMs.

Right now the IOPS seem okay but that's because no one is really using our systems right now. I'd need to check back when production is occurring. When I looked a couple weeks ago, I saw the latency upwards of 1770 ms which I thought was really high.

To circle back on the original post, what does the issue seem to be? Does it appear to sound more networking related or NAS related?

I don't recall a need to setup Jumbo Framing on other networks I've worked on.
Looking at the ISCSI Ports on our Hosts, those ports have all the default connection items selected, excluding IPv6. Should all checkboxes except for IPv4 be unchecked on our ISCSI ports?
      For example, Client for Microsoft Networks, File and Printer Sharing for Microsoft Networks, QoS Packet Scheduler, Microsoft LLDP Protocol Driver, Link-Layer Topology Responder and Link-Layer Topology Discovery Mapper I/O Driver are all checked on our ISCSI apapters.

Author

Commented:
Also, what does the error point towards in my case? I looked it up but some varying reasons,

EVENT ID 5120

Cluster Shared Volume has entered a pause state becuse of 'STATUS_CONNECTION_DISCONNECTED(c000020c)'. All I/O will temporarily be queued until a path to the volume is reestablished.

Author

Commented:
Switching wise, the switch model that is in place appears to be fine. Somehow I feel like there might be a configuration change or a modification need that I'm missing?
kevinhsiehNetwork Engineer
CERTIFIED EXPERT

Commented:
How many disk in the RAID 6, and what kind?

It is quite possible that you don't have enough disks to supports the workload. The super high latency would be an indication of that.

RAID 6 is the slowest RAID for performance.

For 7.2K dives in RAID 5, I would hope that you have maybe on the order of 48 drives for suitable performance.

If you had 15K drives, then you would need fewer drives.

Is there any SSD as cache or storage tier? That would help. SSD in RAID 6 wouldn't be a problem.

Author

Commented:
We have all SAS Drives. The particular RAID 6 pool has 10 Disks in it. The disk are ST12000NM0027 which are 7.2k RPM (just found that out, a little shocked at that).

This NAS does have 2 SSD cache drives as well. Although they said they are in Good standing, they have an Estimated Life Remaining of 0%. That statistic is new in the new firmware applied to the NAS last night. I have also asked QNAP about this.

We also have a Storage Pool 2 setup with RAID 5 that contains 4 drives. This pool consists of the Cluster Quorum and another CSV that also went offline. So all of the CSVs appeared to go offline.
Philip ElderTechnical Architect - HA/Compute/Storage
CERTIFIED EXPERT

Commented:
5120 is generally a network error. Cluster is complaining about not being able to talk to storage.

Author

Commented:
I see. I guess I'm not sure what networking is going wrong specifically.

The switches are not producing any CRC Error Packets from what I can see.

Author

Commented:
Let me ask this, looking through the switches I see that the additional networks that are configured like 192.168.103.X, 192.168.104.X, and 192.168.105.X (our ISCSI and Live Migration networks) all are IP networks but they aren't specified within a VLAN anywhere. The IPs are just configured on the NIC Adapters within the device. Could that be a possible cause to this?

Author

Commented:
Also, since Jumbo Packets are not configured, should the 2 adapters on the QNAP NAS not be set to 9000 MTU and instead 1500?
kevinhsiehNetwork Engineer
CERTIFIED EXPERT

Commented:
If you are not set for Jumbo frames on your hosts and switches, then MTU on you QNAP should also be net to 1500, if that is a supported option.

Author

Commented:
Okay, I can do that. Whats the disadvantage of having it the way it currently is? Where the MTU is different on the NAS versus the Hosts and switches? Just so I know.

Also, is the additional IP Network setup an issue I mentioned in the last comment? Or is that okay as is?
kevinhsiehNetwork Engineer
CERTIFIED EXPERT

Commented:
As long as you can ping each NAS interface from each host, then the IP configuration should be okay.

Having additional protocols bound to the host interfaces doing iSCSI isn't great, but I don't think would cause you these problems.

If the QNAP thinks it can use jumbo frames, then traffic can get fragmented or worse, just drop, causing TCP retransmits. Bad.
Philip ElderTechnical Architect - HA/Compute/Storage
CERTIFIED EXPERT

Commented:
Although they said they are in Good standing, they have an Estimated Life Remaining of 0%. That statistic is new in the new firmware applied to the NAS last night. I have also asked QNAP about this.
I'm going to wager that those SSDs hit 0% remaining, meaning they've exhausted their spare NAND that is there to replace NAND that wears out, are the cause of the performance hit.

Cache is key to high performance.

A 7200 RPM NearLine SAS drive, as is the case here, has a sustained write rate of about 220MB/Second. IOPS wise, maybe 300 IOPS per disk depending on how the storage stack is set up.

The SSDs, on the other hand, will have 10K IOPS and 500MB/Second for SATA and 25K or more and 1GB/Second or more per SAS SSD disk.

Author

Commented:
Kevinhsieh - I will see about changing the MTU on our QNAP to match what our switches currently have.


Philip - Our Cache drives are SSD XM21E 128GB drives.

Do you think this could also be the cause of our issues since they are showing as 0% Estimated Life Remaining?

Author

Commented:
Also, it appears as though when these CSVs go into that pause state, we still are able to ping our QNAP.

Although we can ping, could we be running into a scenario that the NAS maybe isn't responding fast enough and puts these CSVs into a paused state? Or, could it be that some packets are being dropped on the switching end? It looks like our 1GbE switch is registering some packet errors coming from the 2 ISCSI switches. However, within those switches none of the ports are showing any packet errors?
CERTIFIED EXPERT

Commented:
Hey Phil, are you sure on those numbers
A 7200 RPM NearLine SAS drive, as is the case here, has a sustained write rate of about 220MB/Second. IOPS wise, maybe 300 IOPS per disk depending on how the storage stack is set up.
i would have gone about half that for a 7200 RPM drive
Philip ElderTechnical Architect - HA/Compute/Storage
CERTIFIED EXPERT

Commented:
@Gerald Yeah, we've tested both SATA and SAS NearLine drives with 4Kn bringing in about 250MB/Second to 280MB/Second depending on storage stack format. NearLine suck at latency and are dismal for IOPS. So, we only deploy in 60-bay and 102-bay JBODs as a rule.

@Joe: https://www.adata.com/us/feature/403

Ugh, no Power Loss Protection. Plus, they are consumer mSATA drives to boot. :(

From their site:
SSD cache acceleration
The TVS-EC1680U-SAS-RP features 2 pre-installed 128GB mSATA modules, and there are no limits on using SSDs for caching.
That is a very misleading statement as you are now finding out.

When we build out clusters or standalone virtualization hosts we make sure that the rated endurance for the solid-state drives being installed in the system(s) will live through the life of the solution.

Endurance is expressed in Drive Writes Per Day (DWPD). We aim for 3 to 5 DWPD rating with 10 DWPD being the go-to for ultra-high performance and IOPS setups.

1 DWPD = the total capacity of the drive being written to it in one day. So, 1TB SSD can have 1 TB written to it every day for five years which is the usual warranty life for enterprise endurance drives.

Eight nodes writing to consumer grade SSDs? The SSD drives are now duds. Time to replace them.

Author

Commented:
Wow, that is unfortunate.

Thank you very much for pointing this out Philip. Since we're moving the VMs off of this NAS and onto a reconfigured and designed replica of it, maybe we should find a more suitable SSD cache for it.

For SSD sizing, is there a template for like SSD caching based on storage? Ours is about 80TB across 3 Storage pools with separated CSVs below them.

I'm assuming this is still unrelated to the bigger issue that the CSVs are disconnecting right? I saw your previous comment that it's definitely network related?
kevinhsiehNetwork Engineer
CERTIFIED EXPERT

Commented:
How much active data do you have?

A normal rule of thumb is at least 5% SSD by capacity. If your SSDs were used as read cache, they're way too small to be effective. If they were used as write buffer, they were too small to have any longevity. I would say you need a few TB of flash. I wouldn't go any smaller than 3.84 TB SSDs.
Philip ElderTechnical Architect - HA/Compute/Storage
CERTIFIED EXPERT

Commented:
@Joe That was before the zero endurance on the cache SSDs was known.

With them offline, the disconnect is cluster waiting for the Q-Nap to respond. No SSDs, means the drive queue depth and latency numbers must be extreme.

EDIT: On the hosts open ResMon.EXE and have a look at the DISK tab with the drop downs open and sorted to latency. Queue Depth will be on the right.

Author

Commented:
I see. So when I open Resource Monitor on my Hosts, it doesn't show my CSVs listed there? Only the C: Drive and Quorum on the 1 Host.

I can see the latency within the QNAP but not within my Hosts.

Side note - I tested communication with 1400 vs 8000 packets on each of our hosts to one another and to the NAS and found that all hosts are not responding to jumbo packets to some ofther hosts and the NAS. I believe this is tied to the current 10GbE Netgear switch that's in place. It doesn't have MTU settings in it. I think this could possibly be why we are getting the CSV Disconnects at times? I think if the adapters on the Hosts and QNAP were configured back to the default 1500, that should solve our issue with our Cluster going down.

What do you guys think?

Author

Commented:
Quick update on this, we have since disabled Jumbo Framing on all of our Hosts so that it's consistent across all of them. Our NAS however still has it enabled. We we'll leave it as is until when we migrate it to the newly configured NAS that has it disabled.

Thank you guys for the SSD Cache suggestions. That has totally helped. We have ordered new SSD Cache drives for the newly configured NAS so that issue is corrected.

Another question for you guys:
1. I saw the "Cluster Shared Volume has entered a paused state because of 'STATUS_CONNECTION_DISCONNECTED(c000020c)'. All I/O will temporarily be queued until a path to the volume is reestablished." message again today but apparently didn't impact our environment. It seemed to have happened randomly and only affected 1 of the 3 CSVs on that NAS. Any reason for this?

2. I know there's a best practice of setting the MPIO Policy on a Disk to be "Least Queue Depth" opposed to the default "Round Robin with Subset". It appears this has to be changed on each host individually. Is that correct?
Philip ElderTechnical Architect - HA/Compute/Storage
CERTIFIED EXPERT

Commented:
1: Check your switch monitoring to see if there are dropped packets.
2: Least Queue Depth. That's the proper setting. I suggest sticking with it.

Author

Commented:
I checked out one of our 10GbE switches but didn't see any CRC errors. I'll check again and the other switch tomorrow. 

I saw that Least Queue Depth is the best practice. Right now they are set to Round Robin with Subset but wanted to know when making the adjustments to the disks, it appears as though they need to be individually set on each host. Is that correct?
Philip ElderTechnical Architect - HA/Compute/Storage
CERTIFIED EXPERT

Commented:
Use the mpclaim command to set LQD. I think all nodes need the setting done but it's been a while since I've worked with a disaggregate cluster.

Author

Commented:
Got it, okay thanks!

I'll send an update tomorrow on the switching errors.

Author

Commented:
So I logged on to check on the switch and the 10GbE switches are fine. However, when I do a ping from a host to our Host 6, there is some periodic packet loss on our vEthernet interface that our Cluster Communication and VMs use. That's strange. THe switch that is for that interface doesn't say there are any errors.

Looking at the vEthernet adapter's settings, the only main difference I see between it and the others is that it has a "Reliable Multicast Protocol" option with a checkbox next to it. I'm not sure I've heard of that and I checked a few other Hosts and they don't have this option even though they're the same model. What is this feature and should I have this off? Could it be causing the issue?
Lawrence TsePrinciple Consultant
CERTIFIED EXPERT

Commented:
You might want to check something:

  1. Are you using Broadcom NIC?  If so, that's a well known Broadcom issue.  They have released fixes for this but after fixes the problem is still there.  I have personally avoided Broadcom for this reason
  2. Did you configure or disabled VMQ?  If you have not configured VMQ you might want to disable it.
  3. Please make sure all iSCSI path are not on the same subnet ID and not inter-routable
  4. Jumbo frame is important but that is not likely to cause disconnection
  5. Did you installed and enabled MPIO?  Did you configured multiple connection sessions to the iSCSI target?
  6. Is that some kind of IO intensive tasks running in the background when disconnection occurs (e.g. backup, Live Migration, checkpoint creation, etc.)
Commented:
Unlock this solution with a free trial preview.
(No credit card required)
Get Preview
Philip ElderTechnical Architect - HA/Compute/Storage
CERTIFIED EXPERT

Commented:
Thank you. :)
Unlock the solution to this question.
Thanks for using Experts Exchange.

Please provide your email to receive a free trial preview!

*This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

OR

Please enter a first name

Please enter a last name

8+ characters (letters, numbers, and a symbol)

By clicking, you agree to the Terms of Use and Privacy Policy.