Link to home
Start Free TrialLog in
Avatar of Joe Lowe
Joe LoweFlag for United States of America

asked on

2016 Failover Cluster Failures

We seem to be having issues with our 8-Node 2016 Windows Server Failover Cluster where it becomes basically inoperable because of this message disconnecting from our CSV Volumes. This message appears quite a few times on all CSVs on this NAS until things settle down.

EVENT ID 5120

Cluster Shared Volume has entered a pause state becuse of 'STATUS_CONNECTION_DISCONNECTED(c000020c)'. All I/O will temporarily be queued until a path to the volume is reestablished.

This seems shutdown all affected VMs on that CSV and they really don't come back up for quite a long time.

A run down of our current setup:
- 8-Node 2016 Failover Cluster
- QNAP NAS for VM Storage
- 192.168.101.X is our Stacked Switch Production Network. On each Host we have a VMTeam setup and are using it for our Hyper-V Switch (vSwitch).
- 192.168.103.X is our first dedicated iSCSI network. This switch is a 10GB switch that has a connection going to one of the 10GB ports on each our Hosts and a Fiber going go our NAS. 1 Port is going to our 101 network for management.
- 192.168.104.X is our second dedicated iSCSI network. This switch is a 10GB switch that has a connection going the other 10GB port on each our Hosts and a Fiber going go our NAS. 1 Port is going to our 101 network for management.
- 192.168.105.X is our Live Migration Network that is on a dedicated 1GB switch that connects to 2 ports on each of our Hosts to create a LIVE-MIGRATION_TEAM.

- Inside of Failover Cluster we have 4 networks:
1. PROD is our 101 network, it's set for Cluster Communication and can be used for Clients
2. Both of our iSCSI networks are set to NO Cluster Communication
3. LIVE-MIGRATION is set for Cluster Communication Traffic only.

For Live Migration Settings:
1. Our Live Migration is at the very top of the priority list and it's the only network checked.

We were working on our Cluster last night and noticed it went down as we were working on separate tasks, no longer the Cluster when the above message started happening again.

This has happened about 3 times in the last month and we are trying to find out why.

We found out that our previous admin had installed AV on each of our hosts. We have uninstalled it now. We also now have our Hosts updated with the latest Windows Updates and we upgraded our QNAP NAS to the latest firmware too.

We would like to find the cause of this and we're not sure what it is at this point? Our configuration within our NAS, switches, and Failover Cluster appear to be fine? However, we're unable to really find the culprit and why our Cluster appears to be extremely sensitive. Looking back at logs from last night, it appears as though it started around the time we were gathering information via script from our Cluster. The other times, we didn't run a script so it would have been a different reason. The script in the below link is the template script we used. This script has worked in other environments without the Cluster being impacted.

https://gallery.technet.microsoft.com/scriptcenter/Hyper-V-Reporting-Script-4adaf5d0

Any suggestions? If I need to provide any additional information, please let me know.
SOLUTION
Avatar of kevinhsieh
kevinhsieh
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of Joe Lowe

ASKER

I haven't heard of a QNAP for an 8-node cluster prior either. However, it appears to work? We have a TVS-EC1680U-SAS-RP model. Our QNAP NAS appears to be okay as far as CPU and RAM concerned. CPU sits around 9% and RAM about 18%.

I have noticed latency quite high on the Disk Pools in the past and yes, the ISCSI/MPIO both appear to be setup according to QNAP's recommendation. Both networks for our ISCSI are talking to our ISCSI Targers.

Our ISCSI Switches are 10GbE. However, regarding the topic of Jumbo Frames. We don't have that enabled. Our switches are Netgear XS716E switches. I see that the QNAP has MTU of 9000 set up on its 2 Fibert 10GbE connections, but don't see anything reflecting this on our switches. Ou Hosts also don't have Jumbo Packets enabled on any of the ports. Looking at the ISCSI Ports on our Hosts, those ports have all the default connection items selected, excluding IPv6. Should all checkboxes except for IPv4 be unchecked on our ISCSI ports?
      For example, Client for Microsoft Networks, File and Printer Sharing for Microsoft Networks, QoS Packet Scheduler, Microsoft LLDP Protocol Driver, Link-Layer Topology Responder and Link-Layer Topology Discovery Mapper I/O Driver are all checked on our ISCSI apapters.
Also, yes, we do have QNAP Support. I sent them a message last night inside of a previous ticket I had with them. I should hear back from them sometime Monday.
What kind of disks do you have for the data? RAID configuration? How may VMs are you running on those 8 hosts?

How many CSV are you running? How large are they?

How many IOPS are you consuming? How many can the array support?

So, you have a ticket in from Friday night, and you expect to hear back on Monday. This is why people in the know don't run large production clusters on QNAP.

Netgear switches also generally aren't considered production storage switches.

When I call my storage vendor, I am speaking with a L3 engineer in less than 60 seconds.
I inherited this network about 3 weeks ago so I was not present for the decision-making process of the equipment that is in place. However, I am here to resolve the inconsistencies that have begun to arise.

Although I agree that the equipment may not be enterprise-grade, it has been stated to me that it has run smoothly before.

To answer your questions now.

The previous admin configured all VMs to be in 3 CSVs, yes I know, also not best practice.
Basically all production CSVs are within 1 big RAID 6 Storage Pool. The main CSV is 29.3TB and stores about 90% of our 79 VMs.

Right now the IOPS seem okay but that's because no one is really using our systems right now. I'd need to check back when production is occurring. When I looked a couple weeks ago, I saw the latency upwards of 1770 ms which I thought was really high.

To circle back on the original post, what does the issue seem to be? Does it appear to sound more networking related or NAS related?

I don't recall a need to setup Jumbo Framing on other networks I've worked on.
Looking at the ISCSI Ports on our Hosts, those ports have all the default connection items selected, excluding IPv6. Should all checkboxes except for IPv4 be unchecked on our ISCSI ports?
      For example, Client for Microsoft Networks, File and Printer Sharing for Microsoft Networks, QoS Packet Scheduler, Microsoft LLDP Protocol Driver, Link-Layer Topology Responder and Link-Layer Topology Discovery Mapper I/O Driver are all checked on our ISCSI apapters.
Also, what does the error point towards in my case? I looked it up but some varying reasons,

EVENT ID 5120

Cluster Shared Volume has entered a pause state becuse of 'STATUS_CONNECTION_DISCONNECTED(c000020c)'. All I/O will temporarily be queued until a path to the volume is reestablished.
Switching wise, the switch model that is in place appears to be fine. Somehow I feel like there might be a configuration change or a modification need that I'm missing?
How many disk in the RAID 6, and what kind?

It is quite possible that you don't have enough disks to supports the workload. The super high latency would be an indication of that.

RAID 6 is the slowest RAID for performance.

For 7.2K dives in RAID 5, I would hope that you have maybe on the order of 48 drives for suitable performance.

If you had 15K drives, then you would need fewer drives.

Is there any SSD as cache or storage tier? That would help. SSD in RAID 6 wouldn't be a problem.
We have all SAS Drives. The particular RAID 6 pool has 10 Disks in it. The disk are ST12000NM0027 which are 7.2k RPM (just found that out, a little shocked at that).

This NAS does have 2 SSD cache drives as well. Although they said they are in Good standing, they have an Estimated Life Remaining of 0%. That statistic is new in the new firmware applied to the NAS last night. I have also asked QNAP about this.

We also have a Storage Pool 2 setup with RAID 5 that contains 4 drives. This pool consists of the Cluster Quorum and another CSV that also went offline. So all of the CSVs appeared to go offline.
5120 is generally a network error. Cluster is complaining about not being able to talk to storage.
I see. I guess I'm not sure what networking is going wrong specifically.

The switches are not producing any CRC Error Packets from what I can see.
Let me ask this, looking through the switches I see that the additional networks that are configured like 192.168.103.X, 192.168.104.X, and 192.168.105.X (our ISCSI and Live Migration networks) all are IP networks but they aren't specified within a VLAN anywhere. The IPs are just configured on the NIC Adapters within the device. Could that be a possible cause to this?
Also, since Jumbo Packets are not configured, should the 2 adapters on the QNAP NAS not be set to 9000 MTU and instead 1500?
If you are not set for Jumbo frames on your hosts and switches, then MTU on you QNAP should also be net to 1500, if that is a supported option.
Okay, I can do that. Whats the disadvantage of having it the way it currently is? Where the MTU is different on the NAS versus the Hosts and switches? Just so I know.

Also, is the additional IP Network setup an issue I mentioned in the last comment? Or is that okay as is?
As long as you can ping each NAS interface from each host, then the IP configuration should be okay.

Having additional protocols bound to the host interfaces doing iSCSI isn't great, but I don't think would cause you these problems.

If the QNAP thinks it can use jumbo frames, then traffic can get fragmented or worse, just drop, causing TCP retransmits. Bad.
Although they said they are in Good standing, they have an Estimated Life Remaining of 0%. That statistic is new in the new firmware applied to the NAS last night. I have also asked QNAP about this.
I'm going to wager that those SSDs hit 0% remaining, meaning they've exhausted their spare NAND that is there to replace NAND that wears out, are the cause of the performance hit.

Cache is key to high performance.

A 7200 RPM NearLine SAS drive, as is the case here, has a sustained write rate of about 220MB/Second. IOPS wise, maybe 300 IOPS per disk depending on how the storage stack is set up.

The SSDs, on the other hand, will have 10K IOPS and 500MB/Second for SATA and 25K or more and 1GB/Second or more per SAS SSD disk.
Kevinhsieh - I will see about changing the MTU on our QNAP to match what our switches currently have.


Philip - Our Cache drives are SSD XM21E 128GB drives.

Do you think this could also be the cause of our issues since they are showing as 0% Estimated Life Remaining?
Also, it appears as though when these CSVs go into that pause state, we still are able to ping our QNAP.

Although we can ping, could we be running into a scenario that the NAS maybe isn't responding fast enough and puts these CSVs into a paused state? Or, could it be that some packets are being dropped on the switching end? It looks like our 1GbE switch is registering some packet errors coming from the 2 ISCSI switches. However, within those switches none of the ports are showing any packet errors?
Hey Phil, are you sure on those numbers
A 7200 RPM NearLine SAS drive, as is the case here, has a sustained write rate of about 220MB/Second. IOPS wise, maybe 300 IOPS per disk depending on how the storage stack is set up.
i would have gone about half that for a 7200 RPM drive
@Gerald Yeah, we've tested both SATA and SAS NearLine drives with 4Kn bringing in about 250MB/Second to 280MB/Second depending on storage stack format. NearLine suck at latency and are dismal for IOPS. So, we only deploy in 60-bay and 102-bay JBODs as a rule.

@Joe: https://www.adata.com/us/feature/403

Ugh, no Power Loss Protection. Plus, they are consumer mSATA drives to boot. :(

From their site:
SSD cache acceleration
The TVS-EC1680U-SAS-RP features 2 pre-installed 128GB mSATA modules, and there are no limits on using SSDs for caching.
That is a very misleading statement as you are now finding out.

When we build out clusters or standalone virtualization hosts we make sure that the rated endurance for the solid-state drives being installed in the system(s) will live through the life of the solution.

Endurance is expressed in Drive Writes Per Day (DWPD). We aim for 3 to 5 DWPD rating with 10 DWPD being the go-to for ultra-high performance and IOPS setups.

1 DWPD = the total capacity of the drive being written to it in one day. So, 1TB SSD can have 1 TB written to it every day for five years which is the usual warranty life for enterprise endurance drives.

Eight nodes writing to consumer grade SSDs? The SSD drives are now duds. Time to replace them.
Wow, that is unfortunate.

Thank you very much for pointing this out Philip. Since we're moving the VMs off of this NAS and onto a reconfigured and designed replica of it, maybe we should find a more suitable SSD cache for it.

For SSD sizing, is there a template for like SSD caching based on storage? Ours is about 80TB across 3 Storage pools with separated CSVs below them.

I'm assuming this is still unrelated to the bigger issue that the CSVs are disconnecting right? I saw your previous comment that it's definitely network related?
How much active data do you have?

A normal rule of thumb is at least 5% SSD by capacity. If your SSDs were used as read cache, they're way too small to be effective. If they were used as write buffer, they were too small to have any longevity. I would say you need a few TB of flash. I wouldn't go any smaller than 3.84 TB SSDs.
@Joe That was before the zero endurance on the cache SSDs was known.

With them offline, the disconnect is cluster waiting for the Q-Nap to respond. No SSDs, means the drive queue depth and latency numbers must be extreme.

EDIT: On the hosts open ResMon.EXE and have a look at the DISK tab with the drop downs open and sorted to latency. Queue Depth will be on the right.
I see. So when I open Resource Monitor on my Hosts, it doesn't show my CSVs listed there? Only the C: Drive and Quorum on the 1 Host.

I can see the latency within the QNAP but not within my Hosts.

Side note - I tested communication with 1400 vs 8000 packets on each of our hosts to one another and to the NAS and found that all hosts are not responding to jumbo packets to some ofther hosts and the NAS. I believe this is tied to the current 10GbE Netgear switch that's in place. It doesn't have MTU settings in it. I think this could possibly be why we are getting the CSV Disconnects at times? I think if the adapters on the Hosts and QNAP were configured back to the default 1500, that should solve our issue with our Cluster going down.

What do you guys think?
Quick update on this, we have since disabled Jumbo Framing on all of our Hosts so that it's consistent across all of them. Our NAS however still has it enabled. We we'll leave it as is until when we migrate it to the newly configured NAS that has it disabled.

Thank you guys for the SSD Cache suggestions. That has totally helped. We have ordered new SSD Cache drives for the newly configured NAS so that issue is corrected.

Another question for you guys:
1. I saw the "Cluster Shared Volume has entered a paused state because of 'STATUS_CONNECTION_DISCONNECTED(c000020c)'. All I/O will temporarily be queued until a path to the volume is reestablished." message again today but apparently didn't impact our environment. It seemed to have happened randomly and only affected 1 of the 3 CSVs on that NAS. Any reason for this?

2. I know there's a best practice of setting the MPIO Policy on a Disk to be "Least Queue Depth" opposed to the default "Round Robin with Subset". It appears this has to be changed on each host individually. Is that correct?
1: Check your switch monitoring to see if there are dropped packets.
2: Least Queue Depth. That's the proper setting. I suggest sticking with it.

I checked out one of our 10GbE switches but didn't see any CRC errors. I'll check again and the other switch tomorrow. 

I saw that Least Queue Depth is the best practice. Right now they are set to Round Robin with Subset but wanted to know when making the adjustments to the disks, it appears as though they need to be individually set on each host. Is that correct?
Use the mpclaim command to set LQD. I think all nodes need the setting done but it's been a while since I've worked with a disaggregate cluster.
Got it, okay thanks!

I'll send an update tomorrow on the switching errors.
So I logged on to check on the switch and the 10GbE switches are fine. However, when I do a ping from a host to our Host 6, there is some periodic packet loss on our vEthernet interface that our Cluster Communication and VMs use. That's strange. THe switch that is for that interface doesn't say there are any errors.

Looking at the vEthernet adapter's settings, the only main difference I see between it and the others is that it has a "Reliable Multicast Protocol" option with a checkbox next to it. I'm not sure I've heard of that and I checked a few other Hosts and they don't have this option even though they're the same model. What is this feature and should I have this off? Could it be causing the issue?
You might want to check something:

  1. Are you using Broadcom NIC?  If so, that's a well known Broadcom issue.  They have released fixes for this but after fixes the problem is still there.  I have personally avoided Broadcom for this reason
  2. Did you configure or disabled VMQ?  If you have not configured VMQ you might want to disable it.
  3. Please make sure all iSCSI path are not on the same subnet ID and not inter-routable
  4. Jumbo frame is important but that is not likely to cause disconnection
  5. Did you installed and enabled MPIO?  Did you configured multiple connection sessions to the iSCSI target?
  6. Is that some kind of IO intensive tasks running in the background when disconnection occurs (e.g. backup, Live Migration, checkpoint creation, etc.)
ASKER CERTIFIED SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Thank you. :)