Hello guys, I have a problem that Ive been pulling my hair out for quite some time.
What we have setup is a Windows Failover Cluster setup across a number of blade servers. Each server has Windows Server 2012R2 installed on it and the Hyper-V and failover clustering roles installed. Each server has 4 physical network interfaces, two HP NC373i Integrated and two HP NC373m Mezzanine. I only have 2 of 4 physical NICs connected to our switch at this time they are switch independently teamed through Windows. On top of this we have eight virtual NICs connected to a Hyper-V virtual switch:
Access/Management, Cluster network, Migration network, Replica network, and four SMB data transfer networks (for accessing VHDs on a storage server)
We have each virtual NIC on a separate VLAN and they all have statically assigned IP addresses. Occasionally, one of the vNICs will stop working and we will lose Live Migration on that blade, or it may lose communication with the cluster depending on which virtual interfaces have failed.
Ive tried updating the drivers on both sets of physical NICs, reflashing the firmware, turning on/off certain subsystems such as VMQ or RSC etc but nothing has solved this. The interesting thing to note is that if I toggle VMQ on or off it sometimes caused the affected NICs to start responding again, but only for a limited time. I should mention no where in the Network Connections does it state these NICs are malfunctioning or disconnected, it does however list it in Event Viewer as a clustering failure.
edit: when I say the NIC is not responding Im meaning the other hosts can not ping it even though it should. Yes Firewall is off