Cluster network behavior while switch reboots, network team vs. failover / failback
Posted on 2013-01-30
I'd like to ask you for help to find a solution for a following problem we have.
Situation: Lets say we're running one W2k8 based cluster with two nodes (active/passive) Network connections configured:
1. public LAN (team of two NICs) - allowed for client and cluster communication
2. heartbeat LAN (team of two NICs) - for cluster use only
Each NIC of a team is connected to a different switch using Static Link Aggregation on the team site and Virtual Port Channel on the switches site. The NICs are Intel, switches are Cisco, particular type is not relevant I guess.
Everything works good until reboot of one of the switches occurs. Once the switch goes down both the Public and Heartbeat connection correctly performs fail over and the cluster is not affected just network redundancy is degraded. So it works normally at this point.
The problem occurs once the switch comes back. Immediately after Windows (driver) detects the team member is up again the cluster starts to report failure on the network resource and due to dependency configuration rest of resources fail soon as well.
We've found out when the switch is booting it doesn't bring all network layers up right in the same time but it's little bit random due to CPU utilization of the switch. In our case physical layer was up 30 seconds before all others. Status of the network resource on the cluster side was restored automatically after all network layers started but this 30sec delay was enough to get all depended resources down.
Our explanation why network resource failed after link was back on line is:
As the Static Link Aggregation detects only physical layer for link availability it tried to forward traffic to the link which wasn't ready for IP communication yet so it caused packets loss. From cluster perspective it was detected like network link failure.
And now the question is. As per our investigation and practical experience the SLA works with failover/failback capability only from perspective of the physical layer but can't be used as a high availability solution.
We needed to perform the switch maintenance followed by its reboot and we trusted the second team member connected to the second switch is enough to keep cluster network available and usable but unfortunately we were facing network resource failure, paradoxically right after not during reboot of one of the switches. For us it is the big precedence and currently we don't know how to perform seamless maintenance in our network environment.
Is there any better way how to set the network teaming feature other than SLA to keep network team usable in case of switch reboot?
Thank you very much