Link to home
Start Free TrialLog in
Avatar of T-cko
T-cko

asked on

Cluster network behavior while switch reboots, network team vs. failover / failback

Hi All,
I'd like to ask you for help to find a solution for a following problem we have.

Situation: Lets say we're running one W2k8 based cluster with two nodes (active/passive) Network connections configured:
1. public LAN (team of two NICs) - allowed for client and cluster communication
2. heartbeat LAN (team of two NICs) - for cluster use only

Each NIC of a team is connected to a different switch using Static Link Aggregation on the team site and Virtual Port Channel on the switches site. The NICs are Intel, switches are Cisco, particular type is not relevant I guess.

Everything works good until reboot of one of the switches occurs. Once the switch goes down both the Public and Heartbeat connection correctly performs fail over and the cluster is not affected just network redundancy is degraded. So it works normally at this point.

The problem occurs once the switch comes back. Immediately after Windows (driver) detects the team member is up again the cluster starts to report failure on the network resource and due to dependency configuration rest of resources fail soon as well.

We've found out when the switch is booting it doesn't bring all network layers up right in the same time but it's little bit random due to CPU utilization of the switch. In our case physical layer was up 30 seconds before all others. Status of the network resource on the cluster side was restored automatically after all network layers started but this 30sec delay was enough to get all depended resources down.

Our explanation why network resource failed after link was back on line is:
As the Static Link Aggregation detects only physical layer for link availability it tried to forward traffic to the link which wasn't ready for IP communication yet so it caused packets loss. From cluster perspective it was detected like network link failure.

And now the question is. As per our investigation and practical experience the SLA works with failover/failback capability only from perspective of the physical layer but can't be used as a high availability solution.
We needed to perform the switch maintenance followed by its reboot and we trusted the second team member connected to the second switch is enough to keep cluster network available and usable but unfortunately we were facing network resource failure, paradoxically right after not during reboot of one of the switches. For us it is the big precedence and currently we don't know how to perform seamless maintenance in our network environment.

Is there any better way how to set the network teaming feature other than SLA to keep network team usable in case of switch reboot?

Thank you very much
Avatar of David Johnson, CD
David Johnson, CD
Flag of Canada image

I'm guessing active switches here?
Avatar of T-cko
T-cko

ASKER

What does it mean the 'active switches"?
The switches are Cisco Nexus 5596. The vPC function (Virtual Port Channel) allows each member NIC of a team to be connected to a different switch.
Not a great solution, but one method to get around this would be to shut down all ports except the peer-link, uplink, and switch management ports. Then save the config, perform maintenance/reboot, and manually turn up the links when the switch is fully up and running.

Configuring the server and switch for port-channel "active" mode (LACP) may be able to resolve the issue as well. The switch might not negotiate LACP until it is fully ready to pass traffic and therefore resolve the issue.

Another solution depends on the nic teaming drivers and available settings. If you can find one that specifies that neither link is preferred, and to only fail based on a link down status, the problem would be resolved. In this situation, let's say you have nic1 and nic2 with nic1 being the active member of the failover team (aka fault tolerance mode, NOT load balanced mode). When nic1 detects link down, nic2 takes over. When nic1 detects the link is back up, nothing happens except that nic1 is added to the available nics in the team. Since traffic is not actively moved back to nic1, the 30 seconds worth of link up without traffic passing won't be an issue.

In the Broadcom world, this is called SLB with auto-fallback disabled. This only works in failover/fault tolerance designs and not load balancing.
http://www.broadcom.com/docs/support/ethernet_nic/Broadcom_NetXtremeII_Server_T7.4.pdf    (page 20)
My coworker just came across this article which may be the solution! This is great info that I'll make sure to use for myself in the future.

https://supportforums.cisco.com/thread/2095436
ASKER CERTIFIED SOLUTION
Avatar of rauenpc
rauenpc
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of T-cko

ASKER

Hi 'rauenpc'

Thank you very much for your very influential comments.

@... to shut down all ports except the peer-link, uplink, and switch management ports. Then save the config, perform maintenance/reboot, and manually turn up the links when the switch is fully up and running.

That's what we thing about as well, sensitive operation but sure shot

@LACP

Hmm, as per Intel's documentation this would resolve the issue too and looks like more feasible solution than disabling the ports, we'll need to test it

@Active/passive fault tolerance mode, safe but not sure if useful for us because it has a half of throughput capacity, If bandwidth will be the priority we would exclude this option from possible solutions.


@rest of CISCO comments - you're right, as per our analyses the issue occurred mainly only after second switch reboot so this could be the clue. I've forwarded your hints to our network expert, for further analyses.

So currently we're analyzing your hints and influences to find out what could be the optimal solution for our environment.

To be honest I'm little bit surprised this issue isn't discussed more intensively around the internet. Sources are quite poor in comparison how important this topic is in general. I thought this risk I've described is something every admin should care about.