Solved

Cluster network behavior while switch reboots, network team vs. failover / failback

Posted on 2013-01-30
6
356 Views
Last Modified: 2013-11-21
Hi All,
I'd like to ask you for help to find a solution for a following problem we have.

Situation: Lets say we're running one W2k8 based cluster with two nodes (active/passive) Network connections configured:
1. public LAN (team of two NICs) - allowed for client and cluster communication
2. heartbeat LAN (team of two NICs) - for cluster use only

Each NIC of a team is connected to a different switch using Static Link Aggregation on the team site and Virtual Port Channel on the switches site. The NICs are Intel, switches are Cisco, particular type is not relevant I guess.

Everything works good until reboot of one of the switches occurs. Once the switch goes down both the Public and Heartbeat connection correctly performs fail over and the cluster is not affected just network redundancy is degraded. So it works normally at this point.

The problem occurs once the switch comes back. Immediately after Windows (driver) detects the team member is up again the cluster starts to report failure on the network resource and due to dependency configuration rest of resources fail soon as well.

We've found out when the switch is booting it doesn't bring all network layers up right in the same time but it's little bit random due to CPU utilization of the switch. In our case physical layer was up 30 seconds before all others. Status of the network resource on the cluster side was restored automatically after all network layers started but this 30sec delay was enough to get all depended resources down.

Our explanation why network resource failed after link was back on line is:
As the Static Link Aggregation detects only physical layer for link availability it tried to forward traffic to the link which wasn't ready for IP communication yet so it caused packets loss. From cluster perspective it was detected like network link failure.

And now the question is. As per our investigation and practical experience the SLA works with failover/failback capability only from perspective of the physical layer but can't be used as a high availability solution.
We needed to perform the switch maintenance followed by its reboot and we trusted the second team member connected to the second switch is enough to keep cluster network available and usable but unfortunately we were facing network resource failure, paradoxically right after not during reboot of one of the switches. For us it is the big precedence and currently we don't know how to perform seamless maintenance in our network environment.

Is there any better way how to set the network teaming feature other than SLA to keep network team usable in case of switch reboot?

Thank you very much
0
Comment
Question by:T-cko
  • 3
  • 2
6 Comments
 
LVL 78

Expert Comment

by:David Johnson, CD, MVP
Comment Utility
I'm guessing active switches here?
0
 

Author Comment

by:T-cko
Comment Utility
What does it mean the 'active switches"?
The switches are Cisco Nexus 5596. The vPC function (Virtual Port Channel) allows each member NIC of a team to be connected to a different switch.
0
 
LVL 20

Expert Comment

by:rauenpc
Comment Utility
Not a great solution, but one method to get around this would be to shut down all ports except the peer-link, uplink, and switch management ports. Then save the config, perform maintenance/reboot, and manually turn up the links when the switch is fully up and running.

Configuring the server and switch for port-channel "active" mode (LACP) may be able to resolve the issue as well. The switch might not negotiate LACP until it is fully ready to pass traffic and therefore resolve the issue.

Another solution depends on the nic teaming drivers and available settings. If you can find one that specifies that neither link is preferred, and to only fail based on a link down status, the problem would be resolved. In this situation, let's say you have nic1 and nic2 with nic1 being the active member of the failover team (aka fault tolerance mode, NOT load balanced mode). When nic1 detects link down, nic2 takes over. When nic1 detects the link is back up, nothing happens except that nic1 is added to the available nics in the team. Since traffic is not actively moved back to nic1, the 30 seconds worth of link up without traffic passing won't be an issue.

In the Broadcom world, this is called SLB with auto-fallback disabled. This only works in failover/fault tolerance designs and not load balancing.
http://www.broadcom.com/docs/support/ethernet_nic/Broadcom_NetXtremeII_Server_T7.4.pdf    (page 20)
0
Why You Should Analyze Threat Actor TTPs

After years of analyzing threat actor behavior, it’s become clear that at any given time there are specific tactics, techniques, and procedures (TTPs) that are particularly prevalent. By analyzing and understanding these TTPs, you can dramatically enhance your security program.

 
LVL 20

Expert Comment

by:rauenpc
Comment Utility
My coworker just came across this article which may be the solution! This is great info that I'll make sure to use for myself in the future.

https://supportforums.cisco.com/thread/2095436
0
 
LVL 20

Accepted Solution

by:
rauenpc earned 500 total points
Comment Utility
also

http://www.cisco.com/en/US/docs/switches/datacenter/nexus5000/sw/layer2/503_n2_1/503_n2_1nw/Cisco_n5k_layer2_config_gd_rel_503_N2_1_chapter8.pdf

Has additional options regarding vpc delays and auto recovery. It always seems to refer to the secondary switch or vpc member, so I guess i would have to test this to see if you can reboot either switch without interruption, or if rebooting the primary would cause an issue.
0
 

Author Comment

by:T-cko
Comment Utility
Hi 'rauenpc'

Thank you very much for your very influential comments.

@... to shut down all ports except the peer-link, uplink, and switch management ports. Then save the config, perform maintenance/reboot, and manually turn up the links when the switch is fully up and running.

That's what we thing about as well, sensitive operation but sure shot

@LACP

Hmm, as per Intel's documentation this would resolve the issue too and looks like more feasible solution than disabling the ports, we'll need to test it

@Active/passive fault tolerance mode, safe but not sure if useful for us because it has a half of throughput capacity, If bandwidth will be the priority we would exclude this option from possible solutions.


@rest of CISCO comments - you're right, as per our analyses the issue occurred mainly only after second switch reboot so this could be the clue. I've forwarded your hints to our network expert, for further analyses.

So currently we're analyzing your hints and influences to find out what could be the optimal solution for our environment.

To be honest I'm little bit surprised this issue isn't discussed more intensively around the internet. Sources are quite poor in comparison how important this topic is in general. I thought this risk I've described is something every admin should care about.
0

Featured Post

Do You Know the 4 Main Threat Actor Types?

Do you know the main threat actor types? Most attackers fall into one of four categories, each with their own favored tactics, techniques, and procedures.

Join & Write a Comment

Are you one of those front-line IT Service Desk staff fielding calls, replying to emails, all-the-while working to resolve end-user technological nightmares? I am! That's why I have put together this brief overview of tools and techniques I use in o…
New Windows 7 Installations take days for Windows-Updates to show up and install. This can easily be fixed. I have finally decided to write an article because this seems to get asked several times a day lately. This Article and the Links apply to…
This tutorial will give a short introduction and overview of Backup Exec 2012 and how to navigate and perform basic functions. Click on the Backup Exec button in the upper left corner. From here, are global settings for the application such as conne…
This tutorial will walk an individual through the steps necessary to join and promote the first Windows Server 2012 domain controller into an Active Directory environment running on Windows Server 2008. Determine the location of the FSMO roles by lo…

763 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

12 Experts available now in Live!

Get 1:1 Help Now