Solved

Cluster network behavior while switch reboots, network team vs. failover / failback

Posted on 2013-01-30
6
360 Views
Last Modified: 2013-11-21
Hi All,
I'd like to ask you for help to find a solution for a following problem we have.

Situation: Lets say we're running one W2k8 based cluster with two nodes (active/passive) Network connections configured:
1. public LAN (team of two NICs) - allowed for client and cluster communication
2. heartbeat LAN (team of two NICs) - for cluster use only

Each NIC of a team is connected to a different switch using Static Link Aggregation on the team site and Virtual Port Channel on the switches site. The NICs are Intel, switches are Cisco, particular type is not relevant I guess.

Everything works good until reboot of one of the switches occurs. Once the switch goes down both the Public and Heartbeat connection correctly performs fail over and the cluster is not affected just network redundancy is degraded. So it works normally at this point.

The problem occurs once the switch comes back. Immediately after Windows (driver) detects the team member is up again the cluster starts to report failure on the network resource and due to dependency configuration rest of resources fail soon as well.

We've found out when the switch is booting it doesn't bring all network layers up right in the same time but it's little bit random due to CPU utilization of the switch. In our case physical layer was up 30 seconds before all others. Status of the network resource on the cluster side was restored automatically after all network layers started but this 30sec delay was enough to get all depended resources down.

Our explanation why network resource failed after link was back on line is:
As the Static Link Aggregation detects only physical layer for link availability it tried to forward traffic to the link which wasn't ready for IP communication yet so it caused packets loss. From cluster perspective it was detected like network link failure.

And now the question is. As per our investigation and practical experience the SLA works with failover/failback capability only from perspective of the physical layer but can't be used as a high availability solution.
We needed to perform the switch maintenance followed by its reboot and we trusted the second team member connected to the second switch is enough to keep cluster network available and usable but unfortunately we were facing network resource failure, paradoxically right after not during reboot of one of the switches. For us it is the big precedence and currently we don't know how to perform seamless maintenance in our network environment.

Is there any better way how to set the network teaming feature other than SLA to keep network team usable in case of switch reboot?

Thank you very much
0
Comment
Question by:T-cko
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 3
  • 2
6 Comments
 
LVL 81

Expert Comment

by:David Johnson, CD, MVP
ID: 38834719
I'm guessing active switches here?
0
 

Author Comment

by:T-cko
ID: 38834813
What does it mean the 'active switches"?
The switches are Cisco Nexus 5596. The vPC function (Virtual Port Channel) allows each member NIC of a team to be connected to a different switch.
0
 
LVL 20

Expert Comment

by:rauenpc
ID: 38835503
Not a great solution, but one method to get around this would be to shut down all ports except the peer-link, uplink, and switch management ports. Then save the config, perform maintenance/reboot, and manually turn up the links when the switch is fully up and running.

Configuring the server and switch for port-channel "active" mode (LACP) may be able to resolve the issue as well. The switch might not negotiate LACP until it is fully ready to pass traffic and therefore resolve the issue.

Another solution depends on the nic teaming drivers and available settings. If you can find one that specifies that neither link is preferred, and to only fail based on a link down status, the problem would be resolved. In this situation, let's say you have nic1 and nic2 with nic1 being the active member of the failover team (aka fault tolerance mode, NOT load balanced mode). When nic1 detects link down, nic2 takes over. When nic1 detects the link is back up, nothing happens except that nic1 is added to the available nics in the team. Since traffic is not actively moved back to nic1, the 30 seconds worth of link up without traffic passing won't be an issue.

In the Broadcom world, this is called SLB with auto-fallback disabled. This only works in failover/fault tolerance designs and not load balancing.
http://www.broadcom.com/docs/support/ethernet_nic/Broadcom_NetXtremeII_Server_T7.4.pdf    (page 20)
0
Optimizing Cloud Backup for Low Bandwidth

With cloud storage prices going down a growing number of SMBs start to use it for backup storage. Unfortunately, business data volume rarely fits the average Internet speed. This article provides an overview of main Internet speed challenges and reveals backup best practices.

 
LVL 20

Expert Comment

by:rauenpc
ID: 38836181
My coworker just came across this article which may be the solution! This is great info that I'll make sure to use for myself in the future.

https://supportforums.cisco.com/thread/2095436
0
 
LVL 20

Accepted Solution

by:
rauenpc earned 500 total points
ID: 38836348
also

http://www.cisco.com/en/US/docs/switches/datacenter/nexus5000/sw/layer2/503_n2_1/503_n2_1nw/Cisco_n5k_layer2_config_gd_rel_503_N2_1_chapter8.pdf

Has additional options regarding vpc delays and auto recovery. It always seems to refer to the secondary switch or vpc member, so I guess i would have to test this to see if you can reboot either switch without interruption, or if rebooting the primary would cause an issue.
0
 

Author Comment

by:T-cko
ID: 38838596
Hi 'rauenpc'

Thank you very much for your very influential comments.

@... to shut down all ports except the peer-link, uplink, and switch management ports. Then save the config, perform maintenance/reboot, and manually turn up the links when the switch is fully up and running.

That's what we thing about as well, sensitive operation but sure shot

@LACP

Hmm, as per Intel's documentation this would resolve the issue too and looks like more feasible solution than disabling the ports, we'll need to test it

@Active/passive fault tolerance mode, safe but not sure if useful for us because it has a half of throughput capacity, If bandwidth will be the priority we would exclude this option from possible solutions.


@rest of CISCO comments - you're right, as per our analyses the issue occurred mainly only after second switch reboot so this could be the clue. I've forwarded your hints to our network expert, for further analyses.

So currently we're analyzing your hints and influences to find out what could be the optimal solution for our environment.

To be honest I'm little bit surprised this issue isn't discussed more intensively around the internet. Sources are quite poor in comparison how important this topic is in general. I thought this risk I've described is something every admin should care about.
0

Featured Post

Why You Need a DevOps Toolchain

IT needs to deliver services with more agility and velocity. IT must roll out application features and innovations faster to keep up with customer demands, which is where a DevOps toolchain steps in. View the infographic to see why you need a DevOps toolchain.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Resolve DNS query failed errors for Exchange
An article on effective troubleshooting
This tutorial will give a an overview on how to deploy remote agents in Backup Exec 2012 to new servers. Click on the Backup Exec button in the upper left corner. From here, are global settings for the application such as connecting to a remote Back…
This tutorial will walk an individual through the process of transferring the five major, necessary Active Directory Roles, commonly referred to as the FSMO roles to another domain controller. Log onto the new domain controller with a user account t…

742 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question