?
Solved

Cluster network behavior while switch reboots, network team vs. failover / failback

Posted on 2013-01-30
6
Medium Priority
?
362 Views
Last Modified: 2013-11-21
Hi All,
I'd like to ask you for help to find a solution for a following problem we have.

Situation: Lets say we're running one W2k8 based cluster with two nodes (active/passive) Network connections configured:
1. public LAN (team of two NICs) - allowed for client and cluster communication
2. heartbeat LAN (team of two NICs) - for cluster use only

Each NIC of a team is connected to a different switch using Static Link Aggregation on the team site and Virtual Port Channel on the switches site. The NICs are Intel, switches are Cisco, particular type is not relevant I guess.

Everything works good until reboot of one of the switches occurs. Once the switch goes down both the Public and Heartbeat connection correctly performs fail over and the cluster is not affected just network redundancy is degraded. So it works normally at this point.

The problem occurs once the switch comes back. Immediately after Windows (driver) detects the team member is up again the cluster starts to report failure on the network resource and due to dependency configuration rest of resources fail soon as well.

We've found out when the switch is booting it doesn't bring all network layers up right in the same time but it's little bit random due to CPU utilization of the switch. In our case physical layer was up 30 seconds before all others. Status of the network resource on the cluster side was restored automatically after all network layers started but this 30sec delay was enough to get all depended resources down.

Our explanation why network resource failed after link was back on line is:
As the Static Link Aggregation detects only physical layer for link availability it tried to forward traffic to the link which wasn't ready for IP communication yet so it caused packets loss. From cluster perspective it was detected like network link failure.

And now the question is. As per our investigation and practical experience the SLA works with failover/failback capability only from perspective of the physical layer but can't be used as a high availability solution.
We needed to perform the switch maintenance followed by its reboot and we trusted the second team member connected to the second switch is enough to keep cluster network available and usable but unfortunately we were facing network resource failure, paradoxically right after not during reboot of one of the switches. For us it is the big precedence and currently we don't know how to perform seamless maintenance in our network environment.

Is there any better way how to set the network teaming feature other than SLA to keep network team usable in case of switch reboot?

Thank you very much
0
Comment
Question by:T-cko
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 3
  • 2
6 Comments
 
LVL 82

Expert Comment

by:David Johnson, CD, MVP
ID: 38834719
I'm guessing active switches here?
0
 

Author Comment

by:T-cko
ID: 38834813
What does it mean the 'active switches"?
The switches are Cisco Nexus 5596. The vPC function (Virtual Port Channel) allows each member NIC of a team to be connected to a different switch.
0
 
LVL 20

Expert Comment

by:rauenpc
ID: 38835503
Not a great solution, but one method to get around this would be to shut down all ports except the peer-link, uplink, and switch management ports. Then save the config, perform maintenance/reboot, and manually turn up the links when the switch is fully up and running.

Configuring the server and switch for port-channel "active" mode (LACP) may be able to resolve the issue as well. The switch might not negotiate LACP until it is fully ready to pass traffic and therefore resolve the issue.

Another solution depends on the nic teaming drivers and available settings. If you can find one that specifies that neither link is preferred, and to only fail based on a link down status, the problem would be resolved. In this situation, let's say you have nic1 and nic2 with nic1 being the active member of the failover team (aka fault tolerance mode, NOT load balanced mode). When nic1 detects link down, nic2 takes over. When nic1 detects the link is back up, nothing happens except that nic1 is added to the available nics in the team. Since traffic is not actively moved back to nic1, the 30 seconds worth of link up without traffic passing won't be an issue.

In the Broadcom world, this is called SLB with auto-fallback disabled. This only works in failover/fault tolerance designs and not load balancing.
http://www.broadcom.com/docs/support/ethernet_nic/Broadcom_NetXtremeII_Server_T7.4.pdf    (page 20)
0
Has Powershell sent you back into the Stone Age?

If managing Active Directory using Windows Powershell® is making you feel like you stepped back in time, you are not alone.  For nearly 20 years, AD admins around the world have used one tool for day-to-day AD management: Hyena. Discover why.

 
LVL 20

Expert Comment

by:rauenpc
ID: 38836181
My coworker just came across this article which may be the solution! This is great info that I'll make sure to use for myself in the future.

https://supportforums.cisco.com/thread/2095436
0
 
LVL 20

Accepted Solution

by:
rauenpc earned 1500 total points
ID: 38836348
also

http://www.cisco.com/en/US/docs/switches/datacenter/nexus5000/sw/layer2/503_n2_1/503_n2_1nw/Cisco_n5k_layer2_config_gd_rel_503_N2_1_chapter8.pdf

Has additional options regarding vpc delays and auto recovery. It always seems to refer to the secondary switch or vpc member, so I guess i would have to test this to see if you can reboot either switch without interruption, or if rebooting the primary would cause an issue.
0
 

Author Comment

by:T-cko
ID: 38838596
Hi 'rauenpc'

Thank you very much for your very influential comments.

@... to shut down all ports except the peer-link, uplink, and switch management ports. Then save the config, perform maintenance/reboot, and manually turn up the links when the switch is fully up and running.

That's what we thing about as well, sensitive operation but sure shot

@LACP

Hmm, as per Intel's documentation this would resolve the issue too and looks like more feasible solution than disabling the ports, we'll need to test it

@Active/passive fault tolerance mode, safe but not sure if useful for us because it has a half of throughput capacity, If bandwidth will be the priority we would exclude this option from possible solutions.


@rest of CISCO comments - you're right, as per our analyses the issue occurred mainly only after second switch reboot so this could be the clue. I've forwarded your hints to our network expert, for further analyses.

So currently we're analyzing your hints and influences to find out what could be the optimal solution for our environment.

To be honest I'm little bit surprised this issue isn't discussed more intensively around the internet. Sources are quite poor in comparison how important this topic is in general. I thought this risk I've described is something every admin should care about.
0

Featured Post

Industry Leaders: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

A procedure for exporting installed hotfix details of remote computers using powershell
This article provides a convenient collection of links to Microsoft provided Security Patches for operating systems that have reached their End of Life support cycle. Included operating systems covered by this article are Windows XP,  Windows Server…
To efficiently enable the rotation of USB drives for backups, storage pools need to be created. This way no matter which USB drive is installed, the backups will successfully write without any administrative intervention. Multiple USB devices need t…
This tutorial will show how to configure a single USB drive with a separate folder for each day of the week. This will allow each of the backups to be kept separate preventing the previous day’s backup from being overwritten. The USB drive must be s…
Suggested Courses

752 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question