Loosing connectivity to NLB cluster on virtual servers when a server reboots

Posted on 2007-10-16
Medium Priority
Last Modified: 2008-01-09
We have three physical servers we use to host virtual servers. The physical servers run Windows Server 2003 64bit SP2 operating system with Virtual Server 2005 R2 SP1 64bit. All of our virtual servers run Windows Server 2003 SP2 32bit and are hosted by one of the three physical servers.

We have a three node Microsoft Network Load Balancing cluster, consisting of three virtual servers, each on a different one of the three physical servers. The three virtual servers are clones, use NewSID.

Each of the physical servers has a single physical NIC. All three physical servers connect to the same subnet.

Each of the NLB node virtual servers is configured with two virtual NICs. The NLB cluster is configured in unicast mode using one of the two NICs on each virtual server.

Heres the problem: everything works OK until one of the virtual servers is rebooted (which one doesnt matter). After the rebooted server comes back, connectivity is lost to the NLB NICs on the other two virtual servers for about 10 minutes.


1)      NLB Query from outside cluster reports nodes 1, 2, and 3 as converged.
2)      Reboot node 2.
3)      While node 2 is down, NLB Query from outside cluster reports nodes 1 and 3 converged, as should be the case.
4)      When node 2 comes back, NLB Query from outside cluster momentarily reports nodes 1, 2, and 3 as converted.
5)      Within a few seconds, NLB Query from outside the cluster reports node 2 converted. It gets no reply from nodes 1 and 3. Pings to the NLB NICs on nodes 1 and 3 get no response. Outside connectivity to the NLB cluster on nodes 1 and 3 is lost.
6)      After about 10 minutes, connectivity to the NLB NICs on nodes 1 and 3 is restored. NLB Query from outside the cluster reports nodes 1, 2, and 3 as converged. Everything is fine again.

However, during the 10 minutes when connectivity to the NLB NICs on the two nodes is lost, a NLB Query command executed directly on one of the cluster nodes reports that the cluster is converged with nodes 1, 2, and 3.

This seems to indicate that the nodes can communicate with each other during the time that outside connectivity is lost and raises the question of whether the problem is in the NLB networking layer or in the virtual server networking layer.

Any ideas what the problem is?
Question by:psyche6

Accepted Solution

Andrew_Wallbank earned 1500 total points
ID: 20095018
We need to add static ARP entries for the NLB 'MAC' address on our switches here, just a thought, have you done this?

If they aren't there, you should be seeing problems when all 3 servers are up, not just when 1 is rebooted, but it might be worth looking at.

Author Comment

ID: 20098212
We found the problem. This is a little like saying "the butler did it" .... but the problem turned out to be the virus protection software. We use Trend Micro across our network. While working on this problem, one of us noticed the Trend Micro driver on the TCP/IP properties of the NICs. This jogged our memory that the only difference between the production cluster and the test cluster was that Trend Micro hadn't been loaded on the test cluster. We unloaded Trend Micro from the production cluster - problem solved. We loaded Trend Micro onto the test cluster - problem reproduced.

We are working with Trend Micro support to document this bug.

Question closed.

PS: in working with ths problem, we discovered that updates in Windows 2003 SP1 and SP2 allow setting up NLB nodes in unicast mode on servers with one NIC - rather than requiring duel NICed servers to use unicast. See Microsoft KB898867

Featured Post

Keep up with what's happening at Experts Exchange!

Sign up to receive Decoded, a new monthly digest with product updates, feature release info, continuing education opportunities, and more.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

This article provides a convenient collection of links to Microsoft provided Security Patches for operating systems that have reached their End of Life support cycle. Included operating systems covered by this article are Windows XP,  Windows Server…
How to fix a SonicWall Gateway Anti-Virus firewall blocking automatic updates to apps like Windows, Adobe, Symantec, etc.
This video gives you a great overview about bandwidth monitoring with SNMP and WMI with our network monitoring solution PRTG Network Monitor (https://www.paessler.com/prtg). If you're looking for how to monitor bandwidth using netflow or packet s…
There's a multitude of different network monitoring solutions out there, and you're probably wondering what makes NetCrunch so special. It's completely agentless, but does let you create an agent, if you desire. It offers powerful scalability …

840 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question