Loosing connectivity to NLB cluster on virtual servers when a server reboots
Posted on 2007-10-16
We have three physical servers we use to host virtual servers. The physical servers run Windows Server 2003 64bit SP2 operating system with Virtual Server 2005 R2 SP1 64bit. All of our virtual servers run Windows Server 2003 SP2 32bit and are hosted by one of the three physical servers.
We have a three node Microsoft Network Load Balancing cluster, consisting of three virtual servers, each on a different one of the three physical servers. The three virtual servers are clones, use NewSID.
Each of the physical servers has a single physical NIC. All three physical servers connect to the same subnet.
Each of the NLB node virtual servers is configured with two virtual NICs. The NLB cluster is configured in unicast mode using one of the two NICs on each virtual server.
Heres the problem: everything works OK until one of the virtual servers is rebooted (which one doesnt matter). After the rebooted server comes back, connectivity is lost to the NLB NICs on the other two virtual servers for about 10 minutes.
1) NLB Query from outside cluster reports nodes 1, 2, and 3 as converged.
2) Reboot node 2.
3) While node 2 is down, NLB Query from outside cluster reports nodes 1 and 3 converged, as should be the case.
4) When node 2 comes back, NLB Query from outside cluster momentarily reports nodes 1, 2, and 3 as converted.
5) Within a few seconds, NLB Query from outside the cluster reports node 2 converted. It gets no reply from nodes 1 and 3. Pings to the NLB NICs on nodes 1 and 3 get no response. Outside connectivity to the NLB cluster on nodes 1 and 3 is lost.
6) After about 10 minutes, connectivity to the NLB NICs on nodes 1 and 3 is restored. NLB Query from outside the cluster reports nodes 1, 2, and 3 as converged. Everything is fine again.
However, during the 10 minutes when connectivity to the NLB NICs on the two nodes is lost, a NLB Query command executed directly on one of the cluster nodes reports that the cluster is converged with nodes 1, 2, and 3.
This seems to indicate that the nodes can communicate with each other during the time that outside connectivity is lost and raises the question of whether the problem is in the NLB networking layer or in the virtual server networking layer.
Any ideas what the problem is?