• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 1881
  • Last Modified:

NLB and VMs

Hi Guys,

We're currently using blades for our TS enviroment and have been virtualising our enviroment slowly over the last couple of months.

We've just had a new server specifically to start testing some TS virtual machines and have a rather impending need to deploy some sooner than we'd like (as always happens!!)

To cut a long story short, we're using Windows Server 2003 with built in NLB then a round robin DNS that sits on top to point to 3 seperate clusters...

e.g. - DNS name - Company1 round robins to Cluster1, Cluster2 and Cluster3.

Each physically server has four NICs, - Management, NIC1, NIC2 and NLB adaptor.

I initially used the VMware converter to convert an active terminal server. I then removed all of the adaptors, ran NewSID (although I hear this is no longer required from an MS perspective) ... removed from domain, renamed and re-introduced to the domain, re-added all of the required adaptors and their IP configuration. The machine can be remotely connected to fine by multiple users.

Then the problem - I introduced this server to one of the clusters and that cluster went offline (could not ping the cluster IP). One of the servers within the cluster (bearing in mind the server i'd cloned was in a different cluster) started having authentication errors and other servers were very slow to connect to. I could not connect to NLB manager via the cluster IP and any users were booted from their sessions on all of those cluster's servers.

Skipping onwards... (I resolved the above problem by shutting down the VM Terminal server and rebooting all cluster servers) I thought this may be due to some residual config in the registry tying it to the server i'd cloned, so I started to build a new Windows 2003 server from scratch and re-installing all of the applications (absolute pig!) I spent time configuring everything as i'd done previously for the physical servers and went about the same process as above. I then introduced this server to a different cluster. Same problem! Random authentication errors, Cluster IP cannot be connected to via RDP or NLB Manager (or cannot be pinged)... users dropped from sessions, etc, etc.

I checked all of the adaptor settings on the new server and they look fine (NLB had bound correctly - for info, i'm joining the cluster via NLB manager).

Now skipping through event viewer on the new server I seem for the majority to get successful messages, however one of the servers that I had authentication errors on appears as;

NLB Cluster : Initiating convergence on host 3.  Reason: Host 10 is leaving the cluster.

NLB Cluster  Initiating convergence on host 3.  Reason: Host 10 is converging for an unknown reason.

I guess i'm not really giving enough information, but firstly is there anything I should know when adding a VM TS host to a Cluster... and does anything strike out at anyone above?

P.S - all of the VM guest adaptors can be seen on our network, routed to and can be connected to via hostname (DNS as far as I can see has updated fine....)

1 Solution
Your problem is not TS, it's NLB.  NLB and ESX do not see eye to eye by default.

The problem is with Microsoft's implementation of NLB.  It's unicast, and all the servers with the cluster IP share the same MAC address.  ESX sends RARP packets to switches as soon as a VM is powered on to inform the switch of the MAC address changes on the VMs (because ESX dynamically assigns MAC addresses to servers - this is necessary for when they vmotion to another host so that you don't have to wait for ARP table entries to expire).  So, when the VM is powered on, ESX sends a RARP to the physical switch, and it exposes the cluster IP, in addition to the server IP.  As a result, all traffic for the entire NLB group is sent to that switchport, communication with the rest of the NLB group is interrupted, and it goes down.

There are a few ways to go about handling this.  The first is to use multicast NLB.  The second is to disable RARP on ESX.  However, this could be a problem with VM's are migrated ONTO that host, as it will no longer RARP the physical switches to let them know where the VM resides now.  This means that during & shortly after a VM migrates (either by manual VMotion or by DRS), you are going to have an initial period of no connectivity - 15 seconds, by default, IIRC, until the MAC table entries expire.  Depends on the client OS, the switch, etc etc.

Have a look at these for some help:





Featured Post

Technology Partners: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

Tackle projects and never again get stuck behind a technical roadblock.
Join Now