ESXi 4.0 U1 - Management network becomes unstable after a few days
Posted on 2010-04-05
I’ve installed ESXi 4.0 Update 1 on two identical machines that reside in the same network segment. On both servers, I’ve created two virtual machines. One runs RedHat Enterprise Linux 5.4 and one runs a small load balancer appliance (Hercules).
Dell PowerEdge R210
Intel Xeon X3450 2.66GHz HT
2x500 GB in RAID 1
Using ONE port of the internal Broadcom netxtreme II bcm5716 NIC (this port is shared between the management network and the VM’s).
(all hardware is marked as ‘supported’ by VMware)
We applied all available patches, including the recent april 1st patch; we’re at build 244038 now.
After a few days the vSphere client cannot establish a connection to the ESXi hosts anymore. The virtual machines continue to keep running without any problem, however. Only a full reset (applied thru the remote power cycle) restores the connectivity to the management network. We experience this issue on both servers: about three days after power-on/reset, the vSphere client cannot connect anymore.
• Only the management network suffers from connectivity problems.
• Restarting the management network (agents) via the physical console doesn’t restore service
• The physical console offers some basic diagnostics like ‘testing the management network’. The PING tests intermittently fail: about half of the PINGs to the gateway or dns-servers fails. The hardware and the network config MUST be correct, since the management network works for a few days before failing and the VM’s keep running without any problem.
• We’ve investigated the network traffic from a remote vSphere client that is trying to connect to the ESXi server using a packet sniffer. The remote ESXi hosts resets the connection after initial contact, so there IS packet interchange.
Given the above, I strongly suspect a problem in the network driver in ESXi, but I don’t know how to diagnose the issue any further. I’ve exhausted all options on the physical ESXi console. I know how to access the (unsupported) commandline console, but don’t know what to look for. Could it be a problem that the management network shares the same NIC as the VM’s?
I’ve been struggling with this issue for a several weeks now – any help/suggestions is highly appreciated.