Link to home
Start Free TrialLog in
Avatar of Syntax-Montreal
Syntax-MontrealFlag for Canada

asked on

VCenter - ESX Host Frequent Disconnect Issue

Hi everyone,
Below is a problem I am having with our current VMware setup.
Overview:
- 1 Datacenter running 1 ESX cluster
- HA, DRS enabled, vmotion
- 3 ESX hosts in the cluster
- Approximately 21 VM's
- ESX 3.5 Update 4, VC 2.5 U4
- Total CPU resources 72GHz, Total RAM: 144GB, Total processors 24
-ESX software installed on local disk, VM's and data on Fibre SAN.

The problem we are having is with random recurring disconnects of our ESX hosts. These disconnects can last anywhere from 3-5 seconds or until someone manually needs to intervene and reconnect the host. The associated VM's never lose connectivity, but it does affect our hot backups running on vRanger Pro.

At first we thought it was vRanger Pro causing the disconnects because as soon as the backups would start, the hosts would disconnect and the backups would fail.

VCenter would indicate two error messages " An error occured while communicating with the remote host" and "Unable to communicate with the remote host, since it is disconnected".

We connected Vizioncore, and they would not take responsibility stating it was a VMware issue.

We ended up uninstalling the product and restarting the entire datacenter off-hours.

We are still getting random disconnects. The only thing that comes to mind is that when the VI environment was inistially setup, the ESX servers were on a different VLAN than they are on currently. All the IP and DNS changes were made to reflect the VLAN change, but somehow I feel there may be something fishy with DNS somewhere.

Unfortunately, I am not sure where to look right now, and VMware support has been no help.


-
Avatar of maredzki
maredzki

have you noticed all of the VMs on that host getting disconnected as well? you can check that by going to each of the hosts -> Tasks & Events -> Events and search for conn in the "description, type or target contains" field. it should show all of the remote consoles and actual vm disconnect/reconnects to the VC.
I have that issue currently but cannot update to update 4 and according to VMWare support this has been fixed in a few patches after update 4. From what they are telling me, it is related to hostd daemon that runs and craps out that is why we loose connectivity to the host and all vms that are on that host in VC even thought the host is up and all of the VMs on it are up.

The duration of the disconnect last 3-5 seconds. I have had it occur twice in the last month which is acceptable, but if you have this happen multiple times in a week or even day... a call to VM is a must to fix this as it may cause serious problems with the host configuration possibly corrupting the host.

Hope this info helps a bit.
ASKER CERTIFIED SOLUTION
Avatar of Paul Solovyovsky
Paul Solovyovsky
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
1) are your host firmqaware up to dae (important if Dell Poweredge)

2) what version of esx 3.5?  Are you at least at update 3?  If not thats the biggest cause, Update 3 reduces this dramatically.

================== patch vs. rebuild for update 3 ============
Either will work.

3) WHen I applied update 3, it took the opportunity to move the guests to alternate hosts and reubilt the hosts from scratch, enlarging the / root partitio to 2x default, and the /var/log partions to 2.5x default, and have had better stabiltiy.  

/ partition had been a bit small to handle say installing 3 esx hotfixes at the same time

/var/log getting full was affecting this problem - mostly due to VMs having logging enabled, but I enlarged it anyway after turning logging off.
Please update the partitioning scheme of the ESX Host (Size of root partition, boot, /var/log, vmkcore and swap)

How many NICs are you using in each of the ESX Hosts ?
have you checked the version of esx 3.5 yet?

Is it below update 3?

dnilson,
the version he's running is in his description:
- ESX 3.5 Update 4, VC 2.5 U4
Avatar of Syntax-Montreal

ASKER

Turned out to be a misconfigured DNS entry. What was documented VS what was the actual entry in AD were different. Once we corrected this, problem instantly went away.