Syntax-Montreal
asked on
VCenter - ESX Host Frequent Disconnect Issue
Hi everyone,
Below is a problem I am having with our current VMware setup.
Overview:
- 1 Datacenter running 1 ESX cluster
- HA, DRS enabled, vmotion
- 3 ESX hosts in the cluster
- Approximately 21 VM's
- ESX 3.5 Update 4, VC 2.5 U4
- Total CPU resources 72GHz, Total RAM: 144GB, Total processors 24
-ESX software installed on local disk, VM's and data on Fibre SAN.
The problem we are having is with random recurring disconnects of our ESX hosts. These disconnects can last anywhere from 3-5 seconds or until someone manually needs to intervene and reconnect the host. The associated VM's never lose connectivity, but it does affect our hot backups running on vRanger Pro.
At first we thought it was vRanger Pro causing the disconnects because as soon as the backups would start, the hosts would disconnect and the backups would fail.
VCenter would indicate two error messages " An error occured while communicating with the remote host" and "Unable to communicate with the remote host, since it is disconnected".
We connected Vizioncore, and they would not take responsibility stating it was a VMware issue.
We ended up uninstalling the product and restarting the entire datacenter off-hours.
We are still getting random disconnects. The only thing that comes to mind is that when the VI environment was inistially setup, the ESX servers were on a different VLAN than they are on currently. All the IP and DNS changes were made to reflect the VLAN change, but somehow I feel there may be something fishy with DNS somewhere.
Unfortunately, I am not sure where to look right now, and VMware support has been no help.
-
Below is a problem I am having with our current VMware setup.
Overview:
- 1 Datacenter running 1 ESX cluster
- HA, DRS enabled, vmotion
- 3 ESX hosts in the cluster
- Approximately 21 VM's
- ESX 3.5 Update 4, VC 2.5 U4
- Total CPU resources 72GHz, Total RAM: 144GB, Total processors 24
-ESX software installed on local disk, VM's and data on Fibre SAN.
The problem we are having is with random recurring disconnects of our ESX hosts. These disconnects can last anywhere from 3-5 seconds or until someone manually needs to intervene and reconnect the host. The associated VM's never lose connectivity, but it does affect our hot backups running on vRanger Pro.
At first we thought it was vRanger Pro causing the disconnects because as soon as the backups would start, the hosts would disconnect and the backups would fail.
VCenter would indicate two error messages " An error occured while communicating with the remote host" and "Unable to communicate with the remote host, since it is disconnected".
We connected Vizioncore, and they would not take responsibility stating it was a VMware issue.
We ended up uninstalling the product and restarting the entire datacenter off-hours.
We are still getting random disconnects. The only thing that comes to mind is that when the VI environment was inistially setup, the ESX servers were on a different VLAN than they are on currently. All the IP and DNS changes were made to reflect the VLAN change, but somehow I feel there may be something fishy with DNS somewhere.
Unfortunately, I am not sure where to look right now, and VMware support has been no help.
-
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
1) are your host firmqaware up to dae (important if Dell Poweredge)
2) what version of esx 3.5? Are you at least at update 3? If not thats the biggest cause, Update 3 reduces this dramatically.
================== patch vs. rebuild for update 3 ============
Either will work.
3) WHen I applied update 3, it took the opportunity to move the guests to alternate hosts and reubilt the hosts from scratch, enlarging the / root partitio to 2x default, and the /var/log partions to 2.5x default, and have had better stabiltiy.
/ partition had been a bit small to handle say installing 3 esx hotfixes at the same time
/var/log getting full was affecting this problem - mostly due to VMs having logging enabled, but I enlarged it anyway after turning logging off.
2) what version of esx 3.5? Are you at least at update 3? If not thats the biggest cause, Update 3 reduces this dramatically.
================== patch vs. rebuild for update 3 ============
Either will work.
3) WHen I applied update 3, it took the opportunity to move the guests to alternate hosts and reubilt the hosts from scratch, enlarging the / root partitio to 2x default, and the /var/log partions to 2.5x default, and have had better stabiltiy.
/ partition had been a bit small to handle say installing 3 esx hotfixes at the same time
/var/log getting full was affecting this problem - mostly due to VMs having logging enabled, but I enlarged it anyway after turning logging off.
Please update the partitioning scheme of the ESX Host (Size of root partition, boot, /var/log, vmkcore and swap)
How many NICs are you using in each of the ESX Hosts ?
How many NICs are you using in each of the ESX Hosts ?
have you checked the version of esx 3.5 yet?
Is it below update 3?
Is it below update 3?
dnilson,
the version he's running is in his description:
- ESX 3.5 Update 4, VC 2.5 U4
the version he's running is in his description:
- ESX 3.5 Update 4, VC 2.5 U4
ASKER
Turned out to be a misconfigured DNS entry. What was documented VS what was the actual entry in AD were different. Once we corrected this, problem instantly went away.
I have that issue currently but cannot update to update 4 and according to VMWare support this has been fixed in a few patches after update 4. From what they are telling me, it is related to hostd daemon that runs and craps out that is why we loose connectivity to the host and all vms that are on that host in VC even thought the host is up and all of the VMs on it are up.
The duration of the disconnect last 3-5 seconds. I have had it occur twice in the last month which is acceptable, but if you have this happen multiple times in a week or even day... a call to VM is a must to fix this as it may cause serious problems with the host configuration possibly corrupting the host.
Hope this info helps a bit.