Link to home
Start Free TrialLog in
Avatar of egrylls
egrylls

asked on

VMware Host not responding

Thought it might be good to relay a recent incident that I encountered with an ESXi 3.5 cluster host.  I was on vacation and came back in the office to find that one of the other admins had been troubleshooting an issue where a host in the cluster seen through VC (virtualcenter) was in a "not responding" state.  After attempting to disconnect and reconnect, HA was showing failures for that node in the recent tasks pane but I'm not a vmware guru.  I did the usual and googled the heck out of it to no avail.  I also came to find out that this production infrastructure was not under a support contract, so I was on my own.  I could connect directly to the host through vsphere client and manage it from there.  

Ultimately, I ended up powering down each vm guest individually from a VIC to direct IP session, jumped over to the VC session, did a connect on the bad host and while the HA was loading in the Recent tasks (progress bar moving...) I'd migrate the VM.  Figured it was dangerous, but I really had no other options other than waiting for ppl to respond to posts I made.  Anyway, doing that seemed to work.  I had to repeat the process for each VM guest, but once they were all moved, I consoled to the esxi blade, rebooted it, and it then came back to VC like it never had a problem.  

One thing that was curious was in the VC session, I could see that the lun that hosted all the templates for the company was showing as offline on that host.  In the direct to IP session though, it was fine.  Never did figure out why that happened, but I suspect that was directly tied to my problem if it wasnt the root cause itself.  Once the server was rebooted, that lun was back in the VC session as well like nothing was wrong.

If someone has any ideas on why this might have happened and a better way of troubleshooting it in the future, please offer your insights and advice!

Take care!  Earl
SOLUTION
Avatar of Paul Solovyovsky
Paul Solovyovsky
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Hi Egrylls,

I have faced this situation earlier and this is caused due to the performance of the Storage that is added to the ESX server.

We have a farm of 16 ESX blades running close to 1000 VMs. On set of blades "8" were initially set up with SATA drives array over FC and everything was fine till we had 200 VMs running but as we added more and more VMs we started running into issues as the ESX hosts were going offline from the VC and so were the datastores.

We moved aware from that and got SCSI drive array and we have not seen any issues so far.

So best place for you to start finding out why this happened and how to avoid this happenning again, would suggest to you to verify the performance of the storage that you are using and if possible try to fix that from that end.

regards
bhanu
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of egrylls
egrylls

ASKER

I tried all the above and nothing's worked.  HA is now disabled...waiting for VMware support contract to get fixed all up.
hi egrylls,

please do update us with the findings

regards
bhanu
ASKER CERTIFIED SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Egrylls - just out of curiosity, I ran into a problem not long ago where HA stopped working, come to find out the network team made a change to the switch that ESX was using as its gateway to deny ping requests, a typical security configuration.  IF ESX can't ping the gateway HA will fail, I would check this out.  Also, not sure if I asked whether or not you are connected to storage via iSCSI, are you?  If yes, be sure you have extended your iSCSI parameters in your VMs.  If there is a failover event your VMs can lose their connection to the storage temporarily.