VMware Host not responding

Thought it might be good to relay a recent incident that I encountered with an ESXi 3.5 cluster host.  I was on vacation and came back in the office to find that one of the other admins had been troubleshooting an issue where a host in the cluster seen through VC (virtualcenter) was in a "not responding" state.  After attempting to disconnect and reconnect, HA was showing failures for that node in the recent tasks pane but I'm not a vmware guru.  I did the usual and googled the heck out of it to no avail.  I also came to find out that this production infrastructure was not under a support contract, so I was on my own.  I could connect directly to the host through vsphere client and manage it from there.  

Ultimately, I ended up powering down each vm guest individually from a VIC to direct IP session, jumped over to the VC session, did a connect on the bad host and while the HA was loading in the Recent tasks (progress bar moving...) I'd migrate the VM.  Figured it was dangerous, but I really had no other options other than waiting for ppl to respond to posts I made.  Anyway, doing that seemed to work.  I had to repeat the process for each VM guest, but once they were all moved, I consoled to the esxi blade, rebooted it, and it then came back to VC like it never had a problem.  

One thing that was curious was in the VC session, I could see that the lun that hosted all the templates for the company was showing as offline on that host.  In the direct to IP session though, it was fine.  Never did figure out why that happened, but I suspect that was directly tied to my problem if it wasnt the root cause itself.  Once the server was rebooted, that lun was back in the VC session as well like nothing was wrong.

If someone has any ideas on why this might have happened and a better way of troubleshooting it in the future, please offer your insights and advice!

Take care!  Earl
LVL 1
egryllsAsked:
Who is Participating?
 
egryllsAuthor Commented:
Still dont have support, so I am closing this question.  HA is turned off and stuff has quit failing, but now we have no HA....
0
 
Paul SolovyovskySenior IT AdvisorCommented:
If you could connect directly to the host but not in vCenter most likely a DNS issue.

In Virtual Center make sure the resolution is correct to the host and that the hosts are added via FQDN.

On the ESX host check /etc/host and /etc/resolv.conf file for correct configuration.  

You may need to diable HA and re-enable on the host as well
0
 
bhanukir7Commented:
Hi Egrylls,

I have faced this situation earlier and this is caused due to the performance of the Storage that is added to the ESX server.

We have a farm of 16 ESX blades running close to 1000 VMs. On set of blades "8" were initially set up with SATA drives array over FC and everything was fine till we had 200 VMs running but as we added more and more VMs we started running into issues as the ESX hosts were going offline from the VC and so were the datastores.

We moved aware from that and got SCSI drive array and we have not seen any issues so far.

So best place for you to start finding out why this happened and how to avoid this happenning again, would suggest to you to verify the performance of the storage that you are using and if possible try to fix that from that end.

regards
bhanu
0
Ultimate Tool Kit for Technology Solution Provider

Broken down into practical pointers and step-by-step instructions, the IT Service Excellence Tool Kit delivers expert advice for technology solution providers. Get your free copy now.

 
VMwareGuyCommented:
First thing you should always do when you can't connect to - or if there is an issue with your connection between vCenter and ESXi \ ESX is to:

1)  check DNS configuration on the ESXi server and your DNS server that ESX points to making sure you have the appropriate entries

2) try to disconnect and reconnect your ESXi host from your vCenter inventory, this uninstalls and reinstalls the vCenter agent using FQDN and then with IP address if FQDN didn't work, if it doesn't work then proceed to step 3.
3)  Try Restarting both the vCenter management agent on the ESX host and the ESX host management agent.  Learn how to do this here:  http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1003490

4)  If the above didn't do anything for you, and since you mentioned that you lost connectivity to a LUN, which can cause problems with ESX (less now than earlier versions ESX 2.x), connect to ESXi host directly with VI Client and perform a rescan of your storage adaptors and LUNs.    

5)  IF none of these worked, you need to do exactly what you said you did, however, you can also attempt to migrate the VMs live using the RCLI before you start powering down VMs, see documentation on how to do this in the RCLI here:  http://www.vmware.com/pdf/vi3_35/esx_3/r35u2/vi3_35_25_u2_rcli.pdf

but this will most likely not work either since your, but it is worth a try before you start reaching out to everyone to let them know you have to bring down VMs.. and if that doesn't work bring the VMs down and register them on some of your other host servers if you have them and bring them back up.  then you can reboot your ESX box.  
0
 
egryllsAuthor Commented:
I tried all the above and nothing's worked.  HA is now disabled...waiting for VMware support contract to get fixed all up.
0
 
bhanukir7Commented:
hi egrylls,

please do update us with the findings

regards
bhanu
0
 
VMwareGuyCommented:
Egrylls - just out of curiosity, I ran into a problem not long ago where HA stopped working, come to find out the network team made a change to the switch that ESX was using as its gateway to deny ping requests, a typical security configuration.  IF ESX can't ping the gateway HA will fail, I would check this out.  Also, not sure if I asked whether or not you are connected to storage via iSCSI, are you?  If yes, be sure you have extended your iSCSI parameters in your VMs.  If there is a failover event your VMs can lose their connection to the storage temporarily.  
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.