Thought it might be good to relay a recent incident that I encountered with an ESXi 3.5 cluster host. I was on vacation and came back in the office to find that one of the other admins had been troubleshooting an issue where a host in the cluster seen through VC (virtualcenter) was in a "not responding" state. After attempting to disconnect and reconnect, HA was showing failures for that node in the recent tasks pane but I'm not a vmware guru. I did the usual and googled the heck out of it to no avail. I also came to find out that this production infrastructure was not under a support contract, so I was on my own. I could connect directly to the host through vsphere client and manage it from there.
Ultimately, I ended up powering down each vm guest individually from a VIC to direct IP session, jumped over to the VC session, did a connect on the bad host and while the HA was loading in the Recent tasks (progress bar moving...) I'd migrate the VM. Figured it was dangerous, but I really had no other options other than waiting for ppl to respond to posts I made. Anyway, doing that seemed to work. I had to repeat the process for each VM guest, but once they were all moved, I consoled to the esxi blade, rebooted it, and it then came back to VC like it never had a problem.
One thing that was curious was in the VC session, I could see that the lun that hosted all the templates for the company was showing as offline on that host. In the direct to IP session though, it was fine. Never did figure out why that happened, but I suspect that was directly tied to my problem if it wasnt the root cause itself. Once the server was rebooted, that lun was back in the VC session as well like nothing was wrong.
If someone has any ideas on why this might have happened and a better way of troubleshooting it in the future, please offer your insights and advice!
Take care! Earl