?
Solved

VMWare ESX - A possible host failure has been detected by HA on HOST1 in cluster ESXCLUSTER

Posted on 2011-04-29
11
Medium Priority
?
4,022 Views
Last Modified: 2012-05-11
Hello Experts,

I rebooted an ESX server after some hardware updating, and now I receive this error every 5 minutes.
________________________
Target: ESXCLUSTER
Stateless event alarm

Alarm Definition:
([Event alarm expression: HA host isolated] OR [Event alarm expression: All HA hosts isolated] OR [Event alarm expression: HA host failed])
 
Event details:
A possible host failure has been detected by HA on HOST1 in cluster ESXCLUSTER in Location.
________________________

I have tried:
- Reconfiguring each host (2) for HA.
- Disabling HA and reenabling
- Removing HOST1 from cluster and re-adding.

These possible fixes were found here: http://communities.vmware.com/message/1187575

Thanks,
0
Comment
Question by:ottobock
  • 6
  • 3
  • 2
11 Comments
 
LVL 124

Accepted Solution

by:
Andrew Hancock (VMware vExpert / EE MVE^2) earned 1000 total points
ID: 35491993
What hardware was changed or added?

did you add or remove any NICs?

Because the NICs or vmnics may have changed order.

Check all the vmnics are correct, and correctly connected to vSwitches and Physical switches, and all IP addresses and gateways are reachable.
0
 
LVL 7

Author Comment

by:ottobock
ID: 35492051
Added RAM to both nodes (2 of them). Second node is fine, the error always comes from node 1. Will have a look at the nics and switches...
0
 
LVL 16

Assisted Solution

by:Danny McDaniel
Danny McDaniel earned 1000 total points
ID: 35492806
maybe you accidentally unbalanced the NUMA nodes by putting memory in the wrong slots or using different sizes.  That's been known to cause strange behavior in ESX.

if you go to the console, run ' cat /proc/vmware/NUMA/hardware' and you'll see if the amount of memory is the same in both (there will be a few mb's of difference, but that should be it)
0
When ransomware hits your clients, what do you do?

MSPs: Endpoint security isn’t enough to prevent ransomware.
As the impact and severity of crypto ransomware attacks has grown, Webroot has fought back, not just by building a next-gen endpoint solution capable of preventing ransomware attacks but also by being a thought leader.

 
LVL 7

Author Comment

by:ottobock
ID: 35492928
Big thanks for the ideas!

I know each RAM module is identical in each node (HP UDIMMs, same size/type/mfg for each installed module in each node ~ both nodes are identical). Regardless, it's worth a check I think. I'm not too versed in the console commands. Once in the console, logged in as root, what do I type in exactly? 'cat/proc/vmware/NUMA/hardware' doesnt work. (thanks for the patience too) :-)
0
 
LVL 124
ID: 35492965
is this ESXi?
0
 
LVL 16

Expert Comment

by:Danny McDaniel
ID: 35492966
there should be a space between 'cat' and the first '/'

cat is a command and the /proc/vmware/NUMA/hardware is a virtual file that it lists out
0
 
LVL 7

Author Comment

by:ottobock
ID: 35493031
VMware vSphere 4.0 U2 ~ Managed by vCenter Server.

Thanks for the console details ~ but I get "No such file or directory" Is there a variable in there? i.e. should NUMA = host or something like this? (I need to study-up on working with the ESX CLI and console) :-)
0
 
LVL 16

Expert Comment

by:Danny McDaniel
ID: 35493223
NUMA is used on AMD processors and the later Intel's, so you probably don't have to worry about this.

since you are in the console, check the logs for NIC connection errors with 'grep -i down /var/log/vmkernel*'  If there has been any recent losses in connectivity, we should see some indication of it this way.
0
 
LVL 7

Assisted Solution

by:ottobock
ottobock earned 0 total points
ID: 35493400
I just got off the line with VMWare support and they remote controlled my PC and had a look. They said the alart is improperly configured, which I'm curious of, because it's a default alarm that I did not create (only added my email address to).

What's I've done for the moment is disable that alarm and recreate 3 new alarms. One for each of the 3 triggers: [HA host isolated], [All HA hosts isolated], and [HA host failed]

I would like to see if maybe it is just an alarm issue... So let's see. If it triggers again, at least now I'll know exactly which trigger out of the 3 is causing it... Will report back shortly.

Thanks again!
0
 
LVL 7

Author Comment

by:ottobock
ID: 35494857
So far - all good. Still no errors after changing the alarms.
0
 
LVL 7

Author Closing Comment

by:ottobock
ID: 35688008
Points for the effort and helping out - thanks!
0

Featured Post

Prep for the ITIL® Foundation Certification Exam

December’s Course of the Month is now available! Enroll to learn ITIL® Foundation best practices for delivering IT services effectively and efficiently.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Ransomware is a malware that is again in the list of security  concerns. Not only for companies, but also for Government security and  even at personal use. IT departments should be aware and have the right  knowledge to how to fight it.
The article covers five tools all IT professionals should know about, as they up productivity by a great deal!
Teach the user how to configure vSphere clusters to support the VMware FT feature Open vSphere Web Client: Verify vSphere HA is enabled: Verify netowrking for vMotion and FT Logging is in place or create it: Turn On FT for a virtual machine: Verify …
Teach the user how to install and configure the vCenter Orchestrator virtual appliance Open vSphere Web Client: Deploy vCenter Orchestrator virtual appliance OVA file: Verify vCenter Orchestrator virtual appliance boots successfully: Connect to the …

850 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question