Vmware vSphere HA configuration

Have two ESX 4.0 servers (identical HP DL380 G5) connected to shared SAS datastore.  Have three VM configed.  Created a cluster and turned on HA.  Everything is reporting as fine (able to ping hosts, verified DNS, no errors).  VMotion works fine and I can migrate VM from one machine to the other.  If I test HA (unplug NICs) - the VM's do not migrate and restart as expected.  Have walked thru every HA guide I can find (created HA enabled cluster first and then added hosts to it).   The only thing that I see is that at the point the server goes off-line vCenter records "HA agent has an error: HA agent has failed" - this is at the point that I would expect it to migrate.  Any ideas?
TPolkAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

vmwarun - ArunCommented:
What setting have you configured for Host Isolation response ?

0
nappy_dThere are a 1000 ways to skin the technology cat.Commented:
Have you condigured your guests to startup on another host?  

Do you have enough RAM to support all your guests running on one host?

What is the constraint setting for your HA cluster?
0
ryder0707Commented:
by the way, this is not a new issue, had happened since 3.x

you can try to disjoine all hosts & recreate the cluster then all ESX/VC server must have their hosts file updated to include the below entries

- Loopback, always 127.0.0.1 localhost.localdomain localhost
- Local Server IP, FQDN, shortname
- Local Server console IP and <hostname>-cons
- Local Server VMotion IP Address, <hostname>-vmotion
- VirtualCentre Server IP Address. FQDN, shortname
- IP Address and DNS for all hosts in the same HA/DRS configuration

and ensure below is the standard settings in HA cluster(this is standard in environment i usually support)

Number of host failures the cluster can tolerate: 1
Allow VMs to be powered on even if they violate availability constraints: Enabled
VM Restart Prioirty: Low
Host Isolation response: Leave VM powered on
Enable Virtual machine monitoring: Not enabled

good luck!
0
Newly released Acronis True Image 2019

In announcing the release of the 15th Anniversary Edition of Acronis True Image 2019, the company revealed that its artificial intelligence-based anti-ransomware technology – stopped more than 200,000 ransomware attacks on 150,000 customers last year.

TPolkAuthor Commented:
The machines are set to "leave powered on", don't see where to configure VM to start on another host settings - will try the host file edit and see what the results are..
0
nappy_dThere are a 1000 ways to skin the technology cat.Commented:
Check you settings on the properties of your HA cluster... It should look like the images below.

Picture-1.png
Picture-2.png
0
TPolkAuthor Commented:
verified HOSTS file settings, created new cluster and set HA up on it with:

Number of host failures the cluster can tolerate: 1 <cannot set this with setting below>
Allow VMs to be powered on even if they violate availability constraints: Enabled
VM Restart Prioirty: Low
Host Isolation response: Leave VM powered on
Enable Virtual machine monitoring: Not enabled

rebooted host (without placing it in maintenance mode) and VM did NOT restart on other host.  Other ideas?  Any good location to determine why it isn't working?  (support log, etc).  
0
TPolkAuthor Commented:
When the host came back-up the VM did restart (but it waited until the host was back online).  We have moved VM's around with Vmotion and that works fine.
0
nappy_dThere are a 1000 ways to skin the technology cat.Commented:
What messages are in your logs? Do you have any exclamation marks appearing in your VI client for the ESX Hosts?

Look at your event logs...
0
nappy_dThere are a 1000 ways to skin the technology cat.Commented:
Also try enabling virtual machine monitoring for HA.. Set it to low and test again.
0
TPolkAuthor Commented:
Nothing shows up as in error - (other than note that we don't have redundant managment NIC) - the only thing that shows up is at the point of failure (Host is off-line) - there is a message that says "HA agent has error: HA agent has failed" - any particualr log to look in?  We have tried VM monitoring both on and off but no difference...  
0
nappy_dThere are a 1000 ways to skin the technology cat.Commented:
Anything more regarding that error message HA agent has error: HA agent has failed is that the full and complete error message?
0
nappy_dThere are a 1000 ways to skin the technology cat.Commented:
Try these steps http://www.no-x.org/?p=155
0
TPolkAuthor Commented:
nothing more than that error message -

Steps referenced didn't want to work (we have ESXi - so no full service console) but found a similar link using uninstall scripts -

(from the tech support console)

The scripts can be found in /opt/vmware/uninstallers.
To get there:

#cd /opt/vmware/uninstallers

Get a directory listing
#ls
-rwxr-xr-x 1 root root 857 VMware-aam-ha-uninstall.sh
-rwxr-xr-x 1 root root 434 -vpxa-uninstall.sh

To run these scripts,

./VMware-aam-ha-uninstall.sh
./VMware-vpxa-uninstall.sh

The agents are now removed, so re-do the HA config for the cluster

After this steps - resetup HA and retested but same result...
0
nappy_dThere are a 1000 ways to skin the technology cat.Commented:
have you purchase vCenter?  If so, this does come with some support from VMWare..
0
ryder0707Commented:
probably now is the time to engage vmware support
0
nappy_dThere are a 1000 ways to skin the technology cat.Commented:
I concur.  As I have previously mentioned, you had purchased HA with some version of vCenter.  If you have done so in the pas 30 days, you are afforded some technical suport.
0
TPolkAuthor Commented:
Yes we have Vmware support and I think it is time to engage them - I'll update after we resolve (maybe we missed something)
0
vmwarun - ArunCommented:
Have you resolved the HA Issue ?
0
ryder0707Commented:
yeah curious to know what is the actual problem
0
TPolkAuthor Commented:
Currently at level 3 VMWare support - they think it is something environmental but no answer yet...
0
shankarvetrivelCommented:
The only thing that I see is that at the point the server goes off-line vCenter records "HA agent has an error: HA agent has failed" - this is at the point that I would expect it to migrate.  Any ideas?
When u configure HA cluster,esx inside cluster will be sending an heart beart to each esx servers,if agent heart beat is not responding for more than 15 secs,that particular host will be declared as 'Failed host or isolated from network'.
Please make sure your esx is reaching service console gateway.
 
Apologise If my answers are silly.
 
Thanks
 
 
0
TPolkAuthor Commented:
Okay - here is the offical answer from VMWare - There is a bug in the software and HA will not work if you have it on a public internal address.   Theses devices are on a 9.19.x.x network (sorry - don't ask - didn't design it)....  
0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
VMware

From novice to tech pro — start learning today.