The High Availability (HA) feature in vSphere 4.1 allows a group of ESX/ESXi hosts in a cluster to identify individual host failures and thereby provide for higher availability of hosted VMs. HA will restart VMs which were running on a failed host; it is a high-availability solution, not a zero-downtime solution such as application clustering or VMware Fault Tolerance. There will be a period of time when VMs are offline following a physical host failure, this is important to understand and you should ensure that your customers and management are aware of this. HA is a complex topic, but setting it up and using it are fairly straight-forward. This article is meant as a quick technical overview of HA to provide administrators with an understanding of its components and functionality; it by no means covers every detail.
In vSphere 4.1, HA can work with and utilize Distributed Resource Scheduler (DRS) if it is also enabled on the cluster so it is important to understand what DRS is…though a full description of DRS is outside the scope of this article. DRS continuously monitors the resource usage of hosts within a cluster and can suggest or automatically migrate (vMotion) a VM from one host to another to balance out the resource usage across the cluster as a whole and prevent any single host from becoming over-utilized. HA is based on Legato’s Automated Availability Manager, and as such you will see some HA-related files and logs on an ESX host labeled with “AAM”. HA requires vCenter for initial configuration, but unlike DRS it does not require vCenter to function after it is up and running.
Primary and secondary nodes:
HA will elect up to five hosts to become primary HA nodes, all other nodes in a cluster are secondary nodes up to a maximum of 32 total. (Note: Host and Node are used interchangeably) By default, the first 5 nodes to join the HA cluster will be the primary nodes. If a primary node fails, is removed from the cluster, is placed in Maintenance Mode, or an administrator initiates the “Reconfigure for HA” command, HA will initiate the re-election process to randomly elect five primary nodes. The purpose of a primary node is to maintain node state data, which is sent by all nodes every 10 seconds by default in vSphere 4.1. One primary node must be online at all times for HA to function, as such it is recommended to have primary nodes physically separated across multiple racks or enclosures if possible to ensure at least one remains online in the event that a rack or enclosure goes down. With a limit of five primary nodes, the maximum allowable host failures for a single HA cluster is four. One of the five primary nodes will automatically be designated as the active primary (also called Failover Coordinator), and it will be responsible for keeping track of restart attempts and deciding where to restart VMs. It is possible to determine which nodes are currently the primary nodes from the ESX console by launching the AAM CLI using the following syntax:
From the AAM CLI, enter the ln command:
From the AAM CLI you can also promote and demote primary nodes manually using the promoteNode and demoteNode commands, respectively, though this is not generally recommended.
How HA determines a host has failed:
Now that we understand the basic layout of HA, let’s talk about how HA determines host failures. This happens in two ways, a host can determine that it is isolated from all other hosts and initiate its configured isolation response, and other nodes can determine that one host is failed and attempt to restart the VMs hosted on the failed host elsewhere. By default, all nodes send heartbeats to other nodes every second across the management network. Primary nodes send heartbeats to all other nodes, and secondary nodes send heartbeats to primary nodes only.
The isolation response setting determines what action a host will take when it determines that it is isolated from all other nodes in the HA cluster. When configuring HA for a cluster, you have three options for the isolation response: Power Off, Leave Powered On, and Shutdown. The options are pretty self explanatory, the main thing to know is that the power off setting is equivalent to pulling the power on a physical server, it is not a clean shutdown. In vSphere 4.1, the default isolation response is shutdown.
When a host determines that it is no longer receiving heartbeats from any other hosts, it will attempt to ping its isolation address which by default is the default gateway of the management network. If this fails, the isolation response is triggered. Additional isolation addresses can be configured using the advanced setting das.isolationaddressX, where X is a number starting with 2 and incrementing upwards for each additional address. This is useful to detect a situation where the management network may have failed while the VM networks are still operational. The isolation detection timeline is 16 seconds, with an additional second added for each additional isolation address. The timeline breaks down as follows; failure occurs at 0 seconds, at 13 seconds without receiving a heartbeat the isolation address is pinged, if this fails, at 14 seconds the isolation response is triggered by the host. At 15 seconds the host is declared failed by other hosts in the cluster, and finally at 16 seconds with no heartbeats received the failover coordinator attempts to restart the failed host’s VMs on other nodes. Should the initial restart fail, HA will attempt to restart the VM 5 more times before abandoning the restart attempt.
There is some planning to be done when configuring the isolation response. If you use the default isolation address and isolation response settings (management default gateway and shutdown, respectively), it is possible for the management network of the host to become disconnected while the VM networks are still online. In this situation, the isolation response would be triggered and your VMs would be shutdown even though they are still online and functioning normally. Alternatively, setting the isolation response to leave powered on while suffering a complete network failure on a node will prevent your VMs from being restarted on a functioning host, effectively taking them offline until an administrator intervenes.
Other nodes have no way of knowing whether the failed node is network isolated or actually has crashed. VMware handles this by the use of locks on the files on the the shared storage that make up the VM itself. If a host is network isolated and the VMs are still running, other nodes will not be able to lock the files and power on the VM. This prevents a split-brain scenario where a VM is being run on two hosts at the same time. However, if the isolation response is either shutdown or power off, or if the node actually has failed, the file locks will be released (or time out in the case of a failed node), and the VMs will be available to be powered on by other nodes.
Admission control specifies whether or not VMs can be powered on in the cluster when doing so means there are insufficient resources available to provide failover protection and/or ensure resource reservations are met. Setting this to enabled means that you might be prohibited from powering on VMs in the cluster, disabling it means that all VMs will be allowed to power on even if doing so violates resource/failover limits. When enabled, admission control has three available policies for calculating resource usage; number of host failures tolerated, resource percentage and specify a failover host. It should be noted that admission control is not respected during an HA-initiated failover, since admission control is enforced via vCenter while an HA failover is initiated by the HA agent on the ESX hosts.
If you choose to specify the number of host failures allowed (up to a max of 4, remember there are only 5 primaries!), HA uses a slot system to determine available resources. A slot is a logical representation of the resources needed to power on any VM in the cluster, taking CPU and memory reservations into account. A size of a slot will be the size of the largest CPU and memory reservation in the cluster, or if no reservations are configured it will be set to 256Mhz CPU and 0MB+overhead RAM. Take this into account when assigning CPU and memory reservations to the VMs in your cluster, doing so forces HA to calculate larger slot sizes and can lead to poor consolidation ratios and VMs not being allowed to power on by admission control even though there are more than adequate resources available! HA then determines the number of slots available on each host, then subtracts the slots on the number of hosts specified by this setting starting with the LARGEST host. In other words, if you have a host with a larger amount of RAM than the others in your cluster, it is taken out of the equation when determining the number of slots available in the cluster. This can also lead to resource fragmentation should a VM with a large reservation be powered on when no host has the required number of slots available. In this situation, HA will request DRS to migrate VMs to free up additional slots on a host and allow the VM to power on.
Specifying a percentage of total resources for your admission control policy will cause HA to add up the total resources available, add up the total reserved resources (again using 256Mhz and 0MB+overhead for VMs with no reservations), then do the math to determine the total available resource percentage. If the total available percentage is equal to or less than the percentage specified, admission control will prevent more VMs from being powered on in the cluster.
Specifying a failover host creates a hot-spare. This is good in that you always have a host standing by in case of a failure, but the downside is that you are not able to use that host for normal operations.
Individual VM Options:
HA allows administrators to set specific options on a per-VM level as well. VM restart priority can be set to Disabled, Low, Medium or High to specify the order VMs should be restarted after a host failure. The disabled setting prevents HA from restarting a VM. Each isolation response detailed earlier can also be configured at the per-VM level. Settings at this level override settings specified at the cluster level.
VM and Application Monitoring:
HA can monitor VMs and applications as well. VM monitoring will monitor the heartbeat received from VMTools within the OS and I/O usage of the VM itself. If the VMTools heartbeat is not received, HA looks to see whether any I/O has been generated by the VM in the last 120 seconds (by default). If no I/O has been seen, the VM is determined to have failed and is reset. Application monitoring functions similarly, though it requires the appropriate SDKs or applications that support it.
Many HA parameters are alterable via the use of advanced options. The list is fairly long, so rather than copy and paste from a VMware document, I’ll just provide a link, the advanced options table is on page 27 of the vSphere Availability Guide: