vCenter 5.1 HA

Here is are setup In one cluster we have 4 IBM Flex blades and 4 new Cisco UCS blades. EVC is enabled. The host running on the UCS's are boot from SAN. As a test today we pulled dropped both sides of the network fabric to simulate a host failure. vCenter tried to move the running VM to another random host but failed. Now I have noticed that I can vmotion between Flex but not from Flex to UCS. I can only do it if the VM if off. I believe this has to do with difference in the CPU's even though UCS is enabled. Also the VM's in question have extremely high reservation which the UCS are the only who do not through resource errors when try to move the VM's.

1) With boot from SAN. IF both sides of the Network fail would the VM keep running. I believe the default isolation response it to leave powered on.

2) When HA selects a host to failover to how does it select the host? Least utilized?
LVL 21
Who is Participating?

[Product update] Infrastructure Analysis Tool is now available with Business Accounts.Learn More

I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Andrew Hancock (VMware vExpert / EE MVE^2)VMware and Virtualization ConsultantCommented:
vCenter tried to move the running VM to another random host but failed.

Did it?

Because VMware HA does not move running VMs via vMotion.

VMware HA, restarts VMs on other available hosts due to a host failure. No vMotion is involved with VMware HA. (this is a cold start, like a migration, so no EVC applicable!).

I also think you mean EVC is enabled, BUT have you made sure, that the baseline for EVC is the lowest generation available, because if you can vMotion from one host to another, but not the reverse, it would suggest this.

So I think you opening post is a bit confused, as to what VMware HA, vMotion, and DRS does!

1, Correct - default if not changed is - Leave powered on.

2. VMware HA priority is to restart VMs, on other available hosts fast! So it's not least utilized, you can end up with heavily loaded hosts after a HA event!

Hence why it's important if you have HA and DRS (licensed), you enabled it, because it will kick in, CPU and Memory reservations are checked to see, if a host has the resources for the VM to be started on it!

for testing HA, we much prefer the real and live test, and just power off, reset, pull the cable (power) out of a host!
compdigit44Author Commented:
Hancock good to hear from you. You are correct. The VM that vCenter did try to move were powered off but the powered on VM's have the following message listed under events.

"vSphere HA unsuccessfully failed over this virtual machine. VSphere HA will retry if the max number of attempts has not been reached..."

Also regarding HA, HA does not care about host resources usage which is where DRS comes in. But what happens when DRS is in Partial or manual mode
Justin PaulSolutions Architect / BloggerCommented:
You dropped both sides of what fabric ? the FI's on UCS ?

It sounds to me like you pulled too much stuff at once and HA couldnt move something because it wasnt there anymore.

Also just as a side note vcenter is not involved at all with HA... its a kernel function of ESXi. So even if vcenter goes down HA still occures.

If you want to simulate a host failure you are better off to just had power down the blade ... this is much more realistic ...  also if you want to simulate a network failure only pull one at a time, pulling two is unrealistic in 99% of the cases. At that point you should just move your stuff to a better datacenter or get newer hardware. :)
Acronis True Image 2019 just released!

Create a reliable backup. Make sure you always have dependable copies of your data so you can restore your entire system or individual files.

Andrew Hancock (VMware vExpert / EE MVE^2)VMware and Virtualization ConsultantCommented:
If you DRS is in Partial or Manual mode, your hosts will be heavily loaded! Until such time, you do something about it!

The answer for HA failover, will be contained in the logs, fdm.logs
Andrew Hancock (VMware vExpert / EE MVE^2)VMware and Virtualization ConsultantCommented:
As an aside, if you do not have the following volumes on your bookshelf, I would *HIGHLY RECOMMEND* them. These are the best source of VMware vSphere HA and DRS in the world!

VMware vSphere HA and DRS Technical deepdive

By Duncan Epping and Frank Denneman


Written by Duncan Epping and Frank Denneman, both of whom are Consulting Architects at VMware and are perceived by the industry as Subject Matter experts on VMware High Availability and VMware Distributed Resource Scheduler.

This book zooms in on two key components of every VMware based infrastructure. It covers the basic steps needed to create a VMware HA and DRS cluster, and goes on to explain the concepts and mechanisms behind HA and DRS which will enable you to make well educated decisions. You will get the tools to understand and implement e.g. HA admission control policies, DRS resource pools and resource allocation settings and more.

VMware vSphere 5.1 Clustering Deepdive on Amazon
compdigit44Author Commented:
All thank you so much for the great advice. My college was only suppose to pull one side of the Cisco FU but I think  that did not at once...Which makes sence that HA would fail becuase all paths to the connect to the other host are down.

On a side note since i am still new to the UCS. What would cause an event where both sides of the fabric go down also if once side goes down how quickly does Vmware pick the fact the path is active again.
Andrew Hancock (VMware vExpert / EE MVE^2)VMware and Virtualization ConsultantCommented:
Power Failures are common, depends if you have no UPS or Generator.

The FDM agents, Master and Slaves, and heartbeats between them act very quickly. (heartbeats are every second!)
compdigit44Author Commented:
Hancock,  I went back and read the link you posted earlier and they were very good.

So the situation I posted to being with a boot from SAN UCS blade when both sides of the fabric would be consider isolated / failed. It is boot from SAN what would happen to the VM's? I would assume they would lose theri network connect since the CNA card on the UCS uses QoS to split the network traffic between management and VM traffic
Andrew Hancock (VMware vExpert / EE MVE^2)VMware and Virtualization ConsultantCommented:
Hosts that Boot from SAN, which is the same as Hosts which BOOT from USB/SD cards, remove their operating system support e.g. USB/SD cards or BOOT device, or remove both pairs of local disks.

All will behave out of design specifications because the OS is dead. So they will not have access to configuration information, not be able to write to be honest anything could happen, because OS BOOT drive disappearing is a CRITICAL failure!

we would really be guessing....they would network isolate at least, but I do not think it would be a stable production environment.

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today

From novice to tech pro — start learning today.