vmware vsphere lost connectivity all guests powered off

I have 2 hosts with HA and FT, each has a fiber link to iSCSI san.  Each also has separate a vswitch for guests (vswitch1) and console/vmkernel (vswitch0).

I lost physical connectivity to the switch (it was power cycled).  I got an alert about vswith0 losing connectivity, no alert about vswitch1.

Here's the weird bit - when the switch came back online, ALL of the guests on both hosts were powered off.

As far as I can tell the fiber switch did not lose power - so the connection to the SAN *should* have been good the entire time.

My theory is that both hosts tried to vMotion their guests to the other host, which was also down, and the net result is everyting powers off.

Question 1: would loss of L1 connection for console/kernel cause all guests to end up powered off?

Question 2: if I lost connection from hosts to SAN, shouldn't I see an alert related to the storage adapters as well?
Who is Participating?
Q1: Check the settings for your HA...does it say to power down VMs, VMotion?
Q2: Not necessarily; there may not even be anything in the logs (a similar situation happened to me a few mos back...lost connection to host and don't know why, and not anything was in the logs).

Paul SolovyovskySenior IT AdvisorCommented:
Check your isolation responce on the cluster.  If the isolation responce is to power off than it did this correctly.  Basically it says if your ESX hosts can't communicate to the default gateway it will power off the VMs to avoid both ESX hosts bringing up the VM at the same time.

This will explain:

This is due to "isolation response" setting in HA cluster settings. When a Host can not communicate with other hosts on the cluster, then the Host try to ping the default gateway , if that fails , the host thinks that it is isolated from the network. Then the host would try to act as per your settings in "isolation response". I believe you have set it to "power off VMs"
Improve Your Query Performance Tuning

In this FREE six-day email course, you'll learn from Janis Griffin, Database Performance Evangelist. She'll teach 12 steps that you can use to optimize your queries as much as possible and see measurable results in your work. Get started today!

snowdog_2112Author Commented:
You are correct, the isolation was set to "power off", and the failuredetectiontime was the default - 15s.  The switch was offline for about 45 minutes.

This is good, however...the VM guys (keep in mind, these are the knuckleheads who set this up - I'm just cleaning up the mess) said "the switch rebooted, so start a case with Cisco to look at the switch logs".

Whoo...that's funny stuff...

Um...yeah, I know the switch freaked out, but why did all the VM's POWER OFF!  Wow.

Thanks for the links!  Very useful stuff!
snowdog_2112Author Commented:
I split points because coolsport was first, paulsolov led me to the info, and rvivek provided some good background/foundation info.  THANKS A TON!!!
Paul SolovyovskySenior IT AdvisorCommented:
Just a quick question.  You said that each has a fiber link to an iSCSI SAN.  Is the link to the SAN Fiber or Copper?  Usually iSCSI is hardware or software initiator but most of the time the hardware initiator (HBA) is still cat45.
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.