Link to home
Start Free TrialLog in
Avatar of IKtech
IKtech

asked on

VMware isolation

I have read a little about isolation response and I was hoping to get some suggestions on a few things.

First, I have a cluster with two hosts in it With HA turned on.  Second, one of the hosts will stop responding and the only thing I can do is to physically reboot the host to get it back.  It will not respond to a ping or anything else once this has happened.

I have read about putting in a secondary isolation ip address for the cluster and also increasing the failure detection time.  Are these two things pretty safe to implement without any bad side effects?

Also, could there be a false positive that is causing an isolation of one of my VMware hosts?

I have had this happen in the past and it seemed to be related to our shared storage devices.  Please help!!  I have been struggling with this same problem for a while now and nothing seems to fix.  Thanks experts!!!
Avatar of gheist
gheist
Flag of Belgium image

I am sorry - HA just stops when hosts are isolated, VMs continue running on them.
Avatar of IKtech
IKtech

ASKER

can I adjust any settings to keep the host isolation from happening?
Are you isolating VM network and vmkernel network on different nics as per best practices?
Avatar of IKtech

ASKER

yes, I believe so.
Can I see how vswitches picture look in your ESXi (with whatever reminds of your organisation like net numbers or names blurred)?

Also what is your network switches' make and model
(some need configuring cache time and forward delay to make vmotion and HA happy - i.e if you have 3 switches between your ESXis it could just happen that gratuitous arp message is lost and one of them lives with belief that VM is still at other ESXi)
Avatar of IKtech

ASKER

both hosts are setup identical as far as the vswitches go.  Just some different IP addresses.  I can upload the second host config if you like.
vmware-host-not-responding.PNG
vmotion and storage are heavy.. which vmkernel ports storage uses? does host get offline when you copy huge file?
Avatar of IKtech

ASKER

it does seem to happen when shared storage is working hard.  The last few times I have had trouble with this, I had a bad hard drive on one of my storage devices.  I am using QNap devices with 4 drives each with RAID 10.  I am doing block scans and smart tests on the drives to see if I can find an issue with a drive.  Also I am looking into getting new drives that are 6gb/s and 10k rpm vs. what I have now, 3gb/s and 7200rpm.
3Gb/s could very well saturate 1Gb/s FT logging interface.
Avatar of IKtech

ASKER

I am currently not using any FT as a trouble shooting step and it still happens.
Try altering between fixed media speed and auto-select...

Do you have two isolation hosts and they are all the time up?
Avatar of IKtech

ASKER

are you talking about the switch ports the vm host is connected to for the fixed media speead and the auto select?

I just have two hosts and yes they are up all the time.  Are there some settings I can verify to check with one or if both are isolation hosts?

Thanks!
You need to check all interfaces on storage switch. If one of them goes doen at least one host is isolated.
ASKER CERTIFIED SOLUTION
Avatar of IKtech
IKtech

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of IKtech

ASKER

replacing a bad drive seems to have fixed it for now.