Fault tolerance

sara2000 used Ask the Experts™
Experts out there, as i understand, we can only  implement a fault tolerance VMware system  with shared storage or vsan.
This can be expensive for a small network with five VMs.
Is there any other way in which i can implement a  fault tolerance system?
Watch Question

Do more with

Expert Office
EXPERT OFFICE® is a registered trademark of EXPERTS EXCHANGE®
Andrew Hancock (VMware vExpert / EE Fellow)VMware and Virtualization Consultant
Fellow 2018
Expert of the Year 2017

It really depends on "how quick" you want the fault tolerance, how long can your business be without services.

0 minutes
5 minutes
60 minutes
2 hours
4 hours

The more you pay the more resilience, and availability you get...

You could for instance, have two ESXi hosts...

linked back to back with a 10GBe interface, and replicate VMs from A to B, every 15 minutes!

In the event Host A fails, switch on Host B, and power all VMs - how long would that take  you ?

and you've got 15 minutes of data loss...

Other technologies exist with FT, e.g. Double Take HA (which does not require shared storage)
David FavorFractional CTO
Distinguished Expert 2018

As Andrew mentioned, the only real consideration for low tech + cheap failover systems is amount of data loss which can be tolerated.

Once you determine this, you have your update frequency.

Also, how you replicate your data is a determining factor.

The way I do this for LAMP Stacks running WordPress.

1) Setup site in an LXD container.

2) Clone the container onto another machine as the spare.

3) Setup an rsync job to run every 15 minutes, to sync all files related to app/site. This job also does a mysqldump/sync/drop/reload in spare container.

4) Each container has it's own public IP, with a 5 minute TTL associated with the IP.

5) If the production container (#1) crashes, then failover requires a simple IP change to the spare container, which anyone can do, so no calls in the middle of the night.

In many cases, this is all that's required.

Note: The data sync step (#3) determines real sync frequency.

For example, if you clone an entire VM container, this may take a very long time.

Must faster to just clone the actual files + databases related to the app/site.

Also if you're database is very large, then your database backup/sync/drop/reload step may take a long while.

You'll have to experiment to determine what frequency works best.


We can tolerate 60 Minutes downtime. I only know VMware technology. Andrew, you mentioned back to back links without shared storage.
I will appreciate it if you could shed light on replication?
David, I have no knowledge of container and I want something simple which I can implement and manage it.
How to Generate Services Revenue the Easiest Way

This Tuesday! Learn key insights about modern cyber protection services & gain practical strategies to skyrocket business:

- What it takes to build a cloud service portfolio
- How to determine which services will help your unique business grow
- Various use-cases and examples

VMware and Virtualization Consultant
Fellow 2018
Expert of the Year 2017
VMware vSphere Replication or Veeam Backup and Replication, Nakivo Backup and Replication or Xerto!

Four products to choose from, and you may already be licensed for vSphere Replication!

Have you looked at Synology Network NAS ?

They are inexpensive for a few VMs! 10GBe, NFS, Jumbo Frames, support SSD!
kevinhsiehNetwork Engineer

If you can handle 60 minutes of downtime, that RTO (recovery time objective) puts you in the realm
of recovery from backup. What is the recovery point objective (RPO), AKA how much data can you lose?

If your storage is reasonably fast, you can recover VMs in less than an hour, and lose only maybe 15 minutes of data or less. This is the least expensive, and the most important. Even if you have a HA or FT system, you should always have a backup.

A highly available system with shared storage would have a RPO of essentially zero to a few seconds, as uncommitted transactions would be rolled back. This is a classic cluster where if somethings happens to a host the VMs reboot on another host, typically within a minute or so. RTO is a few minutes.

A fault tolerant system (FT) runs the VMs on 2 hosts in parallel. The RPO and RTO is zero. This is the most expensive form of availability to have, and is for when even a few minutes of downtime is too much. RTO may actually be a few seconds, as the system detects the failure and switches networking to the other system. I have two FT systems from Stratus Technologies (stratus.com). They use specialized hardware to be able to run what is essentially 2 motherboards in lock step with each other. Not cheap.
Philip ElderTechnical Architect - HA/Compute/Storage

Azure Stack HCI hyper-converged. Cost is very lucrative. Make sure to obtain the Datacenter license via SPLA (Service Provider Licensing Agreement) arrangement with the vendor. That makes it palatable.


It seems to me VR/Veeam is in a reasonable budget and also I can meet the RPO. If I go in that direction then I will install VR/Veeam on source ESXi host. Everything will be fine as long as the ESXi hosts alive.  For example, the source  ESXi get PSOD, How do I recover VMs at the destination without VR/Veeam?
Andrew Hancock (VMware vExpert / EE Fellow)VMware and Virtualization Consultant
Fellow 2018
Expert of the Year 2017

Login to that ESXi Host and just start the VMs!

Do more with

Expert Office
Submit tech questions to Ask the Experts™ at any time to receive solutions, advice, and new ideas from leading industry professionals.

Start 7-Day Free Trial