various types of replication and HA - disaster scenarios

Can anyone give a real beginners guide into what types of scenario (e.g. physical disaster, user error, hardware failure) the various replication/DR features at both HW and SW level actually save you from?

For arguments sake, say your setup consists of 5 ESXi hosts in a HA cluster - 2 SAN's at different physical offices which do a HW level replication between each, and you also have a backup software that also does some form of replication?

What kind of failure/issue do the 3 levels actually save you from, I need to get it clear why you need all 3 and what different types of disaster each saves you from, esxi HA cluster of hosts, SAN-to-SAN replication, and replication feature at SW level as part of your backup solution.
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

David FavorLinux/LXD/WordPress/Hosting SavantCommented:
This is a large number of scenarios.

I worked in IBM HACMP for a decade + every set of disaster possibilities tended to be unique for each client.

Better to start by hiring someone who's familiar with many scenarios, to audit your specific situation + assist you setting up your HA environment.

I take a fairly simple approach.

3x+ site instances. Located at least 100 feet above sea level. Located in at least 2x countries.

Use DNS round robin for simple HA + if one site loses connectivity either packets (ping) or services (Apache/MariaDB), then the IP for this instance is pulled out of DNS rotation. Set TTLs to 600 seconds (10 minutes).

This means total outage time is TTL/number of site instances, so 10 minutes/3x instances == 3ish minutes visitors with cached DNS data may be unable to reach site, till TTLs expire (10 minutes), then all visitors round robin between 2x instances, rather than 3x.

The greater the number of instances, the short the outage time.

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
Andrew Hancock (VMware vExpert / EE Fellow)VMware and Virtualization ConsultantCommented:
For arguments sake, say your setup consists of 5 ESXi hosts in a HA cluster - 2 SAN's at different physical offices which do a HW level replication between each, and you also have a backup software that also does some form of replication?

Okay, so you have all the data and VMs at different office, what configures and starts these Up ?

Physical Failures, and reducing single points of failure.

VMware vSphere HA - failure of a host or hosts, VMs will restart on other hosts in the cluster.

SAN to SAN replication  - failure of storage and all VMs, but you still need some sort of hosts at DR site, with a SAN to start up.
 which may require additional configuration, unless you have Site Recovery Manager.

Replication from Backup software - gives you the ability to Start VMs from Standby on equipment, at what time they were last replicated.

The more money you have to budget the less downtime you'll get versus how much downtime can the business afford, 60 seconds, 1 hour, 4 hours, 8 hours, 24 hours, 1 week
Gerald ConnollyCommented:
Hopefully you already have a Business Continuity Plan in place, and this will obviously include a Recovery Point Objective (RPO) and a Recovery Time Objective (RTO)! And as Andrew said the smaller the RTO the more expensive it gets!

So disaster scenarios could include:
User error: ie deleting or corrupting data - but if your data is getting replicated then your destination is also deleted/corrupted, so you need to be able to recover from this type of error ie being able to back out transactions
Hardware failures: ie Disk, Server, comms - How do you recover and how do you handle split brain issues
Building failure: Power, aircon etc + Denial of access - Riots, Fire, Flood, earthquake
Cross site links failures/ISP failures/ISP's going out of business/Man with a back hoe etc
Backups failing
Being Hacked
Malicious Staff

Then there is initial seeding of replication, how you failover (and back), how to restore after failure, how to handle split brain issues, what about intersite transaction in flight at point of failure

Plus lots of other stuff

And as i said earlier, your business continuity plan is the key - not just for IT, but how the whole company will survive a disaster
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today

From novice to tech pro — start learning today.