infrastructure failure

Despite having failover/cluster design for your VMware environment /hosts, can there still be examples of major infrastructure failures that can still lead to loss of services, or can almost everything be protected against in terms of hardware failure. Can any examples be provided perhaps on levels below hosts on what could still lead to service outages. our tech team seem adamant there infrastructure is almost faultless and there is failover for almost any type of failure on the network/infrastructure, but I was keen to understand if thats wishful thinking and any examples of issues that can often take everything offline.
LVL 3
pma111Asked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Brian BEE Topic Advisor, Independant Technology ProfessionalCommented:
Never had anything like that happen yet. Although there is a chance that something might not migrate or fail over. That's why you make sure you have a failure test plan that is approved by management and done on a regular schedule.

Biggest cause of issue can be not allowing for things outside your control (extended power failure, disaster, etc.)
0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
Andrew Hancock (VMware vExpert / EE MVE^2)VMware and Virtualization ConsultantCommented:
I can give you an example!

 of a Flood in a City for a very large International Insurance Company, the datacentre was in the ground floor of a building, and flood defences failed, and wiped out the Datacentre!

However, fear not they failed over to DR site, only to find that also failed, because all the fibre and telecomms failed due to water damage!

So they were out for 24-36 hours! to resolve the issues, they shipped in new equipment SANs and Hosts into DR site... (replicated from off-site storage)

Since then, this company Enacts DR Program every MONTH to TEST!
2
nociSoftware EngineerCommented:
The most solved problem in clustering is failure of equipment (mostly computing power).
The read trouble is  communication failure.   Esp. when both sites do NOT fail but each go their own way.
(Split Brain).   I only know of one system that can guard you against it. Its OpenVMS based clusters, including automatic recovery.

Also best setups are done in triangles not in two site setups, problem is not a lot of software can actually handle that.
1
Ultimate Tool Kit for Technology Solution Provider

Broken down into practical pointers and step-by-step instructions, the IT Service Excellence Tool Kit delivers expert advice for technology solution providers. Get your free copy now.

serialbandCommented:
Since then, this company Enacts DR Program every MONTH to TEST!
The major failure of having redundancy is never testing it to be sure that it works.
2
Bryant SchaperCommented:
Besides your traditional hardware failures, you can also experience outages from configuration errors, especially routing protocols and routing in general.  Sometimes these are on your end, other times the carrier.

Mentioning carriers, they have their outages as well, Recently we had a section of the city go dark when a large backbone fiber was damaged.  It impacted most carriers as they all tend to eventually relay a major national company for service.
1
nociSoftware EngineerCommented:
In case your demands are serious then you may have multiple geographical separated connections between sites be sure they STAY separated.
I heard of a case where a company actually contracted 2 different cable providers to separate the connection (about 500Km between sites), the cable trajectories were known and validated.
After a merger between the 2 cable co's the new company decided to merge operations and decommision one cable...
Transparently migrating the connections to one cable. (Without warning the customer...) then the shit did hit the fan there was no connection anymore. After the technical fallout the cable co was cheaper off keeping the old cable after they needed to pay damages. (There was a contractual obligation to keep those separated in case of one company being absorbed by any other entity....)
1
Andrew Hancock (VMware vExpert / EE MVE^2)VMware and Virtualization ConsultantCommented:
is there a bug here, because #a42486502 is not green as an Assisted Answer ?

and neither is noci, #a42486913

very odd ?
0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Virtualization

From novice to tech pro — start learning today.

Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.