Link to home
Start Free TrialLog in
Avatar of Dead_Eyes
Dead_EyesFlag for United Kingdom of Great Britain and Northern Ireland

asked on

Hyper-V QUORUM

Hi all, I am setting up a new 6 node cluster as pictured below and plan on enabling replication between SAN1 & SAN2 (the SANs are a single Dell equallogic group). The cluster is split between two rooms and I am having trouble figuring out what would happen if one room lost power or a one of switches fail as its quite easy to see a situation where 3 of the nodes either side can still see the SAN due to the replication. Can anyone shed some light on whether this setup looks particularly prone to split brain? Thanks in advance
User generated image
Avatar of Shabarinath TR
Shabarinath TR
Flag of India image

In my view point, Power failure in one Room or Switch failure in one Room is fine. This will lead the the failure of servers in that room while cluster will continue to work from the other room.

But more importantly - You should be worried if two rooms get isolated with each other.

I dont have much experience in Hyper-V cluster with storage replication. But do have experience on Hyper-V and clusters. So lets wait for other experts to give more clarity.

Good luck !
That architecture is indeed problematic. And technically would actually not be a valid cluster configuration as all cluster nodes *should* be writing to the same storage, not replicated storage.

Now some SANs (including some Equallogics) do support true simultaneous writes and if the two units lose contact with each other, writes are stopped. That would result in the cluster going down, but it would *prevent* the two SANs from getting out of sync. In that type of configuration though, "replication" is not quite the accurate terminology. And what is presented to the cluster appears as a single instance of shared storage, so the cluster is also valid.

In such a configuration, even losing half your cluster, as long as you've configured a quorum disk, you'll still stay up as one half will achieve quorum.  But if you try to shortcut it and let your SANs lazily replicate, you'd actually have to bend over backwards to get the cluster to validate, and should you lose the storage switch and the core switch, you would indeed have an issue as both could theoretically reach quorum using their "half" of the replicated quorum disk.  

I suppose you could, in theory, not replicate the quorum disk and still replicate everything else, so that even in the event of a full "cut the baby in half" scenario, only one side has the quorum disk...but that still seems like it is jumping through a lot of hoops to make a topology work that the OS and SAN designers never intended, and could still go wrong in a myriad of other very bad, very data corrupting, very unsupportable ways.   Best avoided.
Avatar of Dead_Eyes

ASKER

Hi Shabarinath, Thanks for your input.

Hi Cliff, the Equallogic's replication means but clusters see a single IP while LUNs sync with each other in the background its not like one half of the cluster connects to one SAN and the other half the other. What I am worried about is that if communication for whatever reason was lost between nodes 1-3 & 4-6 all servers would still see the Quorum disk so I am unsure what would happen
Like I said, some equallogics support this, some don't.

"its not like one half of the cluster connects to one SAN and the other half the other"

If that is indeed how you have your SANs configured, how... even theoretically... would all servers see thenquorum disk in the event of a split? You just said they *don't* connect that way...
Hi Cliff,

They would both see the Quorum disk as both SANS would still respond on the shared IP if separated and in case the communication was lost between nodes 1-3 & 4-6 as described previously all nodes would still be able to see the SAN and Quorum
If both SANs can respond on the shared IP then the earlier statement from you that I quoted "it's not like on half of the cluster connects to one SAN and the other half the other" is false. That's *exactly* what its like. And that's the failure in the architecture.
ASKER CERTIFIED SOLUTION
Avatar of Cliff Galiher
Cliff Galiher
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Thanks I get the "big picture" now. I will have a talk with Dell support and see if we can configure an active / passive solution that will protect against this.