Dead_Eyes
asked on
Hyper-V QUORUM
Hi all, I am setting up a new 6 node cluster as pictured below and plan on enabling replication between SAN1 & SAN2 (the SANs are a single Dell equallogic group). The cluster is split between two rooms and I am having trouble figuring out what would happen if one room lost power or a one of switches fail as its quite easy to see a situation where 3 of the nodes either side can still see the SAN due to the replication. Can anyone shed some light on whether this setup looks particularly prone to split brain? Thanks in advance
That architecture is indeed problematic. And technically would actually not be a valid cluster configuration as all cluster nodes *should* be writing to the same storage, not replicated storage.
Now some SANs (including some Equallogics) do support true simultaneous writes and if the two units lose contact with each other, writes are stopped. That would result in the cluster going down, but it would *prevent* the two SANs from getting out of sync. In that type of configuration though, "replication" is not quite the accurate terminology. And what is presented to the cluster appears as a single instance of shared storage, so the cluster is also valid.
In such a configuration, even losing half your cluster, as long as you've configured a quorum disk, you'll still stay up as one half will achieve quorum. But if you try to shortcut it and let your SANs lazily replicate, you'd actually have to bend over backwards to get the cluster to validate, and should you lose the storage switch and the core switch, you would indeed have an issue as both could theoretically reach quorum using their "half" of the replicated quorum disk.
I suppose you could, in theory, not replicate the quorum disk and still replicate everything else, so that even in the event of a full "cut the baby in half" scenario, only one side has the quorum disk...but that still seems like it is jumping through a lot of hoops to make a topology work that the OS and SAN designers never intended, and could still go wrong in a myriad of other very bad, very data corrupting, very unsupportable ways. Best avoided.
Now some SANs (including some Equallogics) do support true simultaneous writes and if the two units lose contact with each other, writes are stopped. That would result in the cluster going down, but it would *prevent* the two SANs from getting out of sync. In that type of configuration though, "replication" is not quite the accurate terminology. And what is presented to the cluster appears as a single instance of shared storage, so the cluster is also valid.
In such a configuration, even losing half your cluster, as long as you've configured a quorum disk, you'll still stay up as one half will achieve quorum. But if you try to shortcut it and let your SANs lazily replicate, you'd actually have to bend over backwards to get the cluster to validate, and should you lose the storage switch and the core switch, you would indeed have an issue as both could theoretically reach quorum using their "half" of the replicated quorum disk.
I suppose you could, in theory, not replicate the quorum disk and still replicate everything else, so that even in the event of a full "cut the baby in half" scenario, only one side has the quorum disk...but that still seems like it is jumping through a lot of hoops to make a topology work that the OS and SAN designers never intended, and could still go wrong in a myriad of other very bad, very data corrupting, very unsupportable ways. Best avoided.
ASKER
Hi Shabarinath, Thanks for your input.
Hi Cliff, the Equallogic's replication means but clusters see a single IP while LUNs sync with each other in the background its not like one half of the cluster connects to one SAN and the other half the other. What I am worried about is that if communication for whatever reason was lost between nodes 1-3 & 4-6 all servers would still see the Quorum disk so I am unsure what would happen
Hi Cliff, the Equallogic's replication means but clusters see a single IP while LUNs sync with each other in the background its not like one half of the cluster connects to one SAN and the other half the other. What I am worried about is that if communication for whatever reason was lost between nodes 1-3 & 4-6 all servers would still see the Quorum disk so I am unsure what would happen
Like I said, some equallogics support this, some don't.
"its not like one half of the cluster connects to one SAN and the other half the other"
If that is indeed how you have your SANs configured, how... even theoretically... would all servers see thenquorum disk in the event of a split? You just said they *don't* connect that way...
"its not like one half of the cluster connects to one SAN and the other half the other"
If that is indeed how you have your SANs configured, how... even theoretically... would all servers see thenquorum disk in the event of a split? You just said they *don't* connect that way...
ASKER
Hi Cliff,
They would both see the Quorum disk as both SANS would still respond on the shared IP if separated and in case the communication was lost between nodes 1-3 & 4-6 as described previously all nodes would still be able to see the SAN and Quorum
They would both see the Quorum disk as both SANS would still respond on the shared IP if separated and in case the communication was lost between nodes 1-3 & 4-6 as described previously all nodes would still be able to see the SAN and Quorum
If both SANs can respond on the shared IP then the earlier statement from you that I quoted "it's not like on half of the cluster connects to one SAN and the other half the other" is false. That's *exactly* what its like. And that's the failure in the architecture.
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
Thanks I get the "big picture" now. I will have a talk with Dell support and see if we can configure an active / passive solution that will protect against this.
But more importantly - You should be worried if two rooms get isolated with each other.
I dont have much experience in Hyper-V cluster with storage replication. But do have experience on Hyper-V and clusters. So lets wait for other experts to give more clarity.
Good luck !