Link to home
Start Free TrialLog in
Avatar of vBR_IT_Masters
vBR_IT_MastersFlag for United States of America

asked on

Exchange 2010 DAG Network Issues

I have a 3 mailbox server DAG configured with 2 servers in one datacenter and the 3rd member of the DAG located at a DR site. The DR mailbox server has all databases lagged 24 hours and activation is explicitly prohibited. The datacenter has the mailbox databases split, with 5 active on MBX01 and 5 active on MBX02. So, the active/passive load is split. We can fail all databases (total of 10) to one server or the other.

My issue arises when the P2P connection between the datacenter and the DR goes up and down briefly. Exchange takes the databases that are online at the datacenter site and moves them all to either MBX01 or MBX02. Whatever it chooses. This shouldn't be happening, because both MBX01 and MBX02 are online, interrupted, on a local subnet. The file share witness is also located on the same local subnet as MBX01 and MBX02.

The primary datacenter and DR site are connected via a single P2P connection. There is direct routing from subnet 192.168.100.x (the datacenter) to subnet 192.168.101.x (the DR site).  Two separate DAG networks are configured in Exchange for each subnet.

Any ideas why Exchange would fail the DBs between the local servers when the P2P from the datacenter to the DR site goes down? Thanks!
Avatar of Member_2_4940386
Member_2_4940386
Flag of United States of America image

I had a similar problem and it was solved by increasing the delay and threshold on the cluster service failover.  

http://technet.microsoft.com/en-us/library/dd197562(WS.10).aspx

Here is the commands I used:

cluster.exe /prop SameSubnetDelay=2000:DWORD
cluster.exe /prop CrossSubnetDelay=4000:DWORD
cluster.exe /prop CrossSubnetThreshold=10:DWORD
cluster.exe /prop SameSubnetThreshold=10:DWORD
Avatar of Akhater
that should not be happening at all. what errors show in the event log /?
Avatar of vBR_IT_Masters

ASKER

Thanks, I've already used the cluster settings you mentioned but I believe they are too narrow to help. When the P2P connection goes down it's usually for 30-60 seconds (or more) and that's well outside the maximum configurable cluster settings.

The event logs reference loss of communication between the two primary MBX servers on the local subnet. Which I know is not the case. Nothing is occurring that would disrupt their communication on the same subnet directly connected to the same switch. It's always our P2P provider that drops the connection between the datacenter and DR.
are they always failing to the same node?

can you do the test of actually shutting the connection with the DR off and see the what happens? what I mean is what is the status of the server database copies ? do they continue to replicate locally or not ?
Good questions, I will bring down the P2P this evening and test. They do continue to replicate locally when the P2P connection goes down. However, I haven't captured the error logs in detail.
What I saw when going through my logs was when the connection was lost or lagging to the remote datacenter all of the servers would go into kind of an "unknown" state for some reason where they were not able to talk to each other or the FSW.  Increasing the threshold and delay allowed them more time to settle down before failing over and the problem decreased.
just to point out the FSW is unused in a 3 server DAG
I think I would need a delay of a minute (or more) but that's not possible, correct?
a CrossSubnetDelay 4000 and a CrossSubnetThreshold 20 will give you a total of 80 seconds
Thanks. This could possibly be my problem: http://3cvguy.com/?p=101
I think the threshold max is 10.

Just to be clear, I don't think you need to increase the threshold to incorporate the entire time that the remote node is down, just long enough so that the cluster is able to "calm down" and shake out exactly what has happened.  

By default, the delay is 1000 and the threshold is 5, giving you 5 seconds of no response before the cluster fails over.  In my experience, when my remote node lagged or was disconnected it took the cluster more than 5 seconds to regain quorum, or realize it already had quorum, or whatever it was doing, and it was just panicking and failing over at the primary site.  By increasing the delay to 2000 and threshold to 10, it then had 20 seconds to determine the state of the cluster instead of 5, and my issue went away.
ASKER CERTIFIED SOLUTION
Avatar of Member_2_4940386
Member_2_4940386
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
CrossSubnetDelay 4000 max and iCrossSubnetThreshold is 20 max
the article you have pointed to is very interesting, are you running on Rollup4 V2 ?
No, actually I'm still on the base SP1 with no rollups. I know, I need to update.
but this issue is more a windows than exchange related. are you on 2008 R2 or 2008 SP2?
2008 R2.