Exchange Site Reslience procedure

The context:

One of the problems I encounter when trying to deploy the Exchange 2010 DAG is how to get site resilience with two nodes. I have been looking for such procedure around Microsoft documentation, and TechNet forums but none of their procedures worked for my costumer case !!! So what I did is making my own one based of the understanding of how DAG works and the benefit of enabling the DAC mode.

The main project was to setup a backup site for Exchange. For my customer with a limited budget, it was not a solution to get a another server for site resilience, so I had to deal with the existing resources. Once that challenge was accepted, it was up to me to find a way to do that with no additional cost .

In the first place, I thought it was not possible to set up DAG with site resilience in this case (two nodes), because the documentation indicated that a minimum of 3 nodes would be required to have DAC (Datacenter Activation Mode) mode activated ! With Exchange 2010 SP1 it was rectified to two nodes so things can get done!

The Principle :

Two sites

Two Exchange servers with all roles installed (MBX, HT, CAS)

Two Witness servers, one in each location (One primary and the second is the alternative for activation procedure)

Both Exchange servers are Domain controller even if It said on MS TechNet this case is not supported)

Installation of Exchange (refer to the earlier post here)

Setting up DAG (refer to the earlier post here)

Site Resilience procedure .

Scenario:

SRV1, WS1 in the primary DC, SRV2, WS2 in the Backup DC .

Leased line with 1Mb bw between sites

SRV1 hold the DB active state, SRV2 hold the DB copy in Standby /Healthy state .

Users connect to SRV1 CAS

Manual activation of the copy works .

If the leased line goes down, users will still have access to their mailing system; because of SRV1 still can get the quorum for letting his DB mounted as far as the WS1 is accessible! if the WS1 become unreachable, within a minute SRV1 will dismount his DB and show it as Disconnected /Healthy state.

It’s normal behavior of DAG to protect against DB corruption as the only server how can get the quorum have the right to mount his db.

In the other side (backup DC ), the remaining SRV2 isolated with the LL down have no access to WS1 and will not be able to mount his DB (will be in dismounted /Healthy state ).

With the LL established the only notified change is the SRV2 (backup DC ) will resynchronize hi DB copy with the DB on the SRV1. IF SRV1 goes down for any reason, the LL is UP and WS1 is reachable, simply the DAG automatic switchover will active the copy on the SRV2. But in this case , users will not have access to their mailing system because initially the use the CAS role in SRV1 , but now he’s no more available. We have two alternative here:

Changing the RPCClientAccessServer attribute on the database to point into the CAS on SRV2 using the commandlet : Set-MailBoxDataBase -Identity DB -RPCClientAccessServer FQDN-SRV2 . Or, second alternative is changing DNS record for CAS1 to CAS2. In the first one, we’ll need to repair outlook profile so it brings the new configuration. I would recommend to do it with DNS it’s the simplest way.

Site Resilience

Now let’s consider that the primary DC encounters a disaster and is no more available.

SRV1 and the WS1 are down, users lost their connectivity to exchange, and on backup DC , SRV2 can’t mount his DB ,as long he do not have access to the WS and cannot get the quorum the following procedure will use the alternate Witness server WS2 , and will force the quorum , do SRV2 will be able to mount his DB copy .

In the first place we need to stop the DAG on SRV1 so we can exclude it later using the command Stop-DatabaseAvailibiltyGroup -Identity DAG -MailBoxServer FQDN-SRV1

It will take a while trying to contact SRV1 who is unreachable, at the end, it will show the error that he couldn’t update the configuration on SRV1. Don’t worry it’s not important in this case, only the configuration on SRV2 have to be updated.

We can verify that SRV1 is stopped in DAG by the commandlet : Get-DatabaseAvailibilityGroup -Identity DAG | fl name,*server *

The result is the names and the state of DAG members, if the previous commandlet works, SRV1 should be in StoppedServers and SRV2 in the StartedServers list. Next step is to stop the Cluster service to restart it with the quorum.

The most important steps come next, after this one SRV2 should mount his DB copy, the restore-DatabaseAvailibilityGroup command is a suit of actions that are:

Excluding the failed member from DAG,

Using the preconfigured Alternate WitnesseServer to bring the quorum

Starting the cluster service with the quorum forced.

Rebuilding DAG with the WS2 and SRV2 .

The syntax is : Restore-DatabaseAvailabilityGroup –Identity Dag –ActiveDirectorySite Default-First-Site-Name

Actually we can add some others switches to this command but this simple one should do the work in our case. Once the command executed, it will take few seconds to SRV2 to automatically mount his Database copy.

At this point users still can’t get access to their Mailing system , and we need to tell them to point to the SRV2 CAS by changing whether the DNS record for the CAS to point to SRV2 or changing the RPCClientAccessServer attribute on the DB Set-MailboxDatabase db –RpcClientAccessServer FQDN-SRV2

Once it’s done you need to repair the Outlook profile to update the configuration on it .

This is it, now you have fully working backup DC , until the primary DC comes back, the remaining question is what’s the procedure to apply when the primary DC is back online after you restore your servers from tapes or any other backing up media .?

Well it’s simple here’s the answer :

Now the backup SRV2 holds the active DB copy and there was be many mails flow between the time the primary DC is offline. Here the DAC mode is useful ensuring that when the Primary DC comes online , SRV1 will not takeover and mount his DB !

First you have to start the DAG on the SRV1 using: Start-DatabaseAvailibilityGroup -identity DAG -MailBoxServer FQDN SRV1 .

Once it’s done , SRV1 will show his DB as synchronizing (from the SRV2 copy ) .after finishing the synchronization ,the DB in SRV1 will be healthy . You can monitor the logs queue and the replayed ones using Get-MailBoxDatabaseCopyStatut -identity db\SRV2

Now you can manually activate the DB on SRV1 and get things back to normal. And finally, the one last thing to do is to modify the DNS record for the CAS to SRV1 or to revert the RPCClientAccessServer the same way we did before.

Hope this helps you.

Comments (1)

Hardik Desai

IT Architect and Trainer

Commented: 2015-08-29

If you are aware that DAG member is not supported on DC because failover cluster service is not supported, you should have not done this. You are putting your customer at risk and losing trust from your customer and other potential customers.