Link to home
Start Free TrialLog in
Avatar of hongedit
hongeditFlag for United Kingdom of Great Britain and Northern Ireland

asked on

Maintaining DAG and CAS

In my environment, what is the correct way to maintain my Exchange Servers - either scheduled reboots, power downs, patching, etc? Is there anything I need to do at all?

2 Exchange Servers - All roles installed on each
Both in DAG
Both in CAS Array with the IP pointing to DAG IP (No NLB but Automatic failover)

Exchange Server 1 is by all intents and purposes the "main" server. Exchange Server 2 is there for Mailbox or CAS failover.

If I were to do anything to Exchange Server 2, I'd think there isnt much I need to do except suspend the Exchange Server 1 Mailbox copy it is holding? Do I even need to do that?

If I were to do anything to Exchange Server 1, CAS Array should automatically route MAPI requests to Exchange Server 2, and DAG should as well right?

What happens when Exchange 1 comes back up? Any manual failback or all automatic?
Avatar of TheGeezer2010
TheGeezer2010

How is the failover between your CAS servers in the array working if not NLB ?
Avatar of hongedit

ASKER

The CAS DNS entry is pointing to the DAG Cluster IP.
This will not work.

CAS servers in an array need to be load balanced either by WNLB - in which case you CANNOT have a DAG on the same server as the two types of clustering are incompatible, or hardware LB (this is Microsoft preference).

Once you have this set up correctly, depending on the type of LB used :-

1. WNLB - if the CAS server fails clients will be redirected by WNLB to other CAS server - note that if a SERVICE fails, this will not happen and clients will have to be restarted to move to other CAS server - this is why Microsoft no longer recommend WNLB.
2. If an odd number of mailboxdatabases in the DAG, failover will happen automatically according to the activation preferences set. If an even number, you must have the quorum FSW available to provide MNS quorum.
3. Public Folder failover will always be manual - you will have to point the mailboxdatabase at the PF database if the PF database is not available.

HTH
Hardware LB for the CAS array is an option (as already stated). I have never tried pointing the CAS array DNS entry to the DAG Cluster IP but do not beleieve it will give you failover for the following reason.

A DAG cluster does not work in the same way as an old-style cluster. It works using Active and Passive Manager technology to move the current live MailboxDatabase from one server in the DAG to another. there is no concept of the movement of ALL databases from one server to another. In fact you could have three different databases which have the current "Live" copy on three different servers - using DAGs the link between server and Database is removed. It is the MailboxDatabases which are the portable object NOT the servers.

The new RPCClientAccess service is used by the clients to point to the current Live copy of each database. The client is provided with a list of relevant (for the site) CAS servers via AD, and selects one. If this CAS server knows where the target Mailboxdatabase is, it will set up RPC connections for the purpose of communication. If it does not, it will try to find a CAS server which does know and this will become its MAPI endpoint.

I hope this goes some way to explaining why what is being proposed here will not work to provide the automatic failover of the CAS server in the event that one fails. You have two options :-

1. Deploy two separate CAS/HT servers and create the array from them.
2. Use HW LB to create an array from the two current servers.
So I have 2 experts with differing opinions...

I guess we could find out by turning off my Exchange Server 1 and see what happens but this is my Live system, I would need to schedule downtime for that.

Those 2 scenarios you listed were exactly what I cannot do and was suggested to do what I have done, if that makes sense.
OK so you have two options really :-

1. Do the test you suggest but you will have to test scenarios such as downing one CAS server, dismounting a database etc.

2. Read carefully about how DAGs and the new RPCClientAccess services work. I am not going to patronize by sending any links on this, but I would strongly advise you do this anyway. At the end of that you may see why I don't believe what you are proposing will work.

If it does, happy days and I am happy to stand corrected !!
Ok. I'm not disagreeing with you by the way, but at the end of that day I need to know what works and what doesn't...

What would be the best way to test this?

Just down down Exchange 1?
Stop the IS on Exchange 1?

You will need to test both CAS and MBX failover. You will need to set up a client with access to one of the CAS servers, and show the connectivity status (rt-click the OL icon in systray).

1. Down server - if your mailbox is on this server, it should failover to the other server. If the CAS failover is working, you should very temporarily lose connectivity whilst your client contacts the other CAS server in the array. You will be able to tell either by running command get-mailboxdatabasecopystatus or by viewing (you need to manually refresh) status in EMC whether the mailboxdatabase active copy has correctly moved. If it has, and you still have no connectivity from the client, the CAS failover has not occurred. If it has and your client reconnects automatically after a short time, the CAS failover has worked.

2. It is possible, that your client is already using the second CAS server. If this is the case, mailboxdatabase failover should occur and there will be NO loss of connectivity. In this case, to test truly all possibilities, you will have to then follow the above but this time with your second server.

Advise you to carefully document all of this as you may b=need it to persuade Management that there are no other options than the ones outlined !!

Best of luck !!
ASKER CERTIFIED SOLUTION
Avatar of Akhater
Akhater
Flag of Lebanon image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
to TheGeezer2010 this will work just fine what you said about DAG is true but it doesn't mean the configuration will not work. The DAG IP will be owned by one node at a time (thus no LB) but will automatically failover from one server to another and can be used for the CAS array IP.
>Suspending it if you are extra careful not really required

Great thanks.

>If you created hte cas array and pointed the IP of the cas array fqdn to the DAG IP and changed the rpcclientaccess on the DBs that's all what you have to do

That's exactly what I have. I created a DNS entry of cas01.domain.local -> DAG IP and changed the rpclientaccess parameter on the DB's.

>no the failback is not automatic
How do I forice the connections back to a preferred server after it fails over?
I will still test as per Geezer2010 by the way, but I have to wait a few days for the opportune moment for least disruption.
You could test CAS failover and MBX failover on two different controlled scenarios but I strongly recommend you verify your RPCClientAccessServer settings

1) Run "Get-MailboxDatabase | fl Name,RPCCLientAccessServer" and see what is the RPCClientAccessServer set to
2) Would the RPCClientAccessServer setting still be valid if you shut either of the systems down?

If you want to test just one role at a time on a controlled manue, you could try the following:

For CAS, stop the Microsoft Exchange RPC Client Access
For MBX, manually failover one of your Exchange databases from one system to another -
      - Open your EMC, select Mailbox within the Organization Configuration hierarchy
      - Highlight the database you'd like to failover on the right top pane
      - Verify the bottom pane shows more than one database copy (one should say Mounted) and the rest should say Healthy.
      - Right click on a "Healthy" database, select Activate Database copy, and select None for the Override Automatic database mount for this test.
      - Refresh the right bottom pane and you should see how the copy status change from Mounted to Healthy and vice-versa for the given database.  
      - You can fail back that specific database following the same procedure.
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Thanks guys. Good constructive information.

Just to run through the motions:

[PS] C:\Windows\system32>Get-MailboxDatabase | Select Name,RPCClientAccessServer | fl


Name                  : MBDB01
RpcClientAccessServer : CAS01.evo.local

Name                  : MBDB02
RpcClientAccessServer : CAS01.evo.local

So yep, the rpcclientaccessserver setting is ok.

Let me ring round and double check with people we are good for a quick test.

Akhater,

"2) Would the RPCClientAccessServer setting still be valid if you shut either of the systems down?
you can shutdown both servers if you want the RCPClientAccess won't change on the DB settings"

I am aware the RPCClientAccessServer will not change by shutting systems down, I am merely asking him to validate if the DNS/IP it is pointing to would still be a valid one after shutting any of the systems down.

""For CAS, stop the Microsoft Exchange RPC Client Access"
this will NOT work even with windows NLB, these are not service aware NLB, as long as the system is running service failure wont be detected"
This will work with most if not all HW NLB, but needs to be properly setup.

Given the amount of unknowns, I think is best to take small steps on testing failover components before moving to a full shutdown failover test.

No go for now, will test thought as soon as possible.
ctc1900 well I am pointing out that this test will not work, stopping the RPC CLient access service while the server is online will lead people to be disconnected failover will NOT happen
>to failback from a mailbox server run "cluster group" and you will see the cluster group is now owned by the other server, to move back run

cluster group "cluster group" /move:Node1

Could you please expand on this, I have no idea what it means! Where do I run that command?
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Stopping the RPC client access services would not trigger an MBX failover but would trigger a CAS failover if properly setup.

Author, I am signing off from this question as it seems you have all the help you need, good luck!
"Stopping the RPC client access services would not trigger an MBX failover but would trigger a CAS failover if properly setup."

CAS failover will NOT happen if you stop the service unless you are running a hardware load balancer.
Couple of things

1. The client will connect to ONE of the CAS servers in the array. It will not use the other CAS server for that same connection socket. This is why you MAY need to test shutting down both CAS servers if your client was already pointing to the "failover" CAS server. Confirm from testing that the client WILL failover eventually if you stop the RPCClientAccess service, but only because the socket has timed out without response. The only surefire way of testing auto-failover is to down the CAS server.

2. My understanding of DAGs is that they apply to individual databases NOT a cluster group. When the Active Manager moves the live copy of a database from one server to another, there is no concept of a cluster group unless the entire server is unavailable. This means that if the underlying storage (for example) on one database fails, ONLY that database copy will be moved, any other databases NOT affected by the storage issue will remain on the original server. To move the live copy back, you firstly need to ascertain its state (get-mailboxdatabasecopystatus - http://technet.microsoft.com/en-us/library/dd298044.aspx) and IF this is shown as healthy for both Database and Index, it can be moved using the move-activemailboxdatabase cmdlet or through EMC).

You CAN move all databases on a server to either a specified target or any suitable available passive copy using the same cmdlet shown above or through EMC.

This link explains about failover and switchover

http://technet.microsoft.com/en-us/library/dd298067.aspx
@TheGeezer2010  what you said is totally correct however

as long as the "main" exchange 2010 server is up (the one owning the dag ip) is online we don't care where the mailbox databases are mounted, they can be all mounted on srv1 or all on srv2 or spread across the 3 as long as cas is concerned it won't matter.

the only case where, in the case at hand, cas is concerned is when the "main" server fails, in that case the IP will failover to the other
I dont think it matters which server holds the cluster IP as this is used only during the creation of the Windows Failover Cluster, and is assigned by DHCP by default. The cluster IP has no bearing on how failover/switchover in a DAG works. This is why I do not believe the solution proposed will work.

You DO need to test shutting down of both CAS servers IF the client is initially using the "failover" CAS server in the array. If this is the case and you shut down the "main" server, the client will continue to use the same CAS server, the target mailboxdatabase for the client will be automatically updated to point at the "failover" server - so yes you will have tested mailboxdatabase failover, but you will NOT have tested CAS array failover. If you now shut down the "failover" CAS server, the mailboxdatbase will be moved to the "main" server (providede it is healthy), and if this solution works, the client will now failover to the "main" CAS server. The net effect will be a slight interruption whilst the failover takes place. If the solution does NOT work, at this point, the client will simply display "Connecting..." and will be unable to reconnect without both repointing the mailboxdatabase to the CAS server (not the array), and restarting OUTLOOK.

I will be interested to see the results.
"I dont think it matters which server holds the cluster IP as this is used only during the creation of the Windows Failover Cluster, and is assigned by DHCP by default. The cluster IP has no bearing on how failover/switchover in a DAG works. This is why I do not believe the solution proposed will work."

But the cluster IP is owned by only one node at a time, will failover will the cluster, will remain active and will reply too all traffic. it will worked I have it working or I wouldn't have suggested it


You DO need to test shutting down of both CAS servers


Of course but each at a time :)


 IF the client is initially using the "failover" CAS server in the array. If this is the case and you shut down the "main" server, the client will continue to use the same CAS server,

technically clients cannot be connect to the "failover" server since we defined "main" as the one holding the DAG IP but you are perfectly right with the idea

the target mailboxdatabase for the client will be automatically updated to point at the "failover" server - so yes you will have tested mailboxdatabase failover, but you will NOT have tested CAS array failover.

Agreed

 If you now shut down the "failover" CAS server, the mailboxdatbase will be moved to the "main" server (providede it is healthy), and if this solution works, the client will now failover to the "main" CAS server. The net effect will be a slight interruption whilst the failover takes place. If the solution does NOT work, at this point, the client will simply display "Connecting..." and will be unable to reconnect without both repointing the mailboxdatabase to the CAS server (not the array), and restarting OUTLOOK.

also agreed



Yep, understood.

I will test by first shutting down the main server and noting where the mailbox database ends up. It should be mounted on Svr2. I will see if connectivity remains.

I will then start Svr1 1 back up, and check the copy status is Healthy.

Next I will shut down Svr2 and again note that the mailbox database is changed back to Svr1, also note what happens to Outlook connectivity.

I will test Outlook from 2 machines to be sure. One is on the domain, the other isn't (using outlook anywhere).
Mmm - Akheter you are failing over the cluster using the clussvc.exe ? So not failing over the databases within the exchange DAG at all is that right ? I know when I read the bible according to Redmond, he clearly states NEVER do anything to a DAG which is not via the Primary and Secondary Active manager components - i.e. never use anmy native clustering tools to manage a DAG as it can have unpredictable consequences - unless I am not understanding you correctly ?
Should not matter if machine on domain or not except for possible extra latency as they will both be pointing at CAS array ultimately but yes, this sounds like a good plan.
No I am not failing over the cluster at all I am just moving the IP dag resource from one node to another, it has (as you stated) no effect whatsoever on the DAG nor the databases.

Th
OK so I am very interested to see if this works !!
Results are in. Tested Outlook connectivity from 2 PC's:

Office PC
Home PC - on a IPSEC VPN with host entries for autodiscover and remote, amongst others for file access. I changed all Exchange related entires to point to the DAG IP.

I also changed the firewall smtp rule to point to the DAG IP for mail flow.

1. I shut down Svr2 first, and nothing changed at all. This told me that the CAS resource was pointing to Svr1.

2. I booted Svr2 back up and checked the DAG status was Healthy

3. I shut down Svr1 and 2 things happened.

Office PC: Outlook froze for about 45s, then came back to life. I could send and receive external mail as normal.

Svr2: Refreshed EMC and confirmed MBX has shifted, and detected Svr1 was "ServiceDown" state.

Home PC: No connectivity.

4. Booted Svr1 back up.

Office PC remained operational, no lag/freeze at all.
Svr2 - Manually moved Active Database back to Svr1.
Home PC: Still no connectivity even after restarting Outlook.

5. Ran the cluster group command.

Office PC: No change. All stayed up.
Home PC: Outlook online after restarting

So yes - it does work. However curious why the Home PC Outlook did not respond.

The Home PC although connected via a VPN is configured to use Outlook Anyhere as per any other external machine, and could resolve all the same addresses both internally and eternally as the Office PC.


did you try to ping cas.domain.com from your home PC while it wasn't connecting ? maybe you didn't configure outlook anywhere on the secod server ?
Hmm, I didnt try.

Possibly that would have been it:

My home PC is connected via a VPN and iI have static host entries for remote and autodiscover, forcing it over the LAN.

Without a cas01 dns entry it would not be able to resolve that right? So makes sense.

Still, it does work, I just need to re-test with a completely external client to replicate my external workers.
you home pc is running outlook anywhere right ?

did you enable outlook anywhere on both servers?
No! I had not!

Will retest now it has been enabled...
it needs 15 min to be active
Interesting results indeed - the same as would expect if the client had failed over to the other CAS server in a WNLB CAS Array. I am going to dig more carefully into how the CAS array works now !!
I am testing it again tonight, primarily to ensure Outlook Anywhere works on the second server once it fail over.

Will report back.
External Clients - Outlook 2007 - Few mins swapover - The administrator has made a change which requires you to restart Outlook

OK, results are in and very pleasing.

1. Shut down Svr1.

>3 x External Clients (Outlook 2007, completely offsite) - Lost connectivity (Outlook hung) for about 3-4 mins each, then came up with a message "The administrator has made a change which requires you to restart Outlook" - I did NOT restart Outlook, and everthing kept working anyway.

To be honest after the first sign of hanging my clients would have just restarted Outlook anyway.

>Home PC - after adding the cas01 host entry it just regained connectivity by itself after a min or 2.

2. Booted up Svr1, changed cluster resource back back to Svr1

>External clients - lost connectivity again, but came back up within a minute. No message. Lot smoother.

>Home PC. Same thing. Less noticeable.

So fail back seems to be a lot smoother, but all in all it works!