Remote Exchange 2010 DAG Node has Fail-over cluster fail often

Alright here we go.

We have an Exchange 2010 environment. 4 servers: CAS, Two local Mailbox Servers, and a Remote Mailbox Server.

The 3 mailbox servers are in a DAG. One of the two local mailbox servers had a mailbox copy but thats in a failed and suspended status. Long story short each server has its own database that it's using for storage. The remote site Failover Clustering Node keeps failing about once a week. I think its from latency and other miscellaneous network problems.

My options are to juggle the servers in order to get that remote server back online (restarting services all over) and finally having that remote node come up (no rhyme or reason why it starts besides restarted Exchange and cluster service) but that usually restores it.

My second option is to "Manage Database Availibility Group" and select to remove this remote server from the DAG. (Also can i do this while the node is down??

What would you recommend I do in this situation? There are no database copies to speak of so that shouldn't be an issue.

Thank you
EricDaRedAsked:
Who is Participating?

[Product update] Infrastructure Analysis Tool is now available with Business Accounts.Learn More

x
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

systechadminConsultantCommented:
when the node is down its difficult the remote server from the DAG.  Refer the link for detailed and explain info.

http://blogs.technet.com/b/timmcmic/archive/2013/09/23/exchange-2010-remove-databaseavailabilitygroupserver-configurationonly-does-not-evict-the-member-from-the-cluster.aspx

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
Amit KumarCommented:
I hope you have FSW in place, actually it totally depends on cluster quorum. If you are using node majority then it can be a big issue.

So please explain DAG infra in details.

However if you again fall in such situation again then there is an option to run cluster service with force quorum.

Stop cluster service on remote node and now you have two nodes in local site. They both have quorum so they can mount cluster services, in case not then shutdown one local node and stop cluster service on remaining node. then open a command prompt and run below command:

net start clussvc /fq

It will run with force quorum.  

once it is up now you can run both remaining servers and let start cluster service they will be up.

However maximum round trip latency is recommended by MS is 500 ms but it sustains till 600 ms. if you have latency more than this then i would suggest to check your network/ISP.

for more analysis please explain your DAG architecture.

Here is one article which have registry configuration when we have multi site cluster configuration to set heartbeat subnet delay.

cluster /cluster:<ClusterName> /prop SameSubnetDelay=<value>
cluster /cluster:<ClusterName> /prop SameSubnetThreshold=<value>
cluster /cluster:<ClusterName> /prop CrossSubnetDelay=<value>
cluster /cluster:<ClusterName> /prop CrossSubnetThreshold=<value>

Please go through with this article before setting these registries.
EricDaRedAuthor Commented:
Thanks for the feedback. I had found most of what you suggested already, I was hoping someone had experience with this particular issue. The quorum is configured in majority I believe, it does use a witness server. When doing snooping with the log files i am finding errors. I suspect that the latency between the two sites is too great during some nights (backups and such) and cause a problem that is really only resolved by restarting all of the clustering and exchange services on the upstream servers. I suspect that these servers have not been patched in a long time and there are other issues there.
mohammad bazzariMicrosoft Infrastructure ExpertCommented:
in your case it seems that you have another Active Directory Site but with the same namespace ; if YES be sure that you configured the Datacenter Activation Coordination  mode to (DAC) to DagOnly , because you are issuing a split Brain Condition , moreover to get more information on how  DAC works review the below link:

http://exchangeserverpro.com/datacenter-activation-coordination-mode/

to configure DAC run the below command on Exchange Management Shell

Set-DatabaseAvailabilityGroup -Identity "DAG-Name" -DatacenterActivationMode DagOnly
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Exchange

From novice to tech pro — start learning today.