Exch 2013 - DAG issue, one of 2 members always failsover

When I activate the DB copy on one of 2 members of a DAG, the next day it's failed over.
Looking at Event Viewer HighAvailablity\Operational log it shows:

Moving all active databases failed for server 's922-exch-mb1.xxx.com (MoveComment: Managed availability system failover initiated by Responder=OutlookMapiHttpDeepTestFailover Component=Outlook., Error: Some (1) active databases could not be successfully moved.).

Get-MailboxDatabaseCopyStatus shows Status as Healthy

Since this problem was noticed I've installed security updates and restarted it.
Actually i've restarted the problem server a couple times now, but it's the same.

Probably out of depth on this.
Hoping someone here can help.


When i run  Get-ServerHealth -Identity "s922-exch-mb1.xxx.com" -HealthSet "Outlook.Protocol" |ft server
,state,name,alertvalue –Autosize

Server                    state Name                                              AlertValue
------                    ----- ----                                              ----------
s922-exch-mb1.xxx.com       OutlookRpcDeepTestMonitor                            Healthy
s922-exch-mb1.xxx.com      OutlookMapiHttpSelfTestMonitor                     Unhealthy
s922-exch-mb1.xxx.com        OutlookRpcSelfTestMonitor                            Healthy
s922-exch-mb1.xxx.com        OutlookMapiHttpDeepTestMonitor                       Healthy
s922-exch-mb1.xxx.com        PrivateWorkingSetWarning....cclientaccess.service    Healthy
s922-exch-mb1.xxx.com        PrivateWorkingSetError....rpcclientaccess.service    Healthy
s922-exch-mb1.xxx.com        ProcessProcessorTimeWarning....ientaccess.service    Healthy
s922-exch-mb1.xxx.com        ProcessProcessorTimeError....clientaccess.service    Healthy
s922-exch-mb1.xxx.com        ExchangeCrashEventError....pcclientaccess.service    Healthy
s922-exch-mb1.xxx.com       LongRunningWatsonWarning....cclientaccess.service    Healthy
s922-exch-mb1.xxx.com        LongRunningWerMgrWarning....cclientaccess.service    Healthy
jman0 warAsked:
Who is Participating?

[Product update] Infrastructure Analysis Tool is now available with Business Accounts.Learn More

x
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Jeff GloverSr. Systems AdministratorCommented:
I assume both servers are running all roles. Run Get-MAPIVirtualDirectory | fl on each one and compare. See if something pops out. As long as you setup MAPI over HTTPS correctly and set the URIs, it should work.  To check the MAPI service, go to https://(your URL)/MAPI/Healthcheck.htm. You should get a 200 OK page.

You might also try the Managed Availability Troubleshooter. You can get it here
https://gallery.technet.microsoft.com/MATS-bc0d200d
jman0 warAuthor Commented:
Thanks for replying.

On the problem server i also couldn't pull up the EAC.
I ended up having to type     https:// s922-exch-cas/ecp/?ExchClientVer=15

I also found another article that seemed to match:
https://social.technet.microsoft.com/Forums/exchange/en-US/7b1ebadd-cb10-48fc-b3f3-0c7a449183c4/exchange-2013-cu3-databases-only-activate-on-one-mailbox-server?forum=exchangesvrgeneral

I have now disabled the responder in that thread.
I was going to activate the DB on the problem server tonight and see what happens.


I tried    https://s922-exch-mb1/mapi/healthcheck.htm but get a 404 not found error
jman0 warAuthor Commented:
the 404 error shows:
Physical Path   C:\inetpub\wwwroot\mapi\healthcheck.htm

But there is no mapi folder at that location.
Determine the Perfect Price for Your IT Services

Do you wonder if your IT business is truly profitable or if you should raise your prices? Learn how to calculate your overhead burden with our free interactive tool and use it to determine the right price for your IT services. Download your free eBook now!

Jeff GloverSr. Systems AdministratorCommented:
look in the IIS manager for that server and make sure the MAPI IIS virtual directory exists in the Web Front End website.
jman0 warAuthor Commented:
It works on the good server  s922-exch-mb2
I get the "200 OK"
jman0 warAuthor Commented:
in IIS Manager under Sites I see:

Default Web Site
Exchange Back End

I don't know what or where i'm supposed to see Web Front End
Jeff GloverSr. Systems AdministratorCommented:
The path should be %Program Files%\Microsoft\Exchange Server\V15\FrontEnd\HttpProxy\mapi to see the directory. In IIS manager, check the MAPI VDir settings for the path.  What CU are you running?
jman0 warAuthor Commented:
on the good server, there is a FrontEnd\HttpProxy\Mapi directory.
On the problem sever, there is no FrontEnd directory.

Version 15 Build 775.38
Jeff GloverSr. Systems AdministratorCommented:
Same build on both? You are only running CU3 and MAPI/HTTPS should not even be there until CU4 or better. It sounds like you have one at CU3 and another at a later CU build.
jman0 warAuthor Commented:
The good MB2 server is also on Version 15 Build 775.38

CAS server is also on the same build.

then there's an Edge Transport server that's Version 14.3 Build 123.4
Jeff GloverSr. Systems AdministratorCommented:
hmmm. When you run get-organizationconfig | fl what do you see for mapihttpenabled? Or do you even see it?
AmitIT ArchitectCommented:
@Joseph

For DAG issues, you need to focus on cluster logs. Open failover snap-in and check the cluster logs. Post it here.
jman0 warAuthor Commented:
I don't see "mapihttpenabled"
Jeff GloverSr. Systems AdministratorCommented:
Then I have no idea why that Health monitor is firing. You could try upgrading to at least CU4 or better, both servers, and see if it helps the issue. Current CU for exchange is 10.
jman0 warAuthor Commented:
Amit, can you give me more specifics about what Cluster Logs?
I can open Failover Cluster Manager.

I go to Nodes and then this server.
Critical Events show:

ID 1135
Cluster node 'S922-EXCH-MB1' was removed from the active failover cluster membership. The Cluster service on this node may have stopped. This could also be due to the node having lost communication with other active nodes in the failover cluster. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapters on this node. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

ID 1561
The cluster service has determined that this node does not have the latest copy of cluster configuration data. Therefore, the cluster service has prevented itself from starting on this node.
Try starting the cluster service on all nodes in the cluster. If the cluster service can be started on other nodes with the latest copy of the cluster configuration data, this node will be able to subsequently join the started cluster successfully.

If there are no nodes available with the latest copy of the cluster configuration data, please consult the documentation for 'Force Cluster Start' in the failover cluster manager snapin, or the 'forcequorum' startup option. Note that this action of forcing quorum should be considered a last resort, since some cluster configuration changes may well be lost.

ID 1177
The Cluster service is shutting down because quorum was lost. This could be due to the loss of network connectivity between some or all nodes in the cluster, or a failover of the witness disk.
Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapter. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.
AmitIT ArchitectCommented:
From cluster errors above, It looks like you have network issue. Are these VM or Physical Servers?

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
jman0 warAuthor Commented:
VM's
AmitIT ArchitectCommented:
Are these on same ESX or different?
jman0 warAuthor Commented:
Im not sure but i think it's the same ESX.
AmitIT ArchitectCommented:
OK you basically need to focus on network issue. During issue next time. Run cluster validation and it will show you the issue and will suggest what you need to do.
Jeff GloverSr. Systems AdministratorCommented:
Since they are on Virtuals, have you had any vMotion incidents? And, sorry for earlier. I did not realize you had a separate CAS server. If you only have Maibox role installed, you will not have the MAPI stuff.
jman0 warAuthor Commented:
sorry but I am now informed that they are not on the same ESX.
They are in different datacenters.

I'll try the Validation and do the Network bit later.

thanks for the help so far.
jman0 warAuthor Commented:
Ok i ended up contacting the VM guys that would have setup this server.
They did some pieces of work for their backup solution and found some additional issues.
I'm not sure of the details.

But I do know that they did not run the Cluster Validation Manager tool.

I activated the DB copy on MB1 and so far it's persisted.
(less than 24 hours)

So maybe the VM guys fixed something, or maybe my turning off the Outlook Responder monitor did.
AmitIT ArchitectCommented:
From your details above, it is a network issue. You might need to ask them what cause this issue or what they did to fix it.
jman0 warAuthor Commented:
They replied  that they fixed something with a backup solution : Avamar. They said the install wizard goes through and establishes items under the Exchange Failover Cluster Manager. Under Roles it has the 'Avamar Backup Client Role', which is assigned an IP address strictly for Avamar processes.

It seems this was not acting correct so it was tore down and rebuilt using the wizard then they worked with EMC support.
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Exchange

From novice to tech pro — start learning today.