We have a fresh install of Exchange 2013 on Windows Server 2012R2, and a separate fresh install domain controller on Windows Server 2012R2. Both servers are VMs with fixed IPs on the same VirtualBox host using bridged networking. There is another older domain controller on the same subnet but it is for a different domain. We are in the midst of "upgrading" from the old domain, old domain controller, old Exchange, etc. to the new setup.
** Note: We are not Windows Networking or Exchange Experts. We run our own Exchange server because we want to keep our customer client data secure within our own network. When we got Exchange 2003 setup properly we left it running for 10+ years with only security updates, and we were perfectly happy with that setup until Outlook 2013 dropped support for Exchange 2003. **
Our problem is Exchange 2013... it is failing intermittently (2-3 times per day). When it fails we get a cascade of error messages in the application log and system log on the Exchange server, and our Outlook clients are not able to connect to Exchange. We have not seen application/system log errors on the domain controller.
There seems to be a pattern to the failures:
1) The first or nearly-first application log error on the Exchange server is "SACL Watcher servicelet encountered an error while monitoring SACL change. Got error 1721 opening group policy on system xxxx in domain yyyy" where xxxx is the domain controller and yyyy is our domain.
2) Then we see one or more MSExchange ADAccess errors "Process Microsoft.Exchange.Directory.TopologyService.exe (PID=2728). Exchange Active Directory Provider could not find an available domain controller in domain yyyy. This event may be caused by network connectivity issues or configured incorrectly DNS server."
3) Then we see a cascade of MSExchangeADTopology errors "Process Microsoft.Exchange.Directory.TopologyService.exe (PID=2728) Forest xxxx. Topology discovery failed, error details
No suitable domain controller was found in domain 'yyyy'."
4) From the cmd prompt on the Exchange server we can ping the domain controller. But "net user xxxx /domain" (where xxxx is a valid domain user account) fails with an error saying it cannot reach the domain controller.
** This data point seems to indicate a general failure in domain-type communication between the Exchange server and the domain controller. But how do we troubleshoot this? **
5) Sometimes we are able to restart the "Microsoft Exchange Active Directory Topology" service and have everything come back up. Other times the restart fails because a service fails to shut down in time. Recently we have started just rebooting the Exchange server each time it happens because this is a more reliable way of getting it back online.
Things we have tried that didn't help:
1) Simplifying Exchange by setting non-essential services to manual: Exchange Diagnostics, Exchange Health Managers, IMAP4, IMAP4 BAckend, Exchange Server Extension for Windows Server Backup, Exchange Throttling, Unified Messaging, Unified Messaging Call Router.
2) Contacting Microsoft to open a case. They were singularly unhelpful. They will only help if the server is actively failing, and when the server is actively failing their suggestion was to troubleshoot the database because it was in a "dirty shutdown" state. We pointed out that this problem is fixed when you restart the services (and demonstrated by restarting the services), and at that point the conversation spiraled into a debate about what the scope of the ticket should be. After 5 hours on the phone and gathering data they had not provided any useful suggestions, so we gave up.
3) Upgrading to the latest Exchange 2013 CU -- we went from CU5 (or 6) to CU11.
4) Disabling the firewall on the domain controller and on the Exchange server.
1) Has anyone else experienced and fixed this problem?
We see other solutions online that relate to the SACL Watcher Servicelet error, but they seem related to retiring a domain controller and having Exchange look for the retired domain controller. In our case the SACL error refers to our new domain controller.
2) What is the best way to troubleshoot the problem once it is failing?
If "net user xxxx /domain" is failing but we can ping the domain controller, it seems like we should be able to nail it down to a communications failure, but we don't know the best tool for this.
Thanks for any help!