Link to home
Start Free TrialLog in
Avatar of FDC2005
FDC2005

asked on

Exchange 2013 intermittent failures - looks like failure to reach domain controller

We have a fresh install of Exchange 2013 on Windows Server 2012R2, and a separate fresh install domain controller on Windows Server 2012R2. Both servers are VMs with fixed IPs on the same VirtualBox host using bridged networking. There is another older domain controller on the same subnet but it is for a different domain. We are in the midst of "upgrading" from the old domain, old domain controller, old Exchange, etc. to the new setup.

** Note: We are not Windows Networking or Exchange Experts. We run our own Exchange server because we want to keep our customer client data secure within our own network. When we got Exchange 2003 setup properly we left it running for 10+ years with only security updates, and we were perfectly happy with that setup until Outlook 2013 dropped support for Exchange 2003. **

Our problem is Exchange 2013... it is failing intermittently (2-3 times per day). When it fails we get a cascade of error messages in the application log and system log on the Exchange server, and our Outlook clients are not able to connect to Exchange. We have not seen application/system log errors on the domain controller.

There seems to be a pattern to the failures:

1) The first or nearly-first application log error on the Exchange server is "SACL Watcher servicelet encountered an error while monitoring SACL change. Got error 1721 opening group policy on system xxxx in domain yyyy" where xxxx is the domain controller and yyyy is our domain.

2) Then we see one or more MSExchange ADAccess errors "Process Microsoft.Exchange.Directory.TopologyService.exe (PID=2728). Exchange Active Directory Provider could not find an available domain controller in domain yyyy. This event may be caused by network connectivity issues or configured incorrectly DNS server."

3) Then we see a cascade of MSExchangeADTopology errors "Process Microsoft.Exchange.Directory.TopologyService.exe (PID=2728) Forest xxxx. Topology discovery failed, error details
No suitable domain controller was found in domain 'yyyy'."

4) From the cmd prompt on the Exchange server we can ping the domain controller. But "net user xxxx /domain" (where xxxx is a valid domain user account) fails with an error saying it cannot reach the domain controller.
  ** This data point seems to indicate a general failure in domain-type communication between the Exchange server and the domain controller. But how do we troubleshoot this? **

5) Sometimes we are able to restart the "Microsoft Exchange Active Directory Topology" service and have everything come back up. Other times the restart fails because a service fails to shut down in time. Recently we have started just rebooting the Exchange server each time it happens because this is a more reliable way of getting it back online.

Things we have tried that didn't help:

1) Simplifying Exchange by setting non-essential services to manual: Exchange Diagnostics, Exchange Health Managers, IMAP4, IMAP4 BAckend, Exchange Server Extension for Windows Server Backup, Exchange Throttling, Unified Messaging, Unified Messaging Call Router.

2) Contacting Microsoft to open a case. They were singularly unhelpful. They will only help if the server is actively failing, and when the server is actively failing their suggestion was to troubleshoot the database because it was in a "dirty shutdown" state. We pointed out that this problem is fixed when you restart the services (and demonstrated by restarting the services), and at that point the conversation spiraled into a debate about what the scope of the ticket should be. After 5 hours on the phone and gathering data they had not provided any useful suggestions, so we gave up.

3) Upgrading to the latest Exchange 2013 CU -- we went from CU5 (or 6) to CU11.

4) Disabling the firewall on the domain controller and on the Exchange server.

--------------------------------------

Questions:

1) Has anyone else experienced and fixed this problem?

We see other solutions online that relate to the SACL Watcher Servicelet error, but they seem related to retiring a domain controller and having Exchange look for the retired domain controller. In our case the SACL error refers to our new domain controller.

2) What is the best way to troubleshoot the problem once it is failing?

If "net user xxxx /domain" is failing but we can ping the domain controller, it seems like we should be able to nail it down to a communications failure, but we don't know the best tool for this.

Thanks for any help!
-Frank.
Avatar of Mahesh
Mahesh
Flag of India image

Virtualbox is not officially supported virtualized platform for exchange 2013 / 2012 and 2012 R2 servers by Microsoft

Check this Virtualization validation Program

I never seen exchange production system on Virtualbox

You can use hyper-v shipped with 2012 R2 standard and deploy exchange 2013 as guest there
You can get one 2012 R2 standard edition installed on physical server and install TWO VMs free of charge on top of that

OR

You can deploy it on VMware esxi, that is also fully supported

I don't think you can get support from Microsoft since you are running non supported scenario

Check exchange link
https://technet.microsoft.com/en-in/library/jj619301(v=exchg.150).aspx

Your description is clearly saying that exchange has getting difficulties to connect to domain controller, in that case you need to see how exchange can get connects to DC correctly

some commands can help
you can check if exchange can communicate to DC
nltest /SC_Query:domain.com
nltest /SC_verify:domain.com
nltest /dsgetdc:domain.com
nltest /dclist:domain.com

replace domain.com with your AD domain
also check nslookup output, it should point to DC if reverse lookup zone is configured

download PortQueryUI tool and see if from exchange server you can telnet to DC on well known AD auth ports (you need to select domains and trust in app)

check if you are able to ping DC from exchange server
Avatar of FDC2005
FDC2005

ASKER

Hi Mahesh -

Thank you for the tips.

What's funny is the Microsoft tech didn't blink at the fact we're using VirtualBox. He was happy to spend hours talking about the scope of the ticket, but he had no complaints about VirtualBox.

We standardized on Virtualbox due to other business reasons - we run VMs with VPN software that cuts off all local network access once connected to remote customer sites, and Virtualbox has the best overhead RDP access to the VM console. We spent many months investigating hyper-V and VMWare and the solutions from both were not sufficient for our needs.

We are a small company, so if it requires a second virtualization solution just to support Exchange, we'll give up on hosting our own Exchange, even if it means leaving our customer data on someone else's server. We have to pick our battles.

Now back to troubleshooting...

> check if you are able to ping DC from exchange server

Yes - we already did this, as I mentioned above. Ping works fine, even when Exchange is failing, but "net user _username_ /domain" fails.

> also check nslookup output, it should point to DC if reverse lookup zone is configured

Confirmed - nslookup points to DC.

> nltest /SC_Query:domain.com

Confirmed (When Exchange is up). Next time it goes down I will try this.

> nltest /SC_verify:domain.com

Confirmed (When Exchange is up). Next time it goes down I will try this.

> nltest /dsgetdc:domain.com

Confirmed (When Exchange is up). Next time it goes down I will try this.

> nltest /dclist:domain.com

Confirmed (When Exchange is up). Next time it goes down I will try this.

> download PortQueryUI tool and see if from exchange server you can telnet to DC on well known AD auth ports (you need to select domains and trust in app)

Good tip... I will try this.

thanks!
-Frank.
You could ask MS support about virtual box supportability wrt 2012 R2 server and Exchange 2013 specifically and if they have any known compatibility issues and breakfix for same

Also you could post in Virtualbox forums, you might get some breakfix

As Exchange server requires at least 16GB memory and 2012 R2 standard edition license, I believe you would require dedicated physical machine to host exchange and DC VMs
You could try providing 4 vcpu and 16GB to exchange VM and see if resolves your issue
Instead of Virtualbox, you could install 2012 R2 standard edition on physical box and still you have rights to run TWO VMs in hyper-V for free (you are allowed to run two instances of VMs for one 2012 r2 standard server license

Finally Virtualbox is Type-2 hypervisor and hyper-V is type-1 hypervisor which can make resource management in better way
https://social.technet.microsoft.com/Forums/exchange/en-US/10faab94-d635-42ac-8b16-0714f8a7747b/installing-exchange-2013-in-virtualbox-vm-unbearably-slow?forum=exchangesvrgeneral

You might go for Office 365 if you don't want to manage exchange on premise, you can check with MS
Avatar of FDC2005

ASKER

Hi Mahesh -

> As Exchange server requires at least 16GB memory and 2012 R2 standard edition license, I believe you would require dedicated physical machine to host exchange and DC VMs You could try providing 4 vcpu and 16GB to exchange VM and see if resolves your issue

We have provided 16GB of RAM to the Exchange VM (according to task manager it is typically using ~8GB) and 12 logical processors. The Exchange VM and DC VM are running on an R610 with dual 6-core Xeons (hyperthreading enabled). This R610 has 80GB of physical RAM and 3TB of solid-state disk. The Exchange VM and DC VM and several other small light-VMs are the only VMs running on the R610. Looking at htop on the R610, it is only using 45GB of its available 80GB of RAM, and the load average is ~1.25. So overall it seems to be a very lightly loaded VM host.

> You might go for Office 365 if you don't want to manage exchange on premise, you can check with MS

We are evaluating Office 365. Our preference is to keep our customer data internal, if possible. Our experience with Exchange 2003 was that once we got it setup, it was very reliable -- 10+ years with the only downtime being reboots for critical updates. We'd just like to get back to that point with Exchange 2013.

I am going to wait for the next downtime and see what I can learn from troubleshooting it with Wireshark / PortQueryUI / etc.

Thanks!
-Frank.
Avatar of FDC2005

ASKER

Two Updates -

1) Running netstat -a during a crash yesterday uncovered ldap connections to the *old* domain controller. I'm not sure why there would ever be a need/reason for the new exchange server to communicate with the old ldap server, so this makes me suspicious. We are done with items from the old domain controller so we shut it down today. After the next exchange crash we'll reboot both the new domain controller and the new exchange server, to ensure there are no hanging references to the old DC.

2) Running wireshark during a crash this morning displayed interesting results that we are looking into, including occasional "spurious retransmission", "duplicate ack" and "destination unreachable". We are comparing against wireshark traffic during normal operation.

-Frank.
SOLUTION
Avatar of Mahesh
Mahesh
Flag of India image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
ASKER CERTIFIED SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of FDC2005

ASKER

Problem wasn't related to Virtual Box or any problems with VM hosting. Root cause was due to old DC and new DC both listening on same RRAS base IP.