Exchange 2013 intermittent failures - looks like failure to reach domain controller

We have a fresh install of Exchange 2013 on Windows Server 2012R2, and a separate fresh install domain controller on Windows Server 2012R2. Both servers are VMs with fixed IPs on the same VirtualBox host using bridged networking. There is another older domain controller on the same subnet but it is for a different domain. We are in the midst of "upgrading" from the old domain, old domain controller, old Exchange, etc. to the new setup.

** Note: We are not Windows Networking or Exchange Experts. We run our own Exchange server because we want to keep our customer client data secure within our own network. When we got Exchange 2003 setup properly we left it running for 10+ years with only security updates, and we were perfectly happy with that setup until Outlook 2013 dropped support for Exchange 2003. **

Our problem is Exchange 2013... it is failing intermittently (2-3 times per day). When it fails we get a cascade of error messages in the application log and system log on the Exchange server, and our Outlook clients are not able to connect to Exchange. We have not seen application/system log errors on the domain controller.

There seems to be a pattern to the failures:

1) The first or nearly-first application log error on the Exchange server is "SACL Watcher servicelet encountered an error while monitoring SACL change. Got error 1721 opening group policy on system xxxx in domain yyyy" where xxxx is the domain controller and yyyy is our domain.

2) Then we see one or more MSExchange ADAccess errors "Process Microsoft.Exchange.Directory.TopologyService.exe (PID=2728). Exchange Active Directory Provider could not find an available domain controller in domain yyyy. This event may be caused by network connectivity issues or configured incorrectly DNS server."

3) Then we see a cascade of MSExchangeADTopology errors "Process Microsoft.Exchange.Directory.TopologyService.exe (PID=2728) Forest xxxx. Topology discovery failed, error details
No suitable domain controller was found in domain 'yyyy'."

4) From the cmd prompt on the Exchange server we can ping the domain controller. But "net user xxxx /domain" (where xxxx is a valid domain user account) fails with an error saying it cannot reach the domain controller.
  ** This data point seems to indicate a general failure in domain-type communication between the Exchange server and the domain controller. But how do we troubleshoot this? **

5) Sometimes we are able to restart the "Microsoft Exchange Active Directory Topology" service and have everything come back up. Other times the restart fails because a service fails to shut down in time. Recently we have started just rebooting the Exchange server each time it happens because this is a more reliable way of getting it back online.

Things we have tried that didn't help:

1) Simplifying Exchange by setting non-essential services to manual: Exchange Diagnostics, Exchange Health Managers, IMAP4, IMAP4 BAckend, Exchange Server Extension for Windows Server Backup, Exchange Throttling, Unified Messaging, Unified Messaging Call Router.

2) Contacting Microsoft to open a case. They were singularly unhelpful. They will only help if the server is actively failing, and when the server is actively failing their suggestion was to troubleshoot the database because it was in a "dirty shutdown" state. We pointed out that this problem is fixed when you restart the services (and demonstrated by restarting the services), and at that point the conversation spiraled into a debate about what the scope of the ticket should be. After 5 hours on the phone and gathering data they had not provided any useful suggestions, so we gave up.

3) Upgrading to the latest Exchange 2013 CU -- we went from CU5 (or 6) to CU11.

4) Disabling the firewall on the domain controller and on the Exchange server.

--------------------------------------

Questions:

1) Has anyone else experienced and fixed this problem?

We see other solutions online that relate to the SACL Watcher Servicelet error, but they seem related to retiring a domain controller and having Exchange look for the retired domain controller. In our case the SACL error refers to our new domain controller.

2) What is the best way to troubleshoot the problem once it is failing?

If "net user xxxx /domain" is failing but we can ping the domain controller, it seems like we should be able to nail it down to a communications failure, but we don't know the best tool for this.

Thanks for any help!
-Frank.
FDC2005Asked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

MaheshArchitectCommented:
Virtualbox is not officially supported virtualized platform for exchange 2013 / 2012 and 2012 R2 servers by Microsoft

Check this Virtualization validation Program

I never seen exchange production system on Virtualbox

You can use hyper-v shipped with 2012 R2 standard and deploy exchange 2013 as guest there
You can get one 2012 R2 standard edition installed on physical server and install TWO VMs free of charge on top of that

OR

You can deploy it on VMware esxi, that is also fully supported

I don't think you can get support from Microsoft since you are running non supported scenario

Check exchange link
https://technet.microsoft.com/en-in/library/jj619301(v=exchg.150).aspx

Your description is clearly saying that exchange has getting difficulties to connect to domain controller, in that case you need to see how exchange can get connects to DC correctly

some commands can help
you can check if exchange can communicate to DC
nltest /SC_Query:domain.com
nltest /SC_verify:domain.com
nltest /dsgetdc:domain.com
nltest /dclist:domain.com

replace domain.com with your AD domain
also check nslookup output, it should point to DC if reverse lookup zone is configured

download PortQueryUI tool and see if from exchange server you can telnet to DC on well known AD auth ports (you need to select domains and trust in app)

check if you are able to ping DC from exchange server
FDC2005Author Commented:
Hi Mahesh -

Thank you for the tips.

What's funny is the Microsoft tech didn't blink at the fact we're using VirtualBox. He was happy to spend hours talking about the scope of the ticket, but he had no complaints about VirtualBox.

We standardized on Virtualbox due to other business reasons - we run VMs with VPN software that cuts off all local network access once connected to remote customer sites, and Virtualbox has the best overhead RDP access to the VM console. We spent many months investigating hyper-V and VMWare and the solutions from both were not sufficient for our needs.

We are a small company, so if it requires a second virtualization solution just to support Exchange, we'll give up on hosting our own Exchange, even if it means leaving our customer data on someone else's server. We have to pick our battles.

Now back to troubleshooting...

> check if you are able to ping DC from exchange server

Yes - we already did this, as I mentioned above. Ping works fine, even when Exchange is failing, but "net user _username_ /domain" fails.

> also check nslookup output, it should point to DC if reverse lookup zone is configured

Confirmed - nslookup points to DC.

> nltest /SC_Query:domain.com

Confirmed (When Exchange is up). Next time it goes down I will try this.

> nltest /SC_verify:domain.com

Confirmed (When Exchange is up). Next time it goes down I will try this.

> nltest /dsgetdc:domain.com

Confirmed (When Exchange is up). Next time it goes down I will try this.

> nltest /dclist:domain.com

Confirmed (When Exchange is up). Next time it goes down I will try this.

> download PortQueryUI tool and see if from exchange server you can telnet to DC on well known AD auth ports (you need to select domains and trust in app)

Good tip... I will try this.

thanks!
-Frank.
MaheshArchitectCommented:
You could ask MS support about virtual box supportability wrt 2012 R2 server and Exchange 2013 specifically and if they have any known compatibility issues and breakfix for same

Also you could post in Virtualbox forums, you might get some breakfix

As Exchange server requires at least 16GB memory and 2012 R2 standard edition license, I believe you would require dedicated physical machine to host exchange and DC VMs
You could try providing 4 vcpu and 16GB to exchange VM and see if resolves your issue
Instead of Virtualbox, you could install 2012 R2 standard edition on physical box and still you have rights to run TWO VMs in hyper-V for free (you are allowed to run two instances of VMs for one 2012 r2 standard server license

Finally Virtualbox is Type-2 hypervisor and hyper-V is type-1 hypervisor which can make resource management in better way
https://social.technet.microsoft.com/Forums/exchange/en-US/10faab94-d635-42ac-8b16-0714f8a7747b/installing-exchange-2013-in-virtualbox-vm-unbearably-slow?forum=exchangesvrgeneral

You might go for Office 365 if you don't want to manage exchange on premise, you can check with MS
Your Guide to Achieving IT Business Success

The IT Service Excellence Tool Kit has best practices to keep your clients happy and business booming. Inside, you’ll find everything you need to increase client satisfaction and retention, become more competitive, and increase your overall success.

FDC2005Author Commented:
Hi Mahesh -

> As Exchange server requires at least 16GB memory and 2012 R2 standard edition license, I believe you would require dedicated physical machine to host exchange and DC VMs You could try providing 4 vcpu and 16GB to exchange VM and see if resolves your issue

We have provided 16GB of RAM to the Exchange VM (according to task manager it is typically using ~8GB) and 12 logical processors. The Exchange VM and DC VM are running on an R610 with dual 6-core Xeons (hyperthreading enabled). This R610 has 80GB of physical RAM and 3TB of solid-state disk. The Exchange VM and DC VM and several other small light-VMs are the only VMs running on the R610. Looking at htop on the R610, it is only using 45GB of its available 80GB of RAM, and the load average is ~1.25. So overall it seems to be a very lightly loaded VM host.

> You might go for Office 365 if you don't want to manage exchange on premise, you can check with MS

We are evaluating Office 365. Our preference is to keep our customer data internal, if possible. Our experience with Exchange 2003 was that once we got it setup, it was very reliable -- 10+ years with the only downtime being reboots for critical updates. We'd just like to get back to that point with Exchange 2013.

I am going to wait for the next downtime and see what I can learn from troubleshooting it with Wireshark / PortQueryUI / etc.

Thanks!
-Frank.
FDC2005Author Commented:
Two Updates -

1) Running netstat -a during a crash yesterday uncovered ldap connections to the *old* domain controller. I'm not sure why there would ever be a need/reason for the new exchange server to communicate with the old ldap server, so this makes me suspicious. We are done with items from the old domain controller so we shut it down today. After the next exchange crash we'll reboot both the new domain controller and the new exchange server, to ensure there are no hanging references to the old DC.

2) Running wireshark during a crash this morning displayed interesting results that we are looking into, including occasional "spurious retransmission", "duplicate ack" and "destination unreachable". We are comparing against wireshark traffic during normal operation.

-Frank.
MaheshArchitectCommented:
You cannot restrict exchange from communicating any specific domain controller in same site
I believe both old and new DCs are in same site..

You need to ensure that there is no stale DC entry exists in active directory, if you found one, you need to remove it from domain controllers OU, Sites and services and DNS (ns records, host(A) record etc)
https://support.microsoft.com/en-us/kb/216498
FDC2005Author Commented:
Update -

We found the problem... at least we think we did. Exchange has been up for 72+ hours since we made our change, where previously it was crashing 2-3 times per day. So we are  happy.

In a nutshell: Both our new domain controller and our old domain controller were listening on the same IP due to both being configured as VPN servers and using the same base IP as part of the RRAS/VPN setup. The servers had different fixed IPs, but the problem was using the same base IP in the RRAS/VPN setup. When exchange communicated to the new domain controller's fixed IP, everything was fine. But when exchange tried to communicate to the new comain controller via the base IP used in RRAS/VPN, that request sometimes went to the new DC, and sometimes to the old DC. Confusion, problems, Exchange has a cascade of failures.

Lessons learned:

1) New DC should not have been setup for RRAS/VPN with same base IP as old DC.
2) Netstat and wireshark make a good troubleshooting duo for problems like this.
3) Server setup for RRAS/VPN listens on the RRAS base IP internally, not just across VPN connections.
4) Name server on new domain controller likely shouldn't publish its address as both its fixed IP and the RRAS base IP. We are looking into this.

Details:

1)      ServerS2: Old domain controller
2)      ServerS3: New domain controller
3)      ServerS3e: New exchange server.
4)      During crash events, on Server3e, cmd> netstat –a listed ldap (active directory) connections from Server3e to both Server3 (expected) and Server2 (unexpected and likely bad).
5)      After the reboot and during uptime, I confirmed netstat reported no connections to Server2.
6)      During the next crash (Friday AM), we also had ldap connections to Server2. During the crash I made several wireshark captures of raw network traffic to analyze.
7)      We shutdown Server2 Friday morning.
8)      No crashes since.
9)      The wireshark captures from Friday morning’s crash showed connections to Server2 and problems due to that (negative replies to authentication requests).
10)      Root cause:
     a.      Server2 was our old VPN server. When we connected to it, it handed out dynamic IP addresses in the 192.168.1.1xx range. It listened on 192.168.1.100.
     b.      Server3 is our new VPN server. When we connect to it, it hands out dynamic IP addresses in the 192.168.1.1xx range. It listens on 192.168.1.100.

Thus we had two servers listening for traffic on 192.168.1.100. And Server3 was publishing (via DNS) its address as both 192.168.1.14 and 192.168.1.100.

Sometimes when Server3e Exchange needed something from Server3, it would send this request to 192.168.1.14. These worked.
And sometimes when Server3e needed something from Server3, it would send this request to 192.168.1.100. If Server3 would answer, it would work. If Server2 would answer, the request would fail. This confused Exchange and led to a cascade of failures.

-Frank.

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
FDC2005Author Commented:
Problem wasn't related to Virtual Box or any problems with VM hosting. Root cause was due to old DC and new DC both listening on same RRAS base IP.
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Exchange

From novice to tech pro — start learning today.