We recently installed updates on two domain controllers at what I will call SITE-1, of a 4 DC Sites/Services configuration and after it was complete we've started to see some strange behavior (it did not present immediately). Currently we receive the attached error message any time we try to connect to a Failover Cluster SQL instance from a Failover Cluster node (error_sqlconnect) at both SITE-1 and SITE-2 via SSMS. We have one node member at SITE-2 that currently holds our 'Production' SQL instance which is working, so luckily we are not hard down because of this, but all replication and mirroring has stopped because of, we think, whatever is causing this error. We have no tolerance if this single node member goes down, then we would be down and my hair would be in hot lava (already on fire).
We use sites and services in AD to control which DCs can be contacted by which servers and those rules are applying properly as far as logging shows for SITE-1 and SITE-2. I should mention that the Failover Clusters at both SITES are being affected by this. Interestingly, when we try and start up mirroring and replication services the service accounts are immediately locked out and SSPI failure messages are logged in our SQL logs. I also want to point out that ALL other functions of
We have an ongoing case open with Microsoft (Sev A) but so far they have been unresponsive and unable to resolve our case. It was opened 50 hours ago and we have had 4 hours of engineer time and the rest we have simply been waiting. Please ask any questions you'd like and I'll try to answer what I can (within reason).
Things we have done:
- Ran dcdiag and repadmin /showrepl with no errors.
- Verified DNS at both SITES and made sure that ALL node members and instances resolve properly.
- SQL validation of connections and checked logs
- Validated SQL connections from other servers that were not cluster members
- Validated that SQL can authenticate fine to DCs
- Confirmed that any authentication from cluster nodes cause accounts to lockout and return the SSPI error in SQL and produces the error attached.
- SSPI context tool used to verify that when accounts don't lock out we could get a successful connection
- Restarted cluster nodes (did not help)
- Used Sites and Services to replicate now between all DCs (both SITE-1 and SITE-2)
- Created new service accounts (these are locking out as well)
Things we have not done:
- Restarted SITE-2 DCs
- Restarted active cluster node
- Destroyed clusters
- Installed fresh
As I try other things and find additional information I will add it to our ASK. Thanks in advance for any assistance you smart masses provide.