asked on

Windows 2003 SP2 Domain Controllers become unresponsive until reboot

Background:
While installing a new DC, because the SA I was replacing was on the wrong path, we discovered that DNS zones were not Active Directory Integrated. We changed zones to ADI and after discovering other issues, DEMOTED the new DC and unpublished the DC root cert for it.

Our network consists of the following:
DC1 - Windows 2003 Server Enterprise w/SP2
DC2 / Exchange Server - Windows 2003 Server Enterprise w/SP2 / Exchange - Exchange 2003 w/SP3 (please stop laughing... it's not MY choice).
Both DC's have DNS installed.
Bluecoat Proxy
Users authenticate by CAC using Valicert Desktop Validator. All certs are downloaded and cached at 24 hour intervals.

Problem:
Network will run fine for several hours (24 - 36) with no errors being reported. Out of nowhere, one or both DC's will become completely unresponsive. Upon reboot, everything begins to run fine again for another 24-36 hours. In the course of troubleshooting, I've increased the size of my security logs and have them backed up and cleared well before they fill up in accordance with Microsoft kb316685. The issue was initially occurring every 24 hours or so. After increasing event log size, the uptime seemed to increase by 12 hours or so (this may be coincidental).

DC1 appears to become unable to find itself, at which point DC2 is usually the first to become unresponsive.

I've attached events (in chron. order) from when the issues seem to start (prior to lockup). We are a military network, so for security reasons, I have replaced the actual FQDN with <FQDN> and altered actual usernames and IP info.

Any and all help is greatly appreciated.

Event Type:	Error
Event Source:	Userenv
Event Category:	None
Event ID:	1006
Date:		10/8/2008
Time:		9:09:15 AM
User:		NT AUTHORITY\SYSTEM
Computer:	TACMDC1
Description:
Windows cannot bind to <FQDN> domain. (Timeout). Group Policy processing aborted. 
 
For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.
-----------------------------------------------------------
Event Type:	Error
Event Source:	Userenv
Event Category:	None
Event ID:	1030
Date:		10/8/2008
Time:		9:09:15 AM
User:		NT AUTHORITY\SYSTEM
Computer:	TACMDC1
Description:
Windows cannot query for the list of Group Policy objects. Check the event log for possible messages previously logged by the policy engine that describes the reason for this.
 
For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.
------------------------------------------------------------------------
Event Type:	Error
Event Source:	BCAAA
Event Category:	(1)
Event ID:	2200
Date:		10/8/2008
Time:		9:10:29 AM
User:		N/A
Computer:	TACMDC1
Description:
[1692:1992] Cannot query domain controller 137.12.5.1; status=64:0x40:The specified network name is no longer available.
-----------------------------------------------------------------------
Event Type:	Error
Event Source:	DNS
Event Category:	None
Event ID:	4016
Date:		10/8/2008
Time:		9:12:08 AM
User:		N/A
Computer:	TACMDC1
Description:
The DNS server timed out attempting an Active Directory service operation on DC=103,DC=5.12.137.in-addr.arpa,cn=MicrosoftDNS,cn=System,DC=DOMAIN,DC=IRAQ,DC=PARENTDOMAIN1,DC=PARENTDOMAIN2,DC=MIL.  Check Active Directory to see that it is functioning properly. The event data contains the error.
 
For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.
Data:
0000: 55 00 00 00               U...    
-----------------------------------------------------------------------
Event Type:	Error
Event Source:	DNS
Event Category:	None
Event ID:	4016
Date:		10/8/2008
Time:		9:12:47 AM
User:		N/A
Computer:	TACMDC1
Description:
The DNS server timed out attempting an Active Directory service operation on ---.  Check Active Directory to see that it is functioning properly. The event data contains the error.
 
For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.
Data:
0000: 55 00 00 00               U...    
-----------------------------------------------------------------------
Event Type:	Error
Event Source:	Userenv
Event Category:	None
Event ID:	1006
Date:		10/8/2008
Time:		9:14:15 AM
User:		NT AUTHORITY\SYSTEM
Computer:	TACMDC1
Description:
Windows cannot bind to FQDN domain. (Server Down). Group Policy processing aborted. 
 
For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.
----------------------------------------------------------------------
Event Type:	Error
Event Source:	Userenv
Event Category:	None
Event ID:	1030
Date:		10/8/2008
Time:		9:14:15 AM
User:		NT AUTHORITY\SYSTEM
Computer:	TACMDC1
Description:
Windows cannot query for the list of Group Policy objects. Check the event log for possible messages previously logged by the policy engine that describes the reason for this.
 
For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.
---------------------------------------------------------------------
Event Type:	Error
Event Source:	BCAAA
Event Category:	(1)
Event ID:	2200
Date:		10/8/2008
Time:		9:14:28 AM
User:		N/A
Computer:	TACMDC1
Description:
[1692:1992] Cannot query domain controller 137.12.5.1; status=64:0x40:The specified network name is no longer available.
--------------------------------------------------------------------
 
Event Type:	Warning
Event Source:	KDC
Event Category:	None
Event ID:	21
Date:		10/8/2008
Time:		9:14:39 AM
User:		N/A
Computer:	TACMDC1
Description:
The client certificate for the user DOMAIN\DOEJ is not valid, and resulted in a failed smartcard logon.  Please contact the user for more information about the certificate they're attempting to use for smartcard logon. The chain status was : The revocation function was unable to check revocation because the revocation server was offline.
 
For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.
Data:
0000: 14 00 00 00 13 20 09 80   ..... .€
0008: 00 00 00 00 00 00 00 00   ........
---------------------------------------------------------------------
Event Type:	Error
Event Source:	Valicert Desktop Validator
Event Category:	None
Event ID:	1
Date:		10/8/2008
Time:		9:14:36 AM
User:		N/A
Computer:	TACMDC1
Description:
Certificate Revocation Status
Calling Application: lsass.exe
Certificate Name: /C=US/O=U.S. Government/OU=DoD/OU=PKI/OU=USA/CN=DOE.JOHN.David.123456789
Certificate Issuer: /C=US/O=U.S. Government/OU=DoD/OU=PKI/CN=DOD EMAIL CA-16
Certificate Serial Number: 1B8CC0
Revocation Status: Unable to verify
Validation Url: file://\\tacmdc1\crls$\emailca16.crl
Error: Memory allocation failure

Open in new window

sstone55423

WHere are the FSMo roles and GC located? Are they all operating?

ChiefIT

Let's ask a few questions:

Are either of these servers multihomed domain controllers? Multihomed is defined as having two or more IP addresses. That could mean two IPs on the same NIC or two+ NICs.

Look in FRS event logs for any errors that are in the 13000's. Any there?

Have you noticed any DNS problems or intermittent internet connectivity during the "up time"?

Are you using imaged/cloned servers? This could break the trust or cause major problems unless the servers had the same SID.

From what I am seeing, this looks like a multihomed Domain server problem.