MGio4
asked on
Windows 2003 SP2 Domain Controllers become unresponsive until reboot
Background:
While installing a new DC, because the SA I was replacing was on the wrong path, we discovered that DNS zones were not Active Directory Integrated. We changed zones to ADI and after discovering other issues, DEMOTED the new DC and unpublished the DC root cert for it.
Our network consists of the following:
DC1 - Windows 2003 Server Enterprise w/SP2
DC2 / Exchange Server - Windows 2003 Server Enterprise w/SP2 / Exchange - Exchange 2003 w/SP3 (please stop laughing... it's not MY choice).
Both DC's have DNS installed.
Bluecoat Proxy
Users authenticate by CAC using Valicert Desktop Validator. All certs are downloaded and cached at 24 hour intervals.
Problem:
Network will run fine for several hours (24 - 36) with no errors being reported. Out of nowhere, one or both DC's will become completely unresponsive. Upon reboot, everything begins to run fine again for another 24-36 hours. In the course of troubleshooting, I've increased the size of my security logs and have them backed up and cleared well before they fill up in accordance with Microsoft kb316685. The issue was initially occurring every 24 hours or so. After increasing event log size, the uptime seemed to increase by 12 hours or so (this may be coincidental).
DC1 appears to become unable to find itself, at which point DC2 is usually the first to become unresponsive.
I've attached events (in chron. order) from when the issues seem to start (prior to lockup). We are a military network, so for security reasons, I have replaced the actual FQDN with <FQDN> and altered actual usernames and IP info.
Any and all help is greatly appreciated.
While installing a new DC, because the SA I was replacing was on the wrong path, we discovered that DNS zones were not Active Directory Integrated. We changed zones to ADI and after discovering other issues, DEMOTED the new DC and unpublished the DC root cert for it.
Our network consists of the following:
DC1 - Windows 2003 Server Enterprise w/SP2
DC2 / Exchange Server - Windows 2003 Server Enterprise w/SP2 / Exchange - Exchange 2003 w/SP3 (please stop laughing... it's not MY choice).
Both DC's have DNS installed.
Bluecoat Proxy
Users authenticate by CAC using Valicert Desktop Validator. All certs are downloaded and cached at 24 hour intervals.
Problem:
Network will run fine for several hours (24 - 36) with no errors being reported. Out of nowhere, one or both DC's will become completely unresponsive. Upon reboot, everything begins to run fine again for another 24-36 hours. In the course of troubleshooting, I've increased the size of my security logs and have them backed up and cleared well before they fill up in accordance with Microsoft kb316685. The issue was initially occurring every 24 hours or so. After increasing event log size, the uptime seemed to increase by 12 hours or so (this may be coincidental).
DC1 appears to become unable to find itself, at which point DC2 is usually the first to become unresponsive.
I've attached events (in chron. order) from when the issues seem to start (prior to lockup). We are a military network, so for security reasons, I have replaced the actual FQDN with <FQDN> and altered actual usernames and IP info.
Any and all help is greatly appreciated.
Event Type: Error
Event Source: Userenv
Event Category: None
Event ID: 1006
Date: 10/8/2008
Time: 9:09:15 AM
User: NT AUTHORITY\SYSTEM
Computer: TACMDC1
Description:
Windows cannot bind to <FQDN> domain. (Timeout). Group Policy processing aborted.
For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.
-----------------------------------------------------------
Event Type: Error
Event Source: Userenv
Event Category: None
Event ID: 1030
Date: 10/8/2008
Time: 9:09:15 AM
User: NT AUTHORITY\SYSTEM
Computer: TACMDC1
Description:
Windows cannot query for the list of Group Policy objects. Check the event log for possible messages previously logged by the policy engine that describes the reason for this.
For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.
------------------------------------------------------------------------
Event Type: Error
Event Source: BCAAA
Event Category: (1)
Event ID: 2200
Date: 10/8/2008
Time: 9:10:29 AM
User: N/A
Computer: TACMDC1
Description:
[1692:1992] Cannot query domain controller 137.12.5.1; status=64:0x40:The specified network name is no longer available.
-----------------------------------------------------------------------
Event Type: Error
Event Source: DNS
Event Category: None
Event ID: 4016
Date: 10/8/2008
Time: 9:12:08 AM
User: N/A
Computer: TACMDC1
Description:
The DNS server timed out attempting an Active Directory service operation on DC=103,DC=5.12.137.in-addr.arpa,cn=MicrosoftDNS,cn=System,DC=DOMAIN,DC=IRAQ,DC=PARENTDOMAIN1,DC=PARENTDOMAIN2,DC=MIL. Check Active Directory to see that it is functioning properly. The event data contains the error.
For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.
Data:
0000: 55 00 00 00 U...
-----------------------------------------------------------------------
Event Type: Error
Event Source: DNS
Event Category: None
Event ID: 4016
Date: 10/8/2008
Time: 9:12:47 AM
User: N/A
Computer: TACMDC1
Description:
The DNS server timed out attempting an Active Directory service operation on ---. Check Active Directory to see that it is functioning properly. The event data contains the error.
For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.
Data:
0000: 55 00 00 00 U...
-----------------------------------------------------------------------
Event Type: Error
Event Source: Userenv
Event Category: None
Event ID: 1006
Date: 10/8/2008
Time: 9:14:15 AM
User: NT AUTHORITY\SYSTEM
Computer: TACMDC1
Description:
Windows cannot bind to FQDN domain. (Server Down). Group Policy processing aborted.
For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.
----------------------------------------------------------------------
Event Type: Error
Event Source: Userenv
Event Category: None
Event ID: 1030
Date: 10/8/2008
Time: 9:14:15 AM
User: NT AUTHORITY\SYSTEM
Computer: TACMDC1
Description:
Windows cannot query for the list of Group Policy objects. Check the event log for possible messages previously logged by the policy engine that describes the reason for this.
For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.
---------------------------------------------------------------------
Event Type: Error
Event Source: BCAAA
Event Category: (1)
Event ID: 2200
Date: 10/8/2008
Time: 9:14:28 AM
User: N/A
Computer: TACMDC1
Description:
[1692:1992] Cannot query domain controller 137.12.5.1; status=64:0x40:The specified network name is no longer available.
--------------------------------------------------------------------
Event Type: Warning
Event Source: KDC
Event Category: None
Event ID: 21
Date: 10/8/2008
Time: 9:14:39 AM
User: N/A
Computer: TACMDC1
Description:
The client certificate for the user DOMAIN\DOEJ is not valid, and resulted in a failed smartcard logon. Please contact the user for more information about the certificate they're attempting to use for smartcard logon. The chain status was : The revocation function was unable to check revocation because the revocation server was offline.
For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.
Data:
0000: 14 00 00 00 13 20 09 80 ..... .€
0008: 00 00 00 00 00 00 00 00 ........
---------------------------------------------------------------------
Event Type: Error
Event Source: Valicert Desktop Validator
Event Category: None
Event ID: 1
Date: 10/8/2008
Time: 9:14:36 AM
User: N/A
Computer: TACMDC1
Description:
Certificate Revocation Status
Calling Application: lsass.exe
Certificate Name: /C=US/O=U.S. Government/OU=DoD/OU=PKI/OU=USA/CN=DOE.JOHN.David.123456789
Certificate Issuer: /C=US/O=U.S. Government/OU=DoD/OU=PKI/CN=DOD EMAIL CA-16
Certificate Serial Number: 1B8CC0
Revocation Status: Unable to verify
Validation Url: file://\\tacmdc1\crls$\emailca16.crl
Error: Memory allocation failure
WHere are the FSMo roles and GC located? Are they all operating?
Let's ask a few questions:
Are either of these servers multihomed domain controllers? Multihomed is defined as having two or more IP addresses. That could mean two IPs on the same NIC or two+ NICs.
Look in FRS event logs for any errors that are in the 13000's. Any there?
Have you noticed any DNS problems or intermittent internet connectivity during the "up time"?
Are you using imaged/cloned servers? This could break the trust or cause major problems unless the servers had the same SID.
From what I am seeing, this looks like a multihomed Domain server problem.
Are either of these servers multihomed domain controllers? Multihomed is defined as having two or more IP addresses. That could mean two IPs on the same NIC or two+ NICs.
Look in FRS event logs for any errors that are in the 13000's. Any there?
Have you noticed any DNS problems or intermittent internet connectivity during the "up time"?
Are you using imaged/cloned servers? This could break the trust or cause major problems unless the servers had the same SID.
From what I am seeing, this looks like a multihomed Domain server problem.
ASKER
No. There are two NICs on each DC, but one is disabled.
The only 13000 messages I have for the FRS service starting and telling me that FRS is no longer preventing the machine from becoming a DC (AFTER REBOOT).
Have not noticed any DNS issues or irregularities with internet connectivity.
We are not using cloned servers. All server images are built from scratch.
There are other servers on the network that use teamed NICs.
The only 13000 messages I have for the FRS service starting and telling me that FRS is no longer preventing the machine from becoming a DC (AFTER REBOOT).
Have not noticed any DNS issues or irregularities with internet connectivity.
We are not using cloned servers. All server images are built from scratch.
There are other servers on the network that use teamed NICs.
ASKER
All FSMo roles are hosted on DC1 (TACMDC1), with the exception of Infrastructure. GC is on DC1. All appear to be operating.
Do you have Symantec AV installed?
ASKER
Yes... DC1 has 10.1.4. Second DC has 10.1.5 and Symantec Mail Security (5.0) For Exchange. These two DC's have had Symantec AV for at least 18 months.
SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
I'll try that..... I'll keep you posted. It's odd because everything's been working fine until recently. Maybe ADI triggered it.
ASKER
It'll take a day or so to know if upgrading to 10.1.5 fixes it.
I wonder if daylight savings time is putting you too far out of synch to authenticate with the server?
ASKER
I'm not sure what server you mean... All servers, including the AV server are on Arabic Standard Time.
I updated Symantec AV to 10.1.5 a few hours ago. I'm going to keep my fingers crossed for a couple of days and see if that does the trick. I'm am still open to suggestions though. I have a window of opportunity to take some leave in a couple of weeks. If I don't get this fixed beforehand, there'll be no leave. Next opportunity for leave will be in February or March.
I updated Symantec AV to 10.1.5 a few hours ago. I'm going to keep my fingers crossed for a couple of days and see if that does the trick. I'm am still open to suggestions though. I have a window of opportunity to take some leave in a couple of weeks. If I don't get this fixed beforehand, there'll be no leave. Next opportunity for leave will be in February or March.
WOW, this is an authentication NIGHTMARE:
__________________________ __________ __________ __________ __________ __________ ___
Event Type: Error
Event Source: BCAAA
Event Category: (1)
Event ID: 2200
---Can not query Domain controller:
http://www.bluecoat.com/doc/direct/607
Blue coat is used to authenticate NTLMhash and be granted an NTLMhash access token from the domain controller. Fortunately, we are out of the stone ages and are currently using Kerberos Authentication. NTLMhas has some very serious vulnerabilities that can be comprimised by an inexperienced hacker. It was used on pre-Windows 2000 PCs. If everything you have on the domain is 2000 Pro or newer, you should NOT be authenticating to the DC using NTLMhash. In fact, the DC should be throwing this back at you as it will not grant you access. 2003 server SP2 shut the door to backwards authentication to NTLMhash.
For a description of LMHash, NTLMhash and Kerberos please see the following link:
https://www.experts-exchange.com/questions/23132123/Computer-failed-to-join-or-logon-to-domain-days-later-after-reboot.html
With that said, things like Malware and Skype can use NTLM. They often don't resort to Kerberos because of the increased security:
http://forums.bluecoat.com/viewtopic.php?p=9499&sid=54d63d665cc6b94ef7df9c643e64da23
__________________________ __________ __________ __________ __________ __________ ____
Event Type: Error
Event Source: DNS
Event Category: None
Event ID: 4016
--The DNS server timed out attempting an Active Directory service operation on ---. Check Active Directory to see that it is functioning properly.
I am assuming you have AD integrated DNS and that is good. AD will not work if you are trying to authenticate using NTLMHash authentication for the above given reasons.
__________________________ __________ __________ __________ __________ __________ __
Event Type: Error
Event Source: Userenv
Event Category: None
Event ID: 1006
Date: 10/8/2008
Time: 9:14:15 AM
User: NT AUTHORITY\SYSTEM
Computer: TACMDC1
---Windows cannot bind to FQDN domain.
UserNV means User not valid: So, the remote procedure call (RPC) you are using to access domain services will through you a bone saying you are not valid because you are trying to authenticate using NTLMhash.
__________________________ __________ __________ __________ __________ __________
Event Type: Warning
Event Source: KDC
Event Category: None
Event ID: 21
Date: 10/8/2008
Time: 9:14:39 AM
User: N/A
--The client certificate for the user DOMAIN\DOEJ is not valid, and resulted in a failed smartcard logon. Please contact the user for more information about the certificate they're attempting to use for smartcard logon. The chain status was : The revocation function was unable to check revocation because the revocation server was offline.
Domain\DoeJ is trying to contact the KDC (Key Distribution Center) for verification. Kerberose will not validate this request, I believe, because it is using NTLMhash to try and authenticate with the domain controller.
__________________________ __________ __________ __________ __________ __________
Event Type: Error
Event Source: Valicert Desktop Validator
Event Category: None
Event ID: 1
Date: 10/8/2008
Time: 9:14:36 AM
User: N/A
Computer: TACMDC1
Description:
Certificate Revocation Status
Calling Application: lsass.exe
http://www.tumbleweed.com/news/press_releases/2005/2005-02-07.html
Tunbleweed is an encrypted protocol, that uses x.509 PKI certs to validate your computer prior to communicating one computer to another. So, every 24 to 36 hours your computer is trying to communicate with another computer and probably replicate data between two DCs. It appears this is trying to replicate DNS zones and validate the Kerberos tickets. The cert can not be verified to the remote computer, therefore you can't get a KDCticket, smart card is knocked down.
__________________________ __________ __________ __________ __________ __________ __\
conclusion:
It is my guess that you need to rid yourself of Blue coat. In a kerberos domain, it is not going to work.
Then, disable the domain controller's ability to be backwards compatible to NTLMhas for security reasons. You don't want your DC to be handing out access tokens to anything using NTLM authentication.
https://www.experts-exchange.com/questions/23132123/Computer-failed-to-join-or-logon-to-domain-days-later-after-reboot.html
Furthermore, you need to update your PKI certs to the domain controller you are trying to replicate with on the remote site. This may require a call to your Tumbleweed vendor.
http://www.tumbleweed.com/news/press_releases/2005/2005-02-07.html
It is also my guess that you need to get ahold of that computer that Domain\DoeJ is on and find out what in the world is using NTLM authentication. This looks like a backdoor attack using NTLM.
__________________________
Event Type: Error
Event Source: BCAAA
Event Category: (1)
Event ID: 2200
---Can not query Domain controller:
http://www.bluecoat.com/doc/direct/607
Blue coat is used to authenticate NTLMhash and be granted an NTLMhash access token from the domain controller. Fortunately, we are out of the stone ages and are currently using Kerberos Authentication. NTLMhas has some very serious vulnerabilities that can be comprimised by an inexperienced hacker. It was used on pre-Windows 2000 PCs. If everything you have on the domain is 2000 Pro or newer, you should NOT be authenticating to the DC using NTLMhash. In fact, the DC should be throwing this back at you as it will not grant you access. 2003 server SP2 shut the door to backwards authentication to NTLMhash.
For a description of LMHash, NTLMhash and Kerberos please see the following link:
https://www.experts-exchange.com/questions/23132123/Computer-failed-to-join-or-logon-to-domain-days-later-after-reboot.html
With that said, things like Malware and Skype can use NTLM. They often don't resort to Kerberos because of the increased security:
http://forums.bluecoat.com/viewtopic.php?p=9499&sid=54d63d665cc6b94ef7df9c643e64da23
__________________________
Event Type: Error
Event Source: DNS
Event Category: None
Event ID: 4016
--The DNS server timed out attempting an Active Directory service operation on ---. Check Active Directory to see that it is functioning properly.
I am assuming you have AD integrated DNS and that is good. AD will not work if you are trying to authenticate using NTLMHash authentication for the above given reasons.
__________________________
Event Type: Error
Event Source: Userenv
Event Category: None
Event ID: 1006
Date: 10/8/2008
Time: 9:14:15 AM
User: NT AUTHORITY\SYSTEM
Computer: TACMDC1
---Windows cannot bind to FQDN domain.
UserNV means User not valid: So, the remote procedure call (RPC) you are using to access domain services will through you a bone saying you are not valid because you are trying to authenticate using NTLMhash.
__________________________
Event Type: Warning
Event Source: KDC
Event Category: None
Event ID: 21
Date: 10/8/2008
Time: 9:14:39 AM
User: N/A
--The client certificate for the user DOMAIN\DOEJ is not valid, and resulted in a failed smartcard logon. Please contact the user for more information about the certificate they're attempting to use for smartcard logon. The chain status was : The revocation function was unable to check revocation because the revocation server was offline.
Domain\DoeJ is trying to contact the KDC (Key Distribution Center) for verification. Kerberose will not validate this request, I believe, because it is using NTLMhash to try and authenticate with the domain controller.
__________________________
Event Type: Error
Event Source: Valicert Desktop Validator
Event Category: None
Event ID: 1
Date: 10/8/2008
Time: 9:14:36 AM
User: N/A
Computer: TACMDC1
Description:
Certificate Revocation Status
Calling Application: lsass.exe
http://www.tumbleweed.com/news/press_releases/2005/2005-02-07.html
Tunbleweed is an encrypted protocol, that uses x.509 PKI certs to validate your computer prior to communicating one computer to another. So, every 24 to 36 hours your computer is trying to communicate with another computer and probably replicate data between two DCs. It appears this is trying to replicate DNS zones and validate the Kerberos tickets. The cert can not be verified to the remote computer, therefore you can't get a KDCticket, smart card is knocked down.
__________________________
conclusion:
It is my guess that you need to rid yourself of Blue coat. In a kerberos domain, it is not going to work.
Then, disable the domain controller's ability to be backwards compatible to NTLMhas for security reasons. You don't want your DC to be handing out access tokens to anything using NTLM authentication.
https://www.experts-exchange.com/questions/23132123/Computer-failed-to-join-or-logon-to-domain-days-later-after-reboot.html
Furthermore, you need to update your PKI certs to the domain controller you are trying to replicate with on the remote site. This may require a call to your Tumbleweed vendor.
http://www.tumbleweed.com/news/press_releases/2005/2005-02-07.html
It is also my guess that you need to get ahold of that computer that Domain\DoeJ is on and find out what in the world is using NTLM authentication. This looks like a backdoor attack using NTLM.
ASKER
Chief, I dont think the unit was experiencing these issues until they went AD Integrated. None of the errors I sent you appear until the DCs start to act up.
ChiefIT: It is my guess that you need to rid yourself of Blue coat. In a kerberos domain, it is not going to work.
I agree with you about BlueCoat. Unfortunately, the military customer that I support seems to think its the greatest Proxy Appliance since sliced bread, although no one here knows a damned thing about it. The wizard that installed it left 8 or 9 months ago. Ridding us of BlueCoat is going to be a few months even if I can talk them into it.
ChiefIT: Then, disable the domain controller's ability to be backwards compatible to NTLMhas for security reasons. You don't want your DC to be handing out access tokens to anything using NTLM authentication.
Agreed, but negated by the fact that the BlueCoat appliance will be here for a bit.
ChiefIT: Furthermore, you need to update your PKI certs to the domain controller you are trying to replicate with on the remote site. This may require a call to your Tumbleweed vendor.
PKI certs are downloaded to the DC every night. We dont start experiencing the KDC and Valicert problems until we start losing the DC. Once we reboot, all issues are resolved.
ChiefIT: It is also my guess that you need to get ahold of that computer that Domain\DoeJ is on and find out what in the world is using NTLM authentication. This looks like a backdoor attack using NTLM.
This actually pertains to every user/machine that tries to log in while were experiencing our issues. I only sent one error of each type. There were actually several.
When we start losing the DCs, LSASS.EXE pegs out at 99%. Tumbleweed uses LSASS.
ChiefIT: It is my guess that you need to rid yourself of Blue coat. In a kerberos domain, it is not going to work.
I agree with you about BlueCoat. Unfortunately, the military customer that I support seems to think its the greatest Proxy Appliance since sliced bread, although no one here knows a damned thing about it. The wizard that installed it left 8 or 9 months ago. Ridding us of BlueCoat is going to be a few months even if I can talk them into it.
ChiefIT: Then, disable the domain controller's ability to be backwards compatible to NTLMhas for security reasons. You don't want your DC to be handing out access tokens to anything using NTLM authentication.
Agreed, but negated by the fact that the BlueCoat appliance will be here for a bit.
ChiefIT: Furthermore, you need to update your PKI certs to the domain controller you are trying to replicate with on the remote site. This may require a call to your Tumbleweed vendor.
PKI certs are downloaded to the DC every night. We dont start experiencing the KDC and Valicert problems until we start losing the DC. Once we reboot, all issues are resolved.
ChiefIT: It is also my guess that you need to get ahold of that computer that Domain\DoeJ is on and find out what in the world is using NTLM authentication. This looks like a backdoor attack using NTLM.
This actually pertains to every user/machine that tries to log in while were experiencing our issues. I only sent one error of each type. There were actually several.
When we start losing the DCs, LSASS.EXE pegs out at 99%. Tumbleweed uses LSASS.
ASKER
UPDATE: Yesterday morning (9:50 AST), I upgraded Symantec AV to 10.1.5 and rebooted DC1. Everything ran flawlessly (no error messages in the event viewr on anything) until 1:36 p.m. today. At 1:36, a handful of the previously mentioned KDC and Valicert error messages showed up for various users (no other errors) and users were able to log in sporadically. After two minutes or so, everything returned to normal. At 3:19 p.m., the errors returned again for a few minutes and disappear again until 4:37. This again clears up and the cycle repeats itself at 5 to 10 minute intervals until it gets persistent and I rebooted at 5:20 p.m. The second DC had very few similar errors until the reboot of DC1, at which point DC2 became extremely slow and was rebooted as well. When DC1 rebooted, I recieved:
Event Type: Warning
Event Source: LSASRV
Event Category: SPNEGO (Negotiator)
Event ID: 40960
Date: 10/10/2008
Time: 5:24:05 PM
User: N/A
Computer: DC1
Description:
The Security System detected an authentication error for the server cifs/DC2.<FQDN>. The failure code from authentication protocol Kerberos was "There are currently no logon servers available to service the logon request.
(0xc000005e)".
For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.
Data:
0000: 5e 00 00 c0 ^..À
__________________________ __________ __________ ___
Event Type: Error
Event Source: Userenv
Event Category: None
Event ID: 1058
Date: 10/10/2008
Time: 5:24:49 PM
User: DOMAINNAME\username
Computer: DC1
Description:
Windows cannot access the file gpt.ini for GPO cn={2910DB65-ED86-477E-908 1-9E7A8A62 E414},cn=p olicies,cn =system,DC =DOMAIN,DC =IRAQ,DC=P ARENTDOMAI N,DC=PAREN TDOMAIN,DC =MIL. The file must be present at the location <\\<FQDN>\SysVol\<FQDN>\Po licies\{29 10DB65-ED8 6-477E-908 1-9E7A8A62 E414}\gpt. ini>. (Configuration information could not be read from the domain controller, either because the machine is unavailable, or access has been denied. ). Group Policy processing aborted.
__________________________ __________ __________ __________ __________ __________ ___
Event Type: Warning
Event Source: Server
Event Category: None
Event ID: 2510
Date: 10/10/2008
Time: 5:25:15 PM
User: N/A
Computer: DC1
Description:
The server service was unable to map error code 998.
For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.
-------------------------- ---------- ---
Now that everything is rebooted, there are no issues. This will last until another 30 hours or so and repeat.
PKI certs are stored on both DCs. I'm totally stumped and trying to avoid building a new DC.
I do get two time errors, one telling me that the machine is configured to use the domain hierarchy to determine its time source, but it is the PDC emulator for the domain at the root of the forest and the following one.but wouldn't think they'd make a difference in such a short amount of time.
__________________
Event Type: Warning
Event Source: W32Time
Event Category: None
Event ID: 36
Date: 10/10/2008
Time: 9:17:23 AM
User: N/A
Computer: DC1
Description:
The time service has not synchronized the system time for 86400 seconds because none of the time service providers provided a usable time stamp. The time service is no longer synchronized and cannot provide the time to other clients or update the system clock. Monitor the system events displayed in the Event Viewer to make sure that a more serious problem does not exist.
For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.
__________________________ _
Again, I'm stumped... any help is appreciated. The only errors I'm seeing are KDC and Valicert.
Event Type: Warning
Event Source: LSASRV
Event Category: SPNEGO (Negotiator)
Event ID: 40960
Date: 10/10/2008
Time: 5:24:05 PM
User: N/A
Computer: DC1
Description:
The Security System detected an authentication error for the server cifs/DC2.<FQDN>. The failure code from authentication protocol Kerberos was "There are currently no logon servers available to service the logon request.
(0xc000005e)".
For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.
Data:
0000: 5e 00 00 c0 ^..À
__________________________
Event Type: Error
Event Source: Userenv
Event Category: None
Event ID: 1058
Date: 10/10/2008
Time: 5:24:49 PM
User: DOMAINNAME\username
Computer: DC1
Description:
Windows cannot access the file gpt.ini for GPO cn={2910DB65-ED86-477E-908
__________________________
Event Type: Warning
Event Source: Server
Event Category: None
Event ID: 2510
Date: 10/10/2008
Time: 5:25:15 PM
User: N/A
Computer: DC1
Description:
The server service was unable to map error code 998.
For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.
--------------------------
Now that everything is rebooted, there are no issues. This will last until another 30 hours or so and repeat.
PKI certs are stored on both DCs. I'm totally stumped and trying to avoid building a new DC.
I do get two time errors, one telling me that the machine is configured to use the domain hierarchy to determine its time source, but it is the PDC emulator for the domain at the root of the forest and the following one.but wouldn't think they'd make a difference in such a short amount of time.
__________________
Event Type: Warning
Event Source: W32Time
Event Category: None
Event ID: 36
Date: 10/10/2008
Time: 9:17:23 AM
User: N/A
Computer: DC1
Description:
The time service has not synchronized the system time for 86400 seconds because none of the time service providers provided a usable time stamp. The time service is no longer synchronized and cannot provide the time to other clients or update the system clock. Monitor the system events displayed in the Event Viewer to make sure that a more serious problem does not exist.
For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.
__________________________
Again, I'm stumped... any help is appreciated. The only errors I'm seeing are KDC and Valicert.
Are the times of the DCs and clients excatly on? Do you have your DC configured to an extenal time source?
ASKER
Times are exact... We are not pointing to an external time source.
SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
40960 error only occurs upon reboot. Prior to that all errors are KDC and Valicert (with occasional w32 error).
I've run the reg file...
I've run the reg file...
ASKER
By the way... Dc2 synchs with DC1
Make sure on your other DC too run w32tm /resync /rediscover.
Check to see if this pertains to you at all.
http://support.microsoft.com/kb/822219
Check to see if this pertains to you at all.
http://support.microsoft.com/kb/822219
Look at this hotfix.
http://support.microsoft.com/kb/833620/
http://support.microsoft.com/kb/833620/
ASKER
I'm running 32 Bit windows. The hotfix doesn't appear to apply. I've looked at it previously.
Here is one more link.
http://www.eventid.net/display.asp?eventid=2510&eventno=559&source=Server&phase=1
http://www.eventid.net/display.asp?eventid=2510&eventno=559&source=Server&phase=1
ASKER
SAV is enabled. The 998 error only occurs on reboot as well.
MGio4: As I have re-read the post I have to agree with Chief's post. Are you sure that the KDC are the first errors? If you look at your question the Event ID: 1006 was listed first.
ASKER
This last time I didn't get the 1006. I don't think it happens unless I let the servers get totally unresponsive. I'm wondering if I might have a DNS issue.
Do a netdiag and post the results.
SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
Cheif - I think I have an ally over here in getting rid of BlueCoat... I'm working on it.. all things in time (hopefully quicktime).
Looking at DNS, I also found the DC we had added and removed in Name Servers (Address Unknown). I've since removed it.
I'm also considering setting up a scheduled task on one DC to net stop DNS and setting up a task on the other to Net Start DNS a while later to see if I can detect a memory leak. I've been here 18 hours though and need to think that over when I'm awake.
At any rate, here's my net diag:
C:\Documents and Settings\taclan>netdiag
.......................... .......... .
Computer Name: DC1
DNS Host Name: DC1.<FQDN>
System info : Microsoft Windows Server 2003 (Build 3790)
Processor : x86 Family 6 Model 15 Stepping 7, GenuineIntel
List of installed hotfixes :
KB911564
KB921503
KB924667-v2
KB925398_WMP64
KB925876
KB925902
KB926122
KB927891
KB929123
KB930178
KB931784
KB932168
KB933360
KB933729
KB933854
KB935839
KB935840
KB935966
KB936021
KB936357
KB936782
KB937143
KB937143-IE7
KB938127
KB938127-IE7
KB938464
KB939653-IE7
KB941202
KB941568
KB941569
KB941644
KB941672
KB941693
KB942615-IE7
KB942763
KB943055
KB943460
KB943484
KB943485
KB944533-IE7
KB944653
KB945553
KB946026
KB947864-IE7
KB948496
KB948590
KB948881
KB949014
KB950759-IE7
KB950760
KB950762
KB950974
KB951066
KB951698
KB951746
KB951748
KB952954
KB953838-IE7
Q147222
Netcard queries test . . . . . . . : Passed
Per interface results:
Adapter : Local Area Connection 1
Netcard queries test . . . : Passed
Host Name. . . . . . . . . : DC1.<FQDN>
IP Address . . . . . . . . : IPADDRESS OF DC1
Subnet Mask. . . . . . . . : 255.255.255.0
Default Gateway. . . . . . : GW IP
Dns Servers. . . . . . . . : DC1 IP
DC2 IP
AutoConfiguration results. . . . . . : Passed
Default gateway test . . . : Passed
NetBT name test. . . . . . : Passed
[WARNING] At least one of the <00> 'WorkStation Service', <03> 'Messenge
r Service', <20> 'WINS' names is missing.
WINS service test. . . . . : Skipped
There are no WINS servers configured for this interface.
Global results:
Domain membership test . . . . . . : Passed
NetBT transports test. . . . . . . : Passed
List of NetBt transports currently configured:
NetBT_Tcpip_{D9459CB6-3577 -40DD-8567 -CBD24A49C 656}
1 NetBt transport currently configured.
Autonet address test . . . . . . . : Passed
IP loopback ping test. . . . . . . : Passed
Default gateway test . . . . . . . : Passed
NetBT name test. . . . . . . . . . : Passed
[WARNING] You don't have a single interface with the <00> 'WorkStation Servi
ce', <03> 'Messenger Service', <20> 'WINS' names defined.
Winsock test . . . . . . . . . . . : Passed
DNS test . . . . . . . . . . . . . : Passed
PASS - All the DNS entries for DC are registered on DNS server 'DC1 IP
and other DCs also have some of the names registered.
PASS - All the DNS entries for DC are registered on DNS server 'DC2 IP
and other DCs also have some of the names registered.
Redir and Browser test . . . . . . : Passed
List of NetBt transports currently bound to the Redir
NetBT_Tcpip_{D9459CB6-3577 -40DD-8567 -CBD24A49C 656}
The redir is bound to 1 NetBt transport.
List of NetBt transports currently bound to the browser
NetBT_Tcpip_{D9459CB6-3577 -40DD-8567 -CBD24A49C 656}
The browser is bound to 1 NetBt transport.
DC discovery test. . . . . . . . . : Passed
DC list test . . . . . . . . . . . : Passed
Trust relationship test. . . . . . : Skipped
Kerberos test. . . . . . . . . . . : Passed
LDAP test. . . . . . . . . . . . . : Passed
Bindings test. . . . . . . . . . . : Passed
WAN configuration test . . . . . . : Skipped
No active remote access connections.
Modem diagnostics test . . . . . . : Passed
IP Security test . . . . . . . . . : Skipped
Note: run "netsh ipsec dynamic show /?" for more detailed information
The command completed successfully
C:\Documents and Settings\taclan>
-------------------------- --------
I'm going to bed and keeping my fingers crossed.
Thanks much for the help.
Looking at DNS, I also found the DC we had added and removed in Name Servers (Address Unknown). I've since removed it.
I'm also considering setting up a scheduled task on one DC to net stop DNS and setting up a task on the other to Net Start DNS a while later to see if I can detect a memory leak. I've been here 18 hours though and need to think that over when I'm awake.
At any rate, here's my net diag:
C:\Documents and Settings\taclan>netdiag
..........................
Computer Name: DC1
DNS Host Name: DC1.<FQDN>
System info : Microsoft Windows Server 2003 (Build 3790)
Processor : x86 Family 6 Model 15 Stepping 7, GenuineIntel
List of installed hotfixes :
KB911564
KB921503
KB924667-v2
KB925398_WMP64
KB925876
KB925902
KB926122
KB927891
KB929123
KB930178
KB931784
KB932168
KB933360
KB933729
KB933854
KB935839
KB935840
KB935966
KB936021
KB936357
KB936782
KB937143
KB937143-IE7
KB938127
KB938127-IE7
KB938464
KB939653-IE7
KB941202
KB941568
KB941569
KB941644
KB941672
KB941693
KB942615-IE7
KB942763
KB943055
KB943460
KB943484
KB943485
KB944533-IE7
KB944653
KB945553
KB946026
KB947864-IE7
KB948496
KB948590
KB948881
KB949014
KB950759-IE7
KB950760
KB950762
KB950974
KB951066
KB951698
KB951746
KB951748
KB952954
KB953838-IE7
Q147222
Netcard queries test . . . . . . . : Passed
Per interface results:
Adapter : Local Area Connection 1
Netcard queries test . . . : Passed
Host Name. . . . . . . . . : DC1.<FQDN>
IP Address . . . . . . . . : IPADDRESS OF DC1
Subnet Mask. . . . . . . . : 255.255.255.0
Default Gateway. . . . . . : GW IP
Dns Servers. . . . . . . . : DC1 IP
DC2 IP
AutoConfiguration results. . . . . . : Passed
Default gateway test . . . : Passed
NetBT name test. . . . . . : Passed
[WARNING] At least one of the <00> 'WorkStation Service', <03> 'Messenge
r Service', <20> 'WINS' names is missing.
WINS service test. . . . . : Skipped
There are no WINS servers configured for this interface.
Global results:
Domain membership test . . . . . . : Passed
NetBT transports test. . . . . . . : Passed
List of NetBt transports currently configured:
NetBT_Tcpip_{D9459CB6-3577
1 NetBt transport currently configured.
Autonet address test . . . . . . . : Passed
IP loopback ping test. . . . . . . : Passed
Default gateway test . . . . . . . : Passed
NetBT name test. . . . . . . . . . : Passed
[WARNING] You don't have a single interface with the <00> 'WorkStation Servi
ce', <03> 'Messenger Service', <20> 'WINS' names defined.
Winsock test . . . . . . . . . . . : Passed
DNS test . . . . . . . . . . . . . : Passed
PASS - All the DNS entries for DC are registered on DNS server 'DC1 IP
and other DCs also have some of the names registered.
PASS - All the DNS entries for DC are registered on DNS server 'DC2 IP
and other DCs also have some of the names registered.
Redir and Browser test . . . . . . : Passed
List of NetBt transports currently bound to the Redir
NetBT_Tcpip_{D9459CB6-3577
The redir is bound to 1 NetBt transport.
List of NetBt transports currently bound to the browser
NetBT_Tcpip_{D9459CB6-3577
The browser is bound to 1 NetBt transport.
DC discovery test. . . . . . . . . : Passed
DC list test . . . . . . . . . . . : Passed
Trust relationship test. . . . . . : Skipped
Kerberos test. . . . . . . . . . . : Passed
LDAP test. . . . . . . . . . . . . : Passed
Bindings test. . . . . . . . . . . : Passed
WAN configuration test . . . . . . : Skipped
No active remote access connections.
Modem diagnostics test . . . . . . : Passed
IP Security test . . . . . . . . . : Skipped
Note: run "netsh ipsec dynamic show /?" for more detailed information
The command completed successfully
C:\Documents and Settings\taclan>
--------------------------
I'm going to bed and keeping my fingers crossed.
Thanks much for the help.
ASKER
Over the weekend, I dug around in DNS a bit and found the DC that we had added and removed was still listed under the name servers tab (Address Unknown). I removed it and stopped and restarted DNS. Im not sure whether or not that would fix anything, but it sure couldnt hurt.
All FSMo roles, except for Infrastructure on are on DC1. Infrastructure is on DC2. DC1 was the only GC, so I made DC2 a GC as well. I also configured time on the PDC, although it doesnt appear to be working& DC2 and all clients are synching time with DC1.
Ive had REPLMON running on DC1 since yesterday morning and it is reporting successful replication at regular intervals. I plan on leaving it running for a couple of days.
Saturday, I decided to point the two domain controllers to the same DNS temporarily and then restart the net logon service for both servers, think that should reregister the domain controller DNS entries. I was going to use REPLMON to determine if replication is really happening. For whatever reason, our BlueCoat Proxy appliance (which Im trying to get rid of due to Chief's excellent advice) freaked out. It shouldnt have as it has both DNS addresses listed. We couldnt get that back up until we did a hard reset of the BlueCoat. I rebooted the DC's while troubleshooting that.
As for the Godforsaken BlueCoat, they'll eventually let me get rid of it. As with anything involving the government, it's going to take a while though.
I rebooted DC1 again yesterday morning to make sure that if the 32 to 36 hour issue occured, it would be in the middle of the day, while I was here and not interfere with any night missions. If it's going to crap out again, it should be sometime this afternoon.
We have two new issues now that may or may not be related. Bluecoat goes crazy after about 24 hours and we get hammered with the following message until we reset the device (TWICE):
Event Type: Warning
Event Source: BCAAA
Event Category: (1)
Event ID: 300
Date: 10/13/2008
Time: 3:06:42 AM
User: N/A
Computer: DC2
Description:
[5756:5432] Connection attempt from forbidden IP address: xxx.xxx.xx.xx
-------------------------- ---------- -------
The other thing I noticed in looking through last nights logs on DC2 was the following three events:
Event Type: Error
Event Source: smtpsvc
Event Category: None
Event ID: 2013
Date: 10/13/2008
Time: 12:09:55 AM
User: N/A
Computer: DC2
Description:
SMTP could not connect to any DNS server. Either none are configured, or all are down.
For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.
Data:
0000: 7c 26 00 00 |&..
-------------------------- ----
Event Type: Warning
Event Source: smtpsvc
Event Category: None
Event ID: 2012
Date: 10/13/2008
Time: 12:09:55 AM
User: N/A
Computer: DC2
Description:
SMTP could not connect to the DNS server 'DC1 IP ADDRESS'. The protocol used was 'UDP'. It may be down or inaccessible.
For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.
Data:
0000: d5 04 00 00 Õ...
-------------------------- ---------- ---------- --
Event Type: Warning
Event Source: smtpsvc
Event Category: None
Event ID: 2012
Date: 10/13/2008
Time: 12:02:55 AM
User: N/A
Computer: DC2
Description:
SMTP could not connect to the DNS server 'DC2 IP ADDRESS'. The protocol used was 'UDP'. It may be down or inaccessible.
For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.
Data:
0000: d5 04 00 00 Õ...
-------------------------- ---------- ---------- ---------- ----
These errors only occured once and I see no issues (email or otherwise).
I'm also going to become familiar with Kiwi Syslog today and see if I can figure out how to configure it.
I appreciate the help& Enjoy whats left of your weekend. If this issue is resolved, I'll have my first weekend in 10 months is a few weeks. :) I look forward to hearing from you.
All FSMo roles, except for Infrastructure on are on DC1. Infrastructure is on DC2. DC1 was the only GC, so I made DC2 a GC as well. I also configured time on the PDC, although it doesnt appear to be working& DC2 and all clients are synching time with DC1.
Ive had REPLMON running on DC1 since yesterday morning and it is reporting successful replication at regular intervals. I plan on leaving it running for a couple of days.
Saturday, I decided to point the two domain controllers to the same DNS temporarily and then restart the net logon service for both servers, think that should reregister the domain controller DNS entries. I was going to use REPLMON to determine if replication is really happening. For whatever reason, our BlueCoat Proxy appliance (which Im trying to get rid of due to Chief's excellent advice) freaked out. It shouldnt have as it has both DNS addresses listed. We couldnt get that back up until we did a hard reset of the BlueCoat. I rebooted the DC's while troubleshooting that.
As for the Godforsaken BlueCoat, they'll eventually let me get rid of it. As with anything involving the government, it's going to take a while though.
I rebooted DC1 again yesterday morning to make sure that if the 32 to 36 hour issue occured, it would be in the middle of the day, while I was here and not interfere with any night missions. If it's going to crap out again, it should be sometime this afternoon.
We have two new issues now that may or may not be related. Bluecoat goes crazy after about 24 hours and we get hammered with the following message until we reset the device (TWICE):
Event Type: Warning
Event Source: BCAAA
Event Category: (1)
Event ID: 300
Date: 10/13/2008
Time: 3:06:42 AM
User: N/A
Computer: DC2
Description:
[5756:5432] Connection attempt from forbidden IP address: xxx.xxx.xx.xx
--------------------------
The other thing I noticed in looking through last nights logs on DC2 was the following three events:
Event Type: Error
Event Source: smtpsvc
Event Category: None
Event ID: 2013
Date: 10/13/2008
Time: 12:09:55 AM
User: N/A
Computer: DC2
Description:
SMTP could not connect to any DNS server. Either none are configured, or all are down.
For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.
Data:
0000: 7c 26 00 00 |&..
--------------------------
Event Type: Warning
Event Source: smtpsvc
Event Category: None
Event ID: 2012
Date: 10/13/2008
Time: 12:09:55 AM
User: N/A
Computer: DC2
Description:
SMTP could not connect to the DNS server 'DC1 IP ADDRESS'. The protocol used was 'UDP'. It may be down or inaccessible.
For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.
Data:
0000: d5 04 00 00 Õ...
--------------------------
Event Type: Warning
Event Source: smtpsvc
Event Category: None
Event ID: 2012
Date: 10/13/2008
Time: 12:02:55 AM
User: N/A
Computer: DC2
Description:
SMTP could not connect to the DNS server 'DC2 IP ADDRESS'. The protocol used was 'UDP'. It may be down or inaccessible.
For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.
Data:
0000: d5 04 00 00 Õ...
--------------------------
These errors only occured once and I see no issues (email or otherwise).
I'm also going to become familiar with Kiwi Syslog today and see if I can figure out how to configure it.
I appreciate the help& Enjoy whats left of your weekend. If this issue is resolved, I'll have my first weekend in 10 months is a few weeks. :) I look forward to hearing from you.
I was looking at your new errors:
Looking at EventID, I see references on how to control SPAM and NDRs.
Please check out the last comments, from "Gordon", on this post:
http://www.eventid.net/display.asp?eventid=2012&eventno=3165&source=smtpsvc&phase=1
__________________________ __________ __________ __________ _
Event Type: Warning
Event Source: BCAAA
Event Category: (1)
Event ID: 300
Date: 10/13/2008
Time: 3:06:42 AM
User: N/A
Computer: DC2
Description:
[5756:5432] Connection attempt from forbidden IP address: xxx.xxx.xx.xx
Could be one of a couple of things:
Either your NTLMhash authentication was refused from a kerberos LDAP.
or
This was once an IP address of someone sending hacking, that was caught and the IP was designated unsafe.
or
Someone sees this connection and is trying a brute force attack.
or
Someone has the wrong logon credentials and were locked out.
Looking at EventID, I see references on how to control SPAM and NDRs.
Please check out the last comments, from "Gordon", on this post:
http://www.eventid.net/display.asp?eventid=2012&eventno=3165&source=smtpsvc&phase=1
__________________________
Event Type: Warning
Event Source: BCAAA
Event Category: (1)
Event ID: 300
Date: 10/13/2008
Time: 3:06:42 AM
User: N/A
Computer: DC2
Description:
[5756:5432] Connection attempt from forbidden IP address: xxx.xxx.xx.xx
Could be one of a couple of things:
Either your NTLMhash authentication was refused from a kerberos LDAP.
or
This was once an IP address of someone sending hacking, that was caught and the IP was designated unsafe.
or
Someone sees this connection and is trying a brute force attack.
or
Someone has the wrong logon credentials and were locked out.
ASKER
Thanks for that one Chief ...The box was checked for recipient filtering, but it had never been enabled under virtual SMTP for either Exchange server. I did notice that one exchange server had the IP assigned in Virtual SMTP, the other does not. I'm going to research that a little now. So far my logs look good regarding my original problem, but it's only been 29 hours since the last reboot, so I'm not going to get real excited just yet as it's not in the trend window I had previously noticed of 32 to 36 hours. I have seen it go 40. Keep your fingers crossed for me, Meanwhile, I'm still digging ...
ASKER
Update: After 31 hours (pretty much to the minute), the servers became unresponsive again. This time I have additional errors in chronological order:
FROM DC1: The first two appear at 5 minute intervals
Event Type: Error
Event Source: Userenv
Event Category: None
Event ID: 1006
Date: 10/13/2008
Time: 2:47:37 PM
User: NT AUTHORITY\SYSTEM
Computer: DC1
Description:
Windows cannot bind to <FQDN> domain. (Timeout). Group Policy processing aborted.
For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.
__________________________ __________ __________ _______
Event Type: Error
Event Source: Userenv
Event Category: None
Event ID: 1030
Date: 10/13/2008
Time: 2:47:37 PM
User: NT AUTHORITY\SYSTEM
Computer: DC1
Description:
Windows cannot query for the list of Group Policy objects. Check the event log for possible messages previously logged by the policy engine that describes the reason for this.
For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.
Event Type: Warning
Event Source: W32Time
Event Category: None
Event ID: 36
Date: 10/13/2008
Time: 2:48:04 PM
User: N/A
Computer: DC1
Description:
The time service has not synchronized the system time for 86400 seconds because none of the time service providers provided a usable time stamp. The time service is no longer synchronized and cannot provide the time to other clients or update the system clock. Monitor the system events displayed in the Event Viewer to make sure that a more serious problem does not exist.
For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.
__________________________ _________
Event Type: Error
Event Source: BCAAA
Event Category: (1)
Event ID: 2200
Date: 10/13/2008
Time: 2:56:26 PM
User: N/A
Computer: DC1
Description:
[1672:1876] Cannot query domain controller <IP ADDRESS for DC1); status=64:0x40:The specified network name is no longer available.
ect.
__________________________ __________ __________ ___
At 2:54 p.m. I begin to get the following DNS errors:
Event Type: Error
Event Source: DNS
Event Category: None
Event ID: 4016
Date: 10/13/2008
Time: 2:54:42 PM
User: N/A
Computer: DC1
Description:
The DNS server timed out attempting an Active Directory service operation on DC=205,DC=15.21.140.in-add r.arpa,cn= MicrosoftD NS,cn=Syst em,DC=Doma in,DC=Pare ntDomain,D C=ParentDo main,DC=Pa rentDomain ,DC=MIL. Check Active Directory to see that it is functioning properly. The event data contains the error.
For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.
Data:
0000: 55 00 00 00 U...
On DC2, I begin to get the KDC and Valicert errors Id mentioned previously at 2:51 p.m. DNS errors start at 2:58 p.m.
I rebooted DC1 at 3:01 p.m. and DC2 immediately after as it was completely unresponsive.
In going back and looking at REPLMON logs for DC2, everything appears to be replicating with DC1 with the exception of the Schema which did not attempt to replicate with DC2 for almost the last two hours. Config was due to replicate at 2:44 and did not as well.
The DC partition was due to replicate @ 2:59. By that time, everything had fallen apart.
All partitions on DC1 were due to replicate 2:58. Again, thats about the time everything froze.
Upon rebooting the DCs, the only errors I got were on DC2:
Event Type: Warning
Event Source: NETLOGON
Event Category: None
Event ID: 3096
Date: 10/13/2008
Time: 3:29:35 PM
User: N/A
Computer: DC2
Description:
The primary Domain Controller for this domain could not be located.
For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.
__________________________ __________ ______
Event Type: Warning
Event Source: LSASRV
Event Category: SPNEGO (Negotiator)
Event ID: 40960
Date: 10/13/2008
Time: 3:29:49 PM
User: N/A
Computer: DC2
Description:
The Security System detected an authentication error for the server cifs/<DC2 IP address>. The failure code from authentication protocol Kerberos was "There are currently no logon servers available to service the logon request.
(0xc000005e)".
For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.
Data:
0000: 5e 00 00 c0 ^..À
__________________________ __________ _______
Event Type: Warning
Event Source: LSASRV
Event Category: SPNEGO (Negotiator)
Event ID: 40960
Date: 10/13/2008
Time: 3:29:51 PM
User: N/A
Computer: DC2
Description:
The Security System detected an authentication error for the server ldap/DC2.FQDN. The failure code from authentication protocol Kerberos was "There are currently no logon servers available to service the logon request.
(0xc000005e)".
For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.
Data:
0000: 5e 00 00 c0 ^..À
__________________________ _______
Event Type: Warning
Event Source: LSASRV
Event Category: SPNEGO (Negotiator)
Event ID: 40960
Date: 10/13/2008
Time: 3:29:52 PM
User: N/A
Computer: DC2
Description:
The Security System detected an authentication error for the server LDAP/DC2. The failure code from authentication protocol Kerberos was "There are currently no logon servers available to service the logon request.
(0xc000005e)".
For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.
Data:
0000: 5e 00 00 c0 ^..À
__________________________ __________ __________
REPLMON shows everything authenticating properly at the moment, which puts me back at square 1.
FROM DC1: The first two appear at 5 minute intervals
Event Type: Error
Event Source: Userenv
Event Category: None
Event ID: 1006
Date: 10/13/2008
Time: 2:47:37 PM
User: NT AUTHORITY\SYSTEM
Computer: DC1
Description:
Windows cannot bind to <FQDN> domain. (Timeout). Group Policy processing aborted.
For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.
__________________________
Event Type: Error
Event Source: Userenv
Event Category: None
Event ID: 1030
Date: 10/13/2008
Time: 2:47:37 PM
User: NT AUTHORITY\SYSTEM
Computer: DC1
Description:
Windows cannot query for the list of Group Policy objects. Check the event log for possible messages previously logged by the policy engine that describes the reason for this.
For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.
Event Type: Warning
Event Source: W32Time
Event Category: None
Event ID: 36
Date: 10/13/2008
Time: 2:48:04 PM
User: N/A
Computer: DC1
Description:
The time service has not synchronized the system time for 86400 seconds because none of the time service providers provided a usable time stamp. The time service is no longer synchronized and cannot provide the time to other clients or update the system clock. Monitor the system events displayed in the Event Viewer to make sure that a more serious problem does not exist.
For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.
__________________________
Event Type: Error
Event Source: BCAAA
Event Category: (1)
Event ID: 2200
Date: 10/13/2008
Time: 2:56:26 PM
User: N/A
Computer: DC1
Description:
[1672:1876] Cannot query domain controller <IP ADDRESS for DC1); status=64:0x40:The specified network name is no longer available.
ect.
__________________________
At 2:54 p.m. I begin to get the following DNS errors:
Event Type: Error
Event Source: DNS
Event Category: None
Event ID: 4016
Date: 10/13/2008
Time: 2:54:42 PM
User: N/A
Computer: DC1
Description:
The DNS server timed out attempting an Active Directory service operation on DC=205,DC=15.21.140.in-add
For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.
Data:
0000: 55 00 00 00 U...
On DC2, I begin to get the KDC and Valicert errors Id mentioned previously at 2:51 p.m. DNS errors start at 2:58 p.m.
I rebooted DC1 at 3:01 p.m. and DC2 immediately after as it was completely unresponsive.
In going back and looking at REPLMON logs for DC2, everything appears to be replicating with DC1 with the exception of the Schema which did not attempt to replicate with DC2 for almost the last two hours. Config was due to replicate at 2:44 and did not as well.
The DC partition was due to replicate @ 2:59. By that time, everything had fallen apart.
All partitions on DC1 were due to replicate 2:58. Again, thats about the time everything froze.
Upon rebooting the DCs, the only errors I got were on DC2:
Event Type: Warning
Event Source: NETLOGON
Event Category: None
Event ID: 3096
Date: 10/13/2008
Time: 3:29:35 PM
User: N/A
Computer: DC2
Description:
The primary Domain Controller for this domain could not be located.
For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.
__________________________
Event Type: Warning
Event Source: LSASRV
Event Category: SPNEGO (Negotiator)
Event ID: 40960
Date: 10/13/2008
Time: 3:29:49 PM
User: N/A
Computer: DC2
Description:
The Security System detected an authentication error for the server cifs/<DC2 IP address>. The failure code from authentication protocol Kerberos was "There are currently no logon servers available to service the logon request.
(0xc000005e)".
For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.
Data:
0000: 5e 00 00 c0 ^..À
__________________________
Event Type: Warning
Event Source: LSASRV
Event Category: SPNEGO (Negotiator)
Event ID: 40960
Date: 10/13/2008
Time: 3:29:51 PM
User: N/A
Computer: DC2
Description:
The Security System detected an authentication error for the server ldap/DC2.FQDN. The failure code from authentication protocol Kerberos was "There are currently no logon servers available to service the logon request.
(0xc000005e)".
For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.
Data:
0000: 5e 00 00 c0 ^..À
__________________________
Event Type: Warning
Event Source: LSASRV
Event Category: SPNEGO (Negotiator)
Event ID: 40960
Date: 10/13/2008
Time: 3:29:52 PM
User: N/A
Computer: DC2
Description:
The Security System detected an authentication error for the server LDAP/DC2. The failure code from authentication protocol Kerberos was "There are currently no logon servers available to service the logon request.
(0xc000005e)".
For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.
Data:
0000: 5e 00 00 c0 ^..À
__________________________
REPLMON shows everything authenticating properly at the moment, which puts me back at square 1.
Consider this:
You may have a multihomed domain controller. A multihomed domain controller is simply defined as a domain controller with multiple IPs. This could mean two or more IPs on the same NIC, Two or more NICs, or a NIC with VPN connection to the outside world.
So, some services may bind or be redirected to the wrong network binding. If fact, they could be bound or redirected to the outside network binding. So, that binding may give you the errors you see before you because that outside binding may not know how to get back to the client.
__________________________ __________ __________ __________ __________ ______
I am in the process of bringing together advice from others on how to configure a multihomed domain controller so there is NO error in the path of communications: (So far, this is what I have come up with)
There are a couple of "transports" or "protocols" or whatever you want to call them, that the DC uses to communicate with other machines on the domain and to the outside world:
1) DNS
2) DHCP
3) Netbios
(((DNS)))
To prevent from DNS binding to the outside NIC or IP address, there are a couple things you will need to do. One is you need to prevent it from registering the SRV records in DNS. The second is you need to clean out DNS of any SRV records to the outside NIC. The third is, you need that outside NIC to not register with DNS.
Step 1) To resolve these issues, Follow this link: (NOTE: By default, 2003 server registers both NICs SRV records in DNS)
-- http://support.microsoft.com/?id=832478
Step 2) Once you prevent bot SRV records from registering in DNS when the netlogon service restarts, then you need to prevent it from registering its DNS records in DNS. To do this go to the NIC configuration>> TCP/IP properties>>Advanced Button>>DNS tab and disable the ability of the NIC to register its DNS settings in DNS
Step3)) Once you have disabled the ability to register that outside NICs DNS address, then you must remove all HOST A, SRV, and cached records of that outside NIC. I assume you already know how to remove HOST A records. To remove DNS cache, go to the command prompt and type IPconfig /flushDNS. To remove the SRV records, pleas follow the advice on this link:
http://support.microsoft.com/kb/241515
(((DHCP:)))
DHCP may try to provide DHCP to all network bindings. This could be a VPN or second NIC to the outside world. You can prevent it from providing DHCP to any binding by following these simple steps:
DHCP snapin>>right click the server in question>>Select properties>>select the Advanced tab>>select binding
You can disable any binding from providing DHCP
(((NETBIOS)))
Preventing Netbios is a little more difficult to do on various types of Multihomed domain controllers. Not always does a DC use WINS when dealing with netbios. So, this is a bit more involved.
To prevent Netbios from binding to the outside binding or VPN connection binding, you must go to that binding and remove the ability of it to do ""Netbios over TCP/IP"" or ""Netbios over DHCP"".
For a VPN connection and Dual NICs:
Right click "My network Places">>select "properties">>right click "VPN connection" or the Second NIC>>Select "Properties" >>Select "TCP/IP">> Go to Properties>>Go to the "WINS" Tab>> and prevent it from providing "Netbios over TCP/IP" and also prevent it from performing "Netbios over DHCP"
Disabling File and Print sharing:
You may also wish to disable your outside NIC from broadcasting out your files and printers to the outside world. To do this, disable File and print sharing.
(((Default Gateway)))
Other things to look out for:
You should have one single gateway for your multihomed NICs. If you are routing over your server, it should be the outside NIC that has a gateway configured. If you have the second NIC to communicate with a few nodes on the network, your Domain, side NIC should have the gateway configured. So, this is domain specific.
__________________________ __________ __________ __________ __________ _____
With that said, the problems you are seeing:
(4960: SPNego)
Comes from the inability to propogate the SRV records in DNS. In fact, all of your errors comes from the inability to do a DNS resolution to the Logon server.
https://www.experts-exchange.com/questions/23356031/There-are-currently-no-logon-servers-available-to-service-the-logon-request.html
You may have a multihomed domain controller. A multihomed domain controller is simply defined as a domain controller with multiple IPs. This could mean two or more IPs on the same NIC, Two or more NICs, or a NIC with VPN connection to the outside world.
So, some services may bind or be redirected to the wrong network binding. If fact, they could be bound or redirected to the outside network binding. So, that binding may give you the errors you see before you because that outside binding may not know how to get back to the client.
__________________________
I am in the process of bringing together advice from others on how to configure a multihomed domain controller so there is NO error in the path of communications: (So far, this is what I have come up with)
There are a couple of "transports" or "protocols" or whatever you want to call them, that the DC uses to communicate with other machines on the domain and to the outside world:
1) DNS
2) DHCP
3) Netbios
(((DNS)))
To prevent from DNS binding to the outside NIC or IP address, there are a couple things you will need to do. One is you need to prevent it from registering the SRV records in DNS. The second is you need to clean out DNS of any SRV records to the outside NIC. The third is, you need that outside NIC to not register with DNS.
Step 1) To resolve these issues, Follow this link: (NOTE: By default, 2003 server registers both NICs SRV records in DNS)
-- http://support.microsoft.com/?id=832478
Step 2) Once you prevent bot SRV records from registering in DNS when the netlogon service restarts, then you need to prevent it from registering its DNS records in DNS. To do this go to the NIC configuration>> TCP/IP properties>>Advanced Button>>DNS tab and disable the ability of the NIC to register its DNS settings in DNS
Step3)) Once you have disabled the ability to register that outside NICs DNS address, then you must remove all HOST A, SRV, and cached records of that outside NIC. I assume you already know how to remove HOST A records. To remove DNS cache, go to the command prompt and type IPconfig /flushDNS. To remove the SRV records, pleas follow the advice on this link:
http://support.microsoft.com/kb/241515
(((DHCP:)))
DHCP may try to provide DHCP to all network bindings. This could be a VPN or second NIC to the outside world. You can prevent it from providing DHCP to any binding by following these simple steps:
DHCP snapin>>right click the server in question>>Select properties>>select the Advanced tab>>select binding
You can disable any binding from providing DHCP
(((NETBIOS)))
Preventing Netbios is a little more difficult to do on various types of Multihomed domain controllers. Not always does a DC use WINS when dealing with netbios. So, this is a bit more involved.
To prevent Netbios from binding to the outside binding or VPN connection binding, you must go to that binding and remove the ability of it to do ""Netbios over TCP/IP"" or ""Netbios over DHCP"".
For a VPN connection and Dual NICs:
Right click "My network Places">>select "properties">>right click "VPN connection" or the Second NIC>>Select "Properties" >>Select "TCP/IP">> Go to Properties>>Go to the "WINS" Tab>> and prevent it from providing "Netbios over TCP/IP" and also prevent it from performing "Netbios over DHCP"
Disabling File and Print sharing:
You may also wish to disable your outside NIC from broadcasting out your files and printers to the outside world. To do this, disable File and print sharing.
(((Default Gateway)))
Other things to look out for:
You should have one single gateway for your multihomed NICs. If you are routing over your server, it should be the outside NIC that has a gateway configured. If you have the second NIC to communicate with a few nodes on the network, your Domain, side NIC should have the gateway configured. So, this is domain specific.
__________________________
With that said, the problems you are seeing:
(4960: SPNego)
Comes from the inability to propogate the SRV records in DNS. In fact, all of your errors comes from the inability to do a DNS resolution to the Logon server.
https://www.experts-exchange.com/questions/23356031/There-are-currently-no-logon-servers-available-to-service-the-logon-request.html
SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
Belay my last while I put my DUMBASS hat on. While doing some troubleshooting, I at least ADDED to any DNS issue I might have had. While undoing some changes and attempting to point the Alternate DNS IP address on DC2 to DC1, I fat fingered part of the IP address&. Dammit&. Dammit&. Dammit&. (this was Friday).
Now its time to go to work on my TIME issue and see what happens next and see if I still have the server locking up issue&.
God& I cant wait until leave&. I need a break.
Ill keep you posted&
Now its time to go to work on my TIME issue and see what happens next and see if I still have the server locking up issue&.
God& I cant wait until leave&. I need a break.
Ill keep you posted&
The fat finger syndrome is common.
ASKER
I've got to talk to our firewall folks and see if 123 is blocked. I can't sync time right now and it's a gov't system so I'm not a liberty to put a 3rd party product on it. That being said, as far as I know, time sync has never been configured on DC1. I believe everything syncs with DC1 okay though...
ASKER
I'm pretty sure my problem is TIME. Port 123 is blocked. Once I corrected some other errors, I noticed that I started getting KDC and Valicert errors within 2 minutes after I get:
Event Type: Warning
Event Source: W32Time
Event Category: None
Event ID: 36
Date: 10/15/2008
Time: 12:09:12 PM
User: N/A
Computer: TACMDC1
Description:
The time service has not synchronized the system time for 86400 seconds because none of the time service providers provided a usable time stamp. The time service is no longer synchronized and cannot provide the time to other clients or update the system clock. Monitor the system events displayed in the Event Viewer to make sure that a more serious problem does not exist.
For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.
__________________________ __________ __________ __________ __________ __________ __
I didn't stop to think that I start seeing these errors 5 or 6 hours before the DC's bug out entirely.
How do I go about pointing DC1 to the router to get time? Do I simply adjust the ntp server for the IP of the address? Or, do I need them to open 123 back up and go out the conventional way to an NTP server?
Event Type: Warning
Event Source: W32Time
Event Category: None
Event ID: 36
Date: 10/15/2008
Time: 12:09:12 PM
User: N/A
Computer: TACMDC1
Description:
The time service has not synchronized the system time for 86400 seconds because none of the time service providers provided a usable time stamp. The time service is no longer synchronized and cannot provide the time to other clients or update the system clock. Monitor the system events displayed in the Event Viewer to make sure that a more serious problem does not exist.
For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.
__________________________
I didn't stop to think that I start seeing these errors 5 or 6 hours before the DC's bug out entirely.
How do I go about pointing DC1 to the router to get time? Do I simply adjust the ntp server for the IP of the address? Or, do I need them to open 123 back up and go out the conventional way to an NTP server?
ASKER
I think it's fixed.... There was a gpo preventing the time from synching. I'm not sure when or why... Then again, there's way too many fingers in this pie sometimes.... I'm going to leave this open for 48 hours or so to know for sure it's working.
ASKER
Final question....hopefully.... DC2 is synching time with DC1... Do I need to do a GPO for the client workstations to synch time as well?
SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Do I need to do a GPO for the client workstations to synch time as well?
No, as stated above. The time synchronizes by default to the 2003 server PDCe. GPOs override this default settings. So, it is best to disable the GPOs for all time synchronization. Then, it will do it naturally.
No, as stated above. The time synchronizes by default to the 2003 server PDCe. GPOs override this default settings. So, it is best to disable the GPOs for all time synchronization. Then, it will do it naturally.
ASKER
Looks like you may be on to something Chief.... Now DC1 totally freezes up and LSASS maxes out the CPU.... After 10 minutes or so, DC2 does the same thing.
ASKER
The time sync issue is fixed and DNS is unhosed....I'm finding Poolmon confusing though. Am I reading the following right in assuming that SavE may be my culprit in Paged?
Memory: 4193264K Avail: 3575340K PageFlts: 138 InRam Krnl: 3360K P:88620
Commit: 469436K Limit:6117048K Peak: 672028K Pool N:42932K P:89500
System pool information
Tag Type Allocs Frees Diff Bytes Per Alloc
SavE Paged 665625 ( 0) 665004 ( 0) 621 55099936 ( 0) 8872
MmSt Paged 5585 ( 0) 1299 ( 0) 4286 8144392 ( 0) 1900
R100 Paged 47 ( 0) 2 ( 0) 45 5461800 ( 0) 121373
CM35 Paged 50 ( 0) 8 ( 0) 42 2818048 ( 0) 67096
Ntff Paged 2200 ( 0) 382 ( 0) 1818 1483488 ( 0) 816
NtfF Paged 1998 ( 0) 677 ( 0) 1321 1236456 ( 0) 936
SACC Paged 250 ( 0) 0 ( 0) 250 1008968 ( 0) 4035
TSdd Paged 1218 ( 0) 1193 ( 0) 25 897632 ( 0) 35905
AfdX Paged 10004 ( 3) 7139 ( 3) 2865 802200 ( 0) 280
Gh15 Paged 17162 ( 20) 17025 ( 20) 137 750032 ( 0) 5474
CMAl Paged 381 ( 0) 209 ( 0) 172 704512 ( 0) 4096
Ttfd Paged 1411 ( 0) 659 ( 0) 752 678984 ( 0) 902
Wmit Paged 13 ( 0) 2 ( 0) 11 655688 ( 0) 59608
Gh05 Paged 6882 ( 0) 6796 ( 0) 86 642896 ( 0) 7475
IoNm Paged 366568 ( 9) 360577 ( 9) 5991 614080 ( 0) 102
Gla1 Paged 481 ( 0) 200 ( 0) 281 579984 ( 0) 2064
TSwd Paged 18 ( 0) 8 ( 0) 10 425800 ( 0) 42580
Obtb Paged 319 ( 0) 158 ( 0) 161 414480 ( 0) 2574
CM16 Paged 82 ( 0) 1 ( 0) 81 344064 ( 0) 4247
SAV Paged 327848 ( 6) 327292 ( 6) 556 320848 ( 0) 577
FSim Paged 2503 ( 0) 246 ( 0) 2257 288896 ( 0) 128
Gcac Paged 53 ( 0) 4 ( 0) 49 268952 ( 0) 5488
ArbA Paged 60 ( 0) 0 ( 0) 60 245760 ( 0) 4096
FSrm Paged 512 ( 0) 363 ( 0) 149 221560 ( 0) 1486
CMVa Paged 87742 ( 0) 84143 ( 0) 3599 217072 ( 0) 60
NtFs Paged 32866 ( 0) 29346 ( 0) 3520 198872 ( 0) 56
CM25 Paged 729 ( 0) 717 ( 0) 12 180224 ( 0) 15018
NtFB Paged 185 ( 0) 170 ( 0) 15 179624 ( 0) 11974
CMDa Paged 19117 ( 0) 17583 ( 0) 1534 167288 ( 0) 109
Toke Paged 50097 ( 49) 49860 ( 51) 237 165000 ( -1392) 696
MmSm Paged 2880 ( 0) 403 ( 0) 2477 158528 ( 0) 64
Ntfo Paged 6049 ( 0) 4752 ( 0) 1297 155816 ( 0) 120
CM39 Paged 648 ( 0) 144 ( 0) 504 145344 ( 0) 288
NtFS Paged 2781 ( 0) 2231 ( 0) 550 142936 ( 0) 259
NtFf Paged 18 ( 0) 8 ( 0) 10 131360 ( 0) 13136
LfsI Paged 2 ( 0) 0 ( 0) 2 131072 ( 0) 65536
CM29 Paged 15 ( 0) 0 ( 0) 15 122880 ( 0) 8192
Gla5 Paged 671 ( 0) 358 ( 0) 313 122696 ( 0) 392
Key Paged 295026 ( 13) 293873 ( 13) 1153 119864 ( 0) 103
WmIS Paged 1 ( 0) 0 ( 0) 1 118784 ( 0) 118784
Bmfd Paged 65 ( 0) 0 ( 0) 65 116624 ( 0) 1794
Ntfc Paged 2366 ( 0) 791 ( 0) 1575 113400 ( 0) 72
Gla: Paged 343 ( 0) 186 ( 0) 157 102992 ( 0) 656
Port Paged 2005 ( 0) 1480 ( 0) 525 97512 ( 0) 185
Ntf0 Paged 9293 ( 0) 6260 ( 0) 3033 97224 ( 0) 32
ObHd Paged 10346 ( 3) 7657 ( 3) 2689 87232 ( 0) 32
Ghab Paged 770 ( 0) 0 ( 0) 770 86240 ( 0) 112
CM17 Paged 10 ( 0) 0 ( 0) 10 81920 ( 0) 8192
or: MMCM under NonPaged
Memory: 4193264K Avail: 3574880K PageFlts: 198 InRam Krnl: 3360K P:88592K
Commit: 469348K Limit:6117048K Peak: 672028K Pool N:43044K P:89476K
System pool information
Tag Type Allocs Frees Diff Bytes Per Alloc
MmCm Nonp 1913 ( 0) 802 ( 0) 1111 16760984 ( 0) 15086
Irp Nonp 919429 ( 35) 908414 ( 31) 11015 4816872 ( 3000) 437
Mdl Nonp 41375 ( 5) 7837 ( 9) 33538 4354096 ( -512) 129
LSwi Nonp 1 ( 0) 0 ( 0) 1 2576384 ( 0) 2576384
File Nonp 361118 ( 169) 350142 ( 180) 10976 1677488 ( -1672) 152
TCPt Nonp 5358 ( 0) 5327 ( 0) 31 1458096 ( 0) 47035
TPLA Nonp 256 ( 0) 0 ( 0) 256 1048576 ( 0) 4096
TCPA Nonp 3309 ( 1) 752 ( 3) 2557 940976 ( -736) 368
AfdE Nonp 10297 ( 10) 7428 ( 16) 2869 803320 ( -1680) 280
Thre Nonp 4930 ( 4) 4206 ( 5) 724 451776 ( -624) 624
brcm Nonp 15 ( 0) 2 ( 0) 13 434176 ( 0) 33398
LSwr Nonp 128 ( 0) 0 ( 0) 128 416768 ( 0) 3256
Memory: 4193264K Avail: 3575340K PageFlts: 138 InRam Krnl: 3360K P:88620
Commit: 469436K Limit:6117048K Peak: 672028K Pool N:42932K P:89500
System pool information
Tag Type Allocs Frees Diff Bytes Per Alloc
SavE Paged 665625 ( 0) 665004 ( 0) 621 55099936 ( 0) 8872
MmSt Paged 5585 ( 0) 1299 ( 0) 4286 8144392 ( 0) 1900
R100 Paged 47 ( 0) 2 ( 0) 45 5461800 ( 0) 121373
CM35 Paged 50 ( 0) 8 ( 0) 42 2818048 ( 0) 67096
Ntff Paged 2200 ( 0) 382 ( 0) 1818 1483488 ( 0) 816
NtfF Paged 1998 ( 0) 677 ( 0) 1321 1236456 ( 0) 936
SACC Paged 250 ( 0) 0 ( 0) 250 1008968 ( 0) 4035
TSdd Paged 1218 ( 0) 1193 ( 0) 25 897632 ( 0) 35905
AfdX Paged 10004 ( 3) 7139 ( 3) 2865 802200 ( 0) 280
Gh15 Paged 17162 ( 20) 17025 ( 20) 137 750032 ( 0) 5474
CMAl Paged 381 ( 0) 209 ( 0) 172 704512 ( 0) 4096
Ttfd Paged 1411 ( 0) 659 ( 0) 752 678984 ( 0) 902
Wmit Paged 13 ( 0) 2 ( 0) 11 655688 ( 0) 59608
Gh05 Paged 6882 ( 0) 6796 ( 0) 86 642896 ( 0) 7475
IoNm Paged 366568 ( 9) 360577 ( 9) 5991 614080 ( 0) 102
Gla1 Paged 481 ( 0) 200 ( 0) 281 579984 ( 0) 2064
TSwd Paged 18 ( 0) 8 ( 0) 10 425800 ( 0) 42580
Obtb Paged 319 ( 0) 158 ( 0) 161 414480 ( 0) 2574
CM16 Paged 82 ( 0) 1 ( 0) 81 344064 ( 0) 4247
SAV Paged 327848 ( 6) 327292 ( 6) 556 320848 ( 0) 577
FSim Paged 2503 ( 0) 246 ( 0) 2257 288896 ( 0) 128
Gcac Paged 53 ( 0) 4 ( 0) 49 268952 ( 0) 5488
ArbA Paged 60 ( 0) 0 ( 0) 60 245760 ( 0) 4096
FSrm Paged 512 ( 0) 363 ( 0) 149 221560 ( 0) 1486
CMVa Paged 87742 ( 0) 84143 ( 0) 3599 217072 ( 0) 60
NtFs Paged 32866 ( 0) 29346 ( 0) 3520 198872 ( 0) 56
CM25 Paged 729 ( 0) 717 ( 0) 12 180224 ( 0) 15018
NtFB Paged 185 ( 0) 170 ( 0) 15 179624 ( 0) 11974
CMDa Paged 19117 ( 0) 17583 ( 0) 1534 167288 ( 0) 109
Toke Paged 50097 ( 49) 49860 ( 51) 237 165000 ( -1392) 696
MmSm Paged 2880 ( 0) 403 ( 0) 2477 158528 ( 0) 64
Ntfo Paged 6049 ( 0) 4752 ( 0) 1297 155816 ( 0) 120
CM39 Paged 648 ( 0) 144 ( 0) 504 145344 ( 0) 288
NtFS Paged 2781 ( 0) 2231 ( 0) 550 142936 ( 0) 259
NtFf Paged 18 ( 0) 8 ( 0) 10 131360 ( 0) 13136
LfsI Paged 2 ( 0) 0 ( 0) 2 131072 ( 0) 65536
CM29 Paged 15 ( 0) 0 ( 0) 15 122880 ( 0) 8192
Gla5 Paged 671 ( 0) 358 ( 0) 313 122696 ( 0) 392
Key Paged 295026 ( 13) 293873 ( 13) 1153 119864 ( 0) 103
WmIS Paged 1 ( 0) 0 ( 0) 1 118784 ( 0) 118784
Bmfd Paged 65 ( 0) 0 ( 0) 65 116624 ( 0) 1794
Ntfc Paged 2366 ( 0) 791 ( 0) 1575 113400 ( 0) 72
Gla: Paged 343 ( 0) 186 ( 0) 157 102992 ( 0) 656
Port Paged 2005 ( 0) 1480 ( 0) 525 97512 ( 0) 185
Ntf0 Paged 9293 ( 0) 6260 ( 0) 3033 97224 ( 0) 32
ObHd Paged 10346 ( 3) 7657 ( 3) 2689 87232 ( 0) 32
Ghab Paged 770 ( 0) 0 ( 0) 770 86240 ( 0) 112
CM17 Paged 10 ( 0) 0 ( 0) 10 81920 ( 0) 8192
or: MMCM under NonPaged
Memory: 4193264K Avail: 3574880K PageFlts: 198 InRam Krnl: 3360K P:88592K
Commit: 469348K Limit:6117048K Peak: 672028K Pool N:43044K P:89476K
System pool information
Tag Type Allocs Frees Diff Bytes Per Alloc
MmCm Nonp 1913 ( 0) 802 ( 0) 1111 16760984 ( 0) 15086
Irp Nonp 919429 ( 35) 908414 ( 31) 11015 4816872 ( 3000) 437
Mdl Nonp 41375 ( 5) 7837 ( 9) 33538 4354096 ( -512) 129
LSwi Nonp 1 ( 0) 0 ( 0) 1 2576384 ( 0) 2576384
File Nonp 361118 ( 169) 350142 ( 180) 10976 1677488 ( -1672) 152
TCPt Nonp 5358 ( 0) 5327 ( 0) 31 1458096 ( 0) 47035
TPLA Nonp 256 ( 0) 0 ( 0) 256 1048576 ( 0) 4096
TCPA Nonp 3309 ( 1) 752 ( 3) 2557 940976 ( -736) 368
AfdE Nonp 10297 ( 10) 7428 ( 16) 2869 803320 ( -1680) 280
Thre Nonp 4930 ( 4) 4206 ( 5) 724 451776 ( -624) 624
brcm Nonp 15 ( 0) 2 ( 0) 13 434176 ( 0) 33398
LSwr Nonp 128 ( 0) 0 ( 0) 128 416768 ( 0) 3256
Mdl Nonp 41375 ( 5) 7837 ( 9) **33538** 4354096 ( -512) 129 <<<---What you are looking for:
LSwi Nonp 1 ( 0) 0 ( 0) 1 2576384 ( 0) 2576384
File Nonp 361118 ( 169) 350142 ( 180) 10976 1677488 ( -1672) 152<<<--possible memory leak.
See the difference between allocations and frees is 33538. When thinking of this, think of a memory block. That memory block is used to extract data from the hard drive, get it ready for the processor, then, pass to the processor and free itself for more data. So, if it is not freeing the block as often as it is allocated, the block becomes full and unusable for more data after a while, (Let's say 24-36 hours in this case)
The second one is certainly something to watch out for 10976. If the difference between frees and allocations continues to grow. This is a second memory leak.
In the first example, the memory pool is allocated 41375 times, but only freed 7837 times. After a while, this will allocate it will grow to a point larger than the pool allcoation and you will get a STOP error.
The second example is certainly something to watch out for.
NOW, what to do:
This tag (((((((Mdl Nonp)))))) represents a program that is struggling with nonPage pool memory. It is not freeing the memory block as often as it is being used. In the I provided above, there is a command prompt run line that you can type in to associate the TAG with the program. Do you see it? Run that command line and find out what program it is. Post your results on this page.
You might consider running that for the other tage (((( File Nonp ))))) while we are doing this. It might be a related process and one memory leak fix will fix the second issue.
For clarification, you could sort your POOLMON results by Differences of allocation and frees. The tags that grow and grow in differences are the memory leaks.
I hope this makes sense.
LSwi Nonp 1 ( 0) 0 ( 0) 1 2576384 ( 0) 2576384
File Nonp 361118 ( 169) 350142 ( 180) 10976 1677488 ( -1672) 152<<<--possible memory leak.
See the difference between allocations and frees is 33538. When thinking of this, think of a memory block. That memory block is used to extract data from the hard drive, get it ready for the processor, then, pass to the processor and free itself for more data. So, if it is not freeing the block as often as it is allocated, the block becomes full and unusable for more data after a while, (Let's say 24-36 hours in this case)
The second one is certainly something to watch out for 10976. If the difference between frees and allocations continues to grow. This is a second memory leak.
In the first example, the memory pool is allocated 41375 times, but only freed 7837 times. After a while, this will allocate it will grow to a point larger than the pool allcoation and you will get a STOP error.
The second example is certainly something to watch out for.
NOW, what to do:
This tag (((((((Mdl Nonp)))))) represents a program that is struggling with nonPage pool memory. It is not freeing the memory block as often as it is being used. In the I provided above, there is a command prompt run line that you can type in to associate the TAG with the program. Do you see it? Run that command line and find out what program it is. Post your results on this page.
You might consider running that for the other tage (((( File Nonp ))))) while we are doing this. It might be a related process and one memory leak fix will fix the second issue.
For clarification, you could sort your POOLMON results by Differences of allocation and frees. The tags that grow and grow in differences are the memory leaks.
I hope this makes sense.
ASKER
I seem to be getting my butt kicked on the cmd line: if I type Findstr /l /m Mdl *.sys, I get: "Filestr: Cannot open c:\pagefile.sys. If I type findstr /s /m Mdl c:\*.sys, I get what looks like every .sys file on my system.
Same thing goes for File.
I've attached the output txt files for your perusal.
Memory: 4193264K Avail: 3548052K PageFlts: 78 InRam Krnl: 3396K P:91076K
Commit: 479312K Limit:6117048K Peak: 925376K Pool N:45404K P:91992K
System pool information
Tag Type Allocs Frees Diff Bytes Per Alloc
MmCm Nonp 2558 ( 0) 1447 ( 0) 1111 16760984 ( 0) 15086
Irp Nonp 937682 ( 4) 926692 ( 2) 10990 5185784 ( 1024) 471
Mdl Nonp 53102 ( 18) 19488 ( 14) 33614 4363824 ( 552) 129
LSwi Nonp 1 ( 0) 0 ( 0) 1 2576384 ( 0) 2576384
File Nonp 1289013 ( 53) 1277122 ( 44) 11891 1819752 ( 1368) 153
TCPt Nonp 9522 ( 6) 9491 ( 6) 31 1458096 ( 0) 47035
TPLA Nonp 256 ( 0) 0 ( 0) 256 1048576 ( 0) 4096
TCPA Nonp 5040 ( 2) 2480 ( 1) 2560 942080 ( 368) 368
AfdE Nonp 18737 ( 5) 15877 ( 2) 2860 800800 ( 840) 280
Thre Nonp 14644 ( 5) 13926 ( 4) 718 448032 ( 624) 624
Ntfr Nonp 7214 ( 0) 430 ( 0) 6784 435144 ( 0) 64
brcm Nonp 15 ( 0) 2 ( 0) 13 434176 ( 0) 33398
MmCa Nonp 27893 ( 0) 24085 ( 0) 3808 420192 ( 0) 110
Paged:
Memory: 4193264K Avail: 3558504K PageFlts: 418 InRam Krnl: 3396K P:91288K
Commit: 481024K Limit:6117048K Peak: 925376K Pool N:45428K P:92184K
System pool information
Tag Type Allocs Frees Diff Bytes Per Alloc
SavE Paged 1130416 ( 0) 1129795 ( 0) 621 55099936 ( 0) 88727
MmSt Paged 7420 ( 1) 2159 ( 1) 5261 9370592 ( 0) 1781
R100 Paged 47 ( 0) 2 ( 0) 45 5461800 ( 0) 121373
CM35 Paged 50 ( 0) 8 ( 0) 42 2818048 ( 0) 67096
Ntff Paged 3494 ( 0) 920 ( 0) 2574 2100384 ( 0) 816
NtfF Paged 22356 ( 0) 20773 ( 0) 1583 1481688 ( 0) 936
SACC Paged 250 ( 0) 0 ( 0) 250 1008968 ( 0) 4035
TSdd Paged 2732 ( 14) 2708 ( 14) 24 897584 ( 0) 37399
Gh15 Paged 50802 ( 100) 50629 ( 100) 173 842656 ( 0) 4870
AfdX Paged 20123 ( 5) 17272 ( 10) 2851 798280 ( -1400) 280
IoNm Paged 1808353 ( 43) 1801629 ( 49) 6724 711424 ( -448) 105
CMAl Paged 468 ( 0) 302 ( 0) 166 679936 ( 0) 4096
Ttfd Paged 2134 ( 0) 1381 ( 0) 753 679392 ( 0) 902
Wmit Paged 13 ( 0) 2 ( 0) 11 655688 ( 0) 59608
Gh05 Paged 6882 ( 0) 6796 ( 0) 86 642896 ( 0) 7475
Gla1 Paged 795 ( 0) 499 ( 0) 296 610944 ( 0) 2064
TSwd Paged 35 ( 0) 25 ( 0) 10 425800 ( 0) 42580
Obtb Paged 471 ( 0) 310 ( 0) 161 414480 ( 0) 2574
FSim Paged 3340 ( 0) 246 ( 0) 3094 396032 ( 0) 128
CM16 Paged 82 ( 0) 1 ( 0) 81 344064 ( 0) 4247
Mdl.txt
file.txt
Same thing goes for File.
I've attached the output txt files for your perusal.
Memory: 4193264K Avail: 3548052K PageFlts: 78 InRam Krnl: 3396K P:91076K
Commit: 479312K Limit:6117048K Peak: 925376K Pool N:45404K P:91992K
System pool information
Tag Type Allocs Frees Diff Bytes Per Alloc
MmCm Nonp 2558 ( 0) 1447 ( 0) 1111 16760984 ( 0) 15086
Irp Nonp 937682 ( 4) 926692 ( 2) 10990 5185784 ( 1024) 471
Mdl Nonp 53102 ( 18) 19488 ( 14) 33614 4363824 ( 552) 129
LSwi Nonp 1 ( 0) 0 ( 0) 1 2576384 ( 0) 2576384
File Nonp 1289013 ( 53) 1277122 ( 44) 11891 1819752 ( 1368) 153
TCPt Nonp 9522 ( 6) 9491 ( 6) 31 1458096 ( 0) 47035
TPLA Nonp 256 ( 0) 0 ( 0) 256 1048576 ( 0) 4096
TCPA Nonp 5040 ( 2) 2480 ( 1) 2560 942080 ( 368) 368
AfdE Nonp 18737 ( 5) 15877 ( 2) 2860 800800 ( 840) 280
Thre Nonp 14644 ( 5) 13926 ( 4) 718 448032 ( 624) 624
Ntfr Nonp 7214 ( 0) 430 ( 0) 6784 435144 ( 0) 64
brcm Nonp 15 ( 0) 2 ( 0) 13 434176 ( 0) 33398
MmCa Nonp 27893 ( 0) 24085 ( 0) 3808 420192 ( 0) 110
Paged:
Memory: 4193264K Avail: 3558504K PageFlts: 418 InRam Krnl: 3396K P:91288K
Commit: 481024K Limit:6117048K Peak: 925376K Pool N:45428K P:92184K
System pool information
Tag Type Allocs Frees Diff Bytes Per Alloc
SavE Paged 1130416 ( 0) 1129795 ( 0) 621 55099936 ( 0) 88727
MmSt Paged 7420 ( 1) 2159 ( 1) 5261 9370592 ( 0) 1781
R100 Paged 47 ( 0) 2 ( 0) 45 5461800 ( 0) 121373
CM35 Paged 50 ( 0) 8 ( 0) 42 2818048 ( 0) 67096
Ntff Paged 3494 ( 0) 920 ( 0) 2574 2100384 ( 0) 816
NtfF Paged 22356 ( 0) 20773 ( 0) 1583 1481688 ( 0) 936
SACC Paged 250 ( 0) 0 ( 0) 250 1008968 ( 0) 4035
TSdd Paged 2732 ( 14) 2708 ( 14) 24 897584 ( 0) 37399
Gh15 Paged 50802 ( 100) 50629 ( 100) 173 842656 ( 0) 4870
AfdX Paged 20123 ( 5) 17272 ( 10) 2851 798280 ( -1400) 280
IoNm Paged 1808353 ( 43) 1801629 ( 49) 6724 711424 ( -448) 105
CMAl Paged 468 ( 0) 302 ( 0) 166 679936 ( 0) 4096
Ttfd Paged 2134 ( 0) 1381 ( 0) 753 679392 ( 0) 902
Wmit Paged 13 ( 0) 2 ( 0) 11 655688 ( 0) 59608
Gh05 Paged 6882 ( 0) 6796 ( 0) 86 642896 ( 0) 7475
Gla1 Paged 795 ( 0) 499 ( 0) 296 610944 ( 0) 2064
TSwd Paged 35 ( 0) 25 ( 0) 10 425800 ( 0) 42580
Obtb Paged 471 ( 0) 310 ( 0) 161 414480 ( 0) 2574
FSim Paged 3340 ( 0) 246 ( 0) 3094 396032 ( 0) 128
CM16 Paged 82 ( 0) 1 ( 0) 81 344064 ( 0) 4247
Mdl.txt
file.txt
ASKER
Chief - I'm sure you're right. After not having any luck determining where Mdl is coming from, I've started anew. I increased the size of my paging file by reducing it to 65MB on the C: partition (for dump) and putting a 6GB file on the F: partition (I'm running enterprise 32-bit with 4GB of RAM). When rebooting, i noticed that I went from getting TWO of the following error to ONE (not sure if it's just a fluke):
Event Type: Warning
Event Source: Server
Event Category: None
Event ID: 2510
Date: 10/10/2008
Time: 5:25:15 PM
User: N/A
Computer: DC1
Description:
The server service was unable to map error code 998.
(MICROSOFT has a hot fix for this that does NOT apply to me because I'm on a 32-bit OS.)
__________________________ __________ _________
After 5 hours sleep last night, I spent the morning trying to figure out PSlist to no avail. I've come back to Poomon and am doing the following in accordance with http://technet.microsoft.com/en-us/library/cc736362.aspx:
This example outlines a procedure for using Poolmon to detect a memory leak.
Start Poolmon in default mode (no additional parameters).
Press P twice to display allocations from only the paged pool. (The P key toggles the display between paged, non-paged, and both.)
Press B to sort the Bytes column in descending order.
Let Poolmon run for a few hours. Because starting Poolmon changes the data, you must let it run until it reaches a steady state before the data is reliable.
Save the information generated by Poolmon, either as a screenshot, or by copying it from the command window and pasting it into Notepad.
Returning to Poolmon, press P twice again, this time to display only allocations from the non-paged pool.
Repeat steps 3, 5 and 6 approximately every half-hour for at least two hours.
When data collection is complete, examine the Diff (allocations minus frees) and Bytes (number of bytes allocated minus number of bytes freed) values for each tag, and note any that continually increase. Next, stop Poolmon, wait for a few hours, and then restart Poolmon. Examine the allocations that were increasing, and determine whether the bytes are now freed. Allocations that have still not been freed, or have continued to increase in size are the likely culprits.
__________________________ __________ ______
Like you, I'm already pretty sure that Mdl is my culprit. Unfortunately, when I try C:\findstr /l /m Mdl *.sys, the only repsonse I get is "findstr: cannot open c:\*.sys".
I'm at my wits end on trying to find out what Mdl belongs too.
Event Type: Warning
Event Source: Server
Event Category: None
Event ID: 2510
Date: 10/10/2008
Time: 5:25:15 PM
User: N/A
Computer: DC1
Description:
The server service was unable to map error code 998.
(MICROSOFT has a hot fix for this that does NOT apply to me because I'm on a 32-bit OS.)
__________________________
After 5 hours sleep last night, I spent the morning trying to figure out PSlist to no avail. I've come back to Poomon and am doing the following in accordance with http://technet.microsoft.com/en-us/library/cc736362.aspx:
This example outlines a procedure for using Poolmon to detect a memory leak.
Start Poolmon in default mode (no additional parameters).
Press P twice to display allocations from only the paged pool. (The P key toggles the display between paged, non-paged, and both.)
Press B to sort the Bytes column in descending order.
Let Poolmon run for a few hours. Because starting Poolmon changes the data, you must let it run until it reaches a steady state before the data is reliable.
Save the information generated by Poolmon, either as a screenshot, or by copying it from the command window and pasting it into Notepad.
Returning to Poolmon, press P twice again, this time to display only allocations from the non-paged pool.
Repeat steps 3, 5 and 6 approximately every half-hour for at least two hours.
When data collection is complete, examine the Diff (allocations minus frees) and Bytes (number of bytes allocated minus number of bytes freed) values for each tag, and note any that continually increase. Next, stop Poolmon, wait for a few hours, and then restart Poolmon. Examine the allocations that were increasing, and determine whether the bytes are now freed. Allocations that have still not been freed, or have continued to increase in size are the likely culprits.
__________________________
Like you, I'm already pretty sure that Mdl is my culprit. Unfortunately, when I try C:\findstr /l /m Mdl *.sys, the only repsonse I get is "findstr: cannot open c:\*.sys".
I'm at my wits end on trying to find out what Mdl belongs too.
ASKER
Okay...I ran 4 Poolmon reports this afternoon and two this evening (attached txt files). If I'm looking at them right, MDL is actually staying steady although file has increased a little bit. Under Paged MmST, NtfF, and IonM appear to be increasing pretty steadily. NtfF is part of NTFS.SYS. I think MmSt is part of the Memory Manager that tries to trim allocated paged pool memory when the system reaches 80 percent of the total paged pool (I may be wrong on that). Please look the txt files over and tell me if I'm on the right track.
On a side note, does increasing the Virtual Memory (Page File) have any affect on memory leaks. I'm thinking not, but haven't had much sleep lately.
Thanks again for everything.
2pmPaged.txt
230pm-NonPaged.txt
3pmPaged.txt
330pm-NonPaged.txt
730pm-NonPaged.txt
730pm-Paged.txt
On a side note, does increasing the Virtual Memory (Page File) have any affect on memory leaks. I'm thinking not, but haven't had much sleep lately.
Thanks again for everything.
2pmPaged.txt
230pm-NonPaged.txt
3pmPaged.txt
330pm-NonPaged.txt
730pm-NonPaged.txt
730pm-Paged.txt
MgIO4:
The only memory leak I am really good at has nothing to do with computers (LOL). I think this is going to require the expertise above the scope of my abilities. So, I am requesting a little bit of help. I think, for sure, you are onto the memory leaks. I am going to get someone to help us knock this puppy out.
From what I have seen, there is one expert that is exceptional on this. He goes by the screename of Placebo and I think recently changed the screenname to placebo69. That doesn't mean others can't provide you with the knowledge to fix it. So, let's see who can pop by and help out. I'll see if I can hunt down placebo.
The only memory leak I am really good at has nothing to do with computers (LOL). I think this is going to require the expertise above the scope of my abilities. So, I am requesting a little bit of help. I think, for sure, you are onto the memory leaks. I am going to get someone to help us knock this puppy out.
From what I have seen, there is one expert that is exceptional on this. He goes by the screename of Placebo and I think recently changed the screenname to placebo69. That doesn't mean others can't provide you with the knowledge to fix it. So, let's see who can pop by and help out. I'll see if I can hunt down placebo.
ASKER
Much appreciated.... I've got 2 weeks leave depending on this. I'm sorry the replys are sometimes slow.... I'm GMT +3. I'm here everday though, and i'm here for a few more hours tonight. Unless somebody with a plan pops in and then I'm here as long as I need to be.
ASKER
would adding the /3GB to the boot ini help???
As I understand that switch:
You have virtual and Kernel memory divided into half. Let's say you have /4Gb. Without the switch you will have 2Gb of Virtual and 2Gb of Kernel. With the switch, you will define the virtual to be /3Gb and the Kernel to be 1Gb. So, I can't see how this would help, since your Kernel memory is the one having the issue.
You have virtual and Kernel memory divided into half. Let's say you have /4Gb. Without the switch you will have 2Gb of Virtual and 2Gb of Kernel. With the switch, you will define the virtual to be /3Gb and the Kernel to be 1Gb. So, I can't see how this would help, since your Kernel memory is the one having the issue.
SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
I've got a ticket in with MS at this point....I'm going to keep this open for a few days. If they fix it, I'll post the results. In the meantime, I'm still digging. I can't talk to MS until Monday (afternoon for me).
ASKER
MS thinks it's a known bug, but is going to finish verifying the user dumps and other data I've sent before they give me a hotfix. I'll repost and close when I know.
ASKER
Microsoft determined that Tumbleweed was the culprit. I contacted Tumbleweed and they said that its a known issue with DV 4.9.0 and 4.91. Apparently, something within the OS triggers this at random (although lately it hasnt been at random). Anyway, supposedly DV 4.9.2 resolves the issue. Im downloading now and will configure tomorrow. I knew that Tumbleweed was involved, because I could disable it for a moment and LSASS would decline. I was confused because it had worked for so long and we didnt have any issues until we went to AD integrated DNS.
Other issues got fixed in the process though, so its a good thing.
Thanks again for the insight.
Other issues got fixed in the process though, so its a good thing.
Thanks again for the insight.
This is because Tubleweed uses NTLMhash and SP2 denies saving and authenticating with NTLMhash authentication.
We have been right all along. LOL
We have been right all along. LOL
ASKER
Looks that way.... Thanks again!
ASKER
Looks that way.... It just chose to rear it's ugly head while I was working on other things. Isn't that usually the case though?
Take care!
Take care!