Solved

Windows 2003 SP2 Domain Controllers become unresponsive until reboot

Posted on 2008-10-08
61
9,851 Views
Last Modified: 2012-08-13
Background:
While installing a new DC, because the SA I was replacing was on the wrong path, we discovered that DNS zones were not Active Directory Integrated. We changed zones to ADI and after discovering other issues, DEMOTED the new DC and unpublished the DC root cert for it.  

Our network consists of the following:
DC1 - Windows 2003 Server Enterprise w/SP2
DC2 / Exchange Server - Windows 2003 Server Enterprise w/SP2 / Exchange - Exchange 2003 w/SP3 (please stop laughing... it's not MY choice).
Both DC's have DNS installed.
Bluecoat Proxy
Users authenticate by CAC using Valicert Desktop Validator. All certs are downloaded and cached at 24 hour intervals.

Problem:
Network will run fine for several hours (24 - 36) with no errors being reported. Out of nowhere, one or both DC's will become completely unresponsive. Upon reboot, everything begins to run fine again for another 24-36 hours. In the course of troubleshooting, I've increased the size of my security logs and have them backed up and cleared well before they fill up in accordance with Microsoft kb316685. The issue was initially occurring every 24 hours or so. After increasing event log size, the uptime seemed to increase by 12 hours or so (this may be coincidental).

DC1 appears to become unable to find itself, at which point DC2 is usually the first to become unresponsive.

I've attached events (in chron. order) from when the issues seem to start (prior to lockup). We are a military network, so for security reasons, I have replaced the actual FQDN with <FQDN> and altered actual usernames and IP info.

Any and all help is greatly appreciated.

Event Type:	Error

Event Source:	Userenv

Event Category:	None

Event ID:	1006

Date:		10/8/2008

Time:		9:09:15 AM

User:		NT AUTHORITY\SYSTEM

Computer:	TACMDC1

Description:

Windows cannot bind to <FQDN> domain. (Timeout). Group Policy processing aborted. 
 

For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.

-----------------------------------------------------------

Event Type:	Error

Event Source:	Userenv

Event Category:	None

Event ID:	1030

Date:		10/8/2008

Time:		9:09:15 AM

User:		NT AUTHORITY\SYSTEM

Computer:	TACMDC1

Description:

Windows cannot query for the list of Group Policy objects. Check the event log for possible messages previously logged by the policy engine that describes the reason for this.
 

For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.

------------------------------------------------------------------------

Event Type:	Error

Event Source:	BCAAA

Event Category:	(1)

Event ID:	2200

Date:		10/8/2008

Time:		9:10:29 AM

User:		N/A

Computer:	TACMDC1

Description:

[1692:1992] Cannot query domain controller 137.12.5.1; status=64:0x40:The specified network name is no longer available.

-----------------------------------------------------------------------

Event Type:	Error

Event Source:	DNS

Event Category:	None

Event ID:	4016

Date:		10/8/2008

Time:		9:12:08 AM

User:		N/A

Computer:	TACMDC1

Description:

The DNS server timed out attempting an Active Directory service operation on DC=103,DC=5.12.137.in-addr.arpa,cn=MicrosoftDNS,cn=System,DC=DOMAIN,DC=IRAQ,DC=PARENTDOMAIN1,DC=PARENTDOMAIN2,DC=MIL.  Check Active Directory to see that it is functioning properly. The event data contains the error.
 

For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.

Data:

0000: 55 00 00 00               U...    

-----------------------------------------------------------------------

Event Type:	Error

Event Source:	DNS

Event Category:	None

Event ID:	4016

Date:		10/8/2008

Time:		9:12:47 AM

User:		N/A

Computer:	TACMDC1

Description:

The DNS server timed out attempting an Active Directory service operation on ---.  Check Active Directory to see that it is functioning properly. The event data contains the error.
 

For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.

Data:

0000: 55 00 00 00               U...    

-----------------------------------------------------------------------

Event Type:	Error

Event Source:	Userenv

Event Category:	None

Event ID:	1006

Date:		10/8/2008

Time:		9:14:15 AM

User:		NT AUTHORITY\SYSTEM

Computer:	TACMDC1

Description:

Windows cannot bind to FQDN domain. (Server Down). Group Policy processing aborted. 
 

For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.

----------------------------------------------------------------------

Event Type:	Error

Event Source:	Userenv

Event Category:	None

Event ID:	1030

Date:		10/8/2008

Time:		9:14:15 AM

User:		NT AUTHORITY\SYSTEM

Computer:	TACMDC1

Description:

Windows cannot query for the list of Group Policy objects. Check the event log for possible messages previously logged by the policy engine that describes the reason for this.
 

For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.

---------------------------------------------------------------------

Event Type:	Error

Event Source:	BCAAA

Event Category:	(1)

Event ID:	2200

Date:		10/8/2008

Time:		9:14:28 AM

User:		N/A

Computer:	TACMDC1

Description:

[1692:1992] Cannot query domain controller 137.12.5.1; status=64:0x40:The specified network name is no longer available.

--------------------------------------------------------------------
 

Event Type:	Warning

Event Source:	KDC

Event Category:	None

Event ID:	21

Date:		10/8/2008

Time:		9:14:39 AM

User:		N/A

Computer:	TACMDC1

Description:

The client certificate for the user DOMAIN\DOEJ is not valid, and resulted in a failed smartcard logon.  Please contact the user for more information about the certificate they're attempting to use for smartcard logon. The chain status was : The revocation function was unable to check revocation because the revocation server was offline.
 

For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.

Data:

0000: 14 00 00 00 13 20 09 80   ..... .€

0008: 00 00 00 00 00 00 00 00   ........

---------------------------------------------------------------------

Event Type:	Error

Event Source:	Valicert Desktop Validator

Event Category:	None

Event ID:	1

Date:		10/8/2008

Time:		9:14:36 AM

User:		N/A

Computer:	TACMDC1

Description:

Certificate Revocation Status

Calling Application: lsass.exe

Certificate Name: /C=US/O=U.S. Government/OU=DoD/OU=PKI/OU=USA/CN=DOE.JOHN.David.123456789

Certificate Issuer: /C=US/O=U.S. Government/OU=DoD/OU=PKI/CN=DOD EMAIL CA-16

Certificate Serial Number: 1B8CC0

Revocation Status: Unable to verify

Validation Url: file://\\tacmdc1\crls$\emailca16.crl

Error: Memory allocation failure

Open in new window

0
Comment
Question by:MGio4
  • 35
  • 15
  • 10
  • +1
61 Comments
 
LVL 8

Expert Comment

by:sstone55423
ID: 22667140
WHere are the FSMo roles and GC located?  Are they all operating?
0
 
LVL 38

Expert Comment

by:ChiefIT
ID: 22667309
Let's ask a few questions:

Are either of these servers multihomed domain controllers? Multihomed is defined as having two or more IP addresses. That could mean two IPs on the same NIC or two+ NICs.

Look in FRS event logs for any errors that are in the 13000's. Any there?

Have you noticed any DNS problems or intermittent internet connectivity during the "up time"?

Are you using imaged/cloned servers? This could break the trust or cause major problems unless the servers had the same SID.

From what I am seeing, this looks like a multihomed Domain server problem.
0
 

Author Comment

by:MGio4
ID: 22667578
No. There are two NICs on each DC, but one is disabled.
The only 13000 messages I have  for the FRS service starting and telling me that FRS is no longer preventing the machine from becoming a DC (AFTER REBOOT).
Have not noticed any DNS issues or irregularities with internet connectivity.
We are not using cloned servers. All server images are built from scratch.
There are other servers on the network that use teamed NICs.
0
 

Author Comment

by:MGio4
ID: 22667676
All FSMo roles are hosted on DC1 (TACMDC1), with the exception of Infrastructure. GC is on DC1. All appear to be operating.
0
 
LVL 59

Expert Comment

by:Darius Ghassem
ID: 22668917
Do you have Symantec AV installed?
0
 

Author Comment

by:MGio4
ID: 22669354
Yes... DC1 has 10.1.4. Second DC has 10.1.5 and Symantec Mail Security (5.0) For Exchange. These two DC's have had Symantec AV for at least 18 months.
0
 
LVL 59

Assisted Solution

by:Darius Ghassem
Darius Ghassem earned 60 total points
ID: 22669614
10.1.4 had a memory leak issue that was fixed in 10.1.5 if I remember right.
0
 

Author Comment

by:MGio4
ID: 22669779
I'll try that..... I'll keep you posted. It's odd because everything's been working fine until recently. Maybe ADI triggered it.
0
 

Author Comment

by:MGio4
ID: 22669802
It'll take a day or so to know if upgrading to 10.1.5 fixes it.
0
 
LVL 38

Expert Comment

by:ChiefIT
ID: 22675453
I wonder if daylight savings time is putting you too far out of synch to authenticate with the server?
0
 

Author Comment

by:MGio4
ID: 22676449
I'm not sure what server you mean... All servers, including the AV server are on Arabic Standard Time.
I updated Symantec AV to 10.1.5 a few hours ago. I'm going to keep my fingers crossed for a couple of days and see if that does the trick. I'm am still open to suggestions though. I have a window of opportunity to take some leave in a couple of weeks. If I don't get this fixed beforehand, there'll be no leave. Next opportunity for leave will be in February or March.
0
 
LVL 38

Expert Comment

by:ChiefIT
ID: 22677184
WOW, this is an authentication NIGHTMARE:
_______________________________________________________________________________
Event Type:      Error
Event Source:      BCAAA
Event Category:      (1)
Event ID:      2200
---Can not query Domain controller:

http://www.bluecoat.com/doc/direct/607

Blue coat is used to authenticate NTLMhash and be granted an NTLMhash access token from the domain controller. Fortunately, we are out of the stone ages and are currently using Kerberos Authentication. NTLMhas has some very serious vulnerabilities that can be comprimised by an inexperienced hacker. It was used on pre-Windows 2000 PCs. If everything you have on the domain is 2000 Pro or newer, you should NOT be authenticating to the DC using NTLMhash. In fact, the DC should be throwing this back at you as it will not grant you access. 2003 server SP2 shut the door to backwards authentication to NTLMhash.

For a description of LMHash, NTLMhash and Kerberos please see the following link:
http://www.experts-exchange.com/OS/Microsoft_Operating_Systems/Server/Windows_2003_Active_Directory/Q_23132123.html

With that said, things like Malware and Skype can use NTLM. They often don't resort to Kerberos because of the increased security:
http://forums.bluecoat.com/viewtopic.php?p=9499&sid=54d63d665cc6b94ef7df9c643e64da23
________________________________________________________________________________
Event Type:      Error
Event Source:      DNS
Event Category:      None
Event ID:      4016
--The DNS server timed out attempting an Active Directory service operation on ---.  Check Active Directory to see that it is functioning properly.

I am assuming you have AD integrated DNS and that is good. AD will not work if you are trying to authenticate using NTLMHash authentication for the above given reasons.
______________________________________________________________________________
Event Type:      Error
Event Source:      Userenv
Event Category:      None
Event ID:      1006
Date:            10/8/2008
Time:            9:14:15 AM
User:            NT AUTHORITY\SYSTEM
Computer:      TACMDC1
---Windows cannot bind to FQDN domain.

UserNV means User not valid: So, the remote procedure call (RPC) you are using to access domain services will through you a bone saying you are not valid because you are trying to authenticate using NTLMhash.
____________________________________________________________________________
  Event Type:      Warning
Event Source:      KDC
Event Category:      None
Event ID:      21
Date:            10/8/2008
Time:            9:14:39 AM
User:            N/A

--The client certificate for the user DOMAIN\DOEJ is not valid, and resulted in a failed smartcard logon.  Please contact the user for more information about the certificate they're attempting to use for smartcard logon. The chain status was : The revocation function was unable to check revocation because the revocation server was offline.

Domain\DoeJ is trying to contact the KDC (Key Distribution Center) for verification. Kerberose will not validate this request, I believe, because it is using NTLMhash to try and authenticate with the domain controller.

____________________________________________________________________________
Event Type:      Error
Event Source:      Valicert Desktop Validator
Event Category:      None
Event ID:      1
Date:            10/8/2008
Time:            9:14:36 AM
User:            N/A
Computer:      TACMDC1
Description:
Certificate Revocation Status
Calling Application: lsass.exe

http://www.tumbleweed.com/news/press_releases/2005/2005-02-07.html

Tunbleweed is an encrypted protocol, that uses x.509 PKI certs to validate your computer prior to communicating one computer to another. So, every 24 to 36 hours your computer is trying to communicate with another computer and probably replicate data between two DCs. It appears this is trying to replicate DNS zones and validate the Kerberos tickets. The cert can not be verified to the remote computer, therefore you can't get a KDCticket, smart card is knocked down.

______________________________________________________________________________\
conclusion:
It is my guess that you need to rid yourself of Blue coat. In a kerberos domain, it is not going to work.

Then, disable the domain controller's ability to be backwards compatible to NTLMhas for security reasons. You don't want your DC to be handing out access tokens to anything using NTLM authentication.
http://www.experts-exchange.com/OS/Microsoft_Operating_Systems/Server/Windows_2003_Active_Directory/Q_23132123.html

Furthermore, you need to update your PKI certs to the domain controller you are trying to replicate with on the remote site. This may require a call to your Tumbleweed vendor.
http://www.tumbleweed.com/news/press_releases/2005/2005-02-07.html

It is also my guess that you need to get ahold of that computer  that Domain\DoeJ is on and find out what in the world is using NTLM authentication. This looks like a backdoor attack using NTLM.
0
 

Author Comment

by:MGio4
ID: 22686029
Chief, I dont think the unit was experiencing these issues until they went AD Integrated. None of the errors I sent you appear until the DCs start to act up.

ChiefIT:  It is my guess that you need to rid yourself of Blue coat. In a kerberos domain, it is not going to work.

I agree with you about BlueCoat. Unfortunately, the military customer that I support seems to think its the greatest Proxy Appliance since sliced bread, although no one here knows a damned thing about it. The wizard that installed it left 8 or 9 months ago. Ridding us of BlueCoat is going to be a few months even if I can talk them into it.

ChiefIT:  Then, disable the domain controller's ability to be backwards compatible to NTLMhas for security reasons. You don't want your DC to be handing out access tokens to anything using NTLM authentication.

Agreed, but negated by the fact that the BlueCoat appliance will be here for a bit.

ChiefIT:  Furthermore, you need to update your PKI certs to the domain controller you are trying to replicate with on the remote site. This may require a call to your Tumbleweed vendor.

PKI certs are downloaded to the DC every night. We dont start experiencing the KDC and Valicert problems until we start losing the DC. Once we reboot, all issues are resolved.

ChiefIT: It is also my guess that you need to get ahold of that computer that Domain\DoeJ is on and find out what in the world is using NTLM authentication. This looks like a backdoor attack using NTLM.

This actually pertains to every user/machine that tries to log in while were experiencing our issues. I only sent one error of each type. There were actually several.

When we start losing the DCs, LSASS.EXE pegs out at 99%. Tumbleweed uses LSASS.
0
 

Author Comment

by:MGio4
ID: 22688196
UPDATE: Yesterday morning (9:50 AST), I upgraded Symantec AV to 10.1.5 and rebooted DC1. Everything ran flawlessly (no error messages in the event viewr on anything) until 1:36 p.m. today. At 1:36, a handful of the previously mentioned KDC and Valicert error messages showed up for various users (no other errors) and users were able to log in sporadically.  After two minutes or so, everything returned to normal. At 3:19 p.m., the errors returned again for a few minutes and disappear again until 4:37. This again clears up and the cycle repeats itself at 5 to 10 minute intervals until it gets persistent and I rebooted at 5:20 p.m. The second DC  had very few similar errors until the reboot of DC1, at which point DC2 became extremely slow and was rebooted as well. When DC1 rebooted, I recieved:

Event Type:      Warning
Event Source:      LSASRV
Event Category:      SPNEGO (Negotiator)
Event ID:      40960
Date:            10/10/2008
Time:            5:24:05 PM
User:            N/A
Computer:      DC1
Description:
The Security System detected an authentication error for the server cifs/DC2.<FQDN>.  The failure code from authentication protocol Kerberos was "There are currently no logon servers available to service the logon request.
 (0xc000005e)".

For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.
Data:
0000: 5e 00 00 c0               ^..À    
_________________________________________________
Event Type:      Error
Event Source:      Userenv
Event Category:      None
Event ID:      1058
Date:            10/10/2008
Time:            5:24:49 PM
User:            DOMAINNAME\username
Computer:      DC1
Description:
Windows cannot access the file gpt.ini for GPO cn={2910DB65-ED86-477E-9081-9E7A8A62E414},cn=policies,cn=system,DC=DOMAIN,DC=IRAQ,DC=PARENTDOMAIN,DC=PARENTDOMAIN,DC=MIL. The file must be present at the location <\\<FQDN>\SysVol\<FQDN>\Policies\{2910DB65-ED86-477E-9081-9E7A8A62E414}\gpt.ini>. (Configuration information could not be read from the domain controller, either because the machine is unavailable, or access has been denied. ). Group Policy processing aborted.
_______________________________________________________________________________
Event Type:      Warning
Event Source:      Server
Event Category:      None
Event ID:      2510
Date:            10/10/2008
Time:            5:25:15 PM
User:            N/A
Computer:      DC1
Description:
The server service was unable to map error code 998.

For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.
---------------------------------------
Now that everything is rebooted, there are no issues. This will last until another 30 hours or so and repeat.

PKI certs are stored on both DCs. I'm totally stumped and trying to avoid building a new DC.

I do get two time errors, one telling me that the machine is configured to use the domain hierarchy to determine its time source, but it is the PDC emulator for the domain at the root of the forest and the following one.but wouldn't think they'd make a difference in such a short amount of time.
__________________
Event Type:      Warning
Event Source:      W32Time
Event Category:      None
Event ID:      36
Date:            10/10/2008
Time:            9:17:23 AM
User:            N/A
Computer:      DC1
Description:
The time service has not synchronized the system time for 86400 seconds  because none of the time service providers provided a usable time  stamp. The time service is no longer synchronized and cannot provide  the time to other clients or update the system clock. Monitor the  system events displayed in the Event  Viewer to make sure that a more  serious problem does not exist.

For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.
___________________________

Again, I'm stumped... any help is appreciated. The only errors I'm seeing are KDC and Valicert.
0
 
LVL 59

Expert Comment

by:Darius Ghassem
ID: 22688410
Are the times of the DCs and clients excatly on? Do you have your DC configured to an extenal time source?
0
 

Author Comment

by:MGio4
ID: 22688462
Times are exact... We are not pointing to an external time source.
0
 
LVL 59

Assisted Solution

by:Darius Ghassem
Darius Ghassem earned 60 total points
ID: 22688490
I would recommend pointing to an external time source. There is a reg file on this post that will import the settings for you on your PDC. Also, look over this to see if the 40960 errors are causing the problems. What errors are coming up first.

http://www.experts-exchange.com/OS/Microsoft_Operating_Systems/Server/2003_Server/Q_23630502.html

http://www.eventid.net/display.asp?eventid=40960&eventno=787&source=LsaSrv&phase=1
0
 

Author Comment

by:MGio4
ID: 22688770
40960 error only occurs upon reboot. Prior to that all errors are KDC and Valicert (with occasional w32 error).

I've run the reg file...
0
 

Author Comment

by:MGio4
ID: 22688800
By the way... Dc2 synchs with DC1
0
 
LVL 59

Expert Comment

by:Darius Ghassem
ID: 22688807
Make sure on your other DC too run w32tm /resync /rediscover.

Check to see if this pertains to you at all.
http://support.microsoft.com/kb/822219
0
 
LVL 59

Expert Comment

by:Darius Ghassem
ID: 22688836
0
 

Author Comment

by:MGio4
ID: 22688878
I'm running 32 Bit windows. The hotfix doesn't appear to apply. I've looked at it previously.
0
 
LVL 59

Expert Comment

by:Darius Ghassem
ID: 22688971
0
 

Author Comment

by:MGio4
ID: 22689143
SAV is enabled. The 998 error only occurs on reboot as well.
0
 
LVL 59

Expert Comment

by:Darius Ghassem
ID: 22689279
MGio4: As I have re-read the post I have to agree with Chief's post. Are you sure that the KDC are the first errors? If you look at your question the Event ID: 1006 was listed first.
0
 

Author Comment

by:MGio4
ID: 22689351
This last time I didn't get the 1006. I don't think it happens unless I let the servers get totally unresponsive. I'm wondering if I might have a DNS issue.
0
 
LVL 59

Expert Comment

by:Darius Ghassem
ID: 22689370
Do a netdiag and post the results.
0
 
LVL 38

Assisted Solution

by:ChiefIT
ChiefIT earned 440 total points
ID: 22689445
MGio4:

OK let's look at this from a different point of view:

My point, on Blue Coat, is SP2 rolled up a security update that refuses storage of NTLM and LM hashes on the server, (I believe). Those hashes were considered a serious vulnerabitlity. As a result, they were refused storage on the AD server and the ability to be granted a Kerberos Access Ticket. So, there is nothing to authenticate with when trying to open up that link between the two machines. You simply can't ask for a NTLMhash token and get a kerberos ticket.

You can make the servers backwards compatible to NTLM again. But, do you really want to considering the security breach of NTLM authentication????? As mentioned above, NTLM is an easy front door, (brute force), hack.

I think RIGHT NOW is a good time to evaluate the need of Blue Coat. Take NTLM_has-->>(been) out of the picture unless you have machines prior to WIN 2000 that need to authenticate with the DC.

If you really need to use NTLM, you can make it backwards compatible by the reverse of this above link. Some of the comments on this link tell you how to disable NTLMhash. There again, you risk the problems this person was having with NTLMhash authentication:
http://www.experts-exchange.com/OS/Microsoft_Operating_Systems/Server/Windows_2003_Active_Directory/Q_23132123.html

Caution:: Backup prior to monkeying around with authentication!! It is really easy to blue screen, (BSOD), when playing with these settings!!!!  

GOING TO AD AND USING KERBEROS IS MUCH MORE SECURE AND IS BETTER SUPPORTED BY IT SUPPORT, IF THE NEED ARISES.
0
 

Author Comment

by:MGio4
ID: 22689868
Cheif - I think I have an ally over here in getting rid of BlueCoat... I'm working on it.. all things in time (hopefully quicktime).

Looking at DNS, I also found the DC we had added and removed in Name Servers (Address Unknown). I've since removed it.
I'm also considering setting up a scheduled task on one DC to net stop DNS and setting up a task on the other to Net Start DNS a while later to see if I can detect a memory leak. I've been here 18 hours though and need to think that over when I'm awake.

At any rate, here's my net diag:

C:\Documents and Settings\taclan>netdiag
.....................................
    Computer Name: DC1
    DNS Host Name: DC1.<FQDN>
    System info : Microsoft Windows Server 2003 (Build 3790)
    Processor : x86 Family 6 Model 15 Stepping 7, GenuineIntel
    List of installed hotfixes :
        KB911564
        KB921503
        KB924667-v2
        KB925398_WMP64
        KB925876
        KB925902
        KB926122
        KB927891
        KB929123
        KB930178
        KB931784
        KB932168
        KB933360
        KB933729
        KB933854
        KB935839
        KB935840
        KB935966
        KB936021
        KB936357
        KB936782
        KB937143
        KB937143-IE7
        KB938127
        KB938127-IE7
        KB938464
        KB939653-IE7
        KB941202
        KB941568
        KB941569
        KB941644
        KB941672
        KB941693
        KB942615-IE7
        KB942763
        KB943055
        KB943460
        KB943484
        KB943485
        KB944533-IE7
        KB944653
        KB945553
        KB946026
        KB947864-IE7
        KB948496
        KB948590
        KB948881
        KB949014
        KB950759-IE7
        KB950760
        KB950762
        KB950974
        KB951066
        KB951698
        KB951746
        KB951748
        KB952954
        KB953838-IE7
        Q147222

Netcard queries test . . . . . . . : Passed

Per interface results:

    Adapter : Local Area Connection 1

        Netcard queries test . . . : Passed

        Host Name. . . . . . . . . : DC1.<FQDN>
        IP Address . . . . . . . . : IPADDRESS OF DC1
        Subnet Mask. . . . . . . . : 255.255.255.0
        Default Gateway. . . . . . : GW IP
        Dns Servers. . . . . . . . : DC1 IP
                                     DC2 IP

        AutoConfiguration results. . . . . . : Passed

        Default gateway test . . . : Passed

        NetBT name test. . . . . . : Passed
        [WARNING] At least one of the <00> 'WorkStation Service', <03> 'Messenge
r Service', <20> 'WINS' names is missing.

        WINS service test. . . . . : Skipped
            There are no WINS servers configured for this interface.

Global results:

Domain membership test . . . . . . : Passed

NetBT transports test. . . . . . . : Passed
    List of NetBt transports currently configured:
        NetBT_Tcpip_{D9459CB6-3577-40DD-8567-CBD24A49C656}
    1 NetBt transport currently configured.

Autonet address test . . . . . . . : Passed

IP loopback ping test. . . . . . . : Passed

Default gateway test . . . . . . . : Passed

NetBT name test. . . . . . . . . . : Passed
    [WARNING] You don't have a single interface with the <00> 'WorkStation Servi
ce', <03> 'Messenger Service', <20> 'WINS' names defined.

Winsock test . . . . . . . . . . . : Passed

DNS test . . . . . . . . . . . . . : Passed
    PASS - All the DNS entries for DC are registered on DNS server 'DC1 IP
 and other DCs also have some of the names registered.
    PASS - All the DNS entries for DC are registered on DNS server 'DC2 IP
 and other DCs also have some of the names registered.

Redir and Browser test . . . . . . : Passed
    List of NetBt transports currently bound to the Redir
        NetBT_Tcpip_{D9459CB6-3577-40DD-8567-CBD24A49C656}
    The redir is bound to 1 NetBt transport.

    List of NetBt transports currently bound to the browser
        NetBT_Tcpip_{D9459CB6-3577-40DD-8567-CBD24A49C656}
    The browser is bound to 1 NetBt transport.

DC discovery test. . . . . . . . . : Passed

DC list test . . . . . . . . . . . : Passed

Trust relationship test. . . . . . : Skipped

Kerberos test. . . . . . . . . . . : Passed

LDAP test. . . . . . . . . . . . . : Passed

Bindings test. . . . . . . . . . . : Passed

WAN configuration test . . . . . . : Skipped
    No active remote access connections.

Modem diagnostics test . . . . . . : Passed

IP Security test . . . . . . . . . : Skipped

    Note: run "netsh ipsec dynamic show /?" for more detailed information

The command completed successfully

C:\Documents and Settings\taclan>
----------------------------------
I'm going to bed and keeping my fingers crossed.
Thanks much for the help.
0
 

Author Comment

by:MGio4
ID: 22700758
Over the weekend, I dug around in DNS a bit and found the DC that we had added and removed was still listed under the name servers tab (Address Unknown). I removed it and stopped and restarted DNS. Im not sure whether or not that would fix anything, but it sure couldnt hurt.

All FSMo roles, except for Infrastructure on are on DC1. Infrastructure is on DC2. DC1 was the only GC, so I made DC2 a GC as well. I also configured time on the PDC, although it doesnt appear to be working& DC2 and all clients are synching time with DC1.

Ive had REPLMON running on DC1 since yesterday morning and it is reporting successful replication at regular intervals. I plan on leaving it running for a couple of days.

Saturday, I decided to point the two domain controllers to the same DNS temporarily and then restart the net logon service for both servers, think that should reregister the domain controller DNS entries. I was going to use REPLMON to determine if replication is really happening. For whatever reason, our BlueCoat Proxy appliance (which Im trying to get rid of due to Chief's excellent advice) freaked out. It shouldnt have as it has both DNS addresses listed. We couldnt get that back up until we did a hard reset of the BlueCoat. I rebooted the DC's while troubleshooting that.

As for the Godforsaken BlueCoat, they'll eventually let me get rid of it. As with anything involving the government, it's going to take a while though.

I rebooted DC1 again yesterday morning to make sure that if the 32 to 36 hour issue occured, it would be in the middle of the day, while I was here and not interfere with any night missions. If it's going to crap out again, it should be sometime this afternoon.

We have two new issues now that may or may not be related. Bluecoat goes crazy after about 24 hours and we get hammered with the following message until we reset the device (TWICE):

Event Type:      Warning
Event Source:      BCAAA
Event Category:      (1)
Event ID:      300
Date:            10/13/2008
Time:            3:06:42 AM
User:            N/A
Computer:      DC2
Description:
[5756:5432] Connection attempt from forbidden IP address: xxx.xxx.xx.xx
-------------------------------------------

The other thing I noticed in looking through last nights logs on DC2 was the following three events:
Event Type:      Error
Event Source:      smtpsvc
Event Category:      None
Event ID:      2013
Date:            10/13/2008
Time:            12:09:55 AM
User:            N/A
Computer:      DC2
Description:
SMTP could not connect to any DNS server. Either none are configured, or all are down.

For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.
Data:
0000: 7c 26 00 00               |&..    
------------------------------
Event Type:      Warning
Event Source:      smtpsvc
Event Category:      None
Event ID:      2012
Date:            10/13/2008
Time:            12:09:55 AM
User:            N/A
Computer:      DC2
Description:
SMTP could not connect to the DNS server 'DC1 IP ADDRESS'. The protocol used was 'UDP'. It may be down or inaccessible.

For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.
Data:
0000: d5 04 00 00               Õ...    
------------------------------------------------
Event Type:      Warning
Event Source:      smtpsvc
Event Category:      None
Event ID:      2012
Date:            10/13/2008
Time:            12:02:55 AM
User:            N/A
Computer:      DC2
Description:
SMTP could not connect to the DNS server 'DC2 IP ADDRESS'. The protocol used was 'UDP'. It may be down or inaccessible.

For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.
Data:
0000: d5 04 00 00               Õ...
------------------------------------------------------------
These errors only occured once and I see no issues (email or otherwise).

I'm also going to become familiar with Kiwi Syslog today and see if I can figure out how to configure it.

I appreciate the help& Enjoy whats left of your weekend. If this issue is resolved, I'll have my first weekend in 10 months is a few weeks. :) I look forward to hearing from you.
0
 
LVL 38

Expert Comment

by:ChiefIT
ID: 22700907
I was looking at your new errors:

Looking at EventID, I see references on how to control SPAM and NDRs.

Please check out the last comments, from "Gordon", on this post:
http://www.eventid.net/display.asp?eventid=2012&eventno=3165&source=smtpsvc&phase=1

_________________________________________________________
Event Type:      Warning
Event Source:      BCAAA
Event Category:      (1)
Event ID:      300
Date:            10/13/2008
Time:            3:06:42 AM
User:            N/A
Computer:      DC2
Description:
[5756:5432] Connection attempt from forbidden IP address: xxx.xxx.xx.xx

Could be one of a couple of things:
Either your NTLMhash authentication was refused from a kerberos LDAP.
or
This was once an IP address of someone sending hacking, that was caught and the IP was designated unsafe.
or
Someone sees this connection and is trying a brute force attack.
or
Someone has the wrong logon credentials and were locked out.
0
 

Author Comment

by:MGio4
ID: 22701181
Thanks for that one Chief ...The box was checked for recipient filtering, but it had never been enabled under virtual SMTP for either Exchange server. I did notice that one exchange server had the IP assigned in Virtual SMTP, the other does not. I'm going to research that a little now. So far my logs look good regarding my original problem, but it's only been 29 hours since the last reboot, so I'm not going to get real excited just yet as it's not in the trend window I had previously noticed of 32 to 36 hours. I have seen it go 40. Keep your fingers crossed for me, Meanwhile, I'm still digging ...
0
 

Author Comment

by:MGio4
ID: 22703691
Update: After 31 hours (pretty much to the minute), the servers became unresponsive again. This time I have additional errors in chronological order:

FROM DC1:  The first two appear at 5 minute intervals

Event Type:      Error
Event Source:      Userenv
Event Category:      None
Event ID:      1006
Date:            10/13/2008
Time:            2:47:37 PM
User:            NT AUTHORITY\SYSTEM
Computer:      DC1
Description:
Windows cannot bind to <FQDN> domain. (Timeout). Group Policy processing aborted.

For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.
_____________________________________________________

Event Type:      Error
Event Source:      Userenv
Event Category:      None
Event ID:      1030
Date:            10/13/2008
Time:            2:47:37 PM
User:            NT AUTHORITY\SYSTEM
Computer:      DC1
Description:
Windows cannot query for the list of Group Policy objects. Check the event log for possible messages previously logged by the policy engine that describes the reason for this.

For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.

Event Type:      Warning
Event Source:      W32Time
Event Category:      None
Event ID:      36
Date:            10/13/2008
Time:            2:48:04 PM
User:            N/A
Computer:      DC1
Description:
The time service has not synchronized the system time for 86400 seconds  because none of the time service providers provided a usable time  stamp. The time service is no longer synchronized and cannot provide  the time to other clients or update the system clock. Monitor the  system events displayed in the Event  Viewer to make sure that a more  serious problem does not exist.

For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.
___________________________________
Event Type:      Error
Event Source:      BCAAA
Event Category:      (1)
Event ID:      2200
Date:            10/13/2008
Time:            2:56:26 PM
User:            N/A
Computer:      DC1
Description:
[1672:1876] Cannot query domain controller <IP ADDRESS for DC1); status=64:0x40:The specified network name is no longer available.
ect.
_________________________________________________
At 2:54 p.m. I begin to get the following DNS errors:

Event Type:      Error
Event Source:      DNS
Event Category:      None
Event ID:      4016
Date:            10/13/2008
Time:            2:54:42 PM
User:            N/A
Computer:      DC1
Description:
The DNS server timed out attempting an Active Directory service operation on DC=205,DC=15.21.140.in-addr.arpa,cn=MicrosoftDNS,cn=System,DC=Domain,DC=ParentDomain,DC=ParentDomain,DC=ParentDomain,DC=MIL.  Check Active Directory to see that it is functioning properly. The event data contains the error.

For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.
Data:
0000: 55 00 00 00               U...    

On DC2, I begin to get the KDC and Valicert errors Id mentioned previously at 2:51 p.m. DNS errors start at 2:58 p.m.

I rebooted DC1 at 3:01 p.m. and DC2 immediately after as it was completely unresponsive.

In going back and looking at REPLMON logs for DC2, everything appears to be replicating with DC1 with the exception of the Schema which did not attempt to replicate with DC2 for almost the last two hours. Config was due to replicate at 2:44 and did not as well.

The DC partition was due to replicate @ 2:59. By that time, everything had fallen apart.

All partitions on DC1 were due to replicate 2:58. Again, thats about the time everything froze.

Upon rebooting the DCs, the only errors I got were on DC2:

Event Type:      Warning
Event Source:      NETLOGON
Event Category:      None
Event ID:      3096
Date:            10/13/2008
Time:            3:29:35 PM
User:            N/A
Computer:      DC2
Description:
The primary Domain Controller for this domain could not be located.

For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.
__________________________________________
Event Type:      Warning
Event Source:      LSASRV
Event Category:      SPNEGO (Negotiator)
Event ID:      40960
Date:            10/13/2008
Time:            3:29:49 PM
User:            N/A
Computer:      DC2
Description:
The Security System detected an authentication error for the server cifs/<DC2 IP address>.  The failure code from authentication protocol Kerberos was "There are currently no logon servers available to service the logon request.
 (0xc000005e)".

For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.
Data:
0000: 5e 00 00 c0               ^..À    
___________________________________________
Event Type:      Warning
Event Source:      LSASRV
Event Category:      SPNEGO (Negotiator)
Event ID:      40960
Date:            10/13/2008
Time:            3:29:51 PM
User:            N/A
Computer:      DC2
Description:
The Security System detected an authentication error for the server ldap/DC2.FQDN.  The failure code from authentication protocol Kerberos was "There are currently no logon servers available to service the logon request.
 (0xc000005e)".

For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.
Data:
0000: 5e 00 00 c0               ^..À    
_________________________________
Event Type:      Warning
Event Source:      LSASRV
Event Category:      SPNEGO (Negotiator)
Event ID:      40960
Date:            10/13/2008
Time:            3:29:52 PM
User:            N/A
Computer:      DC2
Description:
The Security System detected an authentication error for the server LDAP/DC2.  The failure code from authentication protocol Kerberos was "There are currently no logon servers available to service the logon request.
 (0xc000005e)".

For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.
Data:
0000: 5e 00 00 c0               ^..À    
______________________________________________

REPLMON shows everything authenticating properly at the moment, which puts me back at square 1.
0
 
LVL 38

Expert Comment

by:ChiefIT
ID: 22705536
Consider this:

You may have a multihomed domain controller. A multihomed domain controller is simply defined as a domain controller with multiple IPs. This could mean two or more IPs on the same NIC, Two or more NICs, or a NIC with VPN connection to the outside world.

So, some services may bind or be redirected to the wrong network binding. If fact, they could be bound or redirected to the outside network binding. So, that binding may give you the errors you see before you because that outside binding may not know how to get back to the client.

________________________________________________________________________
I am in the process of bringing together advice from others on how to configure a multihomed domain controller so there is NO error in the path of communications: (So far, this is what I have come up with)

There are a couple of "transports" or "protocols" or whatever you want to call them, that the DC uses to communicate with other machines on the domain and to the outside world:

1) DNS
2) DHCP
3) Netbios

(((DNS)))
To prevent from DNS binding to the outside NIC or IP address, there are a couple things you will need to do. One is you need to prevent it from registering the SRV records in DNS. The second is you need to clean out DNS of any SRV records to the outside NIC. The third is, you need that outside NIC to not register with DNS.

Step 1) To resolve these issues, Follow this link: (NOTE: By default, 2003 server registers both NICs SRV records in DNS)
 -- http://support.microsoft.com/?id=832478
Step 2) Once you prevent bot SRV records from registering in DNS when the netlogon service restarts, then you need to prevent it from registering its DNS records in DNS. To do this go to the NIC configuration>> TCP/IP properties>>Advanced Button>>DNS tab and disable the ability of the NIC to register its DNS settings in DNS
Step3)) Once you have disabled the ability to register that outside NICs DNS address, then you must remove all HOST A, SRV, and cached records of that outside NIC. I assume you already know how to remove HOST A records. To remove DNS cache, go to the command prompt and type IPconfig /flushDNS. To remove the SRV records, pleas follow the advice on this link:

http://support.microsoft.com/kb/241515


(((DHCP:)))
DHCP may try to provide DHCP to all network bindings. This could be a VPN or second NIC to the outside world. You can prevent it from providing DHCP to any binding by following these simple steps:

DHCP snapin>>right click the server in question>>Select properties>>select the Advanced tab>>select binding

You can disable any binding from providing DHCP

(((NETBIOS)))
Preventing Netbios is a little more difficult to do on various types of Multihomed domain controllers. Not always does a DC use WINS when dealing with netbios. So, this is a bit more involved.

To prevent Netbios from binding to the outside binding or VPN connection binding, you must go to that binding and remove the ability of it to do ""Netbios over TCP/IP"" or ""Netbios over DHCP"".
For a VPN connection and Dual NICs:
Right click "My network Places">>select "properties">>right click "VPN connection" or the Second NIC>>Select "Properties" >>Select "TCP/IP">> Go to Properties>>Go to the "WINS" Tab>> and prevent it from providing "Netbios over TCP/IP" and also prevent it from performing "Netbios over DHCP"

Disabling File and Print sharing:
You may also wish to disable your outside NIC from broadcasting out your files and printers to the outside world. To do this, disable File and print sharing.

(((Default Gateway)))
Other things to look out for:
You should have one single gateway for your multihomed NICs. If you are routing over your server, it should be the outside NIC that has a gateway configured. If you have the second NIC to communicate with a few nodes on the network, your Domain, side NIC should have the gateway configured. So, this is domain specific.
_______________________________________________________________________
With that said, the problems you are seeing:

(4960: SPNego)
Comes from the inability to propogate the SRV records in DNS. In fact, all of your errors comes from the inability to do a DNS resolution to the Logon server.
http://www.experts-exchange.com/OS/Microsoft_Operating_Systems/Server/2003_Server/Q_23356031.html
0
 
LVL 38

Assisted Solution

by:ChiefIT
ChiefIT earned 440 total points
ID: 22705678
Another thing you should look at is Time Synchronization:

I see that time synchronization is off.

The PDCe, by default, synchs domain clients and servers to it with synchronization flags. However, Group policy overrides the synchronization flags. So, let's say, you have a group policy to synch with xxxcomputer, when you really want to synch with DC1. When you first start up the computer, it will synch with DC1 and eventually see the group policy, then change your synch flags to xxxcomputer.

To change this, prevent group policy from administering who you synch to, go to the command prompt and type GPupdate /force, and then set up an authoritative time server as the PDCe. To set up an authoritative time server follow this link:
http://www.experts-exchange.com/OS/Microsoft_Operating_Systems/Server/Q_22799695.html

On the above site, (left hand side), there are two utilities that make setting up time easy. One is called Symmtime. The second is called Domain Time II. They come from symetricom, who makes time servers.

Symmtime is free and will synchronize your server with outside time servers. Once done you can have all clients and servers synchronize up to that. Domain Time II is an audit software that will look at all of your Domain PCs (XP, 2003 servers, NT, 2000) and it will audit the time for you. So, you don't have to do it manually.
0
 
LVL 38

Accepted Solution

by:
ChiefIT earned 440 total points
ID: 22705799
Now that you have unhosed DNS and Time services on the domain, let's find out why you are getting knocked down every 24-36 hours. This symptom is not indicative of DNS or Time. It sounds more like a memory leak or NIC flood.

We can check for memory leaks using Poolmon. Poolmon monitors your kernel page pool and non-page pool memory for the differences in usage and freeing memory. What you are looking for is the difference between the two. If usage grows and grows, while freeing those blocks doesn't that means the packets are just filling up your memory to the point it will slow down the machine and eventually freeze. If Freeing the memory block grows and grows, that means you have a couple processes fighting for the same memory block and they are both trying to free that block for use. Either way, it's a memory leak.

The attached document below is an example of two memory leaks. The services I would look into have the tags of "SevI" and "Usqm".

Don't forget to check the Page Pool as well as the Non-Paged pool blocks for discrepancies. It is easies to sort by differences.

HOW TO USE POOLMON TO HELP DISCOVER AND FIX YOUR MEMORY LEAKS:
http://www.adopenstatic.com/cs/blogs/ken/archive/2006/07/10/Using-PoolMon-_2800_Pool-Monitor_2900_-to-debug-kernel-memory-leaks.aspx
 
memory-leakage.txt
0
 

Author Comment

by:MGio4
ID: 22709395
Belay my last while I put my DUMBASS hat on. While doing some troubleshooting, I at least ADDED to any DNS issue I might have had. While undoing some changes and attempting to point the Alternate DNS IP address on DC2 to DC1, I fat fingered part of the IP address&. Dammit&. Dammit&. Dammit&. (this was Friday).

Now its time to go to work on my TIME issue and see what happens next and see if I still have the server locking up issue&.

God& I cant wait until leave&. I need a break.

Ill keep you posted&
0
 
LVL 59

Expert Comment

by:Darius Ghassem
ID: 22711339
The fat finger syndrome is common.
0
 

Author Comment

by:MGio4
ID: 22711452
I've got to talk to our firewall folks and see if 123 is blocked. I can't sync time right now and it's a gov't system so I'm not a liberty to put a 3rd party product on it. That being said, as far as I know, time sync has never been configured on DC1. I believe everything syncs with DC1 okay though...
0
 

Author Comment

by:MGio4
ID: 22719383
I'm pretty sure my problem is TIME. Port 123 is blocked. Once I corrected some other errors, I noticed that I started getting KDC and Valicert errors within 2 minutes after I get:

Event Type:      Warning
Event Source:      W32Time
Event Category:      None
Event ID:      36
Date:            10/15/2008
Time:            12:09:12 PM
User:            N/A
Computer:      TACMDC1
Description:
The time service has not synchronized the system time for 86400 seconds  because none of the time service providers provided a usable time  stamp. The time service is no longer synchronized and cannot provide  the time to other clients or update the system clock. Monitor the  system events displayed in the Event  Viewer to make sure that a more  serious problem does not exist.

For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.
______________________________________________________________________________
I didn't stop to think that I start seeing these errors 5 or 6 hours before the DC's bug out entirely.

How do I go about pointing DC1 to the router to get time? Do I simply adjust the ntp server for the IP of the address? Or, do I need them to open 123 back up and go out the conventional way to an NTP server?
0
 

Author Comment

by:MGio4
ID: 22723351
I think it's fixed.... There was a gpo preventing the time from synching. I'm not sure when or why... Then again, there's way too many fingers in this pie sometimes.... I'm going to leave this open for 48 hours or so to know for sure it's working.
0
 

Author Comment

by:MGio4
ID: 22724210
Final question....hopefully.... DC2 is synching time with DC1... Do I need to do a GPO for the client workstations to synch time as well?
0
 
LVL 38

Assisted Solution

by:ChiefIT
ChiefIT earned 440 total points
ID: 22725522
I am still skeptical of the time service knocking you totally down every 24-36 hours. I still think this sounds like memory leakage. So, it is wise to monitor this issue. Using poolmon would be a good idea to see if any paged pool or non-paged pool is growing in difference between freeing and using the memory.
0
 
LVL 38

Expert Comment

by:ChiefIT
ID: 22725564
Do I need to do a GPO for the client workstations to synch time as well?

No, as stated above. The time synchronizes by default to the 2003 server PDCe. GPOs override this default settings. So, it is best to disable the GPOs for all time synchronization. Then, it will do it naturally.

0
 

Author Comment

by:MGio4
ID: 22734542
Looks like you may be on to something Chief.... Now DC1 totally freezes up and LSASS maxes out the CPU.... After 10 minutes or so, DC2 does the same thing.
0
 

Author Comment

by:MGio4
ID: 22734918
The time sync issue is fixed and DNS is unhosed....I'm finding Poolmon confusing though. Am I reading the following right in assuming that SavE may be my culprit in Paged?

Memory: 4193264K Avail: 3575340K  PageFlts:   138   InRam Krnl: 3360K P:88620
 Commit: 469436K Limit:6117048K Peak: 672028K            Pool N:42932K P:89500
 System pool information
 Tag  Type     Allocs            Frees            Diff   Bytes       Per Alloc

 SavE Paged    665625 (   0)    665004 (   0)      621 55099936 (     0)  8872
 MmSt Paged      5585 (   0)      1299 (   0)     4286 8144392 (     0)   1900
 R100 Paged        47 (   0)         2 (   0)       45 5461800 (     0) 121373
 CM35 Paged        50 (   0)         8 (   0)       42 2818048 (     0)  67096
 Ntff Paged      2200 (   0)       382 (   0)     1818 1483488 (     0)    816
 NtfF Paged      1998 (   0)       677 (   0)     1321 1236456 (     0)    936
 SACC Paged       250 (   0)         0 (   0)      250 1008968 (     0)   4035
 TSdd Paged      1218 (   0)      1193 (   0)       25  897632 (     0)  35905
 AfdX Paged     10004 (   3)      7139 (   3)     2865  802200 (     0)    280
 Gh15 Paged     17162 (  20)     17025 (  20)      137  750032 (     0)   5474
 CMAl Paged       381 (   0)       209 (   0)      172  704512 (     0)   4096
 Ttfd Paged      1411 (   0)       659 (   0)      752  678984 (     0)    902
 Wmit Paged        13 (   0)         2 (   0)       11  655688 (     0)  59608
 Gh05 Paged      6882 (   0)      6796 (   0)       86  642896 (     0)   7475
 IoNm Paged    366568 (   9)    360577 (   9)     5991  614080 (     0)    102
 Gla1 Paged       481 (   0)       200 (   0)      281  579984 (     0)   2064
 TSwd Paged        18 (   0)         8 (   0)       10  425800 (     0)  42580
 Obtb Paged       319 (   0)       158 (   0)      161  414480 (     0)   2574
 CM16 Paged        82 (   0)         1 (   0)       81  344064 (     0)   4247
 SAV  Paged    327848 (   6)    327292 (   6)      556  320848 (     0)    577
 FSim Paged      2503 (   0)       246 (   0)     2257  288896 (     0)    128
 Gcac Paged        53 (   0)         4 (   0)       49  268952 (     0)   5488
 ArbA Paged        60 (   0)         0 (   0)       60  245760 (     0)   4096
 FSrm Paged       512 (   0)       363 (   0)      149  221560 (     0)   1486
 CMVa Paged     87742 (   0)     84143 (   0)     3599  217072 (     0)     60
 NtFs Paged     32866 (   0)     29346 (   0)     3520  198872 (     0)     56
 CM25 Paged       729 (   0)       717 (   0)       12  180224 (     0)  15018
 NtFB Paged       185 (   0)       170 (   0)       15  179624 (     0)  11974
 CMDa Paged     19117 (   0)     17583 (   0)     1534  167288 (     0)    109
 Toke Paged     50097 (  49)     49860 (  51)      237  165000 ( -1392)    696
 MmSm Paged      2880 (   0)       403 (   0)     2477  158528 (     0)     64
 Ntfo Paged      6049 (   0)      4752 (   0)     1297  155816 (     0)    120
 CM39 Paged       648 (   0)       144 (   0)      504  145344 (     0)    288
 NtFS Paged      2781 (   0)      2231 (   0)      550  142936 (     0)    259
 NtFf Paged        18 (   0)         8 (   0)       10  131360 (     0)  13136
 LfsI Paged         2 (   0)         0 (   0)        2  131072 (     0)  65536
 CM29 Paged        15 (   0)         0 (   0)       15  122880 (     0)   8192
 Gla5 Paged       671 (   0)       358 (   0)      313  122696 (     0)    392
 Key  Paged    295026 (  13)    293873 (  13)     1153  119864 (     0)    103
 WmIS Paged         1 (   0)         0 (   0)        1  118784 (     0) 118784
 Bmfd Paged        65 (   0)         0 (   0)       65  116624 (     0)   1794
 Ntfc Paged      2366 (   0)       791 (   0)     1575  113400 (     0)     72
 Gla: Paged       343 (   0)       186 (   0)      157  102992 (     0)    656
 Port Paged      2005 (   0)      1480 (   0)      525   97512 (     0)    185
 Ntf0 Paged      9293 (   0)      6260 (   0)     3033   97224 (     0)     32
 ObHd Paged     10346 (   3)      7657 (   3)     2689   87232 (     0)     32
 Ghab Paged       770 (   0)         0 (   0)      770   86240 (     0)    112
 CM17 Paged        10 (   0)         0 (   0)       10   81920 (     0)   8192

or: MMCM under NonPaged

Memory: 4193264K Avail: 3574880K  PageFlts:   198   InRam Krnl: 3360K P:88592K
 Commit: 469348K Limit:6117048K Peak: 672028K            Pool N:43044K P:89476K
 System pool information
 Tag  Type     Allocs            Frees            Diff   Bytes       Per Alloc

 MmCm Nonp       1913 (   0)       802 (   0)     1111 16760984 (     0)  15086
 Irp  Nonp     919429 (  35)    908414 (  31)    11015 4816872 (  3000)    437
 Mdl  Nonp      41375 (   5)      7837 (   9)    33538 4354096 (  -512)    129
 LSwi Nonp          1 (   0)         0 (   0)        1 2576384 (     0) 2576384
 File Nonp     361118 ( 169)    350142 ( 180)    10976 1677488 ( -1672)    152
 TCPt Nonp       5358 (   0)      5327 (   0)       31 1458096 (     0)  47035
 TPLA Nonp        256 (   0)         0 (   0)      256 1048576 (     0)   4096
 TCPA Nonp       3309 (   1)       752 (   3)     2557  940976 (  -736)    368
 AfdE Nonp      10297 (  10)      7428 (  16)     2869  803320 ( -1680)    280
 Thre Nonp       4930 (   4)      4206 (   5)      724  451776 (  -624)    624
 brcm Nonp         15 (   0)         2 (   0)       13  434176 (     0)  33398
 LSwr Nonp        128 (   0)         0 (   0)      128  416768 (     0)   3256
0
 
LVL 38

Expert Comment

by:ChiefIT
ID: 22735247
Mdl  Nonp      41375 (   5)      7837 (   9)    **33538** 4354096 (  -512)    129  <<<---What you are looking for:
 LSwi Nonp          1 (   0)         0 (   0)        1 2576384 (     0) 2576384
 File Nonp     361118 ( 169)    350142 ( 180)    10976 1677488 ( -1672)    152<<<--possible memory leak.


See the difference between allocations and frees is 33538. When thinking of this, think of a memory block. That memory block is used to extract data from the hard drive, get it ready for the processor, then, pass to the processor and free itself for more data. So, if it is not freeing the block as often as it is allocated, the block becomes full and unusable for more data after a while, (Let's say 24-36 hours in this case)

The second one is certainly something to watch out for 10976. If the difference between frees and allocations continues to grow. This is a second memory leak.  

In the first example, the memory pool is allocated 41375 times, but only freed 7837 times. After a while, this will allocate it will grow to a point larger than the pool allcoation and you will get a STOP error.

The second example is certainly something to watch out for.

NOW, what to do:
This tag (((((((Mdl  Nonp)))))) represents a program that is struggling with nonPage pool memory. It is not freeing the memory block as often as it is being used. In the I provided above, there is a command prompt run line that you can type in to associate the TAG with the program. Do you see it? Run that command line and find out what program it is. Post your results on this page.

You might consider running that for the other tage (((( File Nonp ))))) while we are doing this. It might be a related process and one memory leak fix will fix the second issue.

For clarification, you could sort your POOLMON results by Differences of allocation and frees. The tags that grow and grow in differences are the memory leaks.

I hope this makes sense.
0
 

Author Comment

by:MGio4
ID: 22736169
I seem to be getting my butt kicked on the cmd line: if I type Findstr /l /m Mdl *.sys, I get: "Filestr: Cannot open c:\pagefile.sys. If I type findstr /s /m Mdl c:\*.sys, I get what looks like every .sys file on my system.

Same thing goes for File.
I've attached the output txt files for your perusal.
 Memory: 4193264K Avail: 3548052K  PageFlts:    78   InRam Krnl: 3396K P:91076K
 Commit: 479312K Limit:6117048K Peak: 925376K            Pool N:45404K P:91992K
 System pool information
 Tag  Type     Allocs            Frees            Diff   Bytes       Per Alloc

 MmCm Nonp       2558 (   0)      1447 (   0)     1111 16760984 (     0)  15086
 Irp  Nonp     937682 (   4)    926692 (   2)    10990 5185784 (  1024)    471
 Mdl  Nonp      53102 (  18)     19488 (  14)    33614 4363824 (   552)    129
 LSwi Nonp          1 (   0)         0 (   0)        1 2576384 (     0) 2576384
 File Nonp    1289013 (  53)   1277122 (  44)    11891 1819752 (  1368)    153
 TCPt Nonp       9522 (   6)      9491 (   6)       31 1458096 (     0)  47035
 TPLA Nonp        256 (   0)         0 (   0)      256 1048576 (     0)   4096
 TCPA Nonp       5040 (   2)      2480 (   1)     2560  942080 (   368)    368
 AfdE Nonp      18737 (   5)     15877 (   2)     2860  800800 (   840)    280
 Thre Nonp      14644 (   5)     13926 (   4)      718  448032 (   624)    624
 Ntfr Nonp       7214 (   0)       430 (   0)     6784  435144 (     0)     64
 brcm Nonp         15 (   0)         2 (   0)       13  434176 (     0)  33398
 MmCa Nonp      27893 (   0)     24085 (   0)     3808  420192 (     0)    110

Paged:
 Memory: 4193264K Avail: 3558504K  PageFlts:   418   InRam Krnl: 3396K P:91288K
 Commit: 481024K Limit:6117048K Peak: 925376K            Pool N:45428K P:92184K
 System pool information
 Tag  Type     Allocs            Frees            Diff   Bytes       Per Alloc

 SavE Paged   1130416 (   0)   1129795 (   0)      621 55099936 (     0)  88727
 MmSt Paged      7420 (   1)      2159 (   1)     5261 9370592 (     0)   1781
 R100 Paged        47 (   0)         2 (   0)       45 5461800 (     0) 121373
 CM35 Paged        50 (   0)         8 (   0)       42 2818048 (     0)  67096
 Ntff Paged      3494 (   0)       920 (   0)     2574 2100384 (     0)    816
 NtfF Paged     22356 (   0)     20773 (   0)     1583 1481688 (     0)    936
 SACC Paged       250 (   0)         0 (   0)      250 1008968 (     0)   4035
 TSdd Paged      2732 (  14)      2708 (  14)       24  897584 (     0)  37399
 Gh15 Paged     50802 ( 100)     50629 ( 100)      173  842656 (     0)   4870
 AfdX Paged     20123 (   5)     17272 (  10)     2851  798280 ( -1400)    280
 IoNm Paged   1808353 (  43)   1801629 (  49)     6724  711424 (  -448)    105
 CMAl Paged       468 (   0)       302 (   0)      166  679936 (     0)   4096
 Ttfd Paged      2134 (   0)      1381 (   0)      753  679392 (     0)    902
 Wmit Paged        13 (   0)         2 (   0)       11  655688 (     0)  59608
 Gh05 Paged      6882 (   0)      6796 (   0)       86  642896 (     0)   7475
 Gla1 Paged       795 (   0)       499 (   0)      296  610944 (     0)   2064
 TSwd Paged        35 (   0)        25 (   0)       10  425800 (     0)  42580
 Obtb Paged       471 (   0)       310 (   0)      161  414480 (     0)   2574
 FSim Paged      3340 (   0)       246 (   0)     3094  396032 (     0)    128
 CM16 Paged        82 (   0)         1 (   0)       81  344064 (     0)   4247
Mdl.txt
file.txt
0
 

Author Comment

by:MGio4
ID: 22739238
Chief - I'm sure you're right. After not having any luck determining where Mdl is coming from, I've started anew. I increased the size of my paging file by reducing it to 65MB on the C: partition (for dump) and putting a 6GB file on the F: partition (I'm running enterprise 32-bit with 4GB of RAM). When rebooting, i noticed that I went from getting TWO of the following error to ONE (not sure if it's just a fluke):
Event Type:      Warning
Event Source:      Server
Event Category:      None
Event ID:      2510
Date:            10/10/2008
Time:            5:25:15 PM
User:            N/A
Computer:      DC1
Description:
The server service was unable to map error code 998.
(MICROSOFT has a hot fix for this that does NOT apply to me because I'm on a 32-bit OS.)
_____________________________________________
After 5 hours sleep last night, I spent the morning trying to figure out PSlist to no avail. I've come back to Poomon and am doing the following in accordance with http://technet.microsoft.com/en-us/library/cc736362.aspx:

This example outlines a procedure for using Poolmon to detect a memory leak.
Start Poolmon in default mode (no additional parameters).
Press P twice to display allocations from only the paged pool. (The P key toggles the display between paged, non-paged, and both.)
Press B to sort the Bytes column in descending order.
Let Poolmon run for a few hours. Because starting Poolmon changes the data, you must let it run until it reaches a steady state before the data is reliable.
Save the information generated by Poolmon, either as a screenshot, or by copying it from the command window and pasting it into Notepad.
Returning to Poolmon, press P twice again, this time to display only allocations from the non-paged pool.
Repeat steps 3, 5 and 6 approximately every half-hour for at least two hours.
When data collection is complete, examine the Diff (allocations minus frees) and Bytes (number of bytes allocated minus number of bytes freed) values for each tag, and note any that continually increase. Next, stop Poolmon, wait for a few hours, and then restart Poolmon. Examine the allocations that were increasing, and determine whether the bytes are now freed. Allocations that have still not been freed, or have continued to increase in size are the likely culprits.
__________________________________________

Like you, I'm already pretty sure that Mdl is my culprit. Unfortunately, when I try C:\findstr /l /m Mdl *.sys, the only repsonse I get is "findstr: cannot open c:\*.sys".

I'm at my wits end on trying to find out what Mdl belongs too.
0
 

Author Comment

by:MGio4
ID: 22742871
Okay...I ran 4 Poolmon reports this afternoon and two this evening (attached txt files). If I'm looking at them right, MDL is actually staying steady although file has increased a little bit. Under Paged MmST, NtfF, and IonM appear to be increasing pretty steadily. NtfF is part of NTFS.SYS. I think MmSt is part of the Memory Manager that tries to trim allocated paged pool memory when the system reaches 80 percent of the total paged pool (I may be wrong on that). Please look the txt files over and tell me if I'm on the right track.

On a side note, does increasing the Virtual Memory (Page File) have any affect on memory leaks. I'm thinking not, but haven't had much sleep lately.

Thanks again for everything.
2pmPaged.txt
230pm-NonPaged.txt
3pmPaged.txt
330pm-NonPaged.txt
730pm-NonPaged.txt
730pm-Paged.txt
0
 
LVL 38

Expert Comment

by:ChiefIT
ID: 22743110
MgIO4:

The only memory leak I am really good at has nothing to do with computers (LOL). I think this is going to require the expertise above the scope of my abilities. So, I am requesting a little bit of help. I think, for sure, you are onto the memory leaks. I am going to get someone to help us knock this puppy out.

From what I have seen, there is one expert that is exceptional on this. He goes by the screename of Placebo and I think recently changed the screenname to placebo69. That doesn't mean others can't provide you with the knowledge to fix it. So, let's see who can pop by and help out. I'll see if I can hunt down placebo.
0
 

Author Comment

by:MGio4
ID: 22743175
Much appreciated.... I've got 2 weeks leave depending on this. I'm sorry the replys are sometimes slow.... I'm GMT +3. I'm here everday though, and i'm here for a few more hours tonight. Unless somebody with a plan pops in and then I'm here as long as I need to be.
0
 

Author Comment

by:MGio4
ID: 22743994
would adding the /3GB to the boot ini help???
0
 
LVL 38

Expert Comment

by:ChiefIT
ID: 22747221
As I understand that switch:

You have virtual and Kernel memory divided into half. Let's say you have /4Gb. Without the switch you will have 2Gb of Virtual and 2Gb of Kernel. With the switch, you will define the virtual to be /3Gb and the Kernel to be 1Gb. So, I can't see how this would help, since your Kernel memory is the one having the issue.

0
 
LVL 38

Assisted Solution

by:ChiefIT
ChiefIT earned 440 total points
ID: 22747262
MMST paged pool tag Windows Cache Manager uses this tag for file caching. Windows Cache Manager automatically reduces file caching to free paged pool memory if the pool becomes depleted.

Still looking for NTFfs. I think you are right that it does have to do with the NTFS file system.

**You too can be a poolmon pro with knowledge of all switches an options of poolmon.
http://technet.microsoft.com/en-us/library/cc736362.aspx

0
 

Author Comment

by:MGio4
ID: 22751975
I've got a ticket in with MS at this point....I'm going to keep this open for a few days. If they fix it, I'll post the results. In the meantime, I'm still digging. I can't talk to MS until Monday (afternoon for me).
0
 

Author Comment

by:MGio4
ID: 22775148
MS thinks it's a known bug, but is going to finish verifying the user dumps and other data I've sent before they give me a hotfix. I'll repost and close when I know.
0
 

Author Comment

by:MGio4
ID: 22780967
Microsoft determined that Tumbleweed was the culprit. I contacted Tumbleweed and they said that its a known issue with DV 4.9.0 and 4.91. Apparently, something within the OS triggers this at random (although lately it hasnt been at random). Anyway, supposedly DV 4.9.2 resolves the issue. Im downloading now and will configure tomorrow. I knew that Tumbleweed was involved, because I could disable it for a moment and LSASS would decline. I was confused because it had worked for so long and we didnt have any issues until we went to AD integrated DNS.

Other issues got fixed in the process though, so its a good thing.

Thanks again for the insight.
0
 
LVL 38

Expert Comment

by:ChiefIT
ID: 22781065
This is because Tubleweed uses NTLMhash and SP2 denies saving and authenticating with NTLMhash authentication.

We have been right all along. LOL
0
 

Author Comment

by:MGio4
ID: 22781078
Looks that way.... Thanks again!
0
 

Author Comment

by:MGio4
ID: 22781095
Looks that way.... It just chose to rear it's ugly head while I was working on other things. Isn't that usually the case though?

Take care!
0

Join & Write a Comment

Suggested Solutions

I wrote this article to explain some important DNS concepts that should be known to avoid some typical configuration errors I often see in forums. I assume that what is described here is the typical behavior of Microsoft DNS client. I don't know …
On July 14th 2015, Windows Server 2003 will become End of Support, leaving hundreds of thousands of servers around the world that still run this 12 year old operating system vulnerable and potentially out of compliance in many organisations around t…
This tutorial will walk an individual through the process of transferring the five major, necessary Active Directory Roles, commonly referred to as the FSMO roles to another domain controller. Log onto the new domain controller with a user account t…
This tutorial will walk an individual through the process of configuring their Windows Server 2012 domain controller to synchronize its time with a trusted, external resource. Use Google, Bing, or other preferred search engine to locate trusted NTP …

746 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

9 Experts available now in Live!

Get 1:1 Help Now