Link to home
Start Free TrialLog in
Avatar of LittleJohn101
LittleJohn101Flag for United States of America

asked on

Acitve Directory Domain Service failures on DC - simply stops responding to service requests

We recently upgraded our single site AD from Windows Server 2003 32-bit to Windows Server 2008 32-bit. We have two physical servers, which are separated by a 3mbit WAN link. The PDC’s five FSMO roles were transferred to the other physical server still running 2003. We then properly demoted what was the PDC server and disjoined it from the domain before blowing away the OS and installing 2008 Server. During the rebuild process we decided to try adding a 3rd DC running 2008 on one of our VMware ESXi servers (on same LAN as our PDC). It has been running fine since and we have kept it as a backup.

Now both our physical servers are running 2008 and the DC running on our ESXi server is 2008 as well. The FSMO roles were moved back to what we consider the PDC and things were fine, for awhile… but now we have been getting a strange failure from each physical server about once a week or so. When this happens our users no longer get a response from the server for authentication, and DFS folder enumeration which happens to be on our DC’s (we are now considering moving DFS hosting to our file servers). Remote Desktop still works and you can still ping the DC, but it essentially stops responding to all service requests. Additionally, you can’t open the ADUC MMC from the server, it gives error “Naming information cannot be located for the following reason: The server is not operational.” None of the other AD DS MMC’s will open. This only happens on the physical servers and not the server on our ESXi host.

Things I’ve tried:
•      We corrected a few things in our DNS settings (can’t remember what exactly) which helped a few other things but didn’t fix this particular problem.
•      We deleted the old DC computer object from AD because I forgot to remove it, but we used a new computer name so it wasn’t a problem when setting up the new 2008 build.
•      Installed the absolute latest version of network card drivers because we had read somewhere it could cause problems with TCP connectivity.
•      Raised the domain function level to 2008, but that had no effect on this problem either.

We’ve been chasing this problem for awhile and tried a few other fixes (can’t remember what they all were, fairly basic things) but can’t seem to fix it. What is particularly perplexing is the fact that the DC on our ESXi host isn’t affected, and we didn’t configure it any differently. We’ve fixed a few other small problems and got rid of some events being logged, but I am still getting this event 1308 which seems to be the indicator when the DC has stopped working properly. I haven't spent too much time on trying to resolve it, i've just been rebooting the DC and then it is fine again for at least a few days before it happens again. I don't have much time to chase this thing down all day and i'm hoping someone out there has already experienced this or has some good ideas I can try.

Log Name:      Directory Service
Source:        Microsoft-Windows-ActiveDirectory_DomainService
Date:          4/4/2011 6:09:54 PM
Event ID:      1308
Task Category: Knowledge Consistency Checker
Level:         Warning
Keywords:      Classic
User:          ANONYMOUS LOGON
Computer:      DJC-JCCN-DC-1.idjc.idaho.gov
Description:
The Knowledge Consistency Checker (KCC) has detected that successive attempts to replicate with the following directory service has consistently failed. 
 
Attempts:
93 
Directory service:
CN=NTDS Settings,CN=DJC-JCCS-DC,CN=Servers,CN=IDJC,CN=Sites,CN=Configuration,DC=idjc,DC=idaho,DC=gov 
Period of time (minutes):
134 
 
The Connection object for this directory service will be ignored, and a new temporary connection will be established to ensure that replication continues. Once replication with this directory service resumes, the temporary connection will be removed. 
 
Additional Data 
Error value:
1256 The remote system is not available. For information about network troubleshooting, see Windows Help.
Event Xml:
<Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
  <System>
    <Provider Name="Microsoft-Windows-ActiveDirectory_DomainService" Guid="{0e8478c5-3605-4e8c-8497-1e730c959516}" EventSourceName="NTDS KCC" />
    <EventID Qualifiers="32768">1308</EventID>
    <Version>0</Version>
    <Level>3</Level>
    <Task>1</Task>
    <Opcode>0</Opcode>
    <Keywords>0x8080000000000000</Keywords>
    <TimeCreated SystemTime="2011-04-05T00:09:54.388Z" />
    <EventRecordID>1168</EventRecordID>
    <Correlation />
    <Execution ProcessID="612" ThreadID="1512" />
    <Channel>Directory Service</Channel>
    <Computer>DJC-JCCN-DC-1.idjc.idaho.gov</Computer>
    <Security UserID="S-1-5-7" />
  </System>
  <EventData>
    <Data>93</Data>
    <Data>CN=NTDS Settings,CN=DJC-JCCS-DC,CN=Servers,CN=IDJC,CN=Sites,CN=Configuration,DC=idjc,DC=idaho,DC=gov</Data>
    <Data>134</Data>
    <Data>The remote system is not available. For information about network troubleshooting, see Windows Help.</Data>
    <Data>1256</Data>
  </EventData>
</Event>

Open in new window

Avatar of Darius Ghassem
Darius Ghassem
Flag of United States of America image

First run metadata cleanup to remove any lingering objects.

http://www.petri.co.il/delete_failed_dcs_from_ad.htm

Make sure you are running only one NIC in each server. Disable any other NICs.

Make sure you are only pointing to internal DNS servers.

Update all network card drivers and firmware.

Remove an AV for testing.

Go to properties of network cards click Advance tab disable any Offloads listed
Avatar of LittleJohn101

ASKER

Thanks very much for your response dariusg. In response to your recommendations.


The old server name does not exist in the metadata, or in AD Sites and Services, Users and Computers, or in forward or reverse DNS zones.

I have disabled the second NIC on one of my DC's, the others was already disabled.

I confirmed I am only pointing to local DNS servers.

I already have the latest NIC drivers, and there is no firmware update available.

By remove AV for testing I'm assuming you mean Anti-Virus software?

As for (what i'm assuming to be) TCP Checksum Offload. What is your reasoning for disabling this? A quick search on the subject found what it does, but no explanation of problems it could cause other than heavy CPU consumption. I may try this as a next step if disabling the second NIC doesn't resolve the issue.
There have been multiple problems with TCP Offload just like you are explaining I can post a link if you would like but if you search EE you will tons of people having issues with Offloads
Thanks, I'll give it a try and report back. It will take more than a week as thats about how long they go between failures.
Well had the same failure again yesterday. However there are two different but similar offload settings in the network driver. I didn't have the IPv4 Checksum Offline disabled, so i've disabled that one as well. My other domain controller still has the old NIC driver installed and it doesn't have this option, just the TCP/UDP one. Somehow I doubt having both of these disabled will help but I hope i'm wrong. I'll report back with the results in about a week.

 User generated image
ASKER CERTIFIED SOLUTION
Avatar of LittleJohn101
LittleJohn101
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
That would do it. I recommend going to SP2 though
I found this... explaining the hotfix is essentially included in SP2. I only have SP1 installed on my servers. This is my fault for not checking that, had I been at SP2 this problem never would have occurred in the first place.

So if anyone else is reading this, make sure SP2 is installed!

Thanks for your assistance and recommendations dariusq.
Found blod on the web describing my problem and recommending MS hotfix as the fix.