Acitve Directory Domain Service failures on DC - simply stops responding to service requests

We recently upgraded our single site AD from Windows Server 2003 32-bit to Windows Server 2008 32-bit. We have two physical servers, which are separated by a 3mbit WAN link. The PDC’s five FSMO roles were transferred to the other physical server still running 2003. We then properly demoted what was the PDC server and disjoined it from the domain before blowing away the OS and installing 2008 Server. During the rebuild process we decided to try adding a 3rd DC running 2008 on one of our VMware ESXi servers (on same LAN as our PDC). It has been running fine since and we have kept it as a backup.

Now both our physical servers are running 2008 and the DC running on our ESXi server is 2008 as well. The FSMO roles were moved back to what we consider the PDC and things were fine, for awhile… but now we have been getting a strange failure from each physical server about once a week or so. When this happens our users no longer get a response from the server for authentication, and DFS folder enumeration which happens to be on our DC’s (we are now considering moving DFS hosting to our file servers). Remote Desktop still works and you can still ping the DC, but it essentially stops responding to all service requests. Additionally, you can’t open the ADUC MMC from the server, it gives error “Naming information cannot be located for the following reason: The server is not operational.” None of the other AD DS MMC’s will open. This only happens on the physical servers and not the server on our ESXi host.

Things I’ve tried:
•      We corrected a few things in our DNS settings (can’t remember what exactly) which helped a few other things but didn’t fix this particular problem.
•      We deleted the old DC computer object from AD because I forgot to remove it, but we used a new computer name so it wasn’t a problem when setting up the new 2008 build.
•      Installed the absolute latest version of network card drivers because we had read somewhere it could cause problems with TCP connectivity.
•      Raised the domain function level to 2008, but that had no effect on this problem either.

We’ve been chasing this problem for awhile and tried a few other fixes (can’t remember what they all were, fairly basic things) but can’t seem to fix it. What is particularly perplexing is the fact that the DC on our ESXi host isn’t affected, and we didn’t configure it any differently. We’ve fixed a few other small problems and got rid of some events being logged, but I am still getting this event 1308 which seems to be the indicator when the DC has stopped working properly. I haven't spent too much time on trying to resolve it, i've just been rebooting the DC and then it is fine again for at least a few days before it happens again. I don't have much time to chase this thing down all day and i'm hoping someone out there has already experienced this or has some good ideas I can try.

Log Name:      Directory Service
Source:        Microsoft-Windows-ActiveDirectory_DomainService
Date:          4/4/2011 6:09:54 PM
Event ID:      1308
Task Category: Knowledge Consistency Checker
Level:         Warning
Keywords:      Classic
User:          ANONYMOUS LOGON
The Knowledge Consistency Checker (KCC) has detected that successive attempts to replicate with the following directory service has consistently failed. 
Directory service:
CN=NTDS Settings,CN=DJC-JCCS-DC,CN=Servers,CN=IDJC,CN=Sites,CN=Configuration,DC=idjc,DC=idaho,DC=gov 
Period of time (minutes):
The Connection object for this directory service will be ignored, and a new temporary connection will be established to ensure that replication continues. Once replication with this directory service resumes, the temporary connection will be removed. 
Additional Data 
Error value:
1256 The remote system is not available. For information about network troubleshooting, see Windows Help.
Event Xml:
<Event xmlns="">
    <Provider Name="Microsoft-Windows-ActiveDirectory_DomainService" Guid="{0e8478c5-3605-4e8c-8497-1e730c959516}" EventSourceName="NTDS KCC" />
    <EventID Qualifiers="32768">1308</EventID>
    <TimeCreated SystemTime="2011-04-05T00:09:54.388Z" />
    <Correlation />
    <Execution ProcessID="612" ThreadID="1512" />
    <Channel>Directory Service</Channel>
    <Security UserID="S-1-5-7" />
    <Data>CN=NTDS Settings,CN=DJC-JCCS-DC,CN=Servers,CN=IDJC,CN=Sites,CN=Configuration,DC=idjc,DC=idaho,DC=gov</Data>
    <Data>The remote system is not available. For information about network troubleshooting, see Windows Help.</Data>

Open in new window

Who is Participating?

[Product update] Infrastructure Analysis Tool is now available with Business Accounts.Learn More

I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Darius GhassemCommented:
First run metadata cleanup to remove any lingering objects.

Make sure you are running only one NIC in each server. Disable any other NICs.

Make sure you are only pointing to internal DNS servers.

Update all network card drivers and firmware.

Remove an AV for testing.

Go to properties of network cards click Advance tab disable any Offloads listed
LittleJohn101Author Commented:
Thanks very much for your response dariusg. In response to your recommendations.

The old server name does not exist in the metadata, or in AD Sites and Services, Users and Computers, or in forward or reverse DNS zones.

I have disabled the second NIC on one of my DC's, the others was already disabled.

I confirmed I am only pointing to local DNS servers.

I already have the latest NIC drivers, and there is no firmware update available.

By remove AV for testing I'm assuming you mean Anti-Virus software?

As for (what i'm assuming to be) TCP Checksum Offload. What is your reasoning for disabling this? A quick search on the subject found what it does, but no explanation of problems it could cause other than heavy CPU consumption. I may try this as a next step if disabling the second NIC doesn't resolve the issue.
Darius GhassemCommented:
There have been multiple problems with TCP Offload just like you are explaining I can post a link if you would like but if you search EE you will tons of people having issues with Offloads
Problems using Powershell and Active Directory?

Managing Active Directory does not always have to be complicated.  If you are spending more time trying instead of doing, then it's time to look at something else. For nearly 20 years, AD admins around the world have used one tool for day-to-day AD management: Hyena. Discover why

LittleJohn101Author Commented:
Thanks, I'll give it a try and report back. It will take more than a week as thats about how long they go between failures.
LittleJohn101Author Commented:
Well had the same failure again yesterday. However there are two different but similar offload settings in the network driver. I didn't have the IPv4 Checksum Offline disabled, so i've disabled that one as well. My other domain controller still has the old NIC driver installed and it doesn't have this option, just the TCP/UDP one. Somehow I doubt having both of these disabled will help but I hope i'm wrong. I'll report back with the results in about a week.

 NIC properties
LittleJohn101Author Commented:
Well unfortunately offload stuff didn't have any effect on my problem. After more searching on the web I came across the following blog which described my problem almost exactly. The fix is Microsoft hotfix 961775.;EN-US;961775

This appears to have resolved the issue as the server has been running over 2 weeks without a problem, as opposed to before where it usually would not make it over a week.

Anyhow thank you for the help but ultimately a MS hotfix was required.

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
Darius GhassemCommented:
That would do it. I recommend going to SP2 though
LittleJohn101Author Commented:
I found this... explaining the hotfix is essentially included in SP2. I only have SP1 installed on my servers. This is my fault for not checking that, had I been at SP2 this problem never would have occurred in the first place.

So if anyone else is reading this, make sure SP2 is installed!

Thanks for your assistance and recommendations dariusq.
LittleJohn101Author Commented:
Found blod on the web describing my problem and recommending MS hotfix as the fix.
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Active Directory

From novice to tech pro — start learning today.