AD is completely down due to DNS Islanding Issue... I thought MS had engineered this out?

We experienced a power failure last night that caused all of our DCs in our forest root domain to go down. After booting everything back up, AD refuses to work correctly for the entire enterprise. AD, DNS, DFRS, NETLOGON, Kerberos, Intersite Messaging, etc services all start and stay running.

However, DNS will not function because it cannot talk to AD, and AD won't function because neither server can contact a domain controller in the domain. Of note is DNS event id 4013 "The DNS server is waiting for Active Directory Domain Services (AD DS) to signal that the initial synchronization of the directory has been completed. The DNS server cannot start until the initial synchronization is complete because critical DNS data might not yet be replicated onto this domain controller..."

Server 2008 R2, all DCs in our enterprise are GCs.

I feel like I've dealt with this before, but cannot recall how to "trick" AD and DNS into starting up on either server without first talking to another DC in the domain. Any help would be awesome.
synaptixAsked:
Who is Participating?

[Product update] Infrastructure Analysis Tool is now available with Business Accounts.Learn More

x
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Cliff GaliherCommented:
This usually occurs when there were already unresolved replication errors before a crash or when a crash causes a local AD database to become corrupt. Although DCs won't start offering DNS services publicly, they are still aware of each other and compare USNs and similar to determine if they are each healthy. So to get the results you are seeing every DC would have to think another DC has changes it isn't aware of (USN mismatches) or the database is corrupt enough to cause problems.

Usually the fix would be to restore ab authoritative backup of one DC and, as it is authoritative, other DCs will sync to it. In rare cases you'll have yo rebuild the other DCs. While workarounds may also work, given the lack of a known cause, I'd consider them a risk currently.
Blake LongEngineerCommented:
To me it sounds like, because of the power outage, servers didn't come up in the right order. I've seen similar situations remedied by simply shutting down all DC's and powering them up in the proper order. That is, assuming you know the proper order for them to talk correctly. Which servers need to come up first will be dictated by which servers hold which FSMO roles for the specific domains.
synaptixAuthor Commented:
Issue resolved. See below for the solution I was able to work out.

The Problem:
The order of servers booting up did not appear to matter. The problem lied in the fact that as each server came up, it was searching for another DC in the domain with which to synchronize AD. DNS would not start as this initial synchronization could not be performed, and therefore the directory service would not finish starting either. Both [Active Directory Domain Services] and [DNS Server] (and all dependencies) reported that they were running, however, in examining the event logs, AD DS was obviously NOT functioning because it was simply stuck waiting for the initial sync, and DNS was simply in a hung state because it was waiting on AD DS to finish starting up. This was revealed by running DCDIAG and finding among the first lines of output:

The directory service on server.domain.local has not finished initializing.
In order for the directory service to consider itself synchronized, it must attempt an initial synchronization with at least one replica of this server's writeable domain.  It must also obtain Rid information from the Rid FSMO holder.
The directory service has not signalled the event which lets other services know that it is ready to accept requests. Services such as the Key Distribution Center, Intersite Messaging Service, and NetLogon will not consider this system as an eligible domain controller.

This is what I was referring to when I mentioned the "DNS island" problem, as AD relies heavily on DNS in order to locate other DCs by means of service records and the like. Without DNS, AD would not function. And without AD, DNS was waiting to start. Chicken, meet egg.

The Key Piece of Information:
We found that all of most of our AD problems were pointing to DNS resolution being a hangup (see DCDIAG output above). We also found that DNS would not fully start on any affected domain controller. This was referenced in the Applications and Services Logs -> DNS Server log, event ID 4013 (source: DNS-Server-Service):

The DNS server is waiting for Active Directory Domain Services (AD DS) to signal that the initial synchronization of the directory has been completed. The DNS server service cannot start until the initial synchronization is complete because critical DNS data might not yet be replicated onto this domain controller. If events in the AD DS event log indicate that there is a problem with DNS name resolution, consider adding the IP address of another DNS server for this domain to the DNS server list in the Internet Protocol properties of this computer. This event will be logged every two minutes until AD DS has signaled that the initial synchronization has successfully completed.

The Solution:
We resolved our problem by way of a registry hack which removed the need for this initial synchronization in order for DNS to start. See the Microsoft Knowledgebase article below:

https://support.microsoft.com/en-us/kb/2001093

Adding this DWORD value and restarting "Active Directory Domain Services" corrected the problem immediately. Rebooting our other domain controllers that were affected by the electrical problems sped up the resolution process, but it would appear that all DCs across our entire enterprise (approximately 20 across 6 cities) eventually would have caught up on their own.

Needless to say, I'm keeping this one in my back pocket for whenever there is a power failure or other reason (like servers being rebooted out of order or all at once during maintenance) why our forest root DCs go down all at once and refuse to come back online.

Thanks for the input.

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
synaptixAuthor Commented:
Accepted this solution as it was verified to correct the problem. Other suggestions were either unrelated or not strictly relevant to the problem at hand. All input from other users is greatly appreciated.
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
DNS

From novice to tech pro — start learning today.