DNS Critical Error Event ID 4015: Extended Error Debug Information is ""

Ever since I started deploying W2012 R2 Domain Controllers I am starting to see a bunch of DNS errors such as these:
Event ID 4015, DNS-Server-Service: The DNS server has encountered a critical error from the Active Directory. Check that the Active Directory is functioning properly. The extended error debug information (which may be empty) is "". The event data contains the error.

I haven't found any good info on this error and it doesn't tell us anything whatsoever; here is Microsoft's very descriptive resolution: http://technet.microsoft.com/en-us/library/cc735674(v=ws.10).aspx

There is a lot of speculation out there like "You should always point to another DNS server first not self, don't disable IPV6, and possibly slow links in VPN/MPLS connected sites". I'd like to try and demystify what is going on here, or at least get some verbose logging somewhere to find out what the error is referring to.  I haven't seen this issue affect us negatively until yesterday when a whole site couldn't resolve a URL/hostame served by a conditional forwarder.  Saw WAY too many 4015 errors so just restarted the service and it went away and the hostname resolved. We have eight sites and many DCs, I have disabled IPV6 (fully via registry) on all DCs and each Domain Controller points to itself then 2 others.  I have always learned that this is supported and the Microsoft recommended way to do it.  With all the speculation out there I'd like to see some specific documentation backing up why I should re-enable IPV6 and/or point to another DNS server first.  Additionally it seems silly to have a single branch office DC/DNS server point its DNS first entry to another site- that's just me though.
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Here is an exhaustive view point on what bext practices should be:

I could spend time diving into it but I think the link above very clearly spells out what best practices should be in various scenarios.
mcburn13Author Commented:
Yes this is exactly one of the articles I am talking about.  A lot of "you should do this and that" without anything concrete to back it up.
George SasIT EngineerCommented:
Do you have any errors on the AD replication services ?
Do you run multiple domains in the forest ?
Is the TIME synchronized across all DC's ?
Are your DNS zones OK all ?
Try to enable DNS debugging :
and check Dns.log debug logging activity in System32\Dns folder.

Here also a nice article :

Please offer some minor details about the AD structure.
Big Business Goals? Which KPIs Will Help You

The most successful MSPs rely on metrics – known as key performance indicators (KPIs) – for making informed decisions that help their businesses thrive, rather than just survive. This eBook provides an overview of the most important KPIs used by top MSPs.

mcburn13Author Commented:
Again it references that same article everyone else references- a lot of suggestions with no concrete evidence or Microsoft documentation to back it up.  For every person that tells you that disabling IPV6 is bad and pointing to another DNS server first is good, there are probably 10 tried-tested engineers that will tell you the opposite.
George SasIT EngineerCommented:
You did not answer to my questions. Please do so.
mcburn13Author Commented:
I have a lag server that is the only source of replication errors but expected- Using a policy to block anything BUT replication to that server (no errant logins to it etc.)

We have 3 forests with 2- way forest trusts between each (3 separate companies).  Time is synchronized correctly I've been through that exercise but of course we have to check every now and again.

Zones seem ok- I've done some cleaning since i started got rid of the "inProgress..." dupe zones and such, made sure the Name Servers are correct and all. Thinking about going from 3 octect to 2 octet reverse zones (not sure if that will help/hurt/ or N/A e.g. instead of 10.0.1.x, 10.0.2.x up to 10.0.200 etc. just use 10.0.x.x).  The one we have been having issues with twice this week is set with "w2000 compatibility" I am guessing we can change that at this point...

I enabled DNS bugging on our two main DNS servers. Also considering changing our DHCP scopes so the workstations use the 2nd one first kind of splitting the load with the servers that have DNS bindings.  Will check the logs actually had to make them smaller who wants to open a 500MB text file!

Anyway not trying to be smug I appreciate anyone's feedback here- just want to know that if I am going to make more system-wide changes its based on real data not just one or two blog posts that say "I think you should do it".
George SasIT EngineerCommented:
So I guess you have 2008 DC and 2012. Also 2003 ?
Do you get this error on ALL the DC's or only in one site ?

You say you have a policy to block anything but replication. It's a GPO or a firewall policy ?
Just for fun, tried to remove the policy and see if the error persists ?
mcburn13Author Commented:
We've been getting these errors probably as long as I started here - I started deploying W2012 R2 DCs, but we have had a mixture of 2008, 2008R2 and W2012.  The domain functional level is 2003 (trying to change that soon!), forest is 2008

the trusted forest we had that resolution issue with is on 2003 DFL/FFL- I just introduced several 2012 R2 DCs in preparation for decommissioning their 2003 DCs
mcburn13Author Commented:
Well still no luck I enabled DNS debug logging and all it shows is the same exact goofy error (The DNS server has encountered a critical error from the Active Directory. Check that the Active Directory is functioning properly. The extended error debug information (which may be empty) is "". The event data contains the error). Opened a ticket with Microsoft on it so we'll see.  In the meantime I've started cleaning things up:  Turned off WINS, re-enabled IPV6 on all DCs, pointed DCs to a partner DC first then its own IP. Will be cleaning up zones (consolidating many subnets into a 2 octect reverse lookup zone), recreating the _mcdcs zone properly...
mcburn13Author Commented:
Well the outages have subsided, I have narrowed down at least one instance of this error to VM backups being taken of the PDC.  I have since had the backup team move these to way after-hours and staggered them- Then the errors occur just once per server (maybe 4-5 servers a night I see this), right at the time the backup (Veeam snapshot) finishes.  Still not sure if this or something else storage-related caused some of these outages because I am seeing some other suspicious things in AD and still get a lot of flags about "data not being collected about...."
George SasIT EngineerCommented:
What is the uptime of the DC ?
Try to move the PDC role on another DC and see if the error still appears.
Also try to enable Verbose mode on AD logging :
mcburn13Author Commented:
yeah have done all that.  Verbose DNS logging shows the same error.
George SasIT EngineerCommented:
In my last post I was talking about AD verbose logging as the error appears to come from the AD.

You could try to shut down the PDC during backup time and see if the error still comes and then let it boot up again.
Or if it is not to much work, try to disable old backup job and make a new one and see if the error is related to that.
Dig in the backup logs and see if you see some strange things there. It can be a minor glitch somewhere and only looking at ALL the AD as a whole you will be able to see where the error comes from.
Aaron TomoskyDirector of Solutions ConsultingCommented:
First off I'll give my opinion on why a servers first dns entry should not be itself: Personally I use one DC as my main dc where I make all changes including dns additions. This main dc is the primary for dns for all my other dcs. This allows them to pick up that information faster than if they used themselves. That said, you never know when a computer will switch to using the secondary so it's not really a foolproof method. There was a time where I had to set all dc dns to a single dc with no secondary so that netlogin would register the dc dns records correctly and it was nice and easy to just blank out the secondary. Also some commands like nslookup use the primary dns entry by default, even when the server has switched over to the secondary if the primary was unavailable. There could be other commands or services that do the same.

This brings me to my question, perhaps you have the same issue: I was working with a domain that was originally win2k (I didn't know this but it's the only explanation) and it was missing the top level _msdcs zone. Here is what I did to fix it

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
mcburn13Author Commented:
I changed all my DNS servers to point to others first (unless they were in a site by themselves), enabled IPv6 and rebuilt the _msdcs zone  (did the same as in that article).  We only see the error now once and it is WAY after hours I am 99.99999% certain it is because Veeam is running a backup of the VM (takes snapshot) and the PDC becomes unavailable for a few seconds to minutes.   In addition to moving the backup window later  we staggered the DCs so no two are backing up at the same time.  Still would love for Microsoft to BUCK UP and figure out how to give admins errors that actually mean something instead of "".  Perhaps something like "there was an issue with SERVERA communicating with the PDC Emulator THAT's why we threw this error" Just saying..
mcburn13Author Commented:
nothing else offered that helped troubleshoot
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today

From novice to tech pro — start learning today.