Solved

daily DNS failure resolved by stopping and restarting NETLOGON

Posted on 2008-06-26
17
731 Views
Last Modified: 2008-07-03
Each day I have a server that can't contact the GCS(error 1355) and fails DCDIAG tests CONNECTIVITY and FSMOCHECK and reports "Skipping all tests because server GE-PM is not responding to directory service requests"

I found a previous EE article that pointed out some DNS IP config issues with the Primary and Secondary DNS settings.  Resolved this by ensuring all DNS role servers had ONLY their own IP addy in the DNS settings.  The server in question IS a DNS role server.

This server is also an apllication and DB server for an SQL environment.  The above issue appears to be connected to an error in the SQL environment that results in an error message that says "not enough memory to complete the operation."  There is over 60 gigs of drive space on all of my servers so it is NOT a hard drive issue.

I believe the DNS failure is preventing access to the SYSVOL share(s) and thus user authentication is failing causing the above app environment error.

My problem is that I can't clear the instability in DNS and keep it running.  None of my other DNS servers seem to be having any trouble.  I am getting bored with having to stop and restart the NETLOGON service each morning...

Thanks in advance for any suggestions!!

0
Comment
Question by:Firedart
  • 8
  • 4
  • 4
  • +1
17 Comments
 
LVL 6

Expert Comment

by:rehanahmeds
ID: 21875807
till you find a solution for this you can write a batch file which restart netlogon service every morning...

we were having issue with SYSVOL as well... and in the end found out that SYSVOL just somehow lost its share it wasnt shared anymore
0
 
LVL 6

Expert Comment

by:rehanahmeds
ID: 21875822
are you running on SP2??
0
 
LVL 6

Expert Comment

by:rehanahmeds
ID: 21875874
one of the microsoft website link points to Windows Time Service has stopped....

http://support.microsoft.com/kb/272686
0
 
LVL 6

Expert Comment

by:rehanahmeds
ID: 21875886
this link is i think for the same problem which happened to our server as well... when SYSVOL lost its share...

http://support.microsoft.com/kb/283133

before restarting NET LOGON service check if SYSVOL is shared....
0
 
LVL 4

Expert Comment

by:antioed
ID: 21876424
The last link from rehanahmeds looks promising if all your DC-related DNS entries are definitely correct.  You should double-check all records on all DNS servers for DC entries that are not correct.  Check replication in AD Sites and Services...make sure all the partners are able to replicate and verify FSMO roles.  So DNS, Time, replication and verifying SYSVOL/NETLOGON shares are the steps I would be taking.

This link has some helpful information for problems like this:

http://support.microsoft.com/kb/305476/
0
 
LVL 38

Expert Comment

by:ChiefIT
ID: 21881299
Restarting the netlogon service, registers the SRV records in DNS. So,  your SRV records are missing every morining. Let me see if I can find what might cause the SRV records from disappearing. I have seen it before, but don't remember the fix.
0
 
LVL 38

Expert Comment

by:ChiefIT
ID: 21881759
It sounds like you may be scavaging SRV records prior to them updating. Since SRV records update within, I think 24 hours, you would have to have your scavaging set to 0 and 0.
0
 
LVL 38

Expert Comment

by:ChiefIT
ID: 21881794
So, a more likely scenario is a Single-labled domain name:
http://support.microsoft.com/kb/300684

Note that Single Labled Domain names have problems registering and deregistering DNS records.
0
Threat Intelligence Starter Resources

Integrating threat intelligence can be challenging, and not all companies are ready. These resources can help you build awareness and prepare for defense.

 

Author Comment

by:Firedart
ID: 21899305
ChiefIT,
   Thanks for the suggestions, but my domain name is normal TLD and fully registered with Network Solutions...
0
 

Author Comment

by:Firedart
ID: 21899347
Renhanahmeds,
   Wow, lots of ideas from you...
   First, we are still on SP1
   The SYSVOL share does NOT appear to be losing it's share
   I have not seen the TIME SERVICE having stopped.  I will look for this.

My apologies to all the responders.  My SPAM filter was blocking traffic from this site.  It is unblocked now.

0
 
LVL 38

Expert Comment

by:ChiefIT
ID: 21901793
SP1 has a bug in it. You might experience things like DNS hanging. Intermittant comms on an SP1 machine is usually a result of this bug. Let me see if I can find the hotfix. Yep, here it is.

2003 server Service Pack 1 has a discrepancy that can even cause a single NIC to be flooded.
http://support.microsoft.com/default.aspx?scid=kb;en-us;898060

http://www.experts-exchange.com/OS/Microsoft_Operating_Systems/Server/2003_Server/Q_23306595.html

The funny thing is SP2 has issues as well. (Well I guess it isn't so funny). I think this is what was rehanahmeds was leading towards:
http://www.lan-2-wan.com/2003-SP2.htm

I do think you have agressive scavaging or replication problems. That can remove your SRV records. What is your scavaging set to in your DNS configuration?
0
 

Author Comment

by:Firedart
ID: 21903379
I have just set Scavenging to AUTOMATIC on a 7 day cycle.  Prior to that it was set to whatever defaults are relative to the AUTO box NOT being marked.  

Is this a good change to make??

The problem has somehow gotten worse in that the NET STOP/START of the NETLOGON service is no longer resolving the issue.  A full restart is now the only thing that solves the problem.  This was noticed THIS MORNING  BEFORE   the AUTO 7 day scavenign setting was added/changed.

Thanks!
0
 
LVL 38

Expert Comment

by:ChiefIT
ID: 21904894
If as you say the scavenging is set to the deault then there is a 7 day "no refresh" period followed by a 7 day "refresh period" , then you are saying that scavaging will be done on the 15th day.

I set the scavaging ages to 3 and 5. That will be a 9 day turn around and make it one day longer than the 8 day DHCP lease. When doing this, you should check your DHCP lease duration too. Eight day lease duration keeps a pretty tidy DHCP address pool.

Setting it like the above example is not an over agressive Scavange policy and should work if this was your issue in the first place.



0
 
LVL 38

Expert Comment

by:ChiefIT
ID: 21904921
Since scavaging was not the issue. So, let's go another route:
Is this a multi-homed DC? Multihomed is defined as a domain controller with multiple IP addresses. That could mean a DC with one nic and multiple IPs on it or more than two NICS.

0
 
LVL 38

Expert Comment

by:ChiefIT
ID: 21904938
Oh, I looked up the error 1355 to see what it reads:
http://support.ipswitch.com/kb/WP-20051107-DM01.htm

If you are using NTLMhash authentication it will drop you out of kerberose and crash the system. If these are the type of errors your are seeing, I can point you in the right direction of getting onto the right authentication protocol.
0
 

Author Comment

by:Firedart
ID: 21907904
ChiefIT,
    The DC is NOT multi-homed in either sense.  One NIC with one IP assigned.

    The NTLMhash auth might be the issue as I am getting some KCC/Kerberose errors.

    I "inheritied" this setup as pre-existing when I took the IT Manager job at this clinic.  I have found numerous less than ideal or even best practice setups on the varoius servers.  If you can offer any suggestions on where to isolate and resolve this issue, I would greatly appreciate it!

Thanks!
0
 
LVL 38

Accepted Solution

by:
ChiefIT earned 500 total points
ID: 21909863
I know all to well about the word "inheritied". So, some of the best met practices were not followed and you are building it up to some of the best practices.

So, lets get started.
We have to get the KCC up. This is a pre-requiremed service to netlogon. I do see a number of errors that could be corelated or not. KCC could be the root of the problem. You may have replaced an NT4 DC with a 2003 server. If this is the case, the 2003 server would have needed to be compatible with NTLMhash for the duration of the NT server's existance. So, would the clients. It sounds like you have a VLAN site, (we'll call that site 2), that is trying to use NTLM and the main site. The meat of your LAN with the PDCe, (we'll call it site 1), will only use Kerberos. So, as time goes by, site1 locks out site 2 more frequently.

NTLMhas is a Pre-Windows ME protocol. It is used with old NT boxes. That doesn't mean that some post-windows ME computers may not be trying to authenticate via NTLMhash. If your servers at site1 are set to NOT respond to NTLMhash authentication, you will not be able to build a trust relationship. If so, this would shut down the KCC and therefore knock down the Netlogon service. See, the corelation?
There are three comments on here to help you out.

http://www.experts-exchange.com/OS/Microsoft_Operating_Systems/Server/Windows_2003_Active_Directory/Q_23132123.html

1) First comment is to get the local machines to abide by Kerberos. You will want to do this first so you don't leave the client and member server machines in NTLMhash.

2) Second comment is to fix the server. It is a registry hack, and I HIGHLY HIGHLY RECOMEND you do a system restore point and registry backup prior to completing this task:

3) The third comment is the solution of the article and explains the three authenitcation protocols.

Once we get the KCC up, Netlogon may follow.

This is not a common error to have computers trying to use NTLMhash. So, i would like to explore the possibilities of Journal Wrap with you. See, next section.
______________________________________________________________________________
Another thing we need to check is is you are in a Journal Wrap situation. Journal Wrap is a problem with File replication. It occurs in one of two instances. What happens is this: You are trying to replicate out your data to another DC. The DC replication process is slow and only gets a portion of the File replication. So, services shut down and you are stuck with a partial replication. This can shut down Netlogon as well. If you notice, there are many LSA errors when this happens. To fix this you may be able to do this without a registry hack call the ""Burflag"" method. .
http://www.experts-exchange.com/OS/Microsoft_Operating_Systems/Server/2003_Server/Q_23387073.html

I found the steps in this article works, in most cases, to resolve a journal wrap.
http://www.experts-exchange.com/OS/Microsoft_Operating_Systems/Server/2003_Server/Q_23407701.html

If you have to use the burflag method, this post may help you sort this out.
http://www.experts-exchange.com/OS/Microsoft_Operating_Systems/Server/2003_Server/Q_23404565.html

This error would progressivley worsten by shutting down the netlogon service quicker as our DNS records are straightened out. This explains the troublesome errors happening at a faster rate. I think this is your error.
DCdiag reports and event log errors are helpful in tracking down these issues.
____________________________________________________________________________

I have plenty of more possibilities. With the above errors that you provided, I am leaning towards a Journal Wrap. To further enhance our troubleshooting, we could use cut and pasted DCdiag errors and event logs that may pinpoint the issues in greater detail.
0

Featured Post

How to improve team productivity

Quip adds documents, spreadsheets, and tasklists to your Slack experience
- Elevate ideas to Quip docs
- Share Quip docs in Slack
- Get notified of changes to your docs
- Available on iOS/Android/Desktop/Web
- Online/Offline

Join & Write a Comment

Setting up a Microsoft WSUS update system is free relatively speaking if you have hard disk space and processor capacity.   However, WSUS can be a blessing and a curse. For example, there is nothing worse than approving updates and they just have…
You may have discovered the 'Compatibility View Settings' workaround for making your SBS 2008 Remote Web Workplace 'connect to a computer' section stops 'working around' after a Windows 10 client upgrade.  That can be fixed so it 'works around' agai…
This tutorial will walk an individual through the steps necessary to join and promote the first Windows Server 2012 domain controller into an Active Directory environment running on Windows Server 2008. Determine the location of the FSMO roles by lo…
This tutorial will walk an individual through the process of configuring their Windows Server 2012 domain controller to synchronize its time with a trusted, external resource. Use Google, Bing, or other preferred search engine to locate trusted NTP …

707 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

15 Experts available now in Live!

Get 1:1 Help Now