?
Solved

Domain goes offline when one Domain Controller reboots?

Posted on 2006-04-23
19
Medium Priority
?
549 Views
Last Modified: 2012-05-05
Hello Experts,

     This one's a bit complicated, so I'm going to do my best to break down the issue into small, bite-sized chunks.  If anything isn't clear, I'll be happy to try and clarify.

The Problem:

Our company has 2 domain controllers, Apollo and Artemis.  For the last few months, when Apollo is reboot, the domain goes offline, as if Artemis weren't even there.  Users can't log in, Exchange stops responding to user requests, and so on.

When the server, Apollo, comes back up, the Netlogon service is in a paused state, and the network won't start operating properly again until someone logs onto Apollo at the console, and then restarts the network service.  This leaves the following event in my event logs:
---------------------------
Event Type:      Error
Event Source:      NTDS General
Event Category:      Service Control
Event ID:      2103
Date:            4/23/2006
Time:            9:24:34 AM
User:            NT AUTHORITY\ANONYMOUS LOGON
Computer:      APOLLO
Description:
The Active Directory database has been restored using an unsupported restoration procedure.
 
Active Directory will be unable to log on users while this condition persists. As a result, the Net Logon service has paused.
 ----------------------------

After a little digging on this and other websites, I found that this can be caused by MS DTC not properly recognizing a DC promotion/demotion, and I found event logs to support that theory, and following the suggestions in this post (http://www.experts-exchange.com/Operating_Systems/Windows_Server_2003/Q_21589620.html?query=MS+DTC+could+not+correctly+process+a+DC+Promotion%2FDemotion+event.+MS+DTC+will+continue+to+function+and+will+use+the+existing+security+settings.&clearTAFilter=true) I managed to clear that error.

There *had* been an issue where Artemis had to undergo a forced demotion (both servers thought they were PDCEmulators), and I'm assuming the MSDTC errors were a leftover fragment of that.

Aftre making the registry changes listed in the link above, I reboot Apollo, only to find that the domain still "dies" when Apollo is reboot (apparently when the netlogon service stops, sinc e restarting that service seems to repair everything).

So now I turn to the experts.  I've not been able to come up with any answers on how to repair this issue, so I'm hoping someone else has seen this or something like it.

Thanks in advance, everyone.
0
Comment
Question by:wolfstar76
  • 9
  • 9
19 Comments
 
LVL 48

Accepted Solution

by:
Jay_Jay70 earned 2000 total points
ID: 16519115
Hi wolfstar76,

can you run dcdiag for me and let me know what tests fail? also try running it with the /fix switch before posting

this link may also be helpfull as the errors may well have come from the previous FSMO issues you had
http://support.microsoft.com/?kbid=875495

Cheers!
0
 

Author Comment

by:wolfstar76
ID: 16519138
Last week I took the time to cleanup my dcdiag problems (since some of those were, no doubt, adding to the problem).  I repaired a handful of KCC issues (also thanks, in no small part, to the MS DTC repairs).  

Here's a fresh dcdiag (or, the errors, at least) for completeness.

-------------------------------------------------------
C:\Documents and Settings\dsnyder>dcdiag /fix

Domain Controller Diagnosis


      Starting test: Services
            w32time Service is stopped on [APOLLO]
         ......................... APOLLO failed test Services
     
      Starting test: frsevent
         There are warning or error events within the last 24 hours after the
         SYSVOL has been shared.  Failing SYSVOL replication problems may cause
         Group Policy problems.
         ......................... APOLLO failed test frsevent
      Starting test: kccevent
         ......................... APOLLO passed test kccevent
      Starting test: systemlog
         An Error Event occured.  EventID: 0x00000457
            Time Generated: 04/23/2006   08:54:52
            (Event String could not be retrieved)
         An Error Event occured.  EventID: 0x00000457
            Time Generated: 04/23/2006   08:54:53
            (Event String could not be retrieved)
         An Error Event occured.  EventID: 0x00000457
            Time Generated: 04/23/2006   08:54:54
            (Event String could not be retrieved)
         An Error Event occured.  EventID: 0x00000457
            Time Generated: 04/23/2006   08:54:55
            (Event String could not be retrieved)
         An Error Event occured.  EventID: 0x00000457
            Time Generated: 04/23/2006   08:54:56
            (Event String could not be retrieved)
         An Error Event occured.  EventID: 0xC0002715
            Time Generated: 04/23/2006   08:56:18
            (Event String could not be retrieved)
         An Error Event occured.  EventID: 0xC0001B77
            Time Generated: 04/23/2006   08:59:23
            (Event String could not be retrieved)
         An Error Event occured.  EventID: 0x40011006
            Time Generated: 04/23/2006   09:08:52
            Event String: The connection was aborted by the remote WINS.   <------found and fixed earier this morning, was an old WINS server in the list.
         An Error Event occured.  EventID: 0xC25A002E
            Time Generated: 04/23/2006   09:25:08
            (Event String could not be retrieved)
         An Error Event occured.  EventID: 0xC0001B6E
            Time Generated: 04/23/2006   09:25:34
            (Event String could not be retrieved)
         An Error Event occured.  EventID: 0xC0001B6F
            Time Generated: 04/23/2006   09:25:34
            (Event String could not be retrieved)
         An Error Event occured.  EventID: 0xC0001B6E
            Time Generated: 04/23/2006   09:27:02
            (Event String could not be retrieved)
         An Error Event occured.  EventID: 0x00000457
            Time Generated: 04/23/2006   09:51:29
            (Event String could not be retrieved)
         An Error Event occured.  EventID: 0x00000457
            Time Generated: 04/23/2006   09:51:29
            (Event String could not be retrieved)
         An Error Event occured.  EventID: 0x00000457
            Time Generated: 04/23/2006   09:51:30
            (Event String could not be retrieved)
         An Error Event occured.  EventID: 0x00000457
            Time Generated: 04/23/2006   09:51:31
            (Event String could not be retrieved)
         An Error Event occured.  EventID: 0x00000457
            Time Generated: 04/23/2006   09:51:32
            (Event String could not be retrieved)
         ......................... APOLLO failed test systemlog
     
      Starting test: FsmoCheck
         Warning: DcGetDcName(TIME_SERVER) call failed, error 1355
         A Time Server could not be located.
         The server holding the PDC role is down.
         Warning: DcGetDcName(GOOD_TIME_SERVER_PREFERRED) call failed, error 1355
         A Good Time Server could not be located.
         ......................... faysharpe.net failed test FsmoCheck
-------------------------------------

Wow. . . several interesting tidbits here, glad you asked for that.  

There's a WINS error that's been corrected as of this morning (old server was still listed, it's been removed)

The w32time errors are interesting because I manually (re)set those just last week when I was working on this issue before.

I'm also surprised to see so many KCC errors, as, again, I had those all cleared up as of last week.  When I checked the replication topology in AD Sites and Services last week, there was only one connection between the servers, today there are still two.  (Apollo -> Artemis, and the reverse - both auto generated).

Anyhow, there's my results, I'm going to peruse the link you provided while you digest this reply.  :)
0
 
LVL 48

Expert Comment

by:Jay_Jay70
ID: 16519153
ah k biggest problem we are looking at here is the FSMO errors which will be relating back to the DC's both holding the PDC emulator role

how did you resolve this? Did you seize a role or transfer it? can you confirm which DC holds the PDC emulator role at the moment, and for that matter, all the roles?
0
Free recovery tool for Microsoft Active Directory

Veeam Explorer for Microsoft Active Directory provides fast and reliable object-level recovery for Active Directory from a single-pass, agentless backup or storage snapshot — without the need to restore an entire virtual machine or use third-party tools.

 
LVL 48

Expert Comment

by:Jay_Jay70
ID: 16519155
or i could be going a bit overkill ^^^^

check that the time service is enabled and started....
http://support.microsoft.com/default.aspx?kbid=272686
0
 

Author Comment

by:wolfstar76
ID: 16519176
At the time, since both servers believed themselves to be PDCE, I was unable to transfer or seize the role on either box (can't sieze a roll you already have, after all.  :-/)  To resolve that, I had to forcibly demote Artemis (dcpromo /forceremoval) - Artemis was chosen because it should *not* have been PDCE, and because it had the most outdated AD records.

I vaugly recall speaking with Microsoft Support about other issues around that time (come to think of it, I think we spoke with the about the domain dissapearing when we demoted artemis), and they agreed that a forced demotion was the best (only?) way to repair two DCs believing they are PDCE.

Looking over the link you gave me, the only entry I see that's relavant is the Event ID I posted above, and it would seem that the article is suggesting I transfer all my FSMO roles to Artemis, and then demote/promote Apollo.

I don't mind admitting however, that that option frightens me, since Apollo being offline takes my domain with it.  I'd hate to demote apollo and then have no "visible" domain.

FWIW - I tested making a minor edit to my user object to see if the DCs are at least replicating, and yes, they appear to be doing so just fine.  (I changed my Comapny Name (under Organization) to a bogus company on Apollo and saw that replicate to Artemis.  I then fixed it on Artemis and saw it replicate back to Apollo)
0
 
LVL 48

Expert Comment

by:Jay_Jay70
ID: 16519195
see the good and lovely people at microsoft obviously didnt tell you the catch with forcibly removing the DC... if you do that then the roles dont get transfered as you would know, when you seize a role back, you are supposed to format the machine or use a procedure like this one.....

http://www.petri.co.il/delete_failed_dcs_from_ad.htm

this uses the ntdsutil to clean your metadata and all traces of the failed DC in AD (except for sites and services, you still have to do that manually - nice bug huh!)

was your time service started?
0
 

Author Comment

by:wolfstar76
ID: 16519210
I believe I did the metadata cleanup, as I've run through those steps in the past (but then, my network has a long and sordid history of domain controllers giving up the ghost, so I could be recalling that process for a machine that has been out of service two years or more - funny how it all blurs together after a while, no?)

The time service, as you saw in the dcdiag, is not currently running, but I can start it now if you like.  I'm kinda keeping "hands-off" for this issue right now, because I'm tired of trying things that don't work.  ;)  
0
 

Author Comment

by:wolfstar76
ID: 16519225
Thinking over what you said about metadata cleanup - should I attempt to demote Artemis, re-run the cleanup and repromote to see if that cleans things up at all?  I'm certainly willing to give that a go.

The ntdsutil only shows the two servers in the site right now (and not, say, two instances of Artemis), however, if the GUID for Artemis isn't the same as what this is looking for, I can see where that would pose issues.
0
 
LVL 48

Expert Comment

by:Jay_Jay70
ID: 16519238
what i would be doing is the following,

seeing Apollo is the happy sucker, demote artemis and power him down completely - i am hoping that you can run the normal dcpromo but i am guessing that you will have to once again forceremoval - see how you go

once the artemis is demoted run dcdiag again on apollo and confirm the placement of FSMO roles, make sure that Apollo hold them and that everything is ok

run the cleanup on both machines, completely removing AD on artemis and any traces of him from apollo - make sure the time service is set to auto on apollo as well

remove artemis from sites and services and then try repromoting artemis back in

run dcdiag again and see how you go

***** This is providing that everything is ok and fine when artemis is down and apollo is up..... before you do anything   shutdown artemis and run dcdiag on apollo making surethe time service is enabled and that you have cleared your event logs
0
 

Author Comment

by:wolfstar76
ID: 16519243
My apologies, looks like I missed your post about the time server when I was typing my large reply.

I have just now started that service.
0
 

Author Comment

by:wolfstar76
ID: 16519252
Agreed, your plan sounds like what I was contemplating here (and yes, with Artemis down, everything appears to be happy here).

I will set to work on demoting artemis straightaway, then do all the cleanup.

I'll keep you posted in 45 minutes or so.

Also - the time service was set to auto, but not started.  Will research that further while working on the above.
0
 
LVL 48

Expert Comment

by:Jay_Jay70
ID: 16519254
tis all good :)

I'm about to hit the sack, its just after midnight here in aus and i have to get up early tomoz, though saying that i will be back in about 7 hours time so if you are getting nowhere just post or send me an email and i will get straight back to you

James
0
 

Author Comment

by:wolfstar76
ID: 16519422
Enjoy your night's rest.

I was able to peacfully demote Artemis, along with all the FSMO roles "she" was holding.  When I went to run the metadata cleanup, I was pleasantly surprised to see that there were no lingering entries for Artemis to be removed.  

I manually removed the listing from Sites and Services, and just for kicks, I pulled the Artemis from DNS ownership as well.

The promotion is going through now, and appears to be running smoothly.  Almost too smoothly, frankly, as I'm afraid this will end up not being my fix.

Sure enough, after a reboot of Apollo, same issues.

It's looking like I need to demote and repromot Apollo, but that leaves me filled with much consternation, since Apollo being offline drops my domain offline as well.

Heading home for now, but will keep checking this issue from there for new replies/suggestions.
0
 
LVL 12

Expert Comment

by:Rant32
ID: 16520958
Can you run these commands on both domain controllers:

Repadmin /showutdvec apollo dc=MyDomain,dc=ads
Repadmin /showutdvec artemis dc=MyDomain,dc=ads

replace "dc=MyDomain,dc=ads" with your domain NC.

These commands display the known USN for both domain controllers. If USNs are not the same for all domain controllers, replication will not occur and serious AD database inconsistencies will occur (or already have occurred, because you've told us replication seems to work).

If you've used any type of drive imaging/restoration software on your domain controllers (presumably Apollo) then read this article very carefully:
http://support.microsoft.com/?kbid=875495
0
 
LVL 48

Expert Comment

by:Jay_Jay70
ID: 16521173
yah thats the link i posted above with the USN rollback issue :)

i think from what we have seen so far that your going to have to transfer the roles to artemis and do your AD reinstall..... cant really see anyway around it if its the USN problem which so far it is....
0
 

Author Comment

by:wolfstar76
ID: 16577965
Just to keep this topic alive (and unabandoned) due to assorted scheduling issues, I haven't had a chance to rebuild the DC yet.

However, I've set aside some time to do that this weekend, and will followup with the results next week.
0
 
LVL 48

Expert Comment

by:Jay_Jay70
ID: 16581678
aight mate let us know :)
0
 

Author Comment

by:wolfstar76
ID: 16598259
I broke down and contacted Microsoft on this issue, and they agreed to reopen the previous case (since this is appearing to still be a failed promotion the more I dig into it).

With their help I did, indeed, need to follow the fix for the USN rollback error (demote and repromote the problematic DC).

It also turns out that for whatever reason, in the last repromotion, the NETVOL and SYSVOL shares weren't correctly inplemented, further adding to the complications.

With those added and the problem DC re-promoted things appear to be running a lot smoother today.  Specifically I can reboot either DC and still logon to my domain.  I'm still having issues with Exchange, but I'm going to open an Exchange question to help with that aspect of it if a few things I try today don't repair the problem.

Thank you, Jay, for your help and insight.
0
 
LVL 48

Expert Comment

by:Jay_Jay70
ID: 16600625
tis a pleasure mate

i am glad all is well
0

Featured Post

Get your Conversational Ransomware Defense e‑book

This e-book gives you an insight into the ransomware threat and reviews the fundamentals of top-notch ransomware preparedness and recovery. To help you protect yourself and your organization. The initial infection may be inevitable, so the best protection is to be fully prepared.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Learn about cloud computing and its benefits for small business owners.
This article provides a convenient collection of links to Microsoft provided Security Patches for operating systems that have reached their End of Life support cycle. Included operating systems covered by this article are Windows XP,  Windows Server…
Exchange organizations may use the Journaling Agent of the Transport Service to archive messages going through Exchange. However, if the Transport Service is integrated with some email content management application (such as an anti-spam), the admin…
We’ve all felt that sense of false security before—locking down external access to a database or component and feeling like we’ve done all we need to do to secure company data. But that feeling is fleeting. Attacks these days can happen in many w…

807 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question