Link to home
Start Free TrialLog in
Avatar of osaadministrator
osaadministrator

asked on

Domain Replication Errors and computer objects not showing up on all DC's

Hello, I am experiencing all kinds of event log entries of 1864 in the Directory Service event log and 675 in the Security event log.  I believe these issues are related and therefore I believe to have outdated domain controllers in my network.  After our domain admin left a while back the passwords were changed and I am thinking there was a replication issue prior to this and the passwords change did not replicate to the other DC's and this has snowballed.  I went digging today because I have discovered event ID 5723 Netlogon errors of computers that exist in one site (and on that site's domain controller) but not on the other DC's.  That particular computer was recently formatted and added to the domain but only appears on that local DC.  We have 6 office locations around the country each with 1 domain controller in it.  4 of the offices are connected via an MPLS network and 2 of the offices are connected via VPN devices.  All sites can ping each other and file/print share.  All servers are running Server 2003 R2 either Std or Ent edition.

Here are some of the event entries:
This is the replication status for the following directory partition on the local domain controller.
Directory partition:
DC=ForestDnsZones,DC=MYDOMAIN,DC=com
The local domain controller has not recently received replication information from a number of domain controllers.   The count of domain controllers is shown, divided into the following intervals.
More than 24 hours:
1
More than a week:
1
More than one month:
1
More than two months:
1
More than a tombstone lifetime:
1
Tombstone lifetime (days):
60
 Domain controllers that do not replicate in a timely manner may encounter errors. It may miss password changes and be unable to authenticate. A DC that has not replicated in a tombstone lifetime may have missed the deletion of some objects, and may be automatically blocked from future replication until it is reconciled.

It has been too long since this machine last replicated with the named source machine. The time between replications with this source has exceeded the tombstone lifetime. Replication has been stopped with this source.
The reason that replication is not allowed to continue is that the two machine's views of deleted objects may now be different. The source machine may still have copies of objects that have been deleted (and garbage collected) on this machine. If they were allowed to replicate, the source machine might return objects which have already been deleted.
Time of last successful replication:
2008-03-27 23:29:53
Invocation ID of source:
036cf820-f810-036c-0100-000000000000
Name of source:
3e20995e-48b1-4c31-bbe9-37cead2c6a7b._msdcs.osacorp.com
Tombstone lifetime (days):
60
 The replication operation has failed.

What is the best way to troubleshoot this and correct the issue?  Please note I am the sole administrator in one location for all 6 locations.  Any suggestions that can be done remotely are greatly appreciated.

Thank you in advance!
Avatar of Darius Ghassem
Darius Ghassem
Flag of United States of America image

Check this MS article out. Did you have a DC the was formated but not restored? Also, run a netdiag and dcdiag

http://support.microsoft.com/?kbid=216498

From a command prompt try running netdiag /fix this could
repopulate the dns records for your dc

If you don't have the support tools installed, install them from your server
install disk.
d:\support\tools\setup.exe

Run dcdiag, netdiag and repadmin in verbose mode.
-> DCDIAG /V /C /D /E /s:yourdcname > c:\dcdiag.log
-> netdiag.exe /v > c:\netdiag.log (On each dc)
-> repadmin.exe /showrepl dc* /verbose /all /intersite > c:\repl.txt

**Note: Using the /E switch in dcdiag will run diagnostics against ALL dc's
in the forest. If you have significant numbers of DC's this test could
generate significant detail and take a long time. You also want to take
into account slow links to dc's will also add to the testing time.

If you download a gui script I wrote it should be simple to set and run
(DCDiag and NetDiag). It also has the option to run individual tests
without having to learn all the switch options. The details will be output
in notepad text files that pop up automagically.

The script is located in the download section on my website at
http://www.pbbergs.com/windows/downloads.htm#DCDIAG

Just select both dcdiag and netdiag make sure verbose is set. (Leave the
default settings for dcdiag as set when selected)

When complete search for fail, error and warning messages.

Avatar of osaadministrator
osaadministrator

ASKER

dariusq: nice script, thank you for that.  it appears (from what I can read) that my issue is domain controllers are tombstoned/retired invocations.  It is weird that user passwords are replicating but some machines are not.  To answer your question no, no DC were formatted but not restored.  I also looked at sites and services and found that all of my DC's are not listed in each others NTDS Settings.  My primary (first domain controller with all 5 FSMO roles) DC is only replicating with one other DC and not all of the DC's.  From your script it looks like it's because of the latency/retired status.  I'll paste a couple of items that are pointing me this way and please let me know if I am wrong as I am very basically skilled at this particular task.  

Testing server: SITE1\DC1
      Starting test: Replications
         * Replications Check
         * Replication Latency Check
            DC=ForestDnsZones,DC=MYDOMAIN,DC=com
               Latency information for 4 entries in the vector were ignored.
                  4 were retired Invocations.  0 were either: read-only replicas and are not verifiably latent, or dc's no longer replicating this nc.  0 had no latency information (Win2K DC).  
            DC=DomainDnsZones,DC=MYDOMAIN,DC=com
               Latency information for 4 entries in the vector were ignored.
                  4 were retired Invocations.  0 were either: read-only replicas and are not verifiably latent, or dc's no longer replicating this nc.  0 had no latency information (Win2K DC).  
            CN=Schema,CN=Configuration,DC=MYDOMAIN,DC=com
               Latency information for 12 entries in the vector were ignored.
                  12 were retired Invocations.  0 were either: read-only replicas and are not verifiably latent, or dc's no longer replicating this nc.  0 had no latency information (Win2K DC).  
            CN=Configuration,DC=MYDOMAIN,DC=com
               Latency information for 12 entries in the vector were ignored.
                  12 were retired Invocations.  0 were either: read-only replicas and are not verifiably latent, or dc's no longer replicating this nc.  0 had no latency information (Win2K DC).  
            DC=MYDOMAIN,DC=com
               Latency information for 12 entries in the vector were ignored.
                  12 were retired Invocations.  0 were either: read-only replicas and are not verifiably latent, or dc's no longer replicating this nc.  0 had no latency information (Win2K DC).  
         * Replication Site Latency Check
         Site

Please let me know if you need other information from the log file.  I have obviously renamed the domain to MYDOMAIN and Site\Server to Site1\DC1 for posting.

Thank you again in advance for your help.
If you have more then one DC in a domain then you should have the infrastruture master FSMO role moved off onto a DC that isn't a global catalog. Also, if you can please post the whole report. I would check to make sure all of your DCs IP settings are correct and are pointing to the correct srv IP addresses. Make sure you do a netdiag fix after any changes. Did you restore the DC with the same IP, WINS, DNS, name, and etc.

http://support.microsoft.com/kb/260575
re-running the test now to post for you.  Something in your script must have triggered something as the replication information in sites and services has changed and those two missing client computers now show up on all DC's in users and computers.  All DC's are replicating thru one DC now (which isn't the FSMO DC).  This particular DC actually travels around to temporary office locations when needed.  It is currently here physically in the same building with the primary but only connected thru a VPN device, not on the same subnet.  So it is maintaining it's normal connectivity.  However, what will happen to replication when I shut it down and move ship it?  Also, I believe all servers have GC.  I believe this was set up becuase of the cross country distance of each office so that local log in would be faster and all informaiton is local instead of traversing the WAN.

As for your question on restoring a DC with same IP, etc....  I haven't restored anything so I am not sure if I miscommunicated something earlier.  I have not done anything with the DC's to this point.
Did you do this yet?

http://support.microsoft.com/kb/260575

This was your comment above so I might have misunderstood.

That particular computer was recently formatted and added to the domain but only appears on that local DC
dariusq: I have posted the txt file in this message.  as before I did a little cleansing of server names and domain.  I did not do the KB article you posted above.  The laptop in question was formatted, XP Pro installed (previously had XP home and not a member of the domain), added to the domain but only appeared in the local DC's computers list.  Until I ran your script earlier now it appears in all of them???  the user hasn't experienced any issues using his laptop or logging in, this is all on the server side as it appeared I had a replication issue.  Sorry for the confusion.  Hope the text file helps figure this mess out.

Again, thank you.
dcdiag.log
I feel like the changing of the password had something to do with it but I can't wrap my brain around it. I just was involved with a case that someone changed the domain admin password which stopped replication.  Do the link above. Check to make sure the primary can ping the other DCs. I didn't see anything in the log except for a couple of failers in the reverse lookup zone. Look in the log at the bottom to view the fails and check the IPs to what they are assoicated with.
This article explains the four stages of a deleted AD object. It's a really good read for someone with Tombstoned objects.
http://support.microsoft.com/kb/248047

You may have to run the NTDSUtility to remove DCs that have been tombstoned. They call it a Metadata cleanup.
http://www.petri.co.il/delete_failed_dcs_from_ad.htm

I don't know where you are at in these processes.
FRS uses DNS records to replicate with external Domain controllers. So, in bringing up a new domain controller, (which is what you might have to do after removing metadata), you will need to register the domain controller's DNS Host A record and SRV records to itself and then force replicate to the PDCe.
To do this go to the command prompt and type.

IPconfig /flushdns
IPconfig /registerdns
Net Stop netlogon
Net Start netlogon

Then force replicate FROM the PDCe to the other DC:
http://windowsitpro.com/article/articleid/13396/how-do-i-force-replication-between-two-domain-controllers-in-a-site.html
dariusq: One question before I do the KB article, if I can log onto the domain controllers using the domain admin account (the one we changed the password on) shouldn't that identify that the password change was successful on all DC's?  That is the confusing part.  I do get these errors in the security log:
Event ID 675
Pre-authentication failed:
       User Name:      Administrator
       User ID:            MYDOMAIN\administrator
       Service Name:      krbtgt/MYDOMAIN.COM
       Pre-Authentication Type:      0x2
       Failure Code:      0x18
       Client Address:      127.0.0.1

which of course leads back the password change yes?
Look this up and make sure the old password isn't listed here.

DHCP->Properties->Advanced Tab->Credentials button next to DNS Dynamic Updates Registration Credentials.

Back to the tombstone problem. What DC is getting the tombstone errors?
darusq: I did the DHCP...creditiaals however I cannot tell what password is being used since it's hidden so I just reentered the current administrator password on all DC's.

As for the tombstone, not sure how to find out which one is causing the errors as all the servers have similar errors logged.  When I try using repadmin to see latencies I get the below response.
C:\Program Files\Support Tools>repadmin /showvector /latency MYDOMAIN
Caching GUIDs.
..
DsReplicaGetInfo() failed with status 8453 (0x2105):
    Replication access was denied.

Now the interesting part is I did a repadmin /reaplsummary and it is showing no errors on any of the servers (see attached repadmin replsummary post.txt.

but of course I still have this error in the event log, last one was 6/30 at 3 am.
This is the replication status for the following directory partition on the local domain controller.
 
Directory partition:
DC=ForestDnsZones,DC=osacorp,DC=com
 
The local domain controller has not recently received replication information from a number of domain controllers.   The count of domain controllers is shown, divided into the following intervals.
 
More than 24 hours:
1
More than a week:
1
More than one month:
1
More than two months:
1
More than a tombstone lifetime:
1
Tombstone lifetime (days):
60
 Domain controllers that do not replicate in a timely manner may encounter errors. It may miss password changes and be unable to authenticate. A DC that has not replicated in a tombstone lifetime may have missed the deletion of some objects, and may be automatically blocked from future replication until it is reconciled.
 
To identify the domain controllers by name, install the support tools included on the installation  CD and run dcdiag.exe.
You can also use the support tool repadmin.exe to display the replication latencies of the domain controllers in the forest.   The command is "repadmin /showvector /latency <partition-dn>".

For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.


ChiefIT: I'm going to look up that NTDS utility you mentioned tonight and will repost again later.

Thank you!
repadmin-replsummary-post.txt
OSAAdministrator,

Just want to make sure you know if your run the NTDS make sure you read Chief's prior statement when you run a metadata cleanup you might have to re-install. I think you might need to do that anyway since it has been 60 days since the last replication. How is your AD phyiscally setup? Does your DC connect through VPN or Firewall?
dariusq:

I ran the NTDS utility following the instructions on an MS KB article 216498 (http://support.microsoft.com/kb/216498) and looked at each site and only found the correct DC's by repeating steps 10 thru 12.  No burried bones there that I could find, after looking at each site I simply quit the utility and made no changes.  Is there a command to list servers by domain incase there is something in the domain container that wasn't in a site?

DC's 1, 2, 3 and 4 are connected via an MPLS serivce.  DC's 5 and 6 are connected via VPN back to site's 1 and 2.  Primary DC FSMO is DC2.  Each DC is physically in a different state (except DC6 which travels around on a temporary basis and is currently with DC2 but connected thru it's VPN to maintain that subnet)

What I did notice is that all DC's are replicating in a a hub/spoke set up via DC6 (which is as stated above the one that moves around).  I am not sure how or why this changed from my DC2 and I though all DC's should replicate to each other all the time and only WINS should be hub/spoke?  This could just be ignorance on my behalf with my little knowledge in this area...:)  What happens when I shut down DC6 to ship it out, do DC's 1-5 automatically adjust their replication partners?  I guess I could try this today...

I also checked DNS and it shows my 6 DC and when I go into Sites and Services and right click on AD and select Connect to Domain Controller only my 6 servers show up, no extra's there.

I just can't figure out what is tombstoned... and triggering these errors.  I leave for vacation later today thru next week.  If you think of anything today I'll try it.  Otherwise I'll catch up (and hopefully nothing will happen to my network/servers while I'm away).

Thanks again.
Thanks.  I read thru it.  Supposedly all internal traffic is wide open although I can not see what's happening at the ISP level on the MPLS.  The VPN's are set to allow all traffic within the tunnel.  It appears to be allowing the traffic right, otherwise that replsummary would have shown failures correct?  Not sure how I can test ports since ping doesn't allow that.  I could try telneting everything but don't have the time..
When you look in sites and services do you see all of the DC in replication?
When I go into Sites and Services and expand all sites to the NTDS Settings I see each domain controller for that site.  nothing more and nothing less.  When I click on the NTDS Settings of each DC there is only one DC shown on the right pane, that is DC6 (my portable DC).  So DC1 thru DC5 only show DC6 when I click on their NTDS Settings.  When I click on DC6's NTDS settings it shows me DC1 thru DC5.  Hence my description of the hub and spoke set up.  Everything lists as Automatically Generated.
ASKER CERTIFIED SOLUTION
Avatar of ChiefIT
ChiefIT
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
ChiefIT: Thank you for the info.  I'm on vacation this week but was finally able to find some Internet access.  I will try and read these when I can or when I get back next week.  As you mentioned 'my spoked method' for replication, I believe this to be automatic as I did not set this up.  I would think that all servers should replciate to all servers in a non-spoked method.  As you suggested, all DC's do have a global catalog.  I agree with you on the stationary partner and I think I posed the question earlier that if I turn off my DC6 for a few days, will KCC automatically reconfigure the replication?  I will read the articles as soon as I can and thank you for the continued support.
I haven't heard from you in a while. How goes the battle?
well, all seems to be running ok on the surface at this point.  I am still getting the same returns from your script as my June 30th post with the 4 retired invocations and 12 retired invocations in the replication latency check.  Aside from that there were DNS forwarders that are invalid.

In my Sites and Services, the NTDS settings for each domain controller still have the hub and spoke set up to my traveling DC (which has been stationary for a couple months now).  I was thinking about taking that offline for a few days and see if KCC adjusts the NTDS settings to a different server thinking if it does, then it is working properly and if it doesn't, there is a problem I still need to uncover.  I figured taking it off line would be better than demoting it in case there is another problem.  Agreed?

Yes, I agree.