Replication to one domain controller not working, others are fine.

Hi all,

A little background will help here.

We have multiple DC's across multiple sites, site 1 (head office) has two DC's - DCSite1-001 and DCSite1-002, our branch offices all have a single DC and are connected to Head Office via a full MPLS network. The site links are great, all ping communications are done typically under 1ms.

I have a problem between our main site main DC (DCSite1-001) and one of the remote site DC's (DCSite2-001). All of the other sites DC's can replicate to and from the main site DC, it is only these two machines that have a problem. DCSite2-001 can replicate with DCSite1-002, just not DCSite1-001. Again - all other servers in other sites can replicate fine with DCSite1-001.

I have checked the obvious like time syncs, updates etc but everything looks ok. I have run PortQuery and it gets stuck when connecting from DCSite2-001 to DCSite1-001 using TCP over port 389. The problem i have is this server can connect to another server in site 1, just not the main DC. Servers in other sites can connect to the main DC so it can't be that the port is blocked on the server.

Any ideas?
LVL 1
HallidaysAsked:
Who is Participating?

[Product update] Infrastructure Analysis Tool is now available with Business Accounts.Learn More

x
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Radhakrishnan RSenior Technical LeadCommented:
Hi,

Are there any dfs replication errors in the events logs? if so, can you post it here?. Also, download the MS tool and see if you can pin point anything about the replication

http://www.microsoft.com/en-in/download/details.aspx?id=30005
HallidaysAuthor Commented:
Hi,

There is a warning rather than an error but this is relating to DC2 in site 1 as this is where DFS replicates to. AD replication is not working to DC1 in site 1. I am running the AD Replication tool now.


The File Replication Service is having trouble enabling replication from DCSite1-002 to DCSite2-001 for c:\windows\sysvol\domain using the DNS name DCSite1-002.domainname.local. FRS will keep retrying.
 Following are some of the reasons you would see this warning.
 
 [1] FRS can not correctly resolve the DNS name  DCSite1-002.domainname.local from this computer.
 [2] FRS is not running on  DCSite1-002.domainname.local
 [3] The topology information in the Active Directory Domain Services for this replica has not yet replicated to all the Domain Controllers.
 
 This event log message will appear once per connection, After the problem is fixed you will see another event log message indicating that the connection has been established.
HallidaysAuthor Commented:
Interesting - I ran the AD Replication Tool from DCSite1-001 (the main DC) and checked replication with DCSite2-001 and it says:

Domain Controller "DCSite2-001.domain.local" does not exist or could not be contacted. I can ping the server by both name and IP.
Making Bulk Changes to Active Directory

Watch this video to see how easy it is to make mass changes to Active Directory from an external text file without using complicated scripts.

Radhakrishnan RSenior Technical LeadCommented:
Hi,

Which replication method you are using? as per the error it indicates that it's FRS. Also, it looks to me that some servers are using DFS and some other are FRS?

If this is the case, I would suggest to keep as same on all the sites and see it makes any difference?

If that doesn't work then probably you may need to perform a Burflag as per https://support.microsoft.com/en-us/kb/290762/
HallidaysAuthor Commented:
Hi, the error is with AD replication rather than file replication. We use DFS for file replication and that seems to be working fine at the moment, it is just the AD that isn't.
Will SzymkowskiSenior Solution ArchitectCommented:
What happens when you run repadmin /replsum does this fail?

Do you have connections setup between both of these DC's? Have you let the KCC (knowledge consistency checker) make these connections for you (should be automatic connections)

Will.
Radhakrishnan RSenior Technical LeadCommented:
Hi,

Okay. Just to make sure that the NTDS.DIT is fine, could you run an Integrity check on the DC's and see it throwing any failures? you can follow this procedures;

Open command prompt and type>>ntdsutil
ntdsutil: files
file maintenance: Integrity

PS - Sometimes it may ask to activate instance, if so, type "activate instance ntds" after the files command.

Perform this on all the DC's and make sure that the NTDS.DIT fine on all the domain controllers.
HallidaysAuthor Commented:
Hi Will,

repadmin /replsum shows an error between the two DC;s in question

(1726) The remote procedure call failed

It then says "Experienced the following operational errors trying to retrieve replication information: 1053 - DCSite1-001"

All other servers can replicate fine
HallidaysAuthor Commented:
Hi Rad,

Integrity check is fine.
HallidaysAuthor Commented:
We are still having this issue - it looks like the DC isn't allowing RPC calls from our remote DC. All other servers are fine and the site links are ok, it is just this remote network that is having the issue.
Will SzymkowskiSenior Solution ArchitectCommented:
Have you tried to reboot the DC in the site? If this is something that you cannot correct you might need to demote this domain and re-promote.

Will.
HallidaysAuthor Commented:
Rebooting the DC with the master roles seems to allow this to work but this specific remote DC is the only one that cannot connect. We have 4 other remote DC's that continue to communicate fine when one is having an issue. I have demoted, promoted, removed and re added to the domain but it isn't making any difference.
HallidaysAuthor Commented:
Additionally, when the problem reoccurs, which is at random intervals i get the following in the registry:

The Knowledge Consistency Checker (KCC) has detected problems with the following directory partition.
 
Directory partition:
CN=Configuration,DC=Domain,DC=local
 
There is insufficient site connectivity information for the KCC to create a spanning tree replication topology. Or, one or more directory servers with this directory partition are unable to replicate the directory partition information. This is probably due to inaccessible directory servers.
 
User Action
Perform one of the following actions:
- Publish sufficient site connectivity information so that the KCC can determine a route by which this directory partition can reach this site. This is the preferred option.
- Add a Connection object to a directory service that contains the directory partition in this site from a directory service that contains the same directory partition in another site.
 
If neither of the tasks correct this condition, see previous events logged by the KCC that identify the inaccessible directory servers.


Sites are all still connected and everything works fine via IP
HallidaysAuthor Commented:
This is the event that is logged right after the above:

The Knowledge Consistency Checker (KCC) was unable to form a complete spanning tree network topology. As a result, the following list of sites cannot be reached from the local site.
 
Sites:
CN=Remote-Site,CN=Sites,CN=Configuration,DC=Domain,DC=local
Will SzymkowskiSenior Solution ArchitectCommented:
The Knowledge Consistency Checker (KCC) was unable to form a complete spanning tree network topology. As a result, the following list of sites cannot be reached from the local site.

This error message is generating due to you having a site in the default site link that does not have a network connection between some other sites in the same default site link. You need to remove this DC from this Site Link and create a new site link and add only the machines that can communicate to this DC.

That will correct this issue, you are receiving above.

Will.

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
HallidaysAuthor Commented:
Hi Will,

All of the DC's are connected via an MPLS so the default site like is correct,  we are still having trouble with this issue.

Apologies for the late reply.
Will SzymkowskiSenior Solution ArchitectCommented:
Well based on the error message it cannot contact some DC's that are listed in the Default Site Link.

I would suggest checking this again.

Will.
Seth SimmonsSr. Systems AdministratorCommented:
This question has been classified as abandoned and is closed as part of the Cleanup Program. See the recommendation for more details.
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Active Directory

From novice to tech pro — start learning today.