Link to home
Start Free TrialLog in
Avatar of Dale Forguson
Dale ForgusonFlag for United States of America

asked on

Active Directory replication errors

The Customer has 4 sites which are configured in AD sites and services. There is one domain, no sub-domains. Servers are a combination of 2012, 2016, and 2019. Site A has two DCs one of which is the FSMO role holder for all 5 roles. Each of the other sites have one DC each. The DC at Site D reports that it is the role holder for Schema Master. This is the only role which is not synched for all DCs. All four sites are connected with site to site VPN. Each site is on a separate subnet. All DCs can ping all other DCs by name or IP address.

If I run repadmin /syncall /Adeq on the FSMO role holder at Site A I see error 1722 "The RPC server is unavailable" from Site C to Site B.  
If I run repadmin /replsum on the DC at Site D it reports operational error 8341 for all 4 other DC replications. Source (Site A FSMO role holder) fails 40% error 8606 "Insufficient attributes were given to create an object. This object may not exist because it may have been deleted and already garbage collected"
Destination (Site D) has the same failure rate, error code and error message.
At Site D DC in Sites and Services if I navigate to NTDS settings for the FSMO role holder at Site A and select "Replicate configuration from the selected DC" it is successful.
This thread seems to be most similar to the issues I have; https://community.spiceworks.com/topic/2178528-the-target-principal-name-is-incorrect based on Dcdiag results. I have followed the steps in the last post in the thread marked as the solution but my problems are not resolved.
Avatar of Michael Pfister
Michael Pfister
Flag of Germany image

Please attach the output file of  
dcdiag /e /v > dcdiag.txt

Open in new window

on a DC of site A here.

Verify the connectivity between each site for the required ports.
PortQryUI can be helpful here: https://docs.microsoft.com/en-us/troubleshoot/windows-server/networking/use-portqry-verify-active-directory-tcp-ip-connectivity


All services up and running on the problematic DCs?
Avatar of Dale Forguson

ASKER

I neglected to mention that Site A FSMO role holder and Site D DC are both Hyper-V VMs. though it may not matter. I assume you want to see dcdiag from the problematic DC at Site D The server mentioned in Dcdiag "CCFLEET-DC1" is off line. It was also at Site D. see note in Dcdiag output.dcdiag.txt PortQry output to follow
Site A - (FSMO role holder) Dcdiag output dcdiag.txt
Site A - (FSMO role holder) portQryOutputportQryOutput.txt

Have to look at the other logs but please run the PortQry again from DC in Site A against DC in Site D
Services can have several startup types in their normal state. Can you give me a better idea of what you're looking for?
RPCSS is running on the two servers which reported "error 1722 "The RPC server is unavailable""
Site B - BRIDGETONDC3 has replication problems:   Error: 1722 (The RPC server is unavailable.)
 - Services up and running? Network/Firewall?
If service is currently running its ok. Could be a network problem or a firewall blocking high ports required for RPC.
This is helpful for network issues: https://support.microsoft.com/en-us/topic/07e52568-135b-aad9-b871-061ffcb6fc49
Scroll down near the end. There are the corresponding PortQry tests against a problematic server.
Like
portqry -n <problem_server> -e 135
portqry -n <problem_server> -r 1024-5000
portqry -n <problem_server> -r 49152 - 65535
Is the Windows Firewall enabled on those servers? If yes, make sure "Remote Event Log Management" rules are active and allowed for the Domain profile.
If not, enable them and rerun dcdiag on DC SIte A.
second port query reports a debug error. "Run-Time Check Failure #2 - Stack around variable 'my_ncb' was corrupted." I clicked ignore to allow the query to complete. portQueryver2.txt
Windows firewall was disabled for the domain on two of the five servers. I have disabled on the domain for the remaining 3 for testing. dcdiag2.txt
Looks quite good (besides that run-time failure). I wonder why there are so many RPC errors in dcdiag output but portqry shows no problem.

Please try running on CcscDC2

eventvwr \\CccDC3
eventvwr \\BridgetonDC3

to verify dcdiag can read the remote event log.
Lets also verify dns:

dcdiag /e /test:dns > dns.txt
failed with a lengthy error message. I can post a snip if you want. directions are; verify network path, verify computer is online, verify firewall rules are enabled on target. Should I have restarted after disabling the firewall? One of the targets had been disabled (and restarted) previously.
test:dns output dns.txt
No, disabling firewall doesn't require a restart and should help immediately.
I see several configuration errors in DNS that require attention. these appear to mostly be retired or moved servers that haven't been updated.


There are multiple DNS errors/problems that need to be fixed.

I.e. BridgetonDC3 has DHCP enabled?

First of all I'd give it a static address, then verify DNS settings on all DCs.
Under network card, TCP/IP settings, let the primary DNS server point to the central DCs IP address, the 2nd DNS server point to its own IP.

Chesterfield\CCCDC3 shows 2 IP addresses in DNS

Testing server: Chesterfield\CCCDC3

      Starting test: Connectivity

         The GUID based DNS Name resolved to several IPs

         (192.168.2.4, 192.168.2.5), but not all were pingable. Replication and

         other operations may fail if a non-pingable IP is chosen. The first

         pingable IP is 192.168.2.5. 
         ......................... CCCDC3 passed test Connectivity

Open in new window

This should be resolved.
Yes, all the orphaned entries of retired servers have to be removed.
Sorry, but it took several tries to find all of the configuration errors. I still have one problem I haven't found but the tests finally passed. dnsver4.txt
The problem server in the original post seems to be completely out of the loop.
I tried to sync the problem server using Sites and Services. I selected the server ccscdc2 and clicked replicate from selected server. I can't identify the cause of the error. I can ping by name or IP.
User generated image
Lets check dcdiag /e /v again.

CcscAppServer has a problem with its machine account in domain.

Running
dcdiag /s:localhost /repairmachineaccount

Open in new window

on this server might be able to fix this, but you'll need to demote/promote it afterwards.

Depending what other jobs it does it may be better to leave demoted it and continue running it as a member server. Create a fresh DC for that site instead.
Yesterday was a holiday here so today is "Monday". Dashing out the door. Will get back to you later.
The past week was very busy. To avoid interruption to production I decided to wait till the weekend. Over the weekend I spooled up a new DC at site CCSC with DNS and DHCP. I also demoted The two existing DCs at the site, disjoined them from the domain and then re-joined. I had to force demote CCSCDC2. I left them as member servers. This seems to have resolved all name resolution issues for end points at the site. I ran dcdiag /e /v on the new DC dcdiag.txt I suspect that I need to run Ntdsutil metadata cleanup.
ASKER CERTIFIED SOLUTION
Avatar of Michael Pfister
Michael Pfister
Flag of Germany image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
CCSCDC2 is the FSMO role holder at Site A CCSCDC3 is the new DC I I just spooled up on a VM. I ran dcdiag on CCSCDC3, Two servers were demoted at Site D CCSCappserver and CCFleetDC1, They were removed from the domain and then added back as member servers not Domain Controllers. Good point, Ntdsutil may kill the computer account. I do have the local user credentials. I should have run Ntdsutil before I added them back to the domain. I discovered today that workstations at Site D were having trust relationship errors when I tried to login with the domain administrator account on a workstation. I disjoined/rejoined all of them to the domain which was very time consuming but seems to have cleared up the issue. I'll run dcdiag again tomorrow. Does it matter which DC I run Ntdsutil on?
I think the workstations password change didn't replicate...
Shouldn't matter where you run ntdsutil on.