Fear not! To defend your business’ IT systems we’re going to shine a light on the seven most sinister terrors that haunt sysadmins. That way you can be sure there’s nothing in your stack waiting to go bump in the night.
The majority of DFSR issues arise because of active directory replication issues, inadequate staging quota, sharing violations of open files, a corrupted DFSR database, unexpected dirty database shutdowns, conflicting data modifications, and accidental data deletion.
The end result of the above is high backlog activity and out of sync replicated folders and finally DFSR data replication failures or data loss in case of accidental data deletion.
AD replication failures blocks DFSR replicated folder initialisation
I created a new replication group, added a local site server and a remote site server in that group. Even after a few hours replication (initial sync) had not even started.
The remote site does have an additional domain controller
When initial sync (one-way sync) triggers, we should get event ID 4102 under DFSR logs. DFS is active directory aware application and heavily depends upon AD sites and services and AD replication. The possible reason could be active directory replication failure to the remote site.
Whenever we create a DFS namespace and DFS Replication group, the namespace and replicated group are stored into the active directory domain partition and if AD replication is failing, then the changes are not replicated to the remote domain controller, hence the DFS server in that site could not get those changes and could not initialize initial sync (one way sync)
To fix this issue, force AD replication between the local AD site and the remote AD site and on the DFSR servers, run dfsrdiagpollad from an elevated command prompt. The command will poll changes from active directory. Even after forcing replication, if the DFSR initial sync sill did not start, then there must be some issue with AD replication and you need to troubleshoot further.
Once we fix AD replication, the remote site DC updates its domain partition and during polling intervals, the DFSR remote member detects changes and start an initial sync. The DFSR trigger event as shown below;
The above event informs us that at least DFS replicated folder replication was triggered now. We need to wait until DFSR finishes replicating all data from the primary member and triggers an event ID 4104 which means initial sync is completed and now both servers can replicate data authoritatively.
Note – fixing AD replication is not in the scope of this document. The task can be easy or complicated based on the active directory health / issues.
Improper staging area affects DFSR replication
After creating a DFSR replicated group, one-way sync is triggered by the primary member to secondary members. Replication is very slow with latency or almost getting stopped and the backlog is noticeably increased from the source to the destination server. Event ID: 4202, 4204, 4206, 4208, 4212 are logged on either source and destination or both servers which are indicators of low staging quota issue
Event ID: 4202 and 4204
Severity: Warning and informational
With 4202 DFSR tells that staging space is used above watermark and with 4204 tells that old staging files are successfully deleted from staging area
Event ID: 4206 and 4208
Event 4206 states that DFSR failed to cleanup staging area and event 4208 states that staging area is almost full.
Event ID: 4212
The 4212 indicates that dfsr cannot replicate since staging area is inaccessible
Low staging quota:
An improperly sized / low staging area causes a replication loop occurs or it can even halt. The source server replicates the file to the destination server staging, however, the file get purged as part of staging cleanup process before the file can be moved into the Replicated Folder. The purged file now needs to be replicated from the source server again. This process will keep repeating until the file gets moved to the replicated folder and if the staging quota is kept low, in that case, the clean-up process runs more frequently to free up staging space.
The ideal solution to this case is to keep the staging area to be as equal to the data size being replicated, since this is not possible, we should increase the staging area to be as maximum as possible / affordable by comparing the size of data to be replicated and available disk space on the primary / secondary or both servers based on event log occurrence.
Note that for the initial sync process the maximum staging area is required, once the process has finished successfully its utilization is limited to data being changed at both sides, so we can set it to a lower value to save disk space. You may look at the blog post below to get a tentative size of staging quota.
Open files / Sharing access violations
Open files / Sharing Access violations cause replication slowdowns. Event ID 4302 or 4304 logged on DFSR servers
DFSR cannot replicate the open files if files are left open or files remain in use, or if file handles did not close at the source or destination due to sharing violations. It creates a high backlog and causes replication to become slow.
If DFSR data being replicated remains open either on the source or destination, then the file system puts exclusive locks on data being replicated, in that case, data is prevented from staging to the final destination (replicated directory) or vice versa. These scenarios are logged as Sharing violations on either source or destination server (DFSR event 4302 on data destination server OR DFSR event 4304 on data source server)
DFSR needs to wait until files get closed, else we can clear any open sessions on the server from share management, but it's not recommended as data loss may occur.
Avoid replicating bulky files that keep open all the time (Ex: Virtual machine VHD files)
Avoid replicating roaming profile shares and the user's PST stored on network shares. If roaming profiles or users PST are part of DFSR, those users should log off / close the PST upon work closure.
DFSR database corruption or internal error caused replication failed
DFSR member frequently getting the event below;
Error: The DFS Replication service has detected an unexpected shutdown on volume D:. This can occur if the service terminated abnormally (due to a power loss, for example) or an error occurred on the volume. The service has automatically initiated a recovery process. The service will rebuild the database if it determines it cannot reliably recover. No user action is required.
DFSR replication gets halted on specific member with below event
Event ID: 2104
The DFS Replication service failed to recover from an internal database error on volume F:. Replication has been stopped for all replicated folders on this volume.
Error: 9203 (The database is corrupt (-1018))
Database: F:\System Volume Information\DFSR
Error: 9214 (Internal database error (-1605))
Database: D:\System Volume Information\DFSR
DFSR database can be inaccessible / corrupt if disk failure happens or bad sectors are generated, or excessive backlog pressure makes the database out of sync. The Backlog can reach up to a few lakhs files.
The backlog can be checked with either CMD or PowerShell
dfsrdiag backlog /rgname:<REPL_GROUP> /rfname:<REPL_FOLDER> /smem:<SRV_A> /rmem:<SRV_B> [/v] dfsrdiag backlog /rgname:<REPL_GROUP> /rfname:<REPL_FOLDER> /smem:<SRV_B> /rmem:<SRV_A> [/v]
In this case, data replication gets stopped back and forth on the affected member. To resolve this issue we need to rebuild the DFSR database on the affected member.
To resolve this issue, the DFSR database needs to be rebuilt on the affected server. Steps are given below.
Log on to the DFSR server where data is not replicating and if space is available, locate the affected replicated group and open group properties to increase the staging area on the staging tab to maximum affordable value. If you have already increased staging area previously, ignore this step.
Stop and disable the DFSR service on the member server.
If you have added any data on the affected member under the replicated folder after replication failure, copy that specific data (or entire folder if you are not sure) to the other location as during the rebuilding process, that data will get moved to a pre-existing folder under the DFSR folder.
Enable hidden files and protected operating system files to view and locate the ‘system volume information’ folder on the drive where the DFSR replicated folder resides.
In our case, userdata is the actual replicated folder and system volume information is the folder where the DFSR database is stored. This folder is a system folder and remains hidden by default. Only a system account has full control on this folder. You cannot open this folder unless you take ownership of this folder.
Take ownership of this folder and grant the built-in administrators group full control on this folder.
We can see now the DFSR folder, this folder contains the DFSR database along with checkpoint files, jrs files, and staged files in chunks. We need to delete the entire DFSR folder. However, this folder contains the DFSR staging file with more than 256 characters long which are difficult to delete using the GUI.
For that command line tools must be utilized. From elevated cmd, run RD “c:\system volume information\dfsr” /s /q which should be able to delete the DFSR folder. But it may be possible that command fails to remove the folder and its contents, at least the command fails on my lab servers. Hence I used a freeware open source utility named SuperDelete to achieve the result. The utility works great all the time.
Folder deleted successfully.
Event 4102 states that DFSR has started rebuilding the DFSR database. This process again creates a DFSR directory under system volume information with the database and triggered Initial replication (oneway sync), any new files copied in this folder after replication failure get moved to the pre-existing folder under DFSR.
This Initial sync process can take a significant amount of time depending upon the data size. Since the data already exists in the replicated folder, some time will still be required for data staging, building hash and store in the DFSR database.
Once Initial replication completed, DFSR logs event ID 4104 which states that all data is synced and data can be replicated back and forth now.
DFSR Dirty (Unexpected) Shutdown Recovery (Applicable to only 2008 R2 / 2012 servers)
You see DFSR event ID 2213 on the DFSR server due to unexpected shutdown:
Log Name: DFS Replication
Event ID: 2213
The DFS Replication service stopped replication on volume D:. This occurs when a DFSR JET database is not shut down cleanly and Auto Recovery is disabled. To resolve this issue, back up the files in the affected replicated folders, and then use the ResumeReplication WMI method to resume replication.
1. Back up the files in all replicated folders on the volume. Failure to do so may result in data loss due to unexpected conflict resolution during the recovery of the replicated folders.
2. To resume the replication for this volume, use the WMI method ResumeReplication of the DfsrVolumeConfig class. For example, from an elevated command prompt, type the following command:
wmic /namespace:\\root\microsoftdfs path dfsrVolumeConfig where volumeGuid="C2D66758-E5C5-11E8-80C1-00155D010A0A" call ResumeReplication
With 2008 R2 Microsoft has released a new patch (kb2663685 ) for DFSR which will stop DFSR replication for a replicated folder upon a dirty shutdown of the DFSR database. DFSR Event ID 2213 is triggered after a dirty shutdown which provides commands to resume the specified replicated group manually. Dirty shutdowns can happen if a server has rebooted unexpectedly or got BSOD or if hard drive level corruption occurs.
This is the default behaviour with the 2012 server.
To resolve the issue, we need to manually resume replication. Steps are given below.
Back up the files in all replicated folders on the volume. Else it may result in data loss from unexpected conflict resolution during the recovery of the replicated folders.
Copy the WMIC command from step 2 in event ID 2213 recovery steps, and then run it from an elevated command prompt.
wmic /namespace:\\root\microsoftdfs path dfsrVolumeConfig where volumeGuid="C2D66758-E5C5-11E8-80C1-00155D010A0A" call ResumeReplication
If the replication resumed successfully, DFSR logs event ID 2212, 2218 and finally 2214 on the affected member as shown below
Side Note :
The reason Microsoft has stopped auto recovery after DFSR dirty shutdown is that during the auto recovery function, the DFSR member may have lost the replicated folder along with data. This is the kind of bug discovered with 2008 R2 servers and hence, they have introduced new a hotfix with 2008 R2 (KB 2663685 ). After installing this hotfix, new registry items get set on the server.
With this registry set, there is no auto recovery for DFSR dirty shutdown databases and they must resume replication manually. The behaviour is made as default on Windows Server 2012
This is temporary workaround provided by Microsoft to halt auto recovery of DFSR replicated folder. We must copy the replicated folder before resuming the replicated folder to avoid any data loss that may occur and then run the command to resume replication as mentioned above.
After Microsoft found a fix for the actual issue, they have released hotfix (KB 2780453) for 2008 R2 and included it in 2012 OS default media. The hotfix resolved the data deletion issue during DFSR a database auto recovery process.
Once you install above hotfix, then you can change above registry value to 0 on 2008 R2 servers to have auto recovery enabled after a dirty shutdown.
On windows 2012 servers you must create this registry key if it does not exist and set the value to 0 to enable DFSR auto recovery. This is also applicable to 2012 domain controllers running with DFSR Sysvol.
If you did not set the above registry setting on a 2012 domain controller with a 0 value and the DC suffered an unexpected shutdown, the Sysvol folder stops replicating because of a dirty shutdown and you would get event id 2213 in the DFSR logs. Then you must manually resume replication with the above command.
With the release of Windows 2012 R2 / Windows server 2016, the above registry is already created by default when you install DFSR and its value is set as 0, In fact, if you deleted the registry entry, there wouldn't be an issue. hence no action is required. The issue is sorted out permanently.
If 2012 R2 / 2016 server got an unexpected DFSR dirty shutdown, it automatically triggers auto recovery by default and triggers DFSR events 2212, 2218 and 2214
Note that accidental data deletion from a two way DFSR replicated folder is not a technical issue, its default by design behaviour. Since DFSR is a multi master replication technology, all members of the replicated folder once converged are considered as primary members and authoritative for any action taken on data and if data is deleted on one member, deletion gets replicated to all members and data loss occurs. Restoring data from backup is the only solution in that case.
The majority of DFSR issues can be avoided by following best practises as you can see by looking at the article below.
With the next article, I will cover DFSR and DFSN accidental deletion recovery (Backup and restore)
Happy Replicating. If you like the article, please click the Thumbs-up icon below