Link to home
Start Free TrialLog in
Avatar of SeeDk
SeeDk

asked on

DFSR not working in one direction

Been having problems with DFS for awhile now.
These are two servers running Windows Server 2008 R2.
I did these steps already per advice on this site.

1.  Disable replication on affected connections
2. Install hotfix #2780453 and hotfix#2663685
3. Sync files and reboot.
4. Rebuilt DFS database: http://webcache.googleusercontent.com/search?q=cache:r3nzkJg1t-IJ:sheeponline.net/recreate-dfsr-database.html+&cd=1&hl=en&ct=clnk&gl=us

5. Re-enable replication.

There was still some backlogged files at first so it seems that syncing was not complete but fortunately it was decreasing.
It seemed promising, and the backlog even completely cleared on the connection from Server A -> Server B.
A->B replication appears to be working perfectly, which is excellent.

However, the Server B -> Server A connection backlog stopped decreasing after awhile and even began to increase.

It is holding steady now. I can manually see that syncing is working fine one way (A->B) but not from B->A.

There are two other replication groups on these two servers which are working fine both ways.

When I do dfsrdiag replicationstate, I can see that B is receiving updates from A but there are zero active outbound connections from B to A.

It looks like the replication from B->A has just stopped.
I don't see any errors in the event log.

What can I do to get this working again?
Avatar of Dan McFadden
Dan McFadden
Flag of United States of America image

Can you check on both servers for event ID 2213 in the DFS Replication event log?  

If found, in the event message, there is a section named "Recovery Steps."  There is a WMIC command defined in step 2.  Run this command.

You can monitor the backlog with Powershell.  On a computer where the DFSR module is installed, run the following command from an account:

Get-DfsrBacklog | sort folder, sendingmember | ft -a

Open in new window


Dan
Avatar of SeeDk
SeeDk

ASKER

Event ID 2213 is not present on either server.
I've been monitoring the backlog with dfsrdiag, the only drawback being that it will only show the first 100 files.
Does Powershell show more than these?
The output from the PowerShell command shows a summary of the Replication Groups, Sending Member, Receiving Member, Folder (target folder) and Backlog count.

The output looks like:

Replication Group   Sending Member Receiving Member Folder         Backlog
-----------------   -------------- ---------------- ------         -------
GroupShares         SRV-FILE-01    SRV-FILE-02      Group          1
GroupShares         SRV-FILE-02    SRV-FILE-01      Group          0
Installation        SRV-FILE-01    SRV-FILE-02      Installation   0
Installation        SRV-FILE-02    SRV-FILE-01      Installation   0
UserShares          SRV-FILE-01    SRV-FILE-02      User           1
UserShares          SRV-FILE-02    SRV-FILE-01      User           6

Open in new window


Dan
Avatar of SeeDk

ASKER

Ah, I see good to know Powershell can do that.
I am getting that same information from a third party tool named DFSRmon. And I can confirm it is correct by checking on the server itself with dfsrdiag.

For some reason, B ->A is replicating a bit today although very, very slowly.
Maybe I will need to clear the database on this server again?
Clearing the database will reset the replication status of the files on the server.  Meaning that DFSR will have to do a scan of the files it finds in the local target folder, then do a complete scan of the partner data to figure out what needs to be replicated.

I wouldn't clear the database.

How much data we you talking about?  10GB, 100GB, 1000GB?  Number of files?  Also, have you sized your Staging location appropriately?

On both servers, run this command (with the correct path) to determine if your Staging Quota is sized correctly.

(Get-ChildItem <replicatedfolderpath> -recurse –force | Sort length -descending | select -first 48 | measure -property length -sum).sum /1gb

Open in new window


This will return a number in GB, which should correspond to the Staging Quota column in DFS Manager when viewing the Membership tab on a replication object.  I usually add 10% to the number as a buffer.

Reference Links:
1. https://blogs.technet.microsoft.com/askds/2011/07/13/how-to-determine-the-minimum-staging-area-dfsr-needs-for-a-replicated-folder/
2. https://technet.microsoft.com/en-us/library/cc754229.aspx

**PS: I bumped the large file count from 32 to 48 to due to a large number of files and many of which are very large files.

Dan
Avatar of SeeDk

ASKER

The backlog total is now at 392889.

The number returned from that command on A is 52GB.
On B is it 76GB.
I get why, there are some large video files on both sides. But these have already been replicated so their large size should not be affecting this, correct?

I set the quota on both sides to 20GB on Saturday when I did the steps mentioned in my original post.
The quota used to be much smaller before I started working on this (4GB) and there were constant errors in the event viewer as a result.

I never saw any errors in the event viewer from quota size since the change to 20GB and A->B is still working well.
You need to set the Staging quota to at least 76GB!  The purpose of the Powershell command was to determine what your staging quota needs to be... at a minimum.

Irrelevant if your "larger" files have been replicated.  The larger value from server B is what both DFSR servers need to have configured on the replication group.  The fact that Server A returns a smaller number indicates that there are files missing on A when compared to B.

What is probably happening...  Your DFSR Staging Quota is too small.  Therefore the area where DFSR temporarily stores data chunchs to replicate is constantly hitting its upper limit and DFSR then has to purge content from the staging quota directory to make space for newer data that needs to be replicated.  This is slamming your disks and making the process of replicating stretch out much longer than it needs to.  You can check by filtering the DFS Replication event log for events 4202 and 4204.

I see this event pair about once every 3 to 4 days.  If you are seeing occuring daily in high frequency, you have a DFSR Staging Quota size issue.

Also, you need to be sure that you have enough free space on the disk where the DFSR Staging directory is located.  Do not over-allocate space on the disk, doing this will cause the partition to constantly report 0B free.

My recommendation is to set the Staging Quota setting to 80GB on both partner servers in the replication group.

Dan
I also recommend that you run the script against all of your DFS replicated directories to ensure that the Staging Quota is properly sized.  And to set the value of the quota to at least that number, on both target folders in the replication group.

Dan
Avatar of SeeDk

ASKER

I was getting constant Event ID's 4202 and 4204 when the staging quota was at 4GB (which is definitely too small).
But since the increase to 20GB it has stopped occurring frequently.

Following your suggestion of filtering for the events, I can see the last occurrence of 4202 on Server B was on 7/12 at 11:41:31 AM.
4204 last occurrence was on 7/12 at 11:41:34 AM.

So it got above the high watermark but cleared it in 3 seconds and hasn't recurred since then.
With this in mind, it doesn't appear to me as though it doesn't have enough space at the moment. Unless there's another Event ID/log which would show this is the problem?

The replication is still going slowly though and the events I do see relating to this replication group are all 4412 (detected a file changed on multiple servers).
Maybe this is slowing down the replication?

As for the other replication groups, I'm checking them daily and their backlogs are all 0 or low single digits (files in use) so they are working well.
Again... I recommend setting the Staging Quota size to the result of the script.  If one or more of the larger files needs to be replicated, you will most likely start having a replication slowdown.

This is all explained in the first article link I posted, this is no opinion, its tuning best practices from  Microsoft's official enterprise support blog for AD DS and more.  Setting the quota to 20GB when the script calculates a minimum value of 76GB is a bad move.

Even if you have no backlog issues, a staging quota not properly sized, is an accident waiting to happen.  I have Replication Groups running with 0 backlog and their staging quotas are set to between 150GB & 250GB as recommended by the script.


The events 4412 (multiple file changes) is a side effect of replication.  DFSR is just notifying you of the situation.  If a user were to come to you and ask why they aren't seeing their changes in a DFSR supported file, you can use these event to possibly ID the issue.

These event may add a little bit of latency to replication, but its not very perceptible.

Dan
Avatar of SeeDk

ASKER

Ok, I will set the staging quote for B->A to 76GB.
I don't have enough disk space to allocate A->B to 56GB at the same time without comprising space available for users.
I can add more disk space this weekend.
Will keep you posted on the results.
Avatar of SeeDk

ASKER

Checked the backlog and it has decreased by only 38 files (out of 300k+) since the change to 76GB staging quota. Looks like that wasn't the cause of the slowness.

I also tried applying these "tuning" changes recommended by Microsoft but they didn't help: https://blogs.technet.microsoft.com/askds/2010/03/31/tuning-replication-performance-in-dfsr-especially-on-win2008-r2/

What else could be causing this?
SOLUTION
Avatar of Dan McFadden
Dan McFadden
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of SeeDk

ASKER

1. They are on separate sites.
2. Not sure about this one but since the issue is only occuring on one DFSR connection out of 10, bandwidth is likely not the cause.
3. Yes
4. 24/7
4a. Full
5. Symantec installed on both but I have disabled it.
6. These would be Event 4302 and 4304? None on Server B's side.
On A, there are many 4304 events. 25 occured yesterday and 5 so far today (increased as I type this). All of them for the replicated folder with the backlog.
Question 2 is about what kind of network connection the servers have.  Ethernet, Fast Ethernet, Gigabit (10Mb, 100Mb, 1000Mb).

What is your bandwidth between your 2 sites?  How fast is your WAN link? (1Mb, 5Mb, 10Mb, 16Mb, 100Mb, etc.)

Unless your replication group members are all local to each other (in the same physical site network)  I would not use "Full" for the bandwidth allowance for the replication schedule.  I would use a setting that is appropriate based on the amount of bandwidth between your 2 physical sites.

The tuning article is a good place to read thru about configuration optimization.  But nothing other than some time will relieve the backlog or, you could remove the group, delete the directory on server A, pre-stage (re-seed) the directory structure on server A and (as quickly as possible) rebuild the replication group.

For how many days was replication for this group, not functioning?

Dan
Avatar of SeeDk

ASKER

Network connection is 1000Mb.
Bandwidth is 1Gbps.

The backlog is being cleared so slowly, it would take over a year to wait it out.
Before the fixes mentioned in the original post, both A->B and B->A connections for this replication group were backlogged for months.
Afterwards, both were fine except for Monday where B->A seemed to stop.
A->B is working with no issues.
B->A started clearing again on Tuesday but the rate is too slow.

I am considering a pre-seed but not sure the right way to do it.
A has the most up to date files and everything updated this week has replicated to B.
But B has some older files which have not yet replicated to A.
The re-seeding or pre-staging process is discussed in this article.

Link:  https://technet.microsoft.com/en-us/library/dn495044%28v=ws.11%29.aspx?f=255&MSPPError=-2147217396

What is the size of the replicated folders?  MB, GB, TB?

This article may be of interest.

Link:  https://blogs.technet.microsoft.com/filecab/2013/08/21/dfs-replication-initial-sync-in-windows-server-2012-r2-attack-of-the-clones/

Dan
Avatar of SeeDk

ASKER

A: 806 GB
B: 829 GB

I did not re-seed this weekend but it looks like it needs to be done next weekend.

Something happened between Friday evening and Sunday morning which has caused the backlog to increase back to the same heights it was before the initial fix I did.
I noticed yesterday the backlog on both ends is hovering around 720k.
This can't be correct, since I was monitoring this all day on Friday and the backlog on A->B was around 9 (files in use) and B->A was around 320k.
B is a secondary server, not in use by users, so there is no reason for it's backlog to increase.

What did occur late Friday evening is a restore process to recover one lost file.
The recovery was run on the 5PM backup for that same day.
We use Symantec Backup Exec to backup VSS from Server A.

Could there be a bug with DFS which caused this restore process to alter the backlog?

This connection is not seeing any activity now though I do not see any error in the event viewer.
Using SMS Trace for c:\windows\debug\dfsr01000.log I see the last log there as MAIN 756 main Service Exiting Successfully.

The other replication groups are working well.
Avatar of SeeDk

ASKER

Reviewing the logs more I see that a 4102 event was generated on 7/15/16 at 5:49PM
The restore process I mentioned were two jobs run at 5:31PM and 5:37PM.

It seems shortly after the restore process, DFSR initialized the replicated folder. How could this have happened?
Avatar of SeeDk

ASKER

Per the solution here:
https://social.msdn.microsoft.com/Forums/en-US/a15c3cf3-68d0-4ff4-b5c9-b3795aacb80e/windows-2003-dfsr-problems?forum=winserverfiles

I discovered that:
1. There was no primary member set on this replication group after it initialized (event 4102) on 7/15.
2. There was no 4104 event generated on 7/9 (the first fix in initial post) suggesting that the initial sync then never completed.
                      Which I suppose makes sense as the B->A backlog never cleared and is the reason I posted this question in the first place.
3. When I initialized the database on 7/9, a 4112 event was generated that showed Server A was assigned as primary. Not sure why this time it could not find a primary.

So I followed the command there to set Server A as primary and the ran the PollAD command.

A few minutes later I see a flurry of activity on A saying "DeleteIDRecords" and I see event 4112 informing it has been assigned as the primary member.

The backlog reduced to 1 on A->B for a moment but then increased back to 700k.
Now the backlog on both ends is decreasing steadily from their high points on Sunday morning.

So I am essentially back at where I started last week after doing the troubleshooting steps in my first post.

High backlog on both ends which is steadily decreasing.
Based on last time, it will take around 2 days for the servers to go through it all.
However, the last time A->B did eventually clear but B->A did not.

If this happens again, any errors/events I should keep an eye out for?
Also,  it appears the Symantec restore process triggered an initialization for this group...any ideas how this could have happened?

UPDATE:
On 7/9, no 4102 event was generated either. Maybe because the old database was deleted? In any case, of course no 4104 event would be generated without a 4102 event, correct?
ASKER CERTIFIED SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of SeeDk

ASKER

In the end, I decided to just manually go through the remaining backlog.
Removing duplicates and copying over what was missing.
Fortunately (in a way), there were folders with thousands of files on Server B which were not present on A so copying those back sorted out a lot of the backlog.
A 4104 event (DFSR finished initial replication) was logged on Server B on 7/22 and things have been running smoothly since then.

I will do a test restore from Backup Exec on Friday afternoon to confirm it doesn't break it again (since it was odd that the initialization on 7/15 occured right after a restore on a file in that replication group).

Hopefully, nothing goes wrong but if it does...I will open another question. :)
Sorry for not responding, but I'm on vacation & I finally got a reliable internet connection.

Glad you sorted it out.

BTW, 7000 files in backlog is something I wish for.  For perspective, my last DFSR issue had a 350K file backlog.  Fun, fun, fun...

Dan
Avatar of SeeDk

ASKER

The dfsrdiag commands I found in the other topic were what led me to being able to sync the DFSR on these two servers.