DFSR is not replicating SOME replication folders

I have two file servers, which we'll call fs-01 and fs-02. They have one replication group, which we'll call RG-01. Within that replication group are 25 sets of replicated folders. Its clear (by manually looking) that replication is not occurring in 3 of those 25 folder sets. Heres a dump of my troubleshooting so far

The servers are virtual running Windows 2008 R2 in a ESX5.1 environment. Underlying storage is iSCSI SAN. The servers are at 2 different locations (HQ and Hotsite) connected by a 1GB P2P network. The network is trunked across the P2P, though fs-02 is on a different subnet and is defined as such in sites and services.
I inherited this system about 5 days ago. Two weeks prior to that, a network architecture change (the trunking) caused some instabilities in the underlying SAN storage system. By all accounts those have been resolved now.
When I run the "Create Diagnostic Report" and create a propogation test and subsequent report the test files at 22 of the 25 folders are replicated nearly instantaneously (< 1 sec)
I ran the same propogation test in the three affected folders, and 3 days later they still show as "Incomplete tests" - which I take to mean they havent replicated.
If I run a dfsrdiag backlog /Rmem:FS-02 /Smem:FS-01 /RGName:RG-01 /RFName:"Folder name" from the command prompt it returns "No Backlog - member FS-02 is in sync with partner FS-01. Operation Succeeded. It even says this despite the fact that diagnostic test files still show as not yet synced in the report.
These replicated folders are GIGANTIC - 3.4 TB, 8 TB, and 1 TB with individual file sizes sometimes as large as 80 GB, but with (relatively speaking) few files in each individual subfolder - . The staging quote for these 3 volumes is set at 750 GB. All the other replicated folders have a staging quote of 10 GB.
I am ASSUMING that the files in FS-02 were in fact replicated, and not seeded, and that replication worked once upon a time for these folders. That said, I DO notice that these 3 folders all have special characters - ( and ) to be precise - in them.
There are no errors in the application or system logs on either server for DFS-Svc, DFS Replication, DFSR, or DFSR Audit. There are a handful of info notices for routine things like "DFS Server has finished initializing"
Ive looked at the logs in C:\windows\debug, and there are a LOT of them there, but nothing really sticks out as an error.
Anyone have any thoughts, or additional diagnostic tests I can run?
Who is Participating?
ArneLoviusConnect With a Mentor Commented:
special characters are not an issue, same goes for files and folders with a space at the end of the name (mac users...), then you have the joy of fixing them in two locations....

If you had "instabilities in the underlying SAN storage system", i'd check for disk corruption, however with the volume sizes that you have posted, I'd test by copying the files to "something else" or "somewhere else" locally at each site, using something that logs the copy output, possibly robocopy ?

You could try removing one of the folders from the replication group, refreshing both ends, clearing out staging etc and then adding it back agaiin.

I am presuming that you don't have iSCSI going over the link, just the DFS replication.

With the value of the storage for ~13TB 'm going to guess that the value of the data is not low, have you considered opening a Microsoft PSS case ?
Eric_PriceAuthor Commented:
It may quickly come to opening a case. Im a week on the job here and to be honest my past experience has been with one way replication and the old FRS. The system seems pretty straight forward, and m not averse to tinkering either, so long as I have good backups and Ive taken the time to ask for quick assistance either. No sense reinventing the wheel.

It is just the DFS replication.

Those are a couple of great suggestions. I think I'll try one of them today (Sunday) and then based on the results decide on Monday whether to move forward or call Microsoft. Since this replication is "only" to our hotsite, I dont feel QUITE the pressure I think I would if it were something the more directly affected day to day operations, but I still hate the thought of letting it go very long at all. Its almost like an invitation for disaster. lol
Eric_PriceAuthor Commented:
To add some additional information, the most critical folder (labeled "Analysis (Current)" shows no backlog on fs-01, but on fs-02 (the hot site server) dfsr diag reports that there is a backlog of over 900,000 files in that folder. Its clearly been successful replicating some over the past couple of weeks, because those conflicts on fs-01 end up in the conflicts/deleted folder.

The files on fs-01 are the only ones that will ever be modified. There the ones I want to keep at all costs.

PS - I have no good backup, since my predecessor has all the backups going over the P2P link (Appassure). It worked great as long as it was happy making incremental backups, but now it seems to want a new base, and as you can imagine one cant simply make a new base of 14 TB over a 1GB P2P link.

Given the sensitivity, a call to MS is in order methinks. And a stiff drink and a backup.
Creating Active Directory Users from a Text File

If your organization has a need to mass-create AD user accounts, watch this video to see how its done without the need for scripting or other unnecessary complexities.

ArneLoviusConnect With a Mentor Commented:
Nothing wrong with incremental backups that become synthetic full backups (such as Microsoft DPM, or even rsync with hardlinks to existing files, or snapshots) , just incremental backups though is a different kettle of fish...

Presuming you have at least 10Gb Ethernet at each site, I'd be very tempted to order first thing on Monday a box that can take a "quantity" of "inexpensive" disks and setup a backup server at the remote site that you can start ASAP and then bring back to the main site/redeploy when complete...

If you're just backing up files, then using rsync to a *nix box using zfs to do daily snapshots can be an inexpensive simple solution, robocopy in threaded mode can be faster in a LAN environment if you're just comparing modification times, but robocopy only copies whole files, so if you have large files that only have small changes, rsync can be significantly faster...

Anyway, call Microsoft PSS,  pay the ~£200 and open a case, they are open 24/7
PberConnect With a Mentor Solutions ArchitectCommented:
It sucks the logs are helping you out.

Have you seen this article: http://blogs.technet.com/b/askds/archive/2011/07/13/how-to-determine-the-minimum-staging-area-dfsr-needs-for-a-replicated-folder.aspx

Is your staging area larger than 32 of the largest files in the folders in question?
Eric_PriceAuthor Commented:
Im closing it out and spreading the wealth on the points. Thanks for the assist guys.
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.