Hi there! We've been troubleshooting this issue for months without much success, so wanted to put it out there to people smarter than us. ;)
Environment:
Dell c6100 xs23 (4 nodes)
Host nodes include one running a file share that includes user redirected folders as well as two terminal server session hosts.
Issue:
Users will notice the (multiple times a day) that:
-Desktop will flicker (icons go away and back quickly)
-File shares will have to be refreshed to see updates (so if another user saves a file, it won't just "appear" for the rest of the users viewing that folder)
-Chrome (which is using redirected data folders) will crash.
The only logging issue we see is delayed write failures. That in combo with above indicates to us that somehow there is a loss in connectivity between the session hosts and the file server (simple windows file server - no clustering, DFS or otherwise). Easy, right?
Here is what we've tried:
-turning off VMQ on the base NIC, the hyper-v switch, and the host switch
-turning off srv-io on all
-running both the file server and one of the session hosts on the same physical node while running the other session host on a separate node. No change here which eliminated most of our network suspicions as it should literally be communicating only over the hyper-v vSwitch.
-disabling LACP and using a single connection
-upgrading firmware on the broadcom NIC's (there was a problem on previous firmware, but it was a FULL disconnect requiring reboot).
-Looked at KB 2842111 - we don't see any massive number of handles, so didn't apply the fix.
-Looked at KB 2878182 - we don't see any non-responsive threads, so didn't apply.
-The actual even id error of the delayed write is:
EventID 50 / Source: MUP or mrxsmb (some of both)
" {Delayed Write Failed} Windows was unable to save all the data for the file \;H:00000000387e736f\data\Users\iamauserexample .. nts\Chrome_Settings\CrashpadMetrics-active.pma. The data has been lost. This error may be caused by a failure of your computer hardware or network connection. Please try to save this file elsewhere."
-no corresponding errors are seen on the file server that happen at the same time
-no corresponding errors on the hyper-v host
Interestingly, they previously had a server 2008 instance which did not have the same issue. With all of the above this is leading us to believe it may be an issue with server 2012 R2.