Win 2008 R2 File Server Latency Issues

First let me say this issue has yet to produce an event in event viewer.

We have a single Windows 2008 R2 File Server (VM in ESXi 5.1 in a VSA redudant environment) that has a problem.  It's only happened 7 times since last May 2013, but it's of a huge concern because it halts the entire company.  It also doesn't affect any other of the 8 VM we have in the VSA configuration or 7 VM in the vCenter (not in VSA).

The issue is that the file server slowly comes to a halt.  The latency issue starts off only affecting a few users and then escalates to the point where the server is non responsive at the console, but will service remote requests in 3-5 minutes.  The symptoms take anywhere from 3-5 hours to first rear their head to bringing the company to a grinding halt. (File Server is very important to us)

A reboot of the server immediately fixes the problem, however, we also have folder redirection turned on (stored on this server) for appdata roaming, desktop, and favorites.  Reboot of the file server also requires a reboot of all the users workstations, about 60 or so.

The file server has only File Services and FSRM installed on the device, but the problem was occuring before FSRM

Management wants an explanation and resolution and I basically have no idea where to start.  There's no logs, no events, and we simply do not currently have a third party monitoring tool that would record these happenings for review.  

VMWare ESXi reports no unusual service requests times in diskIO, network, CPU, or RAM usage of the machine during these times.  

Time of day has been anywhere during working hours, morning, afternoon, and right before leaving.

In addition, if I wanted to start new file server from scratch, can I take and boot a new vm, attached the vmdk files to the new VM as datastores, and boot those into windows and receive all the permissions and drive space without having to perform a restore of any kind? (I kind of suspect windows will want to format those during diskmgmt operations but I'm not sure).

Any suggestions on where to move next?
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

I've seen this behaviour with volume shadow copy services hanging trying to take a snapshot.  Do you take snapshots of your data during the day?

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
PriorityResearchAuthor Commented:
I believe Appassure uses VSS to take snapshots every hour.  Our Appassure resides on a separate host with separate diskIO, but still resides within the vCenter environment.

Appassure has been present since we installed the VSA environment.  

What tends to cause the hang?

Any ideas how to turn on logging to see if that the issue, or take preventive steps to stop it from rearing its ugly head again?

Edit: Does Previous Versions also use VSS?
Previous Version does also use VSS and I've found that VSS doesn't play nice with multiple partners.  

I have a terminal server that exhibits identical behaviour to yours.  Every so often it will develop a resource leak and eventually choke itself out to the point the server is not functional.  Because the leak is gradual the server never logs anything until it gets to the point that it can't log anything.  A reboot solves the problem until it happens again which can be anywhere from a day to a couple of months.

It's annoying for us to have this server drop off every once in a while but it doesn't sound like it affects you as much as it does us.  

A suggestion would be to install some monitors to keep an eye on your resources.  If you develop a leak it would be good to be able to stop it before it takes the server down.  As an MSP I have my own tools to use but I'm pretty sure you can configure the Windows monitors.  If not I'll help you find some third party software.

Based on your detailed description of the issue I think the problem and solution lies with this one VM and it has nothing to do with being a VM.
OWASP: Threats Fundamentals

Learn the top ten threats that are present in modern web-application development and how to protect your business from them.

PriorityResearchAuthor Commented:
I'd whole heatedly agree with you on your last statement. I've requested solar winds in the past but have never been able to make the roi seem logical to the powers that be. What third party tools do you use or would you suggest? Windows performance monitors are local and I'd really prefer something that has central reporting as we'd monitor more than just that one machine.
Agree, it sounds like a classic memory leak.

PriorityResearchAuthor Commented:
Thank you for the support so far.

I've checked in on a few things and noticed that for some reason my Previous Versions had to triggers.

The first which was intentional was 7am to 7pm every hour daily.

The second was daily at noon.

Perhaps the combination of the two triggers was causing a hang occasionally?  If that does not resolve the issue, I will try to disable Previous Versions entirely (I don't want to because the speed of recovery for most files deleted or accidentally saved etc is much faster than mounting a restorepoint and searching them in Appassure). Going to mark this as solved.

I would try disabling Previous Version for a couple of weeks.  If the problem seems resolved then you probably found the cause.

I've used the Zenith platform and found the VSS support in the backup software was what caused the problem far more often.  Zenith backup back-end is StorageCraft.  Once I disabled the VSS support from the backup software I think I've only had a couple outages due to resource leaks.

Almost anything can set up with SNMP traps that can be centrally managed for monitoring.  I've found it to be a pain to set up but there are plenty of third party tools.  Solar Winds is one of them but I'm sure there's more.  Again, I have much of this built into my MSP software so I haven't really spent a lot of time looking.
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Windows Server 2008

From novice to tech pro — start learning today.