CPU drop / Latency Spike on Windows servers causes disconnect

Hi Experts,

I have a long time problem but cannot find a solution, even after opening tickets with both VMware and Microsoft.

We first found this issue with DFS on a Win 2003 server, but now, we are finding it on an SQL server causing a disconnect.

Server environment includes boot from SAN and all of our disks are SAN based.  The file server has 4GB RAM and 1 processor.  The SQL server has 8GB RAM and 4 CPU.

Disk speed has been upgraded to flash as a test, but this hasn't solved the problem.

We also have a stretched cluster providing a mirrored environment through an IBM SVC implementation.

Logs have no indication of what is causing the spike in latency and the drop in CPU.  We have McAfee and Altiris installed, and our backup agent installed is Avamar.  To isolate, we have tried removing everything from these boxes.

Any thoughts in helping to identify this issue will be greatly greatly appreciated.
svillardiAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Brad GrouxSenior Manager (Wintel Engineering)Commented:
Check the disk latency for the cluster, anything around 20ms response times or higher can cause major issues while 10-12ms is the desired rate for Windows clusters.

You can utilize Storport.sys to find out the I/O usage. Here's a TechNet blog post explaining the process.

http://blogs.technet.com/b/askcore/archive/2013/04/25/tracing-with-storport-in-windows-2012-and-windows-8-without-kb2819476-hotfix.aspx

Note, that the vast majority of CRITSITS regarding clusters revolve around network throughput or disk I/O. Many people outgrow their storage and network without realizing it.
0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
KorbusCommented:
Since you are using SANs for storage, have you confirmed your network isn't simply getting jammed up?  If you connect JUST the SANs and servers together, eliminating the rest of the network, do you continue to have latency issues?
0
svillardiAuthor Commented:
I cannot isolate my environment:  In a stretched cluster, all hosts see all storage.  So the IBM SAN Volume Controller is the middle man.  The zoning is properly set so that the hosts see only the SVC, as well as the storage only sees the SVC.    Hosts do not see storage and vice versa.  No way to keep redundancy between both of our sites across the stretched cluster without dismantling everything.


Brad - I don't know what CRITSITS is.

There has to be another way.
0
Ultimate Tool Kit for Technology Solution Provider

Broken down into practical pointers and step-by-step instructions, the IT Service Excellence Tool Kit delivers expert advice for technology solution providers. Get your free copy now.

svillardiAuthor Commented:
VMware is showing the spike in disk.  The question is how do I find what's causing it?  

The Win 2008 box was at SAS speed and I moved it to flash.  I still get this huge spike in latency / drop in CPU that SQL loses it's connection and has to be reset.  This happens almost every day at about 8:03 in the morning.

The Win 2003 box is that DFS server.  When I moved it to SAS (from NLSAS) the random outages stopped taking about 2-25 mins and each outage was much shorter.
0
svillardiAuthor Commented:
CRITSIT -- Crisis Situation?
0
svillardiAuthor Commented:
Brad,

On my Win 2008 R2 box I tried to install KB978000 and it said "This update is not applicable to your computer."

I double checked and it's Windows 2008 R2, SP1.  For sure.

Thanks,

S.....
0
compdigit44Commented:
We had a similar issue where I work with our IBM SVC and high latency on some servers. Long story short excess pause frames and the resolution was a firmware upgrade to the switch
0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
VMware

From novice to tech pro — start learning today.

Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.