Severe disk latency on NFS stores from EMC VNXe

We have 4 1TB NFS stores carved out on a VNXe3200 . these stores are being used by two VMware servers.  There's also LUNs setup on the EMC being used by a separate physical Windows Server.

2 of the 4 NFS stores are suffering from sever latency.  Every few seconds, could be a couple hundred MS but we've seen 20k+ MS.  Best we can tell the to other two NFS stores are fine.  and the LUNs are fine.

Three different EMC techs have looks at the EMC SAN and say it's fine.  they pulled another set of logs and such and are analyzing.  Only connection is that they are both using SPA - but they're not seeing issues with SPB nor are the LUNs showing latency.

NFS no iSCSI so most issues with framing and such don't apply.  besides everything is setup the same way and only 2 of the stores are effected.  had one VMware tech look - in VMware didn't see anything.  have case open with VMware.  so far no one can tell me anything of interest.  Some of the more common issues like the IOPs bug in VMware bug we checked for and eliminated - all VMs set for unlimited IOPs.  

There's 5 VMs using these two stores and they are NOT at all taxing the CPU, RAM or storage.  all very secondary servers.  my most busy server is on one of the other stores and it's fine.  

any ideas?  since myself, my team, 2 EMC engineers, and 2 techs from the installed that put in the SAN have looked at this - I think we can forgo some of the "easy" items unless it's really easily overlooked.  The VMs used to be connect to a different SAN using iSCSI but that was late last year.  this issue seems to be fairly new - like weeks but seems to be getting worse.
Brian_BAsked:
Who is Participating?

[Product update] Infrastructure Analysis Tool is now available with Business Accounts.Learn More

x
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

gheistCommented:
vmware kb 2016122 perfectly applies to your scenario (it is same stuff used by SIOC)
Carlos IjalbaIT Systems DirectorCommented:
You can go into service mode on the VNX and switch primary-secondary so SPB takes over and see if it has the same behaviour to rule out SPA.

In fact you can even reboot SPA without affecting anything but as usual, always better to perform out of hours just in case.

Probably will not fix it, but at least you rule out another component of the equation.
You can also consult it with EMC tech support.

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
Brian_BAuthor Commented:
@gheist: yes, that is what we are seeing except its not NetAPP - not sure that matters much but it's only happening on 2 of the NFS stores and never hitting the other two.  I would've expect if that was the issue to see it on all NFS stores.  That being said, I don't see any harm setting it to 64 so will do that when I can reboot the servers.

@Carlos: we talked about doing that this weekend, so may try that and see if that does isolate SPA as the issue or not.  We have a ticket open with EMC.  they have about 6 hrs of looking at this SAN and at still looking at logs.
Protecting & Securing Your Critical Data

Considering 93 percent of companies file for bankruptcy within 12 months of a disaster that blocked access to their data for 10 days or more, planning for the worst is just smart business. Learn how Acronis Backup integrates security at every stage

gheistCommented:
EMC and vmware is kind of same family, but it does not mean nobody argues inside...
I just picked first in google search that changes NFS queue depth, it could be EMC, or iodata as well. My point is to replicate workings of SIOC to keep peak latency under control
What is the problem - one node with long queues can suppress others when pushed harder (and what is the disk queue length of that direct LUN on windows?)
Brian_BAuthor Commented:
so I put SPA into service mode, left if for a few hours, latency dropped to normal levels.  Rebooted SPA and it's been fine since.  EMC is seeing if they can explain it.
Carlos IjalbaIT Systems DirectorCommented:
Good stuff Brian,

At least now you've got a "HotFix" to sort it out if it happens again.

Let's wait for the tech support to see if they have an explanation.

It might have been something ocassional, but I would upgrade the firmware as soon as the next version comes out, just in case.
Brian_BAuthor Commented:
they said the threads on SPA were not being released properly - according to the logs.  Told me that upgrading to the latest version would correct this going forward - this SAN hasn't been upgraded since it was installed.

It's been fine since the reboot.  plan on upgrading this weekend.
Brian_BAuthor Commented:
while our issues seem to stem back from a firmware bug (time will tell) the comments provided helped isolate it to just the SAN and saved us a lot of time hunting in other areas.
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Storage

From novice to tech pro — start learning today.