Severe disk latency on NFS stores from EMC VNXe

Brian_B
Brian_B used Ask the Experts™
on
We have 4 1TB NFS stores carved out on a VNXe3200 . these stores are being used by two VMware servers.  There's also LUNs setup on the EMC being used by a separate physical Windows Server.

2 of the 4 NFS stores are suffering from sever latency.  Every few seconds, could be a couple hundred MS but we've seen 20k+ MS.  Best we can tell the to other two NFS stores are fine.  and the LUNs are fine.

Three different EMC techs have looks at the EMC SAN and say it's fine.  they pulled another set of logs and such and are analyzing.  Only connection is that they are both using SPA - but they're not seeing issues with SPB nor are the LUNs showing latency.

NFS no iSCSI so most issues with framing and such don't apply.  besides everything is setup the same way and only 2 of the stores are effected.  had one VMware tech look - in VMware didn't see anything.  have case open with VMware.  so far no one can tell me anything of interest.  Some of the more common issues like the IOPs bug in VMware bug we checked for and eliminated - all VMs set for unlimited IOPs.  

There's 5 VMs using these two stores and they are NOT at all taxing the CPU, RAM or storage.  all very secondary servers.  my most busy server is on one of the other stores and it's fine.  

any ideas?  since myself, my team, 2 EMC engineers, and 2 techs from the installed that put in the SAN have looked at this - I think we can forgo some of the "easy" items unless it's really easily overlooked.  The VMs used to be connect to a different SAN using iSCSI but that was late last year.  this issue seems to be fairly new - like weeks but seems to be getting worse.
Comment
Watch Question

Do more with

Expert Office
EXPERT OFFICE® is a registered trademark of EXPERTS EXCHANGE®
Top Expert 2015
Commented:
vmware kb 2016122 perfectly applies to your scenario (it is same stuff used by SIOC)
IT Systems Director
Top Expert 2014
Commented:
You can go into service mode on the VNX and switch primary-secondary so SPB takes over and see if it has the same behaviour to rule out SPA.

In fact you can even reboot SPA without affecting anything but as usual, always better to perform out of hours just in case.

Probably will not fix it, but at least you rule out another component of the equation.
You can also consult it with EMC tech support.

Author

Commented:
@gheist: yes, that is what we are seeing except its not NetAPP - not sure that matters much but it's only happening on 2 of the NFS stores and never hitting the other two.  I would've expect if that was the issue to see it on all NFS stores.  That being said, I don't see any harm setting it to 64 so will do that when I can reboot the servers.

@Carlos: we talked about doing that this weekend, so may try that and see if that does isolate SPA as the issue or not.  We have a ticket open with EMC.  they have about 6 hrs of looking at this SAN and at still looking at logs.
Ensure you’re charging the right price for your IT

Do you wonder if your IT business is truly profitable or if you should raise your prices? Learn how to calculate your overhead burden using our free interactive tool and use it to determine the right price for your IT services. Start calculating Now!

Top Expert 2015

Commented:
EMC and vmware is kind of same family, but it does not mean nobody argues inside...
I just picked first in google search that changes NFS queue depth, it could be EMC, or iodata as well. My point is to replicate workings of SIOC to keep peak latency under control
What is the problem - one node with long queues can suppress others when pushed harder (and what is the disk queue length of that direct LUN on windows?)

Author

Commented:
so I put SPA into service mode, left if for a few hours, latency dropped to normal levels.  Rebooted SPA and it's been fine since.  EMC is seeing if they can explain it.
Carlos IjalbaIT Systems Director
Top Expert 2014

Commented:
Good stuff Brian,

At least now you've got a "HotFix" to sort it out if it happens again.

Let's wait for the tech support to see if they have an explanation.

It might have been something ocassional, but I would upgrade the firmware as soon as the next version comes out, just in case.

Author

Commented:
they said the threads on SPA were not being released properly - according to the logs.  Told me that upgrading to the latest version would correct this going forward - this SAN hasn't been upgraded since it was installed.

It's been fine since the reboot.  plan on upgrading this weekend.

Author

Commented:
while our issues seem to stem back from a firmware bug (time will tell) the comments provided helped isolate it to just the SAN and saved us a lot of time hunting in other areas.

Do more with

Expert Office
Submit tech questions to Ask the Experts™ at any time to receive solutions, advice, and new ideas from leading industry professionals.

Start 7-Day Free Trial