Link to home
Start Free TrialLog in
Avatar of Brian_B
Brian_B

asked on

Severe disk latency on NFS stores from EMC VNXe

We have 4 1TB NFS stores carved out on a VNXe3200 . these stores are being used by two VMware servers.  There's also LUNs setup on the EMC being used by a separate physical Windows Server.

2 of the 4 NFS stores are suffering from sever latency.  Every few seconds, could be a couple hundred MS but we've seen 20k+ MS.  Best we can tell the to other two NFS stores are fine.  and the LUNs are fine.

Three different EMC techs have looks at the EMC SAN and say it's fine.  they pulled another set of logs and such and are analyzing.  Only connection is that they are both using SPA - but they're not seeing issues with SPB nor are the LUNs showing latency.

NFS no iSCSI so most issues with framing and such don't apply.  besides everything is setup the same way and only 2 of the stores are effected.  had one VMware tech look - in VMware didn't see anything.  have case open with VMware.  so far no one can tell me anything of interest.  Some of the more common issues like the IOPs bug in VMware bug we checked for and eliminated - all VMs set for unlimited IOPs.  

There's 5 VMs using these two stores and they are NOT at all taxing the CPU, RAM or storage.  all very secondary servers.  my most busy server is on one of the other stores and it's fine.  

any ideas?  since myself, my team, 2 EMC engineers, and 2 techs from the installed that put in the SAN have looked at this - I think we can forgo some of the "easy" items unless it's really easily overlooked.  The VMs used to be connect to a different SAN using iSCSI but that was late last year.  this issue seems to be fairly new - like weeks but seems to be getting worse.
SOLUTION
Avatar of gheist
gheist
Flag of Belgium image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
ASKER CERTIFIED SOLUTION
Avatar of Carlos Ijalba
Carlos Ijalba
Flag of Spain image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of Brian_B
Brian_B

ASKER

@gheist: yes, that is what we are seeing except its not NetAPP - not sure that matters much but it's only happening on 2 of the NFS stores and never hitting the other two.  I would've expect if that was the issue to see it on all NFS stores.  That being said, I don't see any harm setting it to 64 so will do that when I can reboot the servers.

@Carlos: we talked about doing that this weekend, so may try that and see if that does isolate SPA as the issue or not.  We have a ticket open with EMC.  they have about 6 hrs of looking at this SAN and at still looking at logs.
EMC and vmware is kind of same family, but it does not mean nobody argues inside...
I just picked first in google search that changes NFS queue depth, it could be EMC, or iodata as well. My point is to replicate workings of SIOC to keep peak latency under control
What is the problem - one node with long queues can suppress others when pushed harder (and what is the disk queue length of that direct LUN on windows?)
Avatar of Brian_B

ASKER

so I put SPA into service mode, left if for a few hours, latency dropped to normal levels.  Rebooted SPA and it's been fine since.  EMC is seeing if they can explain it.
Good stuff Brian,

At least now you've got a "HotFix" to sort it out if it happens again.

Let's wait for the tech support to see if they have an explanation.

It might have been something ocassional, but I would upgrade the firmware as soon as the next version comes out, just in case.
Avatar of Brian_B

ASKER

they said the threads on SPA were not being released properly - according to the logs.  Told me that upgrading to the latest version would correct this going forward - this SAN hasn't been upgraded since it was installed.

It's been fine since the reboot.  plan on upgrading this weekend.
Avatar of Brian_B

ASKER

while our issues seem to stem back from a firmware bug (time will tell) the comments provided helped isolate it to just the SAN and saved us a lot of time hunting in other areas.