Avatar of Brian_B
Brian_B
 asked on

Severe disk latency on NFS stores from EMC VNXe

We have 4 1TB NFS stores carved out on a VNXe3200 . these stores are being used by two VMware servers.  There's also LUNs setup on the EMC being used by a separate physical Windows Server.

2 of the 4 NFS stores are suffering from sever latency.  Every few seconds, could be a couple hundred MS but we've seen 20k+ MS.  Best we can tell the to other two NFS stores are fine.  and the LUNs are fine.

Three different EMC techs have looks at the EMC SAN and say it's fine.  they pulled another set of logs and such and are analyzing.  Only connection is that they are both using SPA - but they're not seeing issues with SPB nor are the LUNs showing latency.

NFS no iSCSI so most issues with framing and such don't apply.  besides everything is setup the same way and only 2 of the stores are effected.  had one VMware tech look - in VMware didn't see anything.  have case open with VMware.  so far no one can tell me anything of interest.  Some of the more common issues like the IOPs bug in VMware bug we checked for and eliminated - all VMs set for unlimited IOPs.  

There's 5 VMs using these two stores and they are NOT at all taxing the CPU, RAM or storage.  all very secondary servers.  my most busy server is on one of the other stores and it's fine.  

any ideas?  since myself, my team, 2 EMC engineers, and 2 techs from the installed that put in the SAN have looked at this - I think we can forgo some of the "easy" items unless it's really easily overlooked.  The VMs used to be connect to a different SAN using iSCSI but that was late last year.  this issue seems to be fairly new - like weeks but seems to be getting worse.
StorageVMwareStorage Hardware

Avatar of undefined
Last Comment
Brian_B

8/22/2022 - Mon
SOLUTION
gheist

THIS SOLUTION ONLY AVAILABLE TO MEMBERS.
View this solution by signing up for a free trial.
Members can start a 7-Day free trial and enjoy unlimited access to the platform.
See Pricing Options
Start Free Trial
GET A PERSONALIZED SOLUTION
Ask your own question & get feedback from real experts
Find out why thousands trust the EE community with their toughest problems.
ASKER CERTIFIED SOLUTION
Carlos Ijalba

THIS SOLUTION ONLY AVAILABLE TO MEMBERS.
View this solution by signing up for a free trial.
Members can start a 7-Day free trial and enjoy unlimited access to the platform.
See Pricing Options
Start Free Trial
GET A PERSONALIZED SOLUTION
Ask your own question & get feedback from real experts
Find out why thousands trust the EE community with their toughest problems.
Brian_B

ASKER
@gheist: yes, that is what we are seeing except its not NetAPP - not sure that matters much but it's only happening on 2 of the NFS stores and never hitting the other two.  I would've expect if that was the issue to see it on all NFS stores.  That being said, I don't see any harm setting it to 64 so will do that when I can reboot the servers.

@Carlos: we talked about doing that this weekend, so may try that and see if that does isolate SPA as the issue or not.  We have a ticket open with EMC.  they have about 6 hrs of looking at this SAN and at still looking at logs.
gheist

EMC and vmware is kind of same family, but it does not mean nobody argues inside...
I just picked first in google search that changes NFS queue depth, it could be EMC, or iodata as well. My point is to replicate workings of SIOC to keep peak latency under control
What is the problem - one node with long queues can suppress others when pushed harder (and what is the disk queue length of that direct LUN on windows?)
Brian_B

ASKER
so I put SPA into service mode, left if for a few hours, latency dropped to normal levels.  Rebooted SPA and it's been fine since.  EMC is seeing if they can explain it.
Your help has saved me hundreds of hours of internet surfing.
fblack61
Carlos Ijalba

Good stuff Brian,

At least now you've got a "HotFix" to sort it out if it happens again.

Let's wait for the tech support to see if they have an explanation.

It might have been something ocassional, but I would upgrade the firmware as soon as the next version comes out, just in case.
Brian_B

ASKER
they said the threads on SPA were not being released properly - according to the logs.  Told me that upgrading to the latest version would correct this going forward - this SAN hasn't been upgraded since it was installed.

It's been fine since the reboot.  plan on upgrading this weekend.
Brian_B

ASKER
while our issues seem to stem back from a firmware bug (time will tell) the comments provided helped isolate it to just the SAN and saved us a lot of time hunting in other areas.
⚡ FREE TRIAL OFFER
Try out a week of full access for free.
Find out why thousands trust the EE community with their toughest problems.