Link to home
Start Free TrialLog in
Avatar of compdigit44
compdigit44

asked on

ESXi 6 Slow Storage Migration

Our Vmware environment consist of a Winodws 2012 R2 vCenter Server Running vCenter 6 Update 3 and 110 ESXi 6 Host Running ESXi 6 Update 2 and all of which are IBM x240 blades using the same Emulex driver version. Our backend storage consist of three IBM GEN3 XIV's connected via FC and FCoE. What I have noticed is that storage vmotion and Template deployments are taking much longer than normal. For example, a coworker was deploying a VM from one of our temples which is same and under 60GB took 40 minutes to complete. What was strainge is that it crawled throughout the entire process then when from 70% to completed. Also I am do storage vmotion down and they are slow as well. When I checked the VMKERNAL log and search for one of the device id's I am migrating from I see the following:

 "NMP: nmp_ThrottleLogForDevice:3298: Cmd 0x89 (0x439e001a5c40, 26417684) to dev "eui.00173800337707f5" on path "vmhba1:C0:T13:L3" Failed: H:0x0 D:0x2 P:0x0 Valid sense data: 0xe 0x1d 0x0. Act:NONE"

Our SAN's are not overload and had our storage Admin check. I feel this is storage related since deployments and storage vmotions are slow. Yet general VM performac overall seems ok. Thoughts??
Avatar of Andrew Hancock (VMware vExpert PRO / EE Fellow/British Beekeeper)
Andrew Hancock (VMware vExpert PRO / EE Fellow/British Beekeeper)
Flag of United Kingdom of Great Britain and Northern Ireland image

and your environment is built as what ?

number of storage vMotion VMKernels, 10GBe, dedicated network, jumbo frames , VLANs ?

P.S. if you going browsing through VMKernel Logs, you'll see lots of those, and other noise, so *may* not necessarilly blame the storage guys!
Avatar of compdigit44
compdigit44

ASKER

I am not sure what you mean as "environment is built as what "?

Our Management and vMotion traffic uses two nics and is 10GB, jumbo frames are not geting used. Yes we do use vLan our management traffic and vmotion use the same vlan which will be changing shortly. Also does storage vmotion traffic get copies over the network using the vmotion vmk or so it is handled by the SAN whic his VAAI compliant?
Network Design ???

So you have 10GBe. - Tick Box for that.

Management and vMotion on same nics - not recommended, best practice - no tick box there.

No Jumbo Frames - not best practice -  no tick box there, not best practice

Management and vMotion on same VLAN - no tick box there, not best practice

All vMotion traffic transverses the vMotion Network so vMotion and Storage vMotion will go over the management network, which is not good.

Use as many nics as possible for vMotion.

So....as a percentage for Best Practice.... I bet your design does not score highly, maybe less than 50%!
As to our setup not being best practices I totally agree. This is why we are changing everything as our new datacenter is built. My problem is this setup has been inplace for a while now yet vmotions seem slower. Also vm deployments are slower as well. Just to confirm storage vmotion traffic goes out the vmotion vmk but does not over the network and is not handles by the SAN correct?
Just to confirm storage vmotion traffic goes out the vmotion vmk but does not over the network and is not handles by the SAN correct?

if you SAN is VAAI compliant, the data should be moved by the SAN, as vCenter Server sends the job to the SAN, and is then monitored by the SAN.

If everything was perfect and working within tolerance, than something has changed or broke.

Now, traffic may not transverse the management interface or vMotion interfaces, if you are migrating this VM on the same host, again VAAI should help here.

I would look at Networks, and just check it's not using any networks, and check nothing has changed on the SAN, and VAAI is not being used.

It's interesting that you make an observation to template deployment.....being slow which has nothing to do with management interface/vMotion - which suggests VAAI.....is not working or assisting you.

Which is back to SAN again, is this the same on all hosts ?

So template deployment - slow

storage vMotion - slow

VAAI -enabled ? is VAAI the issue ?
Andrew, thank you for reply. My understanding is that is your SAN is VAAI compliant that when doing a Storage vmotion on a host to another datastore seen by the host, VAAI should be used and vCenter is just reporting the Status at that point. I feel better knowing that I am not loosing my mind yet. How can i tell in the log if a storage vmotion when over the network or use the SAN VAAI?
You should easily be able to tell, by looking at network traffic spikes in vCenter Server.

if when you perform template, storage vMotion operation there is an increase in traffic, then there is the proof.

You could also take one host out of the cluster, e.g. no production VMs on it, and turn on and turn off VAAI...

perform the same operation, and this will confirm if VAAI is in use or not, it would also be a good check, to ensure vAAI plugins are installed and enabled!

check for following SCSI opcodes, which are specific related to VAAI

0x93      0x41      0x42      0x89      0x83

That log in your OP is VAAI enabled! 0x89 is VAAI SCSI opcode.

This could be a refer to Storage Vendor.
When I last checked all of our host VAAI was enabled for all options but will check again. If it is, check the SAN???
Yes, that OP Code in your first post, is VAAI...

We have seen SANs to maintain current IOPS for VMs, it's throttling back any activity which would impact performance. e.g. VAAI

As originally pointed out, these commands, can often be ignored... see here

https://kb.vmware.com/s/article/1036874

That SCSI sense code, suggests an issue with the Device e.g. SAN LUN, and the error is MISCOMPARE - MISCOMPARE DURING VERIFY OPERATION

whether that is relevant in this case, you would need to check if this LUN is in use, e.g. source or destination.

Storage Vendor would be able to help more..
I believe VAAI was being used on the host at the time since I am seeing the following in the vmkernal logs "2018-05-19T02:54:22.476Z cpu9:33487)NMP: nmp_ThrottleLogForDevice:3298: Cmd 0x89 (0x439e12899d00, 26417684) to dev "eui.0017380033770689" on path "vmhba1:C0:T13:L6" Failed: H:0x0 D:0x2 P:0x0 Valid sense data: 0xe 0x1d 0x0. Act:NONE"

Also when I run a performace report on the host during the time the storage vmotion was running I am the read and write latency averages 2ms or less. Yet when I do a VM deployment it take 40 minutes hangs at 70% then poof done.
0x89 is VAAI.

Is that LUN though a LUN that's being used ?

MISCOMPARE - MISCOMPARE DURING VERIFY OPERATION

it's debatable whether relevant or not. Is that error always there every 5 mins, or only during a VAAI operation.
I ended up opening a case with VMware support and they came back with the following
I have reviewed the logs, we are seeing a lot of ATS Miscompare and write commands (0x89).
When the ATS command writes to the metadata record, the normal sequence of events is that it reads what was written and compares it to what is in memory.
This is done by using COMPARE AND WRITE command 0x89. If both copies match, the ATS command is successful.
However, due to the excessive load on the array, some of the responses to the 0x89 command fail.
As a result, the host assumes that the heartbeat record update/create operation failed, and the datastore becomes inaccessible. This is an ATS miscompare.
An example of a /var/log/vmkernel.log message would look like this: vmkernel.1:2018-05-19T19:47:47.477Z cpu0:33487)ScsiDeviceIO: 2651: Cmd(0x439e001a5c40) 0x89, CmdSN 0x3fcc7f6 from world 26417684 to dev "eui.00173800337707f5" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0xe 0x1d 0x0.
The vmkernel messages above are just a small portion of the miscompares, this is something you may want to address with your storage vendor, they will tell you if this ATS HB needs to be disabled or not.


I checked with our SAN and our arrays are not even close to being overloaded and they are 100% VAAI supported. This only seems to be a issue more recently yet no changes to any of our host or driver versions.
ASKER CERTIFIED SOLUTION
Avatar of Andrew Hancock (VMware vExpert PRO / EE Fellow/British Beekeeper)
Andrew Hancock (VMware vExpert PRO / EE Fellow/British Beekeeper)
Flag of United Kingdom of Great Britain and Northern Ireland image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial