Hyper-v guest disk bottleneck

Have a hyper-v environment where all physical hosts connect to a Powervault 3800f with 10x15k SAS in RAID10 via fibre channel - one virtual disk per host, 4 hosts.

On the SAN performance logs I'm seeing latency spikes of around 50ms on each virtual disk, but average latency is around 2ms.  

On the VMs at the same time I'm seeing latency spikes of 600ms - once I even saw 1100ms.

This tells me that something is going wrong between the SAN and the VM.

Write cache hit is constant at 100%, read cache hit is usually around the 70% mark - read % is usually below 40%.
Maximum combined IOPS of two raid controllers is around 9k.

On the VMs memory and processor usage is unremarkable - high paging on all VMs though (using dynamic memory - maybe stick to fixed?)

The VM with a particular issue has processor usage of around 50% (25 processors) - so split over NUMA nodes.

Where should I be looking for the delay between the SAN and the VM ?
Who is Participating?

[Product update] Infrastructure Analysis Tool is now available with Business Accounts.Learn More

I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Cliff GaliherCommented:
Probably nowhere. That sounds like you are hitting the limits of your chosen configuration, and the numbers are roughly in like with what I'd expect. 10 disks in RAID 10 is effectively going to limit you to five simultaneous spindle writes, and even with high speed disks, four VMs will be random seeking enough that performance can tank temporarily without notice.

More smaller disks is always better than fewer larger disks. Tiered storage and/or lots of spindles is the cure. There's a reason why those deployments are so popular now.

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
devon-ladAuthor Commented:
Why is there such disparity between the latency reported by the SAN and that on the VM though?  Where are those 500ms going?
Cliff GaliherCommented:
Hyper-V is going to try and organize requests when under load. It may decide delaying one VM to handle others is better than allowing the writes to go through in a true FIFO fashion. And its usually right. Seeking for VM 1, then 3 then 1 then 3, then 4 will be far more expensive than doing VM 1,1,1 then 3,3,3, but doing the latter does show up as big latency in VM3. But overall, the tradeoff is still a good one. So your SAN latency is lower than the VM latency. If left to run FIFO, the SAN latency would go up because there is more random seeking and that'd impact all VMs significantly.
IT Pros Agree: AI and Machine Learning Key

We’d all like to think our company’s data is well protected, but when you ask IT professionals they admit the data probably is not as safe as it could be.

devon-ladAuthor Commented:
Ah, I see.

Ok - thanks for your help once again Cliff.

More disks then.
devon-ladAuthor Commented:
Cliff - was just thinking about this some more.  The VM that's showing the highest latencies is an RD session host and the only other VMs on the same host are a DC and a licensing server.  So the session host is pretty much the only VM that's doing anything on that host - so there would be no need for Hyper-V to intervene and reorganise disk requests from different hosts because the lion's share would be coming from the session host.

That being the case, why would the latencies on the SAN for the virtual disk used by this particular hyper-v host be so different from the latencies reported by the session host VM?
Cliff GaliherCommented:
Because of the resequencing as I explained. And background tasks are still running on those other VMs. It doesn't take much on that few spindles.

Let's look at an example using arbitrarily fake numbers (easy math)

Your SAN is getting saturated and starts reporting an additional 5ms write delay per write request.

You have three VMs all trying to write.

Hyper-V stacks three writes for VM1, then three writes for VM2, then three writes for VM3. The SAN won't have to bounce around writing to the VHDs as much.

VM1 suffers from the latency of the SAN. 5 ms.
VM2 had to wait for three writes of VM1. By it's first write, it has waited 15ms, plus the SAN hasn't had time to catch up, so latency is still 5ms.

By the time VM3 can write, it waited through VM1 *and* 2. It'll report a delay of 30ms, but the SAN is still only reporting 5ms. That's a 300% difference.

See what I mean? You have saturated your SAN. I almost guarantee it. But you can use Hyper-V performance counters to see for yourself.
devon-ladAuthor Commented:

For some reason, I never thought of checking the counters on the host itself, was only looking at the VM and the SAN.  You were right - host showed a lot of nasty spikes in latency.

I've stuck 8 x 15k disks in RAID10 into the host and moved the problem VM's VHD onto this.  It's trundling along with under 10ms latency most of the time, only spiking to 20ms now and then - so a big improvement.

Thanks for your help.
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today

From novice to tech pro — start learning today.