Link to home
Start Free TrialLog in
Avatar of devon-lad
devon-lad

asked on

Network performance in Hyper-V with VMQ and NIC teaming

This follows on from https://www.experts-exchange.com/questions/28591820/Poor-network-throughput-in-Hyper-V-guests.html

4 x Hyper-V hosts, each with 4 x 1Gbps NICs.  1 NIC on each host dedicated to management subnet, other 3 teamed using LACP.  NIC team assigned to virtual switch.

Although we initially saw marked performance improvements after disabling VMQ - things began to slow down again as we started to put the system under more load as apps/users were migrated from the old system.

I have since found a Broadcom driver update for Win2012 which is supposed to fix VMQ issues.  So I've installed this and enabled VMQ on all NICs.  There has been reports of some performance improvements since doing this.

However, the most noticeable bottleneck is still between VMs on the same host using the same virtual switch.

I should point out that we're still in the testing/migration stage so the NIC teams are currently broken with 1 or 2 of their team members being used to provide connections from the old system.  The NICs have been removed from the teams in the management OS on each host.  So on one of the hosts there is a team with only one member.

So I understand untidy configuration may be causing problems at the moment - but speeds are still significantly slower than I'd expect from a single 1Gbps card.

I've been doing some research on VMQ, specifically these sources...

http://www.hyper-v.nu/archives/mvaneijk/2013/01/nic-teaming-hyper-v-switch-qos-and-actual-performance-part-3-performance/

http://www.microsoft.com/en-us/download/details.aspx?id=30160

and I'm unclear on whether I should be explicitly assigning processor cores to team members - or is this a pointless exercise because I'm using an LACP on the switch?  Or should I switch to using switch independent teaming in Hyper-v mode and then assign processors to NICs?

Also, I don't understand why any of this would affect traffic between VMs on the same virtual switch.  Does the traffic still go to the network card?  I thought it was handled within Hyper-V.

Any suggestions/advise appreciated.
ASKER CERTIFIED SOLUTION
Avatar of Cliff Galiher
Cliff Galiher
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of devon-lad
devon-lad

ASKER

Cliff,

Thanks for the quick response.

So VMQ enabling (even if it was working) would not be the cause of perceived performance improvements as only 1Gbps cards used?

What about with the suggested registry fix (forgot to mention this - haven't tried it)

http://www.reddit.com/r/sysadmin/comments/2k7jn5/after_2_years_i_have_finally_solved_my_slow/

Regarding changing NICs to Intel - I will bear this in mind for future, but for the moment we're stuck with BCM.
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Thanks for the comprehensive explanation.

So I'll go back and disable the VMQ and report back on speeds between VMs on the same virtual switch, different virtual switch and different hosts and see where we stand.  Will need to be done later as the dev team are using the system at present.
VMQ is a structure that allows virtual networking structures to be processed by the various cores in a CPU. Without VMQ only one core would be processing those packets.

In a Gigabit setting the point is moot since 100MB/Second or thereabouts is not going to tax any modern CPU core.

In a 10GbE setting where we have an abundance of throughput available to the virtual switch things change very quickly. We can then see a single core processing the entire virtual switch being a bottleneck.

In that setting, and beyond, VMQ starts tossing vSwitch processes out across the CPU's cores to distribute the load. Thus, we essentially eliminate the CPU core as a bottleneck source.

For whatever reason, Broadcom has not disabled this setting in their 1Gb NICs as per the spec. This has caused no end of grief over the last number of years for countless Hyper-V admins. With Broadcom NICs one needs to disable Virtual Machine Queues (VMQ) on _all_ Broadcom Gigabit physical ports in a system to avoid what becomes a vSwitch traffic.
Ok, so VMQ disabled again - but still seeing very slow performance particularly noticeable during logon when profiles are being loaded and GPs being applied.

Profiles are all pretty small at the moment i.e. less than 50MB, but taking up to 30 seconds to load.  So going to be pretty slow when the system is in production and profiles start growing.  Group policy can take another 30 seconds - maybe around 15 GPs applied per user.

Users are logging into a session host (Win2008R2) which is on the same physical host (Win2012R2) as the profile server (Win2012R2) and same virtual switch.  So I would have thought a 50MB profile would load instantly.

However, I might be barking up the wrong tree here assuming it was a Hyper-V-related issue as I've just tested copying a 400MB file to and from each of the servers and it takes around 2 seconds.

Because of the VMQ issue on the previous thread (which was solved by Cliff's instruction to disable VMQ) I had wrongly assumed a similar issue had arisen.

Starting to look more like DNS related.
Ok - so there were some DNS issues caused by a multi-homed DC providing AD replication between the old and new system.  Have got rid of this and put a switch in to handle the routing.  So now all working ok - no errors in logs - dcdiag all fine.

However, still experiencing slow logins - loading profiles and group policy.  Quite often the roaming profile isn't synced due to slow network performance.  Profiles are very small - less than 100MB - yet, transferring 500MB files between the session hosts and profile server takes a second or so.
Turn on performance counters. Find out where the bottleneck is. Don't guess. That got you in trouble earlier (in this thread.)  If I were a gambling man, I'd guess one or more of your VMs is just plain misbehaving.  Common culprits include, but are not limited to:

Slow disks/bad disk setup. (You won't believe how many times I see people wondering why running 4 VMs on a set of 4 disks, 7.2k consumer SATA, as a RAID5, and then are surprised that they have bad performance. Copying one large file will efficiently use a RAID cache, but a bunch of users with profiles scattered everywhere will not.  Disk I/O matters.

Ignoring other networking bottlenecks. The profile not be where the hang is occurring. It could be applying group policies from a DC that isn't on the same host....so the external network matters. Or it could be a bad printer driver. Or....

A corrupt VM. I see poor (and unpredictable) performane far too often because someone decided to follow "advice I read on the internet" when they moved to Hyper-V and just grabbed disk2vhd and converted their existing servers. Old drivers in the HAL and all. There is a reason Microsoft has never supported this method of p2v and it injects difficult-to-troubleshoot, and nearly impossible-to-fix issues on a regular basis.

Now mind you, I don't know which of these (if any) applies to you. But since 500MB files are copying fine, we've virtually (excuse the pun) elimated the virtual switch and/or VMQ as the culprit. Which leaves way too many other possibilities to effectively just start guessing at this point. I only mention the above, not as guesses, but as an illustration of how broad the potential causes can be in any given situation.
Cliff - thanks for that.  Yes, already have performance counters on.  The hosts aren't showing any kind of disk bottlenecks or any other kind of stress.  It's a fibre channel SAN with 15k SAS disks - so very much under-utilised at the moment.  These are all new VMs, not converted from physical machines.

Anyway, we're going off at a tangent now - so will leave it at that.

Thanks
FYI - I had been looking at disk performance counters on the hosts.  Figured I'd get it from the horse's mouth and check the performance counters on the SAN itself.  Showing average IOPS at 300 - ok, so reasonable but a bit high.  But the max IOPS is pegged at around 1.3k, which, if my maths is correct, is around about the maximum IOPS you'd expect from 10 x 15k disks in RAID 10.
IOPS are good for measuring theoretical maximum performance, but not as good at determining if you are hitting real world limits, given other factors such as fragmentation, odd disk/controller/firmware/cache interactions, etc. Disk queue depth is better at finding bottlenecks, especially when measured over time and looking for correlations to perceived slowdowns.
Ah...maybe my calculation was incorrect - it's just jumped to 1.9k.  Anyway - higher than expected for the current load.

Queue lengths on the hosts are averaging 1 or below - and are only spiking out of hours during backup etc.

So on the face of it, it looks ok doesn't it??
Anecdotally, yes.
Have been doing some further monitoring/testing.  Obviously my max IOPS calculation is totally wrong as I've seen the SAN jump to almost 4000 IOPS sometimes.  

Disk latency on each of the hosts rarely goes above 2ms, most of the time below 1ms.  

So certainly appears that the SAN is up to the job.

But just noticed something.  Although VMQ has been disabled on all NICs, it still shows as Enabled on the NIC team in the
multiplexor driver.  Could this be the issue?
Shouldn't be. And again, since raw file copies aren't having a performance issue, you really have eliminated the lower levels of the network stack, from NIC through the virtual switch. Doesn't even seem to be a place to focus on looking.
Get-NetAdapterVmq shows disabled for the NIC team even though it shows as enabled in the driver properties.

I found yesterday that I was unable to get more than around 320Mbps on the virtual disk assigned to a particular host (checking the SAN performance counters) - this was the same whichever type of file transfer I tested (host to host, vm to host, vm to vm)

I also found that file transfer speeds were averaging 200Mbps (taken from the Windows file copy dialog) - again source/destination didn't seem to affect this.

I've just disabled VMQ on the NIC team and now getting bursts of almost 3Gbps throughput for the same host on the SAN.

Host to host is now around 900Mbps.

VM to VM on the same host still slow.  I'm wondering if I should recreate the virtual switches as they were originally created when VMQ was enabled on everything?
Just looking at the SAN counters again - it's now showing max transfer speeds of 5Gbps and IOPS of almost 9000.

These are the best I've seen so far - could this have all still been VMQ related?

To clarify for others  - the SAN connections themselves are not affected by the VMQ setting - but the jump in throughput would indicate the hosts are able to move data around quicker now.
VM to VM on same host is actually ok.  Single file transfer is around 1.5Gbps now.
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Apologies for the delay in tidying this up.

I think generally my testing has been a bit inconsistent and so have been led to the wrong conclusions.

I believe disabling VMQ has improved things, but would say the real culprits here have been two issues - AD replication problems caused by incorrect routing tables (profiles are held on DFS shares) and possibly adverse affects of IPv6 being unbound from all NICs (both physical and virtual).

After tidying all these items up, performance is as I would expect it to be.

I think Cliff summed up the VMQ issue in his first post - which was what the original question was about.