Network performance in Hyper-V with VMQ and NIC teaming

This follows on from

4 x Hyper-V hosts, each with 4 x 1Gbps NICs.  1 NIC on each host dedicated to management subnet, other 3 teamed using LACP.  NIC team assigned to virtual switch.

Although we initially saw marked performance improvements after disabling VMQ - things began to slow down again as we started to put the system under more load as apps/users were migrated from the old system.

I have since found a Broadcom driver update for Win2012 which is supposed to fix VMQ issues.  So I've installed this and enabled VMQ on all NICs.  There has been reports of some performance improvements since doing this.

However, the most noticeable bottleneck is still between VMs on the same host using the same virtual switch.

I should point out that we're still in the testing/migration stage so the NIC teams are currently broken with 1 or 2 of their team members being used to provide connections from the old system.  The NICs have been removed from the teams in the management OS on each host.  So on one of the hosts there is a team with only one member.

So I understand untidy configuration may be causing problems at the moment - but speeds are still significantly slower than I'd expect from a single 1Gbps card.

I've been doing some research on VMQ, specifically these sources...

and I'm unclear on whether I should be explicitly assigning processor cores to team members - or is this a pointless exercise because I'm using an LACP on the switch?  Or should I switch to using switch independent teaming in Hyper-v mode and then assign processors to NICs?

Also, I don't understand why any of this would affect traffic between VMs on the same virtual switch.  Does the traffic still go to the network card?  I thought it was handled within Hyper-V.

Any suggestions/advise appreciated.
Who is Participating?

[Product update] Infrastructure Analysis Tool is now available with Business Accounts.Learn More

I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Cliff GaliherCommented:
If VMs are on the same virtual switch, they will not go through the card. However parts of the card's driver code necessarily runs in the switch memory so a bad driver can still kill performance. So my suggestions are as follows:

1) disable VMQ. Even when a driver claims to fix the broadcom bug, VMQ offers *zero* benefit on 1GB NICs. In fact, windows will not use VMQ even when it is enabled. Only the card's firmware notices the difference (which is what causes the bug in broadcoms.)

2) If at all possible, don't use Broadcom NICs. They are consistently the cause of more pain and suffering than any other single thing in windows. But their silicon is cheap, makiing for better markup margins, so you see then as the built-in adapters in budget servers. I strongly encourage disabling any broadcom NICs and buying better cards. Intel is a performance booster almost every time.

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
devon-ladAuthor Commented:

Thanks for the quick response.

So VMQ enabling (even if it was working) would not be the cause of perceived performance improvements as only 1Gbps cards used?

What about with the suggested registry fix (forgot to mention this - haven't tried it)

Regarding changing NICs to Intel - I will bear this in mind for future, but for the moment we're stuck with BCM.
Cliff GaliherCommented:
You can atop reading at "But I started reading about VMQ and I wanted it!"

That tells you everything you need to know. They wanted it and were bound and determined to enable it, even though it does *nothing* for them.

As I said above, even with VMQ enabled on a 1Gb NIC, windows will not use VMQ, the registry entry forcibly changes this behavior so windows would use VMQ. But here's the thing.....

VMQ speeds up network traffic by sending network traffic to multiple cores for processing, reassembling frames into packets, and so forth.

But here's the rub. A reasonably modern processor core can easily do that task for 1Gb/s. It can, in fact, handle roughly 4Gb/s, give or take, depending on exact processor speed and model. So the processor is already sitting idle waiting for the NIC to fire off an interrupt saying it has more data to send. Spreading that to another core just has two cores, now needing to coordinate and both sitting idle more often than not. You can actually take a performance hit by doing that registry key.

So yes, s/he probably saw an improvement.  Not because what they did was right, but because they were insisting on running with VMQ enabled, and the driver bug was biting them. By making the OS use VMQ, they were bypassing the bug... just by a different path. They were getting better performance than they were with the driver bug, but not as good as they could've gotten by disabling VMQ. And that is why I don't ever recommend reddit for good advice. There is no shortage of bad techs wanting to espouse their great discoveries. And no system of accountability to know/realize the advice is bad.

Just as an anecdotal aside, the only reason that registry key exists is because many years ago, a single processor wasn't fast enough to keep up with a gigabit NIC. So RSS/VMQ were of benefit. But Moore's Law has made it obsolete. But like many things in windows, there is legacy code that either exists for backwards compatibility or exists because nobody has gotten around to ripping it out during code refactor.
Creating Active Directory Users from a Text File

If your organization has a need to mass-create AD user accounts, watch this video to see how its done without the need for scripting or other unnecessary complexities.

devon-ladAuthor Commented:
Thanks for the comprehensive explanation.

So I'll go back and disable the VMQ and report back on speeds between VMs on the same virtual switch, different virtual switch and different hosts and see where we stand.  Will need to be done later as the dev team are using the system at present.
Philip ElderTechnical Architect - HA/Compute/StorageCommented:
VMQ is a structure that allows virtual networking structures to be processed by the various cores in a CPU. Without VMQ only one core would be processing those packets.

In a Gigabit setting the point is moot since 100MB/Second or thereabouts is not going to tax any modern CPU core.

In a 10GbE setting where we have an abundance of throughput available to the virtual switch things change very quickly. We can then see a single core processing the entire virtual switch being a bottleneck.

In that setting, and beyond, VMQ starts tossing vSwitch processes out across the CPU's cores to distribute the load. Thus, we essentially eliminate the CPU core as a bottleneck source.

For whatever reason, Broadcom has not disabled this setting in their 1Gb NICs as per the spec. This has caused no end of grief over the last number of years for countless Hyper-V admins. With Broadcom NICs one needs to disable Virtual Machine Queues (VMQ) on _all_ Broadcom Gigabit physical ports in a system to avoid what becomes a vSwitch traffic.
devon-ladAuthor Commented:
Ok, so VMQ disabled again - but still seeing very slow performance particularly noticeable during logon when profiles are being loaded and GPs being applied.

Profiles are all pretty small at the moment i.e. less than 50MB, but taking up to 30 seconds to load.  So going to be pretty slow when the system is in production and profiles start growing.  Group policy can take another 30 seconds - maybe around 15 GPs applied per user.

Users are logging into a session host (Win2008R2) which is on the same physical host (Win2012R2) as the profile server (Win2012R2) and same virtual switch.  So I would have thought a 50MB profile would load instantly.

However, I might be barking up the wrong tree here assuming it was a Hyper-V-related issue as I've just tested copying a 400MB file to and from each of the servers and it takes around 2 seconds.

Because of the VMQ issue on the previous thread (which was solved by Cliff's instruction to disable VMQ) I had wrongly assumed a similar issue had arisen.

Starting to look more like DNS related.
devon-ladAuthor Commented:
Ok - so there were some DNS issues caused by a multi-homed DC providing AD replication between the old and new system.  Have got rid of this and put a switch in to handle the routing.  So now all working ok - no errors in logs - dcdiag all fine.

However, still experiencing slow logins - loading profiles and group policy.  Quite often the roaming profile isn't synced due to slow network performance.  Profiles are very small - less than 100MB - yet, transferring 500MB files between the session hosts and profile server takes a second or so.
Cliff GaliherCommented:
Turn on performance counters. Find out where the bottleneck is. Don't guess. That got you in trouble earlier (in this thread.)  If I were a gambling man, I'd guess one or more of your VMs is just plain misbehaving.  Common culprits include, but are not limited to:

Slow disks/bad disk setup. (You won't believe how many times I see people wondering why running 4 VMs on a set of 4 disks, 7.2k consumer SATA, as a RAID5, and then are surprised that they have bad performance. Copying one large file will efficiently use a RAID cache, but a bunch of users with profiles scattered everywhere will not.  Disk I/O matters.

Ignoring other networking bottlenecks. The profile not be where the hang is occurring. It could be applying group policies from a DC that isn't on the same the external network matters. Or it could be a bad printer driver. Or....

A corrupt VM. I see poor (and unpredictable) performane far too often because someone decided to follow "advice I read on the internet" when they moved to Hyper-V and just grabbed disk2vhd and converted their existing servers. Old drivers in the HAL and all. There is a reason Microsoft has never supported this method of p2v and it injects difficult-to-troubleshoot, and nearly impossible-to-fix issues on a regular basis.

Now mind you, I don't know which of these (if any) applies to you. But since 500MB files are copying fine, we've virtually (excuse the pun) elimated the virtual switch and/or VMQ as the culprit. Which leaves way too many other possibilities to effectively just start guessing at this point. I only mention the above, not as guesses, but as an illustration of how broad the potential causes can be in any given situation.
devon-ladAuthor Commented:
Cliff - thanks for that.  Yes, already have performance counters on.  The hosts aren't showing any kind of disk bottlenecks or any other kind of stress.  It's a fibre channel SAN with 15k SAS disks - so very much under-utilised at the moment.  These are all new VMs, not converted from physical machines.

Anyway, we're going off at a tangent now - so will leave it at that.

devon-ladAuthor Commented:
FYI - I had been looking at disk performance counters on the hosts.  Figured I'd get it from the horse's mouth and check the performance counters on the SAN itself.  Showing average IOPS at 300 - ok, so reasonable but a bit high.  But the max IOPS is pegged at around 1.3k, which, if my maths is correct, is around about the maximum IOPS you'd expect from 10 x 15k disks in RAID 10.
Cliff GaliherCommented:
IOPS are good for measuring theoretical maximum performance, but not as good at determining if you are hitting real world limits, given other factors such as fragmentation, odd disk/controller/firmware/cache interactions, etc. Disk queue depth is better at finding bottlenecks, especially when measured over time and looking for correlations to perceived slowdowns.
devon-ladAuthor Commented:
Ah...maybe my calculation was incorrect - it's just jumped to 1.9k.  Anyway - higher than expected for the current load.

Queue lengths on the hosts are averaging 1 or below - and are only spiking out of hours during backup etc.

So on the face of it, it looks ok doesn't it??
Cliff GaliherCommented:
Anecdotally, yes.
devon-ladAuthor Commented:
Have been doing some further monitoring/testing.  Obviously my max IOPS calculation is totally wrong as I've seen the SAN jump to almost 4000 IOPS sometimes.  

Disk latency on each of the hosts rarely goes above 2ms, most of the time below 1ms.  

So certainly appears that the SAN is up to the job.

But just noticed something.  Although VMQ has been disabled on all NICs, it still shows as Enabled on the NIC team in the
multiplexor driver.  Could this be the issue?
Cliff GaliherCommented:
Shouldn't be. And again, since raw file copies aren't having a performance issue, you really have eliminated the lower levels of the network stack, from NIC through the virtual switch. Doesn't even seem to be a place to focus on looking.
devon-ladAuthor Commented:
Get-NetAdapterVmq shows disabled for the NIC team even though it shows as enabled in the driver properties.

I found yesterday that I was unable to get more than around 320Mbps on the virtual disk assigned to a particular host (checking the SAN performance counters) - this was the same whichever type of file transfer I tested (host to host, vm to host, vm to vm)

I also found that file transfer speeds were averaging 200Mbps (taken from the Windows file copy dialog) - again source/destination didn't seem to affect this.

I've just disabled VMQ on the NIC team and now getting bursts of almost 3Gbps throughput for the same host on the SAN.

Host to host is now around 900Mbps.

VM to VM on the same host still slow.  I'm wondering if I should recreate the virtual switches as they were originally created when VMQ was enabled on everything?
devon-ladAuthor Commented:
Just looking at the SAN counters again - it's now showing max transfer speeds of 5Gbps and IOPS of almost 9000.

These are the best I've seen so far - could this have all still been VMQ related?

To clarify for others  - the SAN connections themselves are not affected by the VMQ setting - but the jump in throughput would indicate the hosts are able to move data around quicker now.
devon-ladAuthor Commented:
VM to VM on same host is actually ok.  Single file transfer is around 1.5Gbps now.
Philip ElderTechnical Architect - HA/Compute/StorageCommented:
I suggest reading Jose Baretto's blog: Using file copy to measure storage performance – Why it’s not a good idea and what you should do instead.

IOmeter, JETStress, SQLIO, and DriveSpd are utilities designed to truly iron out how storage is going to perform. We use IOmeter primarily at this time to test all storage related environmentals.

As Jose says, file copies are not the way to test for performance.
devon-ladAuthor Commented:
Apologies for the delay in tidying this up.

I think generally my testing has been a bit inconsistent and so have been led to the wrong conclusions.

I believe disabling VMQ has improved things, but would say the real culprits here have been two issues - AD replication problems caused by incorrect routing tables (profiles are held on DFS shares) and possibly adverse affects of IPv6 being unbound from all NICs (both physical and virtual).

After tidying all these items up, performance is as I would expect it to be.

I think Cliff summed up the VMQ issue in his first post - which was what the original question was about.
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today

From novice to tech pro — start learning today.