1 CPU thread pegged at 100% at almost all times by ntoskrnl.exe

We've had users complain of slowness of a HyperV running on a HP ProLiant DL380p Gen8 server. I can't see anything wrong with the HyperV itself, but the host server is showing strange issues.

The first thing I noticed is that one of the CPU threads is almost constantly pegged at 100% by System:

1 CPU thread at 100%
Digging deeper with Process Explorer I can see that is is ntoskrnl.exe that is that cause, but that is the system kernel, and so doesn't narrow it down by much:

Process explorer
Deeper still with Windows Performance Analyzer, the exact library in use is hal.dll and PSHED.dll:

Windows Performance Analyzer
My thinking was this is either a driver or hardware issue, based on the components involved here. So I start with some hardware checks using HP's System Management Homepage. First thing I notice is one of the 4 memory modules is reporting "degraded". I get the bad module removed and boot back up, and everything looks fine initially. But about an hour later a different CPU thread is now pegged back at 100%.

Ok, so maybe it's a driver or different device issue. No errors showing in device manager, so I went through and disabled as many non-critical devices as I could - no change.
I also updated the drivers for the HP iLO as it didn't seem properly installed - no change.

At this point I contacted HP Enterprise support. They downloaded their Active Health System logs, but didn't find any issues. They noted that the memory installed wasn't official HP SmartMemory, so they may not be getting full diagnostics data.
Shortly after the call I saw that another memory module was now degraded. This wasn't there 5 minutes ago when on the phone with HP:

2nd bad module
The iLO was reporting specifically that the module had "Exceeded the corrected memory error threshold":

 corrected memory threshold
Now, I could pull this new faulty module, but then the server would be down to 32GB and could be weeks before replacements arrive. And I would imagine the odds of 2 modules being faulty are low, so I'm thinking something else is to blame here, possibly the motherboard?

This is where I am at, so any suggestions are welcome. Do I replace all the memory with HP SmartMemory? Is the motherboard faulty and the entire server need to be replaced?

Bear in mind that IT has no direct physical access to this server, any physical interaction is performed by local users under instruction from IT. So even a BIOS update would be a big ask, in case things go wrong. This is also a prod server, so downtime/reboots needs to be kept to a minimum.
TeekayShippingAsked:
Who is Participating?

[Product update] Infrastructure Analysis Tool is now available with Business Accounts.Learn More

x
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Adam BrownSr Solutions ArchitectCommented:
HAL.DLL and PSHED.DLL are both associated to hardware. HAL.DLL is the Hardware Abstraction Layer and PSHED.DLL is the Platform Specific Hardware Error Detection system. If both of these are using up significant resources, the cause is probably going to be faulty hardware, so pulling that RAM is going to have to be the next troubleshooting step. If there is a notice that a specific piece of RAM has exceeded error thresholds, it's a very good bet that it's a bad stick of RAM. RAM is not something that handles errors well, and doesn't usually generate errors of any type.
andyalderSaggar maker's framemakerCommented:
It's not a DIMM issue as that would affect all cores, not just one. ECC threshold exceeded is similar to a SMART error on a disk, it's just a pre-failure situation unless the BIOS has disabled it as a precaution in which case you would be running low on RAM.

Do any of the individual VMs show excessive CPU usage and is it the same one each time?
TeekayShippingAuthor Commented:
If there is a notice that a specific piece of RAM has exceeded error thresholds, it's a very good bet that it's a bad stick of RAM

I am going to try and replace all the memory next. I just thought the odds of having 2 bad modules in such a short amount of time was suspicious.

Do any of the individual VMs show excessive CPU usage and is it the same one each time?

The VM looks fine in terms of CPU usage, nothing out of the ordinary (host on left, VM on right):

CPU compare
The affected thread does change. Initially it was thread 0, but after removing the first bad module and rebooting, it moved to thread 4.
SolarWinds® IP Control Bundle (IPCB)

Combines SolarWinds IP Address Manager and User Device Tracker to help detect IP conflicts, quickly identify affected systems, and help your team take near instantaneous action. Help improve visibility and enhance reliability with SolarWinds IP Control Bundle.

andyalderSaggar maker's framemakerCommented:
You only have one VM and you have assigned 12 virtual CPUs to it on a machine that has 12 cores available? That's very odd, there will be an unnecessary amount of context switching going on. Try reducing the number of vCPUs assigned to the VM.
Philip ElderTechnical Architect - HA/Compute/StorageCommented:
I have two very thorough EE articles on all things Hyper-V:

Some Hyper-V Hardware and Software Best Practices
Practical Hyper-V Performance Expectations

As mentioned by @Handy Holder, assigning more vCPUs and vRAM to a VM can actually have a detrimental effect on performance.

In our long history of building servers, we've been at it since the late 1990s, installing RAM that is not on the manufacturer's approved hardware list is a bad idea. Period.

We found that out the hard way when Intel refused to honour their hardware warranty due to non-approved RAM being installed. Fortunately, the RAM vendor was okay with swapping what we had with Intel Certified RAM. And, go figure, once we swapped the RAM the spontaneous reboots the server was experiencing went away.
andyalderSaggar maker's framemakerCommented:
HPE don't invalidate the server warranty for non-certified RAM, they just slow it down a bit to the Intel spec rather than HPE's overclock spec and charge you if they go out to change the motherboard but find the RAM is causing the problem. There's a statement from there somewhere to that effect. This server isn't crashing though so and any RAM errors have been CRC corrected.

Philip, do you know whether Hyper-V uses multiple threads to manage the allocation of resources to VMs or is that normally a single process?
TeekayShippingAuthor Commented:
Thanks for all the input, at least I feel I'm going in the right direction investigating the memory as the cause.

In regards to the VM setup, I wasn't involved in the design nor am I in a position to make changes, however I will certainly forward these recommendations to those that do make the decisions.

This VM setup is replicated in over 100 different locations on the same hardware, but this is the only one where I've seen this kind of issue. So I wouldn't put it down to being the cause of this particular issue.
arnoldCommented:
check the VM allocations of processors to make sure you distribute them across versus

Look at the process list and which host processes are tied to the CPU 0

What manifestations are experienced as a consequence of this high utilization.
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Server Hardware

From novice to tech pro — start learning today.