We've had users complain of slowness of a HyperV running on a HP ProLiant DL380p Gen8 server. I can't see anything wrong with the HyperV itself, but the host server is showing strange issues.
The first thing I noticed is that one of the CPU threads is almost constantly pegged at 100% by System
Digging deeper with Process Explorer I can see that is is ntoskrnl.exe
that is that cause, but that is the system kernel, and so doesn't narrow it down by much:
Deeper still with Windows Performance Analyzer, the exact library in use is hal.dll
My thinking was this is either a driver or hardware issue, based on the components involved here. So I start with some hardware checks using HP's System Management Homepage. First thing I notice is one of the 4 memory modules is reporting "degraded". I get the bad module removed and boot back up, and everything looks fine initially. But about an hour later a different CPU thread is now pegged back at 100%.
Ok, so maybe it's a driver or different device issue. No errors showing in device manager, so I went through and disabled as many non-critical devices as I could - no change.
I also updated the drivers for the HP iLO as it didn't seem properly installed - no change.
At this point I contacted HP Enterprise support. They downloaded their Active Health System logs, but didn't find any issues. They noted that the memory installed wasn't official HP SmartMemory, so they may not be getting full diagnostics data.
Shortly after the call I saw that another memory module was now degraded
. This wasn't there 5 minutes ago when on the phone with HP:
The iLO was reporting specifically that the module had "Exceeded the corrected memory error threshold":
Now, I could pull this new faulty module, but then the server would be down to 32GB and could be weeks before replacements arrive. And I would imagine the odds of 2 modules being faulty are low, so I'm thinking something else is to blame here, possibly the motherboard?
This is where I am at, so any suggestions are welcome. Do I replace all the memory with HP SmartMemory? Is the motherboard faulty and the entire server need to be replaced?
Bear in mind that IT has no direct physical access to this server, any physical interaction is performed by local users under instruction from IT. So even a BIOS update would be a big ask, in case things go wrong. This is also a prod server, so downtime/reboots needs to be kept to a minimum.