Link to home
Start Free TrialLog in
Avatar of Jon Pletcher
Jon PletcherFlag for United States of America

asked on

CPU issues on VMHOST and child VMs

I have a host with 2 - 6 core CPUs for a total of 12 cores.  It is one of four physical hosts in the cluster.  While doing updates to the ESXi host, I move VMs from one host to the other three in order to put it in maintenance mode and do updates.  I've done this a million times.  Today however, when I did this, one of the populated physical hosts became very unresponsive and the VMs on it were at a halt.  

I know recently that some of the VMs had additional vcpus added to them, for a total of 8 vcpus on them.  One of these was on the physical host that was having issues.  Unfortunately, the vcenter server vm was also on the struggling physical host.  The CPU on the vcenter server was pegged at 100% and never let up.  The physical host showed that it had plenty of CPU though, but I know that can be misleading.  Even though the physical host showed plenty of memory, a single vm on it can use a lot of cpu and trip up other VMs that have 8 virtual cpus and need 8 cores open at a time for a CPU cycle.  

When I moved the server that had 8 vcpus to another physical host, the vcenter server all of a sudden went from 100% cpu to less than 10% cpu.  It's like the two vms could not be on the same physical host without problems.

Are there counters I can check that can tell me that even if the host had plenty of cpu showing, that the VMs were having to wait for CPU cycles themselves?  Or does anyone have any other ideas what may have gone wrong here?
Avatar of Andrew Hancock (VMware vExpert PRO / EE Fellow/British Beekeeper)
Andrew Hancock (VMware vExpert PRO / EE Fellow/British Beekeeper)
Flag of United Kingdom of Great Britain and Northern Ireland image

it will be done to resources, e.g. what cpu and memory was in use?

see my EE article

HOW TO:  Performance Monitor vSphere 4.x or 5.0

what did you do to resolve it ?
Avatar of Jon Pletcher

ASKER

Here is what I did in order.  What led to the problem, and what resolved it.

1. Starting moving VMs off HOST04 to HOST03 so I could update HOST04 with ESXi patches
2. Second to last VM I moved was the vCenter VM
3. Last server I moved over was a 8vcpu SQL server
4. Suddenly all VMs on HOST03 became unusable.  
5. I try to get into vcenter to see what is going on, but that VM is unusable
6. I start shutting down VMs on HOST03, hoping to at least get into vcenter
7. I am able to at least get into the vcenter server now, but vcenter itself won't open.  It's timing out and unresponsive.  I notice cpu is spiked at 100% and never lets up
8. I hard power down the vcenter vm and boot back up
9. I am slowly able to open vcenter, but cpu is again at 100% and never lets up
10. I migrate the last SQL server with 8vcpus back over to HOST04
11. As soon as the migration finishes, the vcenter server cpu drops to almost zero.  All VMs now work fine again.
Okay, so you should have some good statistics to look at in vCenter Server performance charts...
Unfortunately it looks like since vCenter was hung up the whole time, statistics are missing for the time of the incident.
SOLUTION
Avatar of Andrew Hancock (VMware vExpert PRO / EE Fellow/British Beekeeper)
Andrew Hancock (VMware vExpert PRO / EE Fellow/British Beekeeper)
Flag of United Kingdom of Great Britain and Northern Ireland image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
ASKER CERTIFIED SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
The host had plenty of RAM, but now I am seeing vCenter alerting about a bad memory module in the physical host.  That could very well be what was causing the issue.  It wasn't alerting at the time of the loss of responsiveness, but shortly after.  

Given I don't have stats to look at from when the problem was happening (vCenter was locked up and not collecting stats), I'm going to have to guess this was the issue.

Thanks you guys for your help, and you were possibly right about RAM being the issue, just not from oversubscribing, but going physically bad.
If you have any sort of management processor like IMPI or AMT or ILO or IMM or DRAC you can find & remove failing module (And replace under warranty)
Yep, we have iLO cards and I will be contacting HP for replacement.
While contacting you can ask if they let you remove module to restore partial functionality.