Link to home
Start Free TrialLog in
Avatar of Anonymous KH
Anonymous KHFlag for Singapore

asked on

VMware VMs shut down automatically

Dear Experts,

We have installed a customised ISO for ESXi 6.7 Update 1 into a ThinkSystem SR530.

It is using a Intel Xeon Silver 4114 10C 85W 2.2GHz 2400MHz  Processor.

In it we created 3 VMs.

The issue is the VMs will shut down without warning.

My boss says that it could be due to the processor.

Did anyone faced this issue before?
Avatar of Andrew Hancock (VMware vExpert PRO / EE Fellow/British Beekeeper)
Andrew Hancock (VMware vExpert PRO / EE Fellow/British Beekeeper)
Flag of United Kingdom of Great Britain and Northern Ireland image

No why does he say suspect the processor?

The VMs only shutdown not the host?
Avatar of Anonymous KH

ASKER

He is saying that he suspects that the cpu is causing the issue but I don't know how or where he got the info from.

The host is still operational, only all the VMs will just shut down.
That makes no sense at all, if there were a CPU issue the entire host would PSOD.

Have you gone through the event logs on the VM's at the time the shutdown occured? Anything in there such as a BSOD or anything?
A VM just shut down again!
Go through the logs then
Where do i navigate to view the logs?
can you answer these question to figure out the solution:
does all the VMs shut down at the same time?
did you check the event viewer on each VMs?
did you check if there are any updates or patches for the ESXi?

the event viewer will mostly reveal the cause of the shutdown.,
@anonymous KH
right-click computer and select manage, then select event viewer.
User generated imagecheck mostly the system event viewer
I have not seen all VMs shut down together yet.

Sysetm Event error - The previous system shutdown at 9:22:29 am on ‎11/‎9/‎2019 was unexpected.

I am not too sure about patching. I know there is a U3.
An application (/bin/vmx) running on ESXi host has crashed (9 time(s) so far). A core file might have been created at /vmfs/volumes/5d5ae0ad-5751d5b0-504c-0894ef755cd8/PC01/vmx-zdump.001.
Type
Warning

Time
Wednesday, September 11, 2019, 09:42:09 +0800



Error message on PC01 on ESX in ha-datacenter: We will respond on the basis of your support entitlement.
Type
Info

Time
Wednesday, September 11, 2019, 09:42:09 +0800




PC01 on ESX in ha-datacenter is powered off
Right,

Check the VMX dump and post the error here.

Thanks
Alex
Hi! Alex,

How do I that?
Go into V-Center, find the LUN that the VM resides on, go to the logs folder and see if they are in there. Alternatively they are on the ESXi host itself, you'll need to winSCP to the box and navigate to

"/vmfs/volumes/5d5ae0ad-5751d5b0-504c-0894ef755cd8/PC01/vmx-zdump.001."

Thanks
Alex
Hi! Alex,

I did a generate logs from the ESXi instead.

I have the folder until PC01.

The file types are FRAG-000XX Files is is those files?

I don't see the particular file vmx-zdump.001
Those are fragments, you can send them to VMware and they can rebuild the dump file.

It's definitely an issue with your ESXi rather than the VM's, is it happening on just one box in a cluster or a standalone box?
It is just one physical server Lenovo server
Until we get the fragments rejoined into a dump (again, VMWare) I can't recommend anything, it may be something as simple as updating ESXi to a newer version.
ASKER CERTIFIED SOLUTION
Avatar of Andrew Hancock (VMware vExpert PRO / EE Fellow/British Beekeeper)
Andrew Hancock (VMware vExpert PRO / EE Fellow/British Beekeeper)
Flag of United Kingdom of Great Britain and Northern Ireland image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Is there anyway to see what cause the VM to shut down. I looked into the server event log, there are logs but I don't see any event log that says that something happened then the server shut down.
2019-09-22T12:41:32.608Z| vcpu-1| W115: MONITOR PANIC: vcpu-0:VMM fault 14: src=MONITOR rip=0xfffffffffc0a0000 regs=0xfffffffffc406400
2019-09-22T12:41:32.608Z| vcpu-1| I125: Core dump with build build-11675023
2019-09-22T12:41:32.608Z| vcpu-3| I125: Exiting vcpu-3
2019-09-22T12:41:32.608Z| vcpu-5| I125: Exiting vcpu-5
2019-09-22T12:41:32.608Z| vcpu-6| I125: Exiting vcpu-6
2019-09-22T12:41:32.608Z| vcpu-4| I125: Exiting vcpu-4
2019-09-22T12:41:32.608Z| vcpu-2| I125: Exiting vcpu-2
2019-09-22T12:41:32.608Z| vcpu-0| I125: Exiting vcpu-0
2019-09-22T12:41:32.608Z| vcpu-7| I125: Exiting vcpu-7
2019-09-22T12:41:32.628Z| vcpu-1| I125: Writing monitor file `vmmcores.gz`
2019-09-22T12:41:32.652Z| vcpu-1| W115: Dumping core for vcpu-0
2019-09-22T12:41:32.652Z| vcpu-1| I125: VMK Stack for vcpu 0 is at 0x451a14193000
2019-09-22T12:41:32.652Z| vcpu-1| I125: Beginning monitor coredump
2019-09-22T12:41:33.157Z| vcpu-1| I125: End monitor coredump
2019-09-22T12:41:33.157Z| vcpu-1| W115: Dumping core for vcpu-1
2019-09-22T12:41:33.332Z| mks| W115: Panic in progress... ungrabbing
2019-09-22T12:41:33.332Z| mks| I125: MKS: Release starting (Panic)
2019-09-22T12:41:33.332Z| mks| I125: MKS: Release finished (Panic)
2019-09-22T12:41:34.160Z| vcpu-1| I125: VMK Stack for vcpu 1 is at 0x451a04713000
2019-09-22T12:41:34.160Z| vcpu-1| I125: Beginning monitor coredump
2019-09-22T12:41:34.668Z| vcpu-1| I125: End monitor coredump
2019-09-22T12:41:34.669Z| vcpu-1| W115: Dumping core for vcpu-2
2019-09-22T12:41:34.669Z| vcpu-1| I125: VMK Stack for vcpu 2 is at 0x451a1a693000
2019-09-22T12:41:34.669Z| vcpu-1| I125: Beginning monitor coredump
2019-09-22T12:41:35.167Z| vcpu-1| I125: End monitor coredump
2019-09-22T12:41:35.167Z| vcpu-1| W115: Dumping core for vcpu-3
2019-09-22T12:41:35.167Z| vcpu-1| I125: VMK Stack for vcpu 3 is at 0x451a15d93000
2019-09-22T12:41:35.167Z| vcpu-1| I125: Beginning monitor coredump
2019-09-22T12:41:35.666Z| vcpu-1| I125: End monitor coredump
2019-09-22T12:41:35.666Z| vcpu-1| W115: Dumping core for vcpu-4
2019-09-22T12:41:35.666Z| vcpu-1| I125: VMK Stack for vcpu 4 is at 0x451a13b93000
2019-09-22T12:41:35.666Z| vcpu-1| I125: Beginning monitor coredump
2019-09-22T12:41:36.160Z| vcpu-1| I125: End monitor coredump
2019-09-22T12:41:36.160Z| vcpu-1| W115: Dumping core for vcpu-5
2019-09-22T12:41:36.161Z| vcpu-1| I125: VMK Stack for vcpu 5 is at 0x451a09193000
2019-09-22T12:41:36.161Z| vcpu-1| I125: Beginning monitor coredump
2019-09-22T12:41:36.657Z| vcpu-1| I125: End monitor coredump
2019-09-22T12:41:36.657Z| vcpu-1| W115: Dumping core for vcpu-6
2019-09-22T12:41:36.657Z| vcpu-1| I125: VMK Stack for vcpu 6 is at 0x451a19d13000
2019-09-22T12:41:36.657Z| vcpu-1| I125: Beginning monitor coredump
2019-09-22T12:41:37.151Z| vcpu-1| I125: End monitor coredump
2019-09-22T12:41:37.152Z| vcpu-1| W115: Dumping core for vcpu-7
2019-09-22T12:41:37.152Z| vcpu-1| I125: VMK Stack for vcpu 7 is at 0x451a1ad93000
2019-09-22T12:41:37.152Z| vcpu-1| I125: Beginning monitor coredump
2019-09-22T12:41:37.645Z| vcpu-1| I125: End monitor coredump
2019-09-22T12:41:49.333Z| vcpu-1| W115: A core file is available in "/vmfs/volumes/5d5ae0ad-5751d5b0-504c-0894ef755cd8/SVR/vmx-zdump.001"
2019-09-22T12:41:49.333Z| vcpu-1| I125: Msg_Post: Error
2019-09-22T12:41:49.333Z| vcpu-1| I125: [msg.log.error.unrecoverable] VMware ESX unrecoverable error: (vcpu-1)
2019-09-22T12:41:49.333Z| vcpu-1| I125+ vcpu-0:VMM fault 14: src=MONITOR rip=0xfffffffffc0a0000 regs=0xfffffffffc406400
2019-09-22T12:41:49.333Z| vcpu-1| I125: [msg.panic.haveLog] A log file is available in "/vmfs/volumes/5d5ae0ad-5751d5b0-504c-0894ef755cd8/SVR/vmware.log".  
2019-09-22T12:41:49.333Z| vcpu-1| I125: [msg.panic.requestSupport.withoutLog] You can request support.  
2019-09-22T12:41:49.333Z| vcpu-1| I125: [msg.panic.requestSupport.vmSupport.vmx86]
2019-09-22T12:41:49.333Z| vcpu-1| I125+ To collect data to submit to VMware technical support, run "vm-support".
2019-09-22T12:41:49.333Z| vcpu-1| I125: [msg.panic.response] We will respond on the basis of your support entitlement.
2019-09-22T12:41:49.333Z| vcpu-1| I125: ----------------------------------------
looks like the VM just crashed and stopped.
Hi! Andrew,

Is this issue related to the processor?

I saw an article that says the processor has an EPT misconfiguration? Could this be the issue?

If not, I really don't know what is causing the server to crash? I also have no way to communicate to VMware other than this channel.
Hi! Andrew,

Yes, this time the main server crashed.
I read from

1.      https://kb.vmware.com/s/article/50113028
2.      https://support.hpe.com/hpsc/doc/public/display?docId=mmr_sf-EN_US000018412

They said that it is processor issue.

But I cannot find on what grounds to upgrade the BIOS and firmware, as I don't see any resolution in the release notes.
1. Do you see - EPT misconfiguration: in the logs ? (vmkernel.log)

2. The VMware KB states...

Confirm the BIOS, Firmware,and processor microcode are all up-to-date. If the issue persists the CPU will need to be replaced by the hardware vendor.
Hi! Andrew,

There is no record that the hardware or processor is at fault.

In the vmkernel.log, is there any text that I can look out for to see if there is anything wrong with the ESXi OS?
Hi! Andrew,

Is this a crtical error?

2019-09-29T05:40:34.859Z cpu12:2097867)ScsiDeviceIO: 3068: Cmd(0x459a4af67940) 0x85, CmdSN 0x1e00 from world 2099310 to dev "naa.600605b00df1846024ed98c76df02c79" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0.
2019-09-29T08:10:35.596Z cpu3:2097867)ScsiDeviceIO: 3068: Cmd(0x459a71a3f200) 0x85, CmdSN 0x1e14 from world 2099310 to dev "naa.600605b00df1846024ed98c76df02c79" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0.
2019-09-29T10:40:36.318Z cpu1:2097867)ScsiDeviceIO: 3068: Cmd(0x459a6ec4b900) 0x85, CmdSN 0x1e28 from world 2099310 to dev "naa.600605b00df1846024ed98c76df02c79" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0.
2019-09-29T13:10:37.055Z cpu1:2097867)ScsiDeviceIO: 3068: Cmd(0x459a4ae10740) 0x85, CmdSN 0x1e3c from world 2099310 to dev "naa.600605b00df1846024ed98c76df02c79" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0.
2019-09-29T15:40:37.804Z cpu11:2097867)ScsiDeviceIO: 3068: Cmd(0x459a421288c0) 0x85, CmdSN 0x1e50 from world 2099310 to dev "naa.600605b00df1846024ed98c76df02c79" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0.
2019-09-29T18:10:38.532Z cpu10:2097867)ScsiDeviceIO: 3068: Cmd(0x459a6edc7040) 0x85, CmdSN 0x1e64 from world 2099310 to dev "naa.600605b00df1846024ed98c76df02c79" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0.
That error is just noise and not critical.

1. Do you see - EPT misconfiguration: in the logs ? (vmkernel.log)

Troubleshooting...

1. Update all firmware.
2. Update ESXi OS.

Check if VMs and ESXi still crash.
No I don't see EPT misconfiguration in the logs.

I cannot update the firmware and ESXi OS as this is a production server.

I do not have any evidence to say that upgrading the firmware or OS will resolve the issue.
Okay, well if you don't see any EPT misconfiguration in the logs, then the VMware KB that said it was a processor issue, is possibly not your issue.

I'm afraid we all have to upgrade production servers at sometime, it's called Emergency Downtime, and I'm sure that updating you server is better than a production server which is crashing, and VMs are not stable.

The first issues with most of ESXi is

1. Hardware fault - so contact Lenovo.
2. Firmware issues - update firmware.
3. ESXi OS out of date - update ESXi 6.7 to latest version.

Only if you do 2 & 3, will you know if you have faulty hardware, which means you'll have to test your hardware, memory and CPU, which will also mean you have to take your server down for testing.
Hi! Andrew,

I have sent the logs to Lenovo and lenovo said that there is nothing wrong with the hardware.

For the upgrading of firmware, I need to provide technical explanation to my boss on how the new firmware can resolve the issue, which I failed to find any.
It's a general recommendation for fault finding.
Do you have a step-by-step guide on how I can upgrade the ESXi - OS firmware?
1. Power off all VMs.

2. Enable Maintenance Mode.

3. if your host has internet access type the  following

esxcli network firewall ruleset set -e true -r httpClient
esxcli software profile update -p ESXi-6.7.0-20190802001-standard -d https://hostupdate.vmware.com/software/VUM/PRODUCTION/main/vmw-depot-index.xml
esxcli network firewall ruleset set -e false -r httpClient

Open in new window


this will upgrade the host from current version to ESXi-6.7.0-20190802001-standard (Build 14320388)
Hi! Andrew,

To upgrade the ESXi OS version to 6.7 Update 3

1. The current ESXi is using the Customised ISO, will using the cli still be able to do the upgrading?
2. How can I backup the VMs in case the host server crashed?
3. What will the downtime be like approximately?
4. Do I need to prepare anything else before I start the upgrading?

I have downloaded the ISO file for 6.7 Update 3 for LNV, what would be the proceudre if I want to upgrade from the ISO file?
1. Yes.
2. Using your normal backup software.
3. 10-30 minutes.
4. No.

Insert the CDROM in the server and boot from it.
After upgrading, no issues of any VMs shutting down by itself.
Yes we are still monitoring on a daily basis