Server randomly crashing with no trace of memory.dmp
Hey gents we have been experiencing several shutdown errors with our ESXI Host. We have the companies FS/ DC running off this host I believe its 6.0. Last night around 4 or 5PM EST The servers shut down again. I have attached some event viewer logs. Server OS is 2012 R2. Now after the server went down it did not automatically reboot. I checked for memory.dmp files and could not find any under the %systemroot% (I don't think they are hidden). So I am wondering if its related to power. Also looking at our Datto backup in place which stopped working 3 hours prior, looking at the meta data it showed a communication loss to one of the shared partitions on the server which was their E:\ volume where all their shares sit. Here is the Meta Data Log.
Tue 23/04/19 1:01:02 pm - metrics {"name""HighlyRandomUnidentifiableMimetype","description""Detects file entries that contain a known extension, unidentifiable mimetype, and are highly random","result"false,"occurrences"0,"total"0,"percent"0,"threshold"0.05,"minimum_samples"20,"maximum_occurrences"9223372036854775807,"files"[]} Tue 23/04/19 1:01:02 pm - Rule "HighlyRandomUnidentifiableMimetype" did not detect ransomware. 0 out of 0 files checked showed signs of ransomware (0%%). Tue 23/04/19 1:01:02 pm - metrics {"name""KnownRansomwareExtensions","description""Detects file entries that contain a known ransomware extension","result"false,"occurrences"0,"total"0,"percent"0,"threshold"0.005,"minimum_samples"20,"maximum_occurrences"3,"files"[]} Tue 23/04/19 1:01:02 pm - Rule "KnownRansomwareExtensions" did not detect ransomware. 0 out of 0 files checked showed signs of ransomware (0%%).
There are various log files with this. I do recall in the meta data that it showed corrupt volume on their E:\ which is where their shares are which they claimed were inaccessible as of yesterday.
Datto did something on their end, got the local and offsite backups to run which restored connectivity to the E:\ volume internally as well I am just unsure as to what they did. However everything is backing up now as normal. I am looking at the server logs and here is what they show.
The previous system shutdown at 1:40:21 PM on 4/22/2019 was unexpected.
Event 6008
The system has rebooted without cleanly shutting down first. This error could be caused if the system stopped responding, crashed, or lost power unexpectedly.
Kernel Power (I think it lost power)
The reason supplied by user CIRCUITCLINICAL\admin.del for the last unexpected shutdown of this computer is: Other (Unplanned) Reason Code: 0xa000000 Problem ID: Bugcheck String: Comment: Event ID 1076
Since I was not there I can not tell if this was related to power or an application crash. Any assistance would be greatly appreciated. This has gone down 3 times so far this year.
Server HardwareMicrosoft Server OSWindows Server 2012VMware
Last Comment
Andrew Hancock (VMware vExpert PRO / EE Fellow/British Beekeeper)
8/22/2022 - Mon
Andrew N. Kowtalo
ASKER
There have been complaints in the past where every internal user would lose access to the e:\ share which was \\servername\share and it would miraculously come back up. The DC gives out DNS to everything as an update.
Running a ping to the local server there was no packet drops.
Ping statistics for 172.16.30.3:
Packets: Sent = 209, Received = 209, Lost = 0 (0% loss),
Approximate round trip times in milli-seconds:
Minimum = 1ms, Maximum = 39ms, Average = 2ms
Control-C
^C
C:\Users\kowtaloa>
C:\Users\kowtaloa>
I asked Datto what they did to fix the problem they told me the drives were refreshed and the backups were kicked off
Olgierd Ungehojer
What server is it ? Is it Dell?
What kind UPS do you have there? You can check power event on ups.
Andrew N. Kowtalo
ASKER
Hi Olgierd. Here is the server make and model.
VMware, Inc. VMware Virtual Platform
System Serial Number: VMware-56 4d e9 7b 4c 52 6f 1a-55 dd a2 16 ea 44 1f 04
Enclosure Type: Other
Windows Server 2012 R2 Standard (build 9600.19327)
Primary Domain Controller
Install Language: English (United States)
System Locale: English (United States)
Installed: 8/24/2017 6:30:50 PM
Boot Mode: BIOS (Secure Boot not supported)
Server Roles:
Web Server (IIS)
Active Directory Domain Services
DHCP Server
DNS Server
Network Policy and Access Services
Remote Desktop Services
Remote Access
File and Storage Services
407.65 Gigabytes Usable Hard Drive Capacity
236.38 Gigabytes Hard Drive Free Space
NECVMWar VMware SATA CD00 [Optical drive]
VMware Virtual disk SCSI Disk Device (322.12 GB) -- drive 1, s/n 6000c2922c438118ed210b882f6ec330
VMware Virtual disk SCSI Disk Device (85.90 GB) -- drive 0, s/n 6000c29adee7a80accd9a77163cde322
I am wondering if Kernal Power was selected as a drop down when they logged back into the server when the error box popped up as the reason for the crash.
I am unsure if there are event logs on the UPS I can check.
There is an APC Powersupply with no network so there is no way I can access it to check any logs.
Olgierd Ungehojer
I mean if your server is power off - your host VM is power of - you have to physically go and power it on. Am I right ?
You problem looks like related with hardware, but I am not sure, because you gave me information about your VM machine.
Do you have problem only with your VM ? And you ESXI Host has power and is running when you have issue ?
Andrew N. Kowtalo
ASKER
The physical server remained on but the ESXI host went down. So they hard powered the server down and that brought both the Physical Box and the ESXI host back up which fixed it. So it looks like the ESXI host is the culprit.
Almost all server makes have a hardware log you can interrogate either during POST or via remote management interface. As Olgierd Ungehojer says what make/model is the hardware?
Andrew N. Kowtalo
ASKER
The server has a rack on it if I am logged into where can I tell the make and model I ran belarc but that's all it told me cant tell if its an HP or Dell
RAFA
Hello
What model is your physical server?
You can restart it and run a hardware diagnostic to validate which hardware component is actually failing.
in your physical server you can see if the led of memory or power source are alarmed.?
iLO might be sharing one of the onboard ports, the extra NIC port is an option. Might not be configured with an IP address. Easy enough to check the log and configure iLO for next time via BIOS but that needs a reboot. If HPE tools are installed you can use hponcfg it under VMware without rebooting, however you need the current iLO password to do that. I don't know if they still tie a cardboard label on with the iLO default password any more.
RAFA, diags aren't much use if something only crashes occasionally. Assuming it crashes once per day you would have to run diags for a couple of weeks to verify the hardware as good.
I started fixing hardware at 21 year old and now pushing 60 and still think diags are a waste of time since POST will catch the big faults and the best way to test hardware is to throw an OS at it. All the major components report any errors to the BMC so diags are no better than running VMware to isolate faults.
Andrew N. Kowtalo
ASKER
@andyalder so what do you recommend ILO? I am being told ESXI is the culprit not the physical server. It seems when ESXI goes down the hardware remains on and running.
I will see if I can obtain the ESXI logs and post then once we get them.
Andrew N. Kowtalo
ASKER
Attached are the log files from the ESXI host ESXI-Logs.zip
RAFA
Hello,
Reviewing the log you have provided, note the following error in the vmkernel .log:
1) 2019-04-05T13: 26: 05.799Z cpu1: 66170) nhpsa: hpsa_vmkScsiCmdDone: 5238: Sense data: error code: 0x70, key: 0x5, info: 00 00 00 00, cmdInfo: 00 00 00 00, CmdSN: 0xf7e0, worldId: 0x1077b, Cmd: 0x85, ASC: 0x20, ASCQ: 0x0
This message is generated because your disk or device does not support the 0x85 command. 0x85 is ATA PASS-THROUGH (16). Where in the vmkernel log it refers to this code:
Cmd: 0x85, ASC: 0x20, ASCQ: 0x0
2) Additional is another event:
2019-04-05T13: 20: 50.780Z cpu10: 13596655) FSS: 6751: Failed to open file 'hpilo-d0ccb0'; Requested flags 0x5, world: 13596655 [sfcb-smx], (Existing flags 0x5, world: 13596657 [sfcb-smx]): Busy
Both events are not necessarily a problem. Some devices do not provide and can not provide the requested information If the storage device is working properly, the log messages can be ignored.
Similarly annex reference link, where you can validate what I'm indicating, since there are several cases where this same error occurs with HP servers and different versions of ESXi even in the latest version 6.7.
In the same way it is good to know if you have presented problems with the host again, have you rebooted or shut down unexpectedly?
To date, the host has worked correctly.
In order to reboot the ESXI host we have had to physically power down the server and bring it back up which then forces the host to restart and works again.
Andrew Hancock (VMware vExpert PRO / EE Fellow/British Beekeeper)
How is the ESXi shutting down ?
Purple Screen of Death or just going off ?
Your version of ESXi is old...
Latest version is ESXi-6.5.0-20190304001-standard (Build 13004031)
Andrew N. Kowtalo
ASKER
Andrew thanks for the reply the entire location loses access to their local DC/ FS where the ESXI host is running. The physical hardware remains on but access to local files is lost. The only way to resolve the issue is by restarting the DC by powering it down and back on and bringing it back up. The host has reported some failures in the logs.
Also I am not sure they are willing to upgrade since 6.5 is 5 grand. Are there significant changes that have repaired issues in this version?
Andrew Hancock (VMware vExpert PRO / EE Fellow/British Beekeeper)
Okay, is the issue with
1. ESXi Hypervisor crashing or restarting ???
2. Virtual Machine issues (they will have issue with Guest VMs if 1. above crashes) ???
They are already on ESXi 6.5.
471.87 days of ESXi host uptme, does not suggest an issue with the Host Hypervisor. BUT that was the initial General Relase of ESXi 6.5 (e.g. the FIRST release).
there have been 22 bug fixes since your version.....
Your version is two years old...
I would recommend...
1. Update firmware in Host Server.
2. Update ESXi 6.5 to current version.
the see if the GuestVM issues continue. and also make sure the GuestVMs are ALL running with VMXNET3 interfaces and not E1000. (VMware Tools MUST be installed)
Andrew N. Kowtalo
ASKER
To be honest I am honestly not sure where the break is happening. I can just tell you their FS/DC and ESXI are all running off 1 server. Once all access is lost to the host files the only way to restore access is to physically power cycle the server. When access is does go down the physical server remains on so the device isn't powering off. The logs in the VM suggested controller failures which is why I was thinking it was the hypervisor possibly. Its hard to tell.
Andrew Hancock (VMware vExpert PRO / EE Fellow/British Beekeeper)
I will recommend these changes to the client and see how they want to proceed.
Andrew N. Kowtalo
ASKER
Andrew I also found out there is an incorrect DNS setting on their vmxnet3 adapter. I am going to adjust that to reflect the DC's ip for the primary and the loopback for the secondary and then update VMTools tonight and reboot and see what that does going forward. Incorrect-DNS.JPG
Andrew N. Kowtalo
ASKER
Now it has the correct DNS and will have VMWare Tools updated tonight. Correct-DNS.JPG
Andrew it has been that way for a long time I just made the discovery. It is now correct. And hopefully the tools updates the driver. By the way they are running 6.5
Running a ping to the local server there was no packet drops.
Ping statistics for 172.16.30.3:
Packets: Sent = 209, Received = 209, Lost = 0 (0% loss),
Approximate round trip times in milli-seconds:
Minimum = 1ms, Maximum = 39ms, Average = 2ms
Control-C
^C
C:\Users\kowtaloa>
C:\Users\kowtaloa>
I asked Datto what they did to fix the problem they told me the drives were refreshed and the backups were kicked off