Server randomly crashing with no trace of memory.dmp

Andrew N. Kowtalo
Andrew N. Kowtalo used Ask the Experts™
on
Hey gents we have been experiencing several shutdown errors with our ESXI Host.   We have the companies FS/ DC running off this host I believe its 6.0.   Last night around 4 or 5PM EST The servers shut down again.   I have attached some event viewer logs.   Server OS is 2012 R2.   Now after the server went down it did not automatically reboot.   I checked for memory.dmp files and could not find any under the %systemroot% (I don't think they are hidden).   So I am wondering if its related to power.  Also looking at our Datto backup in place which stopped working 3 hours prior, looking at the meta data it showed a communication loss to one of the shared partitions on the server which was their E:\ volume where all their shares sit.     Here is the Meta Data Log.

Tue 23/04/19 1:01:02 pm - metrics {"name""HighlyRandomUnidentifiableMimetype","description""Detects file entries that contain a known extension, unidentifiable mimetype, and are highly random","result"false,"occurrences"0,"total"0,"percent"0,"threshold"0.05,"minimum_samples"20,"maximum_occurrences"9223372036854775807,"files"[]} 
Tue 23/04/19 1:01:02 pm - Rule "HighlyRandomUnidentifiableMimetype" did not detect ransomware. 0 out of 0 files checked showed signs of ransomware (0%%). 
Tue 23/04/19 1:01:02 pm - metrics {"name""KnownRansomwareExtensions","description""Detects file entries that contain a known ransomware extension","result"false,"occurrences"0,"total"0,"percent"0,"threshold"0.005,"minimum_samples"20,"maximum_occurrences"3,"files"[]} 
Tue 23/04/19 1:01:02 pm - Rule "KnownRansomwareExtensions" did not detect ransomware. 0 out of 0 files checked showed signs of ransomware (0%%). 

Open in new window


There are various log files with this.   I do recall in the meta data that it showed corrupt volume on their E:\ which is where their shares are which they claimed were inaccessible as of yesterday.

Datto did something on their end, got the local and offsite backups to run which restored connectivity to the E:\ volume internally as well I am just unsure as to what they did.   However everything is backing up now as normal.    I am looking at the server logs and here is what they show.

The previous system shutdown at 1:40:21 PM on ‎4/‎22/‎2019 was unexpected.
Event 6008

The system has rebooted without cleanly shutting down first. This error could be caused if the system stopped responding, crashed, or lost power unexpectedly.
Kernel Power (I think it lost power)

The reason supplied by user CIRCUITCLINICAL\admin.del for the last unexpected shutdown of this computer is: Other (Unplanned)
 Reason Code: 0xa000000
 Problem ID: 
 Bugcheck String: 
 Comment:  Event ID 1076

Open in new window


Since I was not there I can not tell if this was related to power or an application crash.   Any assistance would be greatly appreciated.   This has gone down 3 times so far this year.
Comment
Watch Question

Do more with

Expert Office
EXPERT OFFICE® is a registered trademark of EXPERTS EXCHANGE®
Andrew N. KowtaloSupport Center Engineer

Author

Commented:
There have been complaints in the past where every internal user would lose access to the e:\ share which was \\servername\share and it would miraculously come back up.   The DC gives out DNS to everything as an update.

Running a ping to the local server there was no packet drops.

Ping statistics for 172.16.30.3:
    Packets: Sent = 209, Received = 209, Lost = 0 (0% loss),
Approximate round trip times in milli-seconds:
    Minimum = 1ms, Maximum = 39ms, Average = 2ms
Control-C
^C
C:\Users\kowtaloa>

C:\Users\kowtaloa>


I asked Datto what they did to fix the problem they told me the drives were refreshed and the backups were kicked off
Olgierd UngehojerSenior Network Administrator

Commented:
What server is it ? Is it Dell?
What kind UPS do you have there? You can check power event on ups.
Andrew N. KowtaloSupport Center Engineer

Author

Commented:
Hi Olgierd.   Here is the server make and model.  

VMware, Inc. VMware Virtual Platform
System Serial Number: VMware-56 4d e9 7b 4c 52 6f 1a-55 dd a2 16 ea 44 1f 04
Enclosure Type: Other

Windows Server 2012 R2 Standard (build 9600.19327)
Primary Domain Controller
Install Language: English (United States)
System Locale: English (United States)
Installed: 8/24/2017 6:30:50 PM
Boot Mode: BIOS (Secure Boot not supported)

Server Roles:
    Web Server (IIS)
    Active Directory Domain Services
    DHCP Server
    DNS Server
    Network Policy and Access Services
    Remote Desktop Services
    Remote Access
    File and Storage Services

Board: Intel Corporation 440BX Desktop Reference Platform
BIOS: Phoenix Technologies LTD 6.00 04/05/2016

2.10 gigahertz Intel Xeon E5-2620 v4 (6 installed)
2048 kilobyte primary memory cache
64-bit ready
Not hyper-threaded

407.65 Gigabytes Usable Hard Drive Capacity
236.38 Gigabytes Hard Drive Free Space

NECVMWar VMware SATA CD00 [Optical drive]

VMware Virtual disk SCSI Disk Device (322.12 GB) -- drive 1, s/n 6000c2922c438118ed210b882f6ec330
VMware Virtual disk SCSI Disk Device (85.90 GB) -- drive 0, s/n 6000c29adee7a80accd9a77163cde322



I am wondering if Kernal Power was selected as a drop down when they logged back into the server when the error box popped up as the reason for the crash.  

I am unsure if there are event logs on the UPS I can check.
HTML5 and CSS3 Fundamentals

Build a website from the ground up by first learning the fundamentals of HTML5 and CSS3, the two popular programming languages used to present content online. HTML deals with fonts, colors, graphics, and hyperlinks, while CSS describes how HTML elements are to be displayed.

Andrew N. KowtaloSupport Center Engineer

Author

Commented:
There is an APC Powersupply with no network so there is no way I can access it to check any logs.
Olgierd UngehojerSenior Network Administrator

Commented:
I mean if your server is power off - your host VM is power of - you have to physically go and power it on. Am I right ?
You problem looks like related with hardware, but I am not sure, because you gave me information about your VM machine.

Do you have problem only with your VM ? And you ESXI Host has power and is running when you have issue ?
Andrew N. KowtaloSupport Center Engineer

Author

Commented:
The physical server remained on but the ESXI host went down.   So they hard powered the server down and that brought both the Physical Box and the ESXI host back up which fixed it.   So it looks like the ESXI host is the culprit.
Top Expert 2014

Commented:
Almost all server makes have a hardware log you can interrogate either during POST or via remote management interface. As Olgierd Ungehojer says what make/model is the hardware?
Andrew N. KowtaloSupport Center Engineer

Author

Commented:
The server has a rack on it if I am logged into where can I tell the make and model I ran belarc but that's all it told me cant tell if its an HP or Dell
RAFAIT CONSULTANT
Distinguished Expert 2018

Commented:
Hello

What model is your physical server?

You can restart it and run a hardware diagnostic to validate which hardware component is actually failing.

in your physical server you can see if the led of memory or power source are alarmed.?

I remain attentive to your comments.

regards...
Andrew N. KowtaloSupport Center Engineer

Author

Commented:
Thank you the server is an HP ProLiant DL160 I managed to get the cover off.
Andrew N. KowtaloSupport Center Engineer

Author

Commented:
I am not sure if ILO is running.
RAFAIT CONSULTANT
Distinguished Expert 2018

Commented:
Hello

Could you tell me if it is Gen6, gen8, gen9 please?

I remain attentive to your comments.

regards...
Andrew N. KowtaloSupport Center Engineer

Author

Commented:
Gen 9 thank you
Olgierd UngehojerSenior Network Administrator

Commented:
Connecting thru Remote Management you can check your hardware. This is a guide https://support.hpe.com/hpsc/doc/public/display?docId=emr_na-c01325507. Server should have extra NIC for it.
Top Expert 2014

Commented:
iLO might be sharing one of the onboard ports, the extra NIC port is an option. Might not be configured with an IP address. Easy enough to check the log and configure iLO for next time via BIOS but that needs a reboot. If HPE tools are installed you can use hponcfg it under VMware without rebooting, however you need the current iLO password to do that. I don't know if they still tie a cardboard label on with the iLO default password any more.
RAFAIT CONSULTANT
Distinguished Expert 2018

Commented:
Hello,
Could you perform the diagnostic test on the server?
Andrew N. KowtaloSupport Center Engineer

Author

Commented:
RAFA what kind of diagnostic test?
RAFAIT CONSULTANT
Distinguished Expert 2018

Commented:
Hello,

to validate which component fails at the hardware level

regards..
Top Expert 2014

Commented:
RAFA, diags aren't much use if something only crashes occasionally. Assuming it crashes once per day you would have to run diags for a couple of weeks to verify the hardware as good.

I started fixing hardware at 21 year old and now pushing 60 and still think diags are a waste of time since POST will catch the big faults and the best way to test hardware is to throw an OS at it. All the major components report any errors to the BMC so diags are no better than running VMware to isolate faults.
Andrew N. KowtaloSupport Center Engineer

Author

Commented:
@andyalder so what do you recommend ILO?  I am being told ESXI is the culprit not the physical server.   It seems when ESXI goes down the hardware remains on and running.
Top Expert 2014

Commented:
Still worth checking the IML for errors, see the section under "Integrated Management Log" in http://cdn.cnetcontent.com/61/78/6178231e-6699-4b5d-835e-bce904c69555.pdf
Andrew N. KowtaloSupport Center Engineer

Author

Commented:
I just did a search for this on the server and do not see it is this something that manually need to be installed?
iml.JPG
RAFAIT CONSULTANT
Distinguished Expert 2018

Commented:
If the ESXi is responsible, migrate your vm to other hosts.

Place the affected host in maintenance mode and reinstall the ESXi on the affected server.
RAFAIT CONSULTANT
Distinguished Expert 2018

Commented:
If you want you can download the ESXi Host log, and share them to review them.

Annex reference link of how you should do it.

https://www.altaro.com/vmware/introduction-esxi-vm-log-files/
Andrew N. KowtaloSupport Center Engineer

Author

Commented:
I will see if I can obtain the ESXI logs and post then once we get them.
Andrew N. KowtaloSupport Center Engineer

Author

Commented:
Attached are the log files from the ESXI host
ESXI-Logs.zip
RAFAIT CONSULTANT
Distinguished Expert 2018

Commented:
Hello,
Reviewing the log you have provided, note the following error in the vmkernel .log:

1) 2019-04-05T13: 26: 05.799Z cpu1: 66170) nhpsa: hpsa_vmkScsiCmdDone: 5238: Sense data: error code: 0x70, key: 0x5, info: 00 00 00 00, cmdInfo: 00 00 00 00, CmdSN: 0xf7e0, worldId: 0x1077b, Cmd: 0x85, ASC: 0x20, ASCQ: 0x0
This message is generated because your disk or device does not support the 0x85 command. 0x85 is ATA PASS-THROUGH (16). Where in the vmkernel log it refers to this code:
Cmd: 0x85, ASC: 0x20, ASCQ: 0x0

2) Additional is another event:

2019-04-05T13: 20: 50.780Z cpu10: 13596655) FSS: 6751: Failed to open file 'hpilo-d0ccb0'; Requested flags 0x5, world: 13596655 [sfcb-smx], (Existing flags 0x5, world: 13596657 [sfcb-smx]): Busy
Both events are not necessarily a problem. Some devices do not provide and can not provide the requested information If the storage device is working properly, the log messages can be ignored.

Similarly annex reference link, where you can validate what I'm indicating, since there are several cases where this same error occurs with HP servers and different versions of ESXi even in the latest version 6.7.

https://communities.vmware.com/thread/559310

In the same way it is good to know if you have presented problems with the host again, have you rebooted or shut down unexpectedly?
To date, the host has worked correctly.

I remain attentive to your comments..
RAFAIT CONSULTANT
Distinguished Expert 2018

Commented:
Please can you tell me which version of ESXi you have installed, same as the version of the ESXi build.

It would also be nice to know the model of your HBA card.
Andrew N. KowtaloSupport Center Engineer

Author

Commented:
Each time the problem of disconnection has occurred the server physical hardware remained on however the ESXI shut down.  


VMware ESXi™

Copyright © 1998-2016 VMware, Inc. All rights reserved. This product is protected by U.S. and international copyright and intellectual property laws. VMware products are covered by one or more patents listed at http://www.vmware.com/go/patents.
Version:

1.8.0
Build number:

4516221
ESXi version:

6.5.0
ESXi build number:

4564106
Andrew N. KowtaloSupport Center Engineer

Author

Commented:
In order to reboot the ESXI host we have had to physically power down the server and bring it back up which then forces the host to restart and works again.
Top Expert 2014

Commented:
You can't get to the IML from a virtual machine, it has no access to the host's hardware.
Andrew N. KowtaloSupport Center Engineer

Author

Commented:
Hi Gents have you been able to come up with any updates on this?
Andrew N. KowtaloSupport Center Engineer

Author

Commented:
Hi guys just following up on this to see if you located any potential resolution.
Andrew N. KowtaloSupport Center Engineer

Author

Commented:
Where can I get the model of the HBA card?
Andrew N. KowtaloSupport Center Engineer

Author

Commented:
I don't know if this will have the HBA Card info not sure how to get that but here is the entire about.
esxi-about.txt
Andrew N. KowtaloSupport Center Engineer

Author

Commented:
here is more hardware info.

    Get vCenter ServerCreate/Register VMShut downRebootRefreshActions

CPU
FREE: 14.8 GHz
12%
USED: 2 GHz
CAPACITY: 16.8 GHz
MEMORY
FREE: 10.13 GB
68%
USED: 21.62 GB
CAPACITY: 31.75 GB
STORAGE
FREE: 811.38 GB
27%
USED: 298.87 GB
CAPACITY: 1.08 TB
Host

Version:

6.5.0 (Build 4564106)
State:

Normal (not connected to any vCenter Server)
Uptime:

471.87 days
You are running HPE Customized Image ESXi 6.5.0 version 650.9.6.0.28 released on November 2016 and based on ESXi 6.5.0 Vmkernel Release Build 4564106.
SSH is enabled on this host. You should disable SSH unless it is necessary for administrative purposes.

    Actions

Hardware
Manufacturer
HP
Model
ProLiant DL160 Gen9
CPU
Logical processors
16
Processor type
Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz
Sockets
1
Cores per socket
8
Hyperthreading
Yes, enabled
Memory
31.75 GB
Virtual flash
Capacity
0 B
Used
0 B
Free
0 B
Networking
Hostname

IP addresses

    1. vmk0: 172.16.30.2

DNS servers
1. 8.8.8.8,2. 4.2.2.1
Default gateway
172.16.30.1
IPv6 enabled
No
Host adapters
2
Networks
Name	VMs
VM Network
	3
Storage
Physical adapters
2
Datastores
Name	Type	Capacity	Free
datastore1
	VMFS5	1.08 TB	811.38 GB
Configuration
Image profile
HPE-ESXi-6.5.0-OS-Release-iso-650.9.6.0.28 (Hewlett Packard Enterprise)
vSphere HA state
Not configured
vMotion
Not supported
System Information
Date/time on host
Monday, May 06, 2019, 18:20:47 UTC
Install date
Thursday, August 24, 2017, 17:28:53 UTC
Asset tag
unknown
Service tag
2M273102RK
BIOS version
U20
BIOS release date
Sunday, September 11, 2016, 20:00:00 -0400

Open in new window

Andrew N. KowtaloSupport Center Engineer

Author

Commented:
Does this help any?
hba.JPG
Andrew Hancock (VMware vExpert / EE Fellow)VMware and Virtualization Consultant
Fellow 2018
Expert of the Year 2017

Commented:
How is the ESXi shutting down ?

Purple Screen of Death or just going off ?

Your version of ESXi is old...

Latest version is ESXi-6.5.0-20190304001-standard (Build 13004031)
Andrew N. KowtaloSupport Center Engineer

Author

Commented:
Andrew thanks for the reply the entire location loses access to their local DC/ FS where the ESXI host is running.   The physical hardware remains on but access to local files is lost.  The only way to resolve the issue is by restarting the DC by powering it down and back on and bringing it back up.   The host has reported some failures in the logs.  

Also I am not sure they are willing to upgrade since 6.5 is 5 grand.   Are there significant changes that have repaired issues in this version?
Andrew Hancock (VMware vExpert / EE Fellow)VMware and Virtualization Consultant
Fellow 2018
Expert of the Year 2017

Commented:
Okay, is the issue with

1. ESXi Hypervisor crashing or restarting ???

2. Virtual Machine issues (they will have issue with Guest VMs if 1. above crashes) ???

They are already on ESXi 6.5.

471.87 days of ESXi host uptme, does not suggest an issue with the Host Hypervisor. BUT that was the initial General Relase of ESXi 6.5 (e.g. the FIRST release).

there have been 22 bug fixes since your version.....

Your version is two years old...

I would recommend...

1. Update firmware in Host Server.
2. Update ESXi 6.5 to current version.

the see if the GuestVM issues continue. and also make sure the GuestVMs are ALL running with VMXNET3 interfaces and not E1000. (VMware Tools MUST be installed)
Andrew N. KowtaloSupport Center Engineer

Author

Commented:
To be honest I am honestly not sure where the break is happening.   I can just tell you their FS/DC and ESXI are all running off 1 server.   Once all access is lost to the host files the only way to restore access is to physically power cycle the server.   When access is does go down the physical server remains on so the device isn't powering off.   The logs in the VM suggested controller failures which is why I was thinking it was the hypervisor possibly.   Its hard to tell.
Andrew Hancock (VMware vExpert / EE Fellow)VMware and Virtualization Consultant
Fellow 2018
Expert of the Year 2017

Commented:
Please see above post.
Andrew Hancock (VMware vExpert / EE Fellow)VMware and Virtualization Consultant
Fellow 2018
Expert of the Year 2017

Commented:
To be honest with you, with a very old ESXi 6.5 exposed to the internet, is not a good idea.
VMware and Virtualization Consultant
Fellow 2018
Expert of the Year 2017
Commented:
I would recommend...

0. Removal of Public IP Address or Forwarded!
1. Update firmware in Host Server.
2. Update ESXi 6.5 to current version.
3. Check network interfaces in the VMs.

the see if the GuestVM issues continue. and also make sure the GuestVMs are ALL running with VMXNET3 interfaces and not E1000. (VMware Tools MUST be installed)
Andrew N. KowtaloSupport Center Engineer

Author

Commented:
I think we will need to hide that.
Andrew N. KowtaloSupport Center Engineer

Author

Commented:
Fantastic help as always gents.   I will do the recommended.
Andrew N. KowtaloSupport Center Engineer

Author

Commented:
I will recommend these changes to the client and see how they want to proceed.
Andrew N. KowtaloSupport Center Engineer

Author

Commented:
Andrew I also found out there is an incorrect DNS setting on their vmxnet3 adapter.  I am going to adjust that to reflect the DC's ip for the primary and the loopback for the secondary and then update VMTools tonight and reboot and see what that does going forward.
Incorrect-DNS.JPG
Andrew N. KowtaloSupport Center Engineer

Author

Commented:
Now it has the correct DNS and will have VMWare Tools updated tonight.
Correct-DNS.JPG
Andrew Hancock (VMware vExpert / EE Fellow)VMware and Virtualization Consultant
Fellow 2018
Expert of the Year 2017
Commented:
Incorrect DNS does not help. I'm surprised DNS functions and AD was not complaining.

I assume this is the server, which does not account for crashing.... although who knows what happenes with incorrect IP  configuration.
Andrew N. KowtaloSupport Center Engineer

Author

Commented:
Andrew it has been that way for a long time I just made the discovery.   It is now correct.   And hopefully the tools updates the driver.   By the way they are running 6.5
Andrew Hancock (VMware vExpert / EE Fellow)VMware and Virtualization Consultant
Fellow 2018
Expert of the Year 2017
Commented:
Yes, I know, a very early original ESXi 6.5 which has been updated 22 times, over the last two years...!

Do more with

Expert Office
Submit tech questions to Ask the Experts™ at any time to receive solutions, advice, and new ideas from leading industry professionals.

Start 7-Day Free Trial