Avatar of Andrew N. Kowtalo
Andrew N. Kowtalo
 asked on

Server randomly crashing with no trace of memory.dmp

Hey gents we have been experiencing several shutdown errors with our ESXI Host.   We have the companies FS/ DC running off this host I believe its 6.0.   Last night around 4 or 5PM EST The servers shut down again.   I have attached some event viewer logs.   Server OS is 2012 R2.   Now after the server went down it did not automatically reboot.   I checked for memory.dmp files and could not find any under the %systemroot% (I don't think they are hidden).   So I am wondering if its related to power.  Also looking at our Datto backup in place which stopped working 3 hours prior, looking at the meta data it showed a communication loss to one of the shared partitions on the server which was their E:\ volume where all their shares sit.     Here is the Meta Data Log.

Tue 23/04/19 1:01:02 pm - metrics {"name""HighlyRandomUnidentifiableMimetype","description""Detects file entries that contain a known extension, unidentifiable mimetype, and are highly random","result"false,"occurrences"0,"total"0,"percent"0,"threshold"0.05,"minimum_samples"20,"maximum_occurrences"9223372036854775807,"files"[]} 
Tue 23/04/19 1:01:02 pm - Rule "HighlyRandomUnidentifiableMimetype" did not detect ransomware. 0 out of 0 files checked showed signs of ransomware (0%%). 
Tue 23/04/19 1:01:02 pm - metrics {"name""KnownRansomwareExtensions","description""Detects file entries that contain a known ransomware extension","result"false,"occurrences"0,"total"0,"percent"0,"threshold"0.005,"minimum_samples"20,"maximum_occurrences"3,"files"[]} 
Tue 23/04/19 1:01:02 pm - Rule "KnownRansomwareExtensions" did not detect ransomware. 0 out of 0 files checked showed signs of ransomware (0%%). 

Open in new window


There are various log files with this.   I do recall in the meta data that it showed corrupt volume on their E:\ which is where their shares are which they claimed were inaccessible as of yesterday.

Datto did something on their end, got the local and offsite backups to run which restored connectivity to the E:\ volume internally as well I am just unsure as to what they did.   However everything is backing up now as normal.    I am looking at the server logs and here is what they show.

The previous system shutdown at 1:40:21 PM on ‎4/‎22/‎2019 was unexpected.
Event 6008

The system has rebooted without cleanly shutting down first. This error could be caused if the system stopped responding, crashed, or lost power unexpectedly.
Kernel Power (I think it lost power)

The reason supplied by user CIRCUITCLINICAL\admin.del for the last unexpected shutdown of this computer is: Other (Unplanned)
 Reason Code: 0xa000000
 Problem ID: 
 Bugcheck String: 
 Comment:  Event ID 1076

Open in new window


Since I was not there I can not tell if this was related to power or an application crash.   Any assistance would be greatly appreciated.   This has gone down 3 times so far this year.
Server HardwareMicrosoft Server OSWindows Server 2012VMware

Avatar of undefined
Last Comment
Andrew Hancock (VMware vExpert PRO / EE Fellow/British Beekeeper)

8/22/2022 - Mon
Andrew N. Kowtalo

ASKER
There have been complaints in the past where every internal user would lose access to the e:\ share which was \\servername\share and it would miraculously come back up.   The DC gives out DNS to everything as an update.

Running a ping to the local server there was no packet drops.

Ping statistics for 172.16.30.3:
    Packets: Sent = 209, Received = 209, Lost = 0 (0% loss),
Approximate round trip times in milli-seconds:
    Minimum = 1ms, Maximum = 39ms, Average = 2ms
Control-C
^C
C:\Users\kowtaloa>

C:\Users\kowtaloa>


I asked Datto what they did to fix the problem they told me the drives were refreshed and the backups were kicked off
Olgierd Ungehojer

What server is it ? Is it Dell?
What kind UPS do you have there? You can check power event on ups.
Andrew N. Kowtalo

ASKER
Hi Olgierd.   Here is the server make and model.  

VMware, Inc. VMware Virtual Platform
System Serial Number: VMware-56 4d e9 7b 4c 52 6f 1a-55 dd a2 16 ea 44 1f 04
Enclosure Type: Other

Windows Server 2012 R2 Standard (build 9600.19327)
Primary Domain Controller
Install Language: English (United States)
System Locale: English (United States)
Installed: 8/24/2017 6:30:50 PM
Boot Mode: BIOS (Secure Boot not supported)

Server Roles:
    Web Server (IIS)
    Active Directory Domain Services
    DHCP Server
    DNS Server
    Network Policy and Access Services
    Remote Desktop Services
    Remote Access
    File and Storage Services

Board: Intel Corporation 440BX Desktop Reference Platform
BIOS: Phoenix Technologies LTD 6.00 04/05/2016

2.10 gigahertz Intel Xeon E5-2620 v4 (6 installed)
2048 kilobyte primary memory cache
64-bit ready
Not hyper-threaded

407.65 Gigabytes Usable Hard Drive Capacity
236.38 Gigabytes Hard Drive Free Space

NECVMWar VMware SATA CD00 [Optical drive]

VMware Virtual disk SCSI Disk Device (322.12 GB) -- drive 1, s/n 6000c2922c438118ed210b882f6ec330
VMware Virtual disk SCSI Disk Device (85.90 GB) -- drive 0, s/n 6000c29adee7a80accd9a77163cde322



I am wondering if Kernal Power was selected as a drop down when they logged back into the server when the error box popped up as the reason for the crash.  

I am unsure if there are event logs on the UPS I can check.
I started with Experts Exchange in 2004 and it's been a mainstay of my professional computing life since. It helped me launch a career as a programmer / Oracle data analyst
William Peck
Andrew N. Kowtalo

ASKER
There is an APC Powersupply with no network so there is no way I can access it to check any logs.
Olgierd Ungehojer

I mean if your server is power off - your host VM is power of - you have to physically go and power it on. Am I right ?
You problem looks like related with hardware, but I am not sure, because you gave me information about your VM machine.

Do you have problem only with your VM ? And you ESXI Host has power and is running when you have issue ?
Andrew N. Kowtalo

ASKER
The physical server remained on but the ESXI host went down.   So they hard powered the server down and that brought both the Physical Box and the ESXI host back up which fixed it.   So it looks like the ESXI host is the culprit.
⚡ FREE TRIAL OFFER
Try out a week of full access for free.
Find out why thousands trust the EE community with their toughest problems.
andyalder

Almost all server makes have a hardware log you can interrogate either during POST or via remote management interface. As Olgierd Ungehojer says what make/model is the hardware?
Andrew N. Kowtalo

ASKER
The server has a rack on it if I am logged into where can I tell the make and model I ran belarc but that's all it told me cant tell if its an HP or Dell
RAFA

Hello

What model is your physical server?

You can restart it and run a hardware diagnostic to validate which hardware component is actually failing.

in your physical server you can see if the led of memory or power source are alarmed.?

I remain attentive to your comments.

regards...
Your help has saved me hundreds of hours of internet surfing.
fblack61
Andrew N. Kowtalo

ASKER
Thank you the server is an HP ProLiant DL160 I managed to get the cover off.
Andrew N. Kowtalo

ASKER
I am not sure if ILO is running.
RAFA

Hello

Could you tell me if it is Gen6, gen8, gen9 please?

I remain attentive to your comments.

regards...
⚡ FREE TRIAL OFFER
Try out a week of full access for free.
Find out why thousands trust the EE community with their toughest problems.
Andrew N. Kowtalo

ASKER
Gen 9 thank you
Olgierd Ungehojer

Connecting thru Remote Management you can check your hardware. This is a guide https://support.hpe.com/hpsc/doc/public/display?docId=emr_na-c01325507. Server should have extra NIC for it.
andyalder

iLO might be sharing one of the onboard ports, the extra NIC port is an option. Might not be configured with an IP address. Easy enough to check the log and configure iLO for next time via BIOS but that needs a reboot. If HPE tools are installed you can use hponcfg it under VMware without rebooting, however you need the current iLO password to do that. I don't know if they still tie a cardboard label on with the iLO default password any more.
This is the best money I have ever spent. I cannot not tell you how many times these folks have saved my bacon. I learn so much from the contributors.
rwheeler23
RAFA

Hello,
Could you perform the diagnostic test on the server?
Andrew N. Kowtalo

ASKER
RAFA what kind of diagnostic test?
RAFA

Hello,

to validate which component fails at the hardware level

regards..
⚡ FREE TRIAL OFFER
Try out a week of full access for free.
Find out why thousands trust the EE community with their toughest problems.
andyalder

RAFA, diags aren't much use if something only crashes occasionally. Assuming it crashes once per day you would have to run diags for a couple of weeks to verify the hardware as good.

I started fixing hardware at 21 year old and now pushing 60 and still think diags are a waste of time since POST will catch the big faults and the best way to test hardware is to throw an OS at it. All the major components report any errors to the BMC so diags are no better than running VMware to isolate faults.
Andrew N. Kowtalo

ASKER
@andyalder so what do you recommend ILO?  I am being told ESXI is the culprit not the physical server.   It seems when ESXI goes down the hardware remains on and running.
andyalder

Still worth checking the IML for errors, see the section under "Integrated Management Log" in http://cdn.cnetcontent.com/61/78/6178231e-6699-4b5d-835e-bce904c69555.pdf
Experts Exchange has (a) saved my job multiple times, (b) saved me hours, days, and even weeks of work, and often (c) makes me look like a superhero! This place is MAGIC!
Walt Forbes
Andrew N. Kowtalo

ASKER
I just did a search for this on the server and do not see it is this something that manually need to be installed?
iml.JPG
RAFA

If the ESXi is responsible, migrate your vm to other hosts.

Place the affected host in maintenance mode and reinstall the ESXi on the affected server.
RAFA

If you want you can download the ESXi Host log, and share them to review them.

Annex reference link of how you should do it.

https://www.altaro.com/vmware/introduction-esxi-vm-log-files/
⚡ FREE TRIAL OFFER
Try out a week of full access for free.
Find out why thousands trust the EE community with their toughest problems.
Andrew N. Kowtalo

ASKER
I will see if I can obtain the ESXI logs and post then once we get them.
Andrew N. Kowtalo

ASKER
Attached are the log files from the ESXI host
ESXI-Logs.zip
RAFA

Hello,
Reviewing the log you have provided, note the following error in the vmkernel .log:

1) 2019-04-05T13: 26: 05.799Z cpu1: 66170) nhpsa: hpsa_vmkScsiCmdDone: 5238: Sense data: error code: 0x70, key: 0x5, info: 00 00 00 00, cmdInfo: 00 00 00 00, CmdSN: 0xf7e0, worldId: 0x1077b, Cmd: 0x85, ASC: 0x20, ASCQ: 0x0
This message is generated because your disk or device does not support the 0x85 command. 0x85 is ATA PASS-THROUGH (16). Where in the vmkernel log it refers to this code:
Cmd: 0x85, ASC: 0x20, ASCQ: 0x0

2) Additional is another event:

2019-04-05T13: 20: 50.780Z cpu10: 13596655) FSS: 6751: Failed to open file 'hpilo-d0ccb0'; Requested flags 0x5, world: 13596655 [sfcb-smx], (Existing flags 0x5, world: 13596657 [sfcb-smx]): Busy
Both events are not necessarily a problem. Some devices do not provide and can not provide the requested information If the storage device is working properly, the log messages can be ignored.

Similarly annex reference link, where you can validate what I'm indicating, since there are several cases where this same error occurs with HP servers and different versions of ESXi even in the latest version 6.7.

https://communities.vmware.com/thread/559310

In the same way it is good to know if you have presented problems with the host again, have you rebooted or shut down unexpectedly?
To date, the host has worked correctly.

I remain attentive to your comments..
Experts Exchange is like having an extremely knowledgeable team sitting and waiting for your call. Couldn't do my job half as well as I do without it!
James Murphy
RAFA

Please can you tell me which version of ESXi you have installed, same as the version of the ESXi build.

It would also be nice to know the model of your HBA card.
Andrew N. Kowtalo

ASKER
Each time the problem of disconnection has occurred the server physical hardware remained on however the ESXI shut down.  


VMware ESXi™

Copyright © 1998-2016 VMware, Inc. All rights reserved. This product is protected by U.S. and international copyright and intellectual property laws. VMware products are covered by one or more patents listed at http://www.vmware.com/go/patents.
Version:

1.8.0
Build number:

4516221
ESXi version:

6.5.0
ESXi build number:

4564106
Andrew N. Kowtalo

ASKER
In order to reboot the ESXI host we have had to physically power down the server and bring it back up which then forces the host to restart and works again.
⚡ FREE TRIAL OFFER
Try out a week of full access for free.
Find out why thousands trust the EE community with their toughest problems.
andyalder

You can't get to the IML from a virtual machine, it has no access to the host's hardware.
Andrew N. Kowtalo

ASKER
Hi Gents have you been able to come up with any updates on this?
Andrew N. Kowtalo

ASKER
Hi guys just following up on this to see if you located any potential resolution.
All of life is about relationships, and EE has made a viirtual community a real community. It lifts everyone's boat
William Peck
Andrew N. Kowtalo

ASKER
Where can I get the model of the HBA card?
Andrew N. Kowtalo

ASKER
I don't know if this will have the HBA Card info not sure how to get that but here is the entire about.
esxi-about.txt
Andrew N. Kowtalo

ASKER
here is more hardware info.

    Get vCenter ServerCreate/Register VMShut downRebootRefreshActions

CPU
FREE: 14.8 GHz
12%
USED: 2 GHz
CAPACITY: 16.8 GHz
MEMORY
FREE: 10.13 GB
68%
USED: 21.62 GB
CAPACITY: 31.75 GB
STORAGE
FREE: 811.38 GB
27%
USED: 298.87 GB
CAPACITY: 1.08 TB
Host

Version:

6.5.0 (Build 4564106)
State:

Normal (not connected to any vCenter Server)
Uptime:

471.87 days
You are running HPE Customized Image ESXi 6.5.0 version 650.9.6.0.28 released on November 2016 and based on ESXi 6.5.0 Vmkernel Release Build 4564106.
SSH is enabled on this host. You should disable SSH unless it is necessary for administrative purposes.

    Actions

Hardware
Manufacturer
HP
Model
ProLiant DL160 Gen9
CPU
Logical processors
16
Processor type
Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz
Sockets
1
Cores per socket
8
Hyperthreading
Yes, enabled
Memory
31.75 GB
Virtual flash
Capacity
0 B
Used
0 B
Free
0 B
Networking
Hostname

IP addresses

    1. vmk0: 172.16.30.2

DNS servers
1. 8.8.8.8,2. 4.2.2.1
Default gateway
172.16.30.1
IPv6 enabled
No
Host adapters
2
Networks
Name	VMs
VM Network
	3
Storage
Physical adapters
2
Datastores
Name	Type	Capacity	Free
datastore1
	VMFS5	1.08 TB	811.38 GB
Configuration
Image profile
HPE-ESXi-6.5.0-OS-Release-iso-650.9.6.0.28 (Hewlett Packard Enterprise)
vSphere HA state
Not configured
vMotion
Not supported
System Information
Date/time on host
Monday, May 06, 2019, 18:20:47 UTC
Install date
Thursday, August 24, 2017, 17:28:53 UTC
Asset tag
unknown
Service tag
2M273102RK
BIOS version
U20
BIOS release date
Sunday, September 11, 2016, 20:00:00 -0400

Open in new window

⚡ FREE TRIAL OFFER
Try out a week of full access for free.
Find out why thousands trust the EE community with their toughest problems.
Andrew N. Kowtalo

ASKER
Does this help any?
hba.JPG
Andrew Hancock (VMware vExpert PRO / EE Fellow/British Beekeeper)

How is the ESXi shutting down ?

Purple Screen of Death or just going off ?

Your version of ESXi is old...

Latest version is ESXi-6.5.0-20190304001-standard (Build 13004031)
Andrew N. Kowtalo

ASKER
Andrew thanks for the reply the entire location loses access to their local DC/ FS where the ESXI host is running.   The physical hardware remains on but access to local files is lost.  The only way to resolve the issue is by restarting the DC by powering it down and back on and bringing it back up.   The host has reported some failures in the logs.  

Also I am not sure they are willing to upgrade since 6.5 is 5 grand.   Are there significant changes that have repaired issues in this version?
I started with Experts Exchange in 2004 and it's been a mainstay of my professional computing life since. It helped me launch a career as a programmer / Oracle data analyst
William Peck
Andrew Hancock (VMware vExpert PRO / EE Fellow/British Beekeeper)

Okay, is the issue with

1. ESXi Hypervisor crashing or restarting ???

2. Virtual Machine issues (they will have issue with Guest VMs if 1. above crashes) ???

They are already on ESXi 6.5.

471.87 days of ESXi host uptme, does not suggest an issue with the Host Hypervisor. BUT that was the initial General Relase of ESXi 6.5 (e.g. the FIRST release).

there have been 22 bug fixes since your version.....

Your version is two years old...

I would recommend...

1. Update firmware in Host Server.
2. Update ESXi 6.5 to current version.

the see if the GuestVM issues continue. and also make sure the GuestVMs are ALL running with VMXNET3 interfaces and not E1000. (VMware Tools MUST be installed)
Andrew N. Kowtalo

ASKER
To be honest I am honestly not sure where the break is happening.   I can just tell you their FS/DC and ESXI are all running off 1 server.   Once all access is lost to the host files the only way to restore access is to physically power cycle the server.   When access is does go down the physical server remains on so the device isn't powering off.   The logs in the VM suggested controller failures which is why I was thinking it was the hypervisor possibly.   Its hard to tell.
Andrew Hancock (VMware vExpert PRO / EE Fellow/British Beekeeper)

Please see above post.
⚡ FREE TRIAL OFFER
Try out a week of full access for free.
Find out why thousands trust the EE community with their toughest problems.
Andrew Hancock (VMware vExpert PRO / EE Fellow/British Beekeeper)

To be honest with you, with a very old ESXi 6.5 exposed to the internet, is not a good idea.
ASKER CERTIFIED SOLUTION
Andrew Hancock (VMware vExpert PRO / EE Fellow/British Beekeeper)

THIS SOLUTION ONLY AVAILABLE TO MEMBERS.
View this solution by signing up for a free trial.
Members can start a 7-Day free trial and enjoy unlimited access to the platform.
See Pricing Options
Start Free Trial
GET A PERSONALIZED SOLUTION
Ask your own question & get feedback from real experts
Find out why thousands trust the EE community with their toughest problems.
Andrew N. Kowtalo

ASKER
I think we will need to hide that.
Andrew N. Kowtalo

ASKER
Fantastic help as always gents.   I will do the recommended.
Your help has saved me hundreds of hours of internet surfing.
fblack61
Andrew N. Kowtalo

ASKER
I will recommend these changes to the client and see how they want to proceed.
Andrew N. Kowtalo

ASKER
Andrew I also found out there is an incorrect DNS setting on their vmxnet3 adapter.  I am going to adjust that to reflect the DC's ip for the primary and the loopback for the secondary and then update VMTools tonight and reboot and see what that does going forward.
Incorrect-DNS.JPG
Andrew N. Kowtalo

ASKER
Now it has the correct DNS and will have VMWare Tools updated tonight.
Correct-DNS.JPG
⚡ FREE TRIAL OFFER
Try out a week of full access for free.
Find out why thousands trust the EE community with their toughest problems.
SOLUTION
Andrew Hancock (VMware vExpert PRO / EE Fellow/British Beekeeper)

THIS SOLUTION ONLY AVAILABLE TO MEMBERS.
View this solution by signing up for a free trial.
Members can start a 7-Day free trial and enjoy unlimited access to the platform.
See Pricing Options
Start Free Trial
⚡ FREE TRIAL OFFER
Try out a week of full access for free.
Find out why thousands trust the EE community with their toughest problems.
Andrew N. Kowtalo

ASKER
Andrew it has been that way for a long time I just made the discovery.   It is now correct.   And hopefully the tools updates the driver.   By the way they are running 6.5
SOLUTION
Andrew Hancock (VMware vExpert PRO / EE Fellow/British Beekeeper)

THIS SOLUTION ONLY AVAILABLE TO MEMBERS.
View this solution by signing up for a free trial.
Members can start a 7-Day free trial and enjoy unlimited access to the platform.
See Pricing Options
Start Free Trial
⚡ FREE TRIAL OFFER
Try out a week of full access for free.
Find out why thousands trust the EE community with their toughest problems.