Link to home
Start Free TrialLog in
Avatar of Stan J
Stan JFlag for United States of America

asked on

PSOD on ESXi 6.5

Got in this morning and had a slew of emails that the ESXi Server went down late yesterday.

Our IT guy connected a console to the host and sure enough, PSOD.

System has been up for 75 days

I was able to extract the vkernal log from the core dump..

Error is ---
@Bluescreen: Spin Count exceeded - possible deadlock with pcpu 48

vmkwarning log has:
 vmx-thread-5 vpn 0x600bc9e5 status: "Invalid address (bad00026)
 vmx-thread-6 Lali-box: vpn 0x600bca07 status: "Invalid address (bad00026)
 

Checking the ESXi Logs and events, it shows a VM running and with the error

In the Event Log in vCenter, i see several events repeating with a red X  
  The user world daemon for Kali-box could not fault in a page.
  The virtual machine is terminated as further progress is impossible

I don't think it is a hardware issue (CPU, RAM, Disk)?
Avatar of Andrew Hancock (VMware vExpert PRO / EE Fellow/British Beekeeper)
Andrew Hancock (VMware vExpert PRO / EE Fellow/British Beekeeper)
Flag of United Kingdom of Great Britain and Northern Ireland image

Most PSOD are caused by hardware, compatibility driver issues.

1. Are you running the latest build of ESXi ?

2. Are you using the latest firmware updates for your server ?

3. Is the server on the HCL ?

4. Can you reproduce the error PSOD ?

5. Check hardware using vendor diagnostics, e.g. CPU, Memory.

6. But a spinlock error is caused by a thread lockup, and requires the support of VMware, to find the issue. (basically a process locked up and held a thread for a very long time, timeout and crashed e.g. PSOD)
Avatar of Stan J

ASKER

1. Are you running the latest build of ESXi ?
 No 6.5 U1

2. Are you using the latest firmware updates for your server ?
Probably Not

3. Is the server on the HCL ?
I believe so

4. Can you reproduce the error PSOD ?
no,,tried running the same VM and PSOD does not occur - VM is still running

5. Check hardware using vendor diagnostics, e.g. CPU, Memory.
Can't  take the system down that long ..with 512 GB RAM, diag will take days
The logs don't seem to point to memory

6. But a spinlock error is caused by a thread lockup, and requires the support of VMware, to find the issue.  
That would require logs to support, and we are not permitted take any files of the esxi host
Well in that case you got 1,2,3 to do... and see if it comes back.

it could be random.... but it is unusual.

Spinlock needs logs sending to VMware Support.
Avatar of Stan J

ASKER

the most we could do on logs is to tell them what we see and ask them what they may want to see and type in info

we can look into updates ....

there is a lot of info at VMware on spinlock, but not tied to this exact issue
VMware Support will only be able to work on a Bundle Upload.

You will have to discuss with them, or try what I've suggested 1,2,3

or do nothing, and see if it re-occurs within 7 days.
Avatar of Member_2_231077
Member_2_231077

If it's a decent server it will have a hardware error log stored on the motherboard that you can query. What make/model is it?
Avatar of Stan J

ASKER

it is a SuperMicro and I have contacted the vendor
That could be an issue which you need to discuss with Supermicro, as their hardware may not necessarily be on the Hardware Compatibility List.

Good Luck with Supermicro.
Avatar of Stan J

ASKER

It has been running fine as a dual node for 2 years.  

I will double check, but if it is like the other servers we have, it was built to VMware HCL specs.
Weird things happen, hopefully hardware fault then.

We don't use Supermicro for Production, or any of our clients, they use Teir 1 vendors, because of support issues. (Response and time to fix, is longer than 1 hour or 4 hours at worse, which does not meet our SLAs)

let us know what they recommend and the fix.
Avatar of Stan J

ASKER

will do,,,

this is dev and test environment

our next server set may be Dell. I have worked with Dell in the past and we even have field reps assigned here locally

vendor first suggestion was whet you suggested,,,send the support bundle to VMware...  but we can remove files, so my guess is the next steps will be

firmware and  update ESXi..

i will let you know
We would recommend starting with firmware updates, and latest ESXi build of ESXi 6.5.

which is ESXi-6.5.0-20191204001-standard (Build 15256549) as of 21 Feb 2020.

and then see if the problem returns.
The IPMI guide for SuperMicro mentions the BMC event log but whether that log lists any hardware errors is hard to tell. If a machine just crashes once I generally put it down to cosmic rays; there's no way to completely eliminate bit-flips in the CPUs etc and the smaller scale the lithography gets the more likely they are to occur.
Avatar of Stan J

ASKER

I contacted the vendor and there are no logs are kept on the motherboard, at least one that would be of any use

Also,
we just had another PSOD on that node.  Same VM looks to have caused it.  

The logs showed a message .... Invalid address (bad00026)

Found the below KB article that mentions “Invalid address (bad00026)” which is one of the entries in the logs I searched.

https://kb.vmware.com/s/article/2151113

The current Build Number of the our ESXi is 5969303

The patch that corrects the (bad00026) issue in release ESXi650-201712001,  Build 7388607.

Not saying the patch is the answer, but it looks promising.
It's always recommended to apply the latest builds available.

You are 24 patches behind, 2 years out of date.

The latest patch will have all the other fixes incorporated, so as I recommended.

Firmware updates if any apply, especially to BIOS, Storage controller and Network interfaces and apply ESXi-6.5.0-20191204001-standard (Build 15256549) as of 21 Feb 2020.
Avatar of Stan J

ASKER

yes,,,i am waiting on the vendor to supply the firmware updates and will get the latest ESXi release or may upgrade to 6.7 latest release along with vCenter
That's a huge jump to 6.7! (just to fix a buglet and PSOD!) but if test & dev, might as wait until March for new release!!!???
Avatar of Stan J

ASKER

I don't know if we would jump to the  new release of 7.

I usually wait for 1 cycle of patches or updates before upgrading.
This question needs an answer!
Become an EE member today
7 DAY FREE TRIAL
Members can start a 7-Day Free trial then enjoy unlimited access to the platform.
View membership options
or
Learn why we charge membership fees
We get it - no one likes a content blocker. Take one extra minute and find out why we block content.