Stan J
asked on
PSOD on ESXi 6.5
Got in this morning and had a slew of emails that the ESXi Server went down late yesterday.
Our IT guy connected a console to the host and sure enough, PSOD.
System has been up for 75 days
I was able to extract the vkernal log from the core dump..
Error is ---
@Bluescreen: Spin Count exceeded - possible deadlock with pcpu 48
vmkwarning log has:
vmx-thread-5 vpn 0x600bc9e5 status: "Invalid address (bad00026)
vmx-thread-6 Lali-box: vpn 0x600bca07 status: "Invalid address (bad00026)
Checking the ESXi Logs and events, it shows a VM running and with the error
In the Event Log in vCenter, i see several events repeating with a red X
The user world daemon for Kali-box could not fault in a page.
The virtual machine is terminated as further progress is impossible
I don't think it is a hardware issue (CPU, RAM, Disk)?
Our IT guy connected a console to the host and sure enough, PSOD.
System has been up for 75 days
I was able to extract the vkernal log from the core dump..
Error is ---
@Bluescreen: Spin Count exceeded - possible deadlock with pcpu 48
vmkwarning log has:
vmx-thread-5 vpn 0x600bc9e5 status: "Invalid address (bad00026)
vmx-thread-6 Lali-box: vpn 0x600bca07 status: "Invalid address (bad00026)
Checking the ESXi Logs and events, it shows a VM running and with the error
In the Event Log in vCenter, i see several events repeating with a red X
The user world daemon for Kali-box could not fault in a page.
The virtual machine is terminated as further progress is impossible
I don't think it is a hardware issue (CPU, RAM, Disk)?
ASKER
1. Are you running the latest build of ESXi ?
No 6.5 U1
2. Are you using the latest firmware updates for your server ?
Probably Not
3. Is the server on the HCL ?
I believe so
4. Can you reproduce the error PSOD ?
no,,tried running the same VM and PSOD does not occur - VM is still running
5. Check hardware using vendor diagnostics, e.g. CPU, Memory.
Can't take the system down that long ..with 512 GB RAM, diag will take days
The logs don't seem to point to memory
6. But a spinlock error is caused by a thread lockup, and requires the support of VMware, to find the issue.
That would require logs to support, and we are not permitted take any files of the esxi host
No 6.5 U1
2. Are you using the latest firmware updates for your server ?
Probably Not
3. Is the server on the HCL ?
I believe so
4. Can you reproduce the error PSOD ?
no,,tried running the same VM and PSOD does not occur - VM is still running
5. Check hardware using vendor diagnostics, e.g. CPU, Memory.
Can't take the system down that long ..with 512 GB RAM, diag will take days
The logs don't seem to point to memory
6. But a spinlock error is caused by a thread lockup, and requires the support of VMware, to find the issue.
That would require logs to support, and we are not permitted take any files of the esxi host
Well in that case you got 1,2,3 to do... and see if it comes back.
it could be random.... but it is unusual.
Spinlock needs logs sending to VMware Support.
it could be random.... but it is unusual.
Spinlock needs logs sending to VMware Support.
ASKER
the most we could do on logs is to tell them what we see and ask them what they may want to see and type in info
we can look into updates ....
there is a lot of info at VMware on spinlock, but not tied to this exact issue
we can look into updates ....
there is a lot of info at VMware on spinlock, but not tied to this exact issue
VMware Support will only be able to work on a Bundle Upload.
You will have to discuss with them, or try what I've suggested 1,2,3
or do nothing, and see if it re-occurs within 7 days.
You will have to discuss with them, or try what I've suggested 1,2,3
or do nothing, and see if it re-occurs within 7 days.
If it's a decent server it will have a hardware error log stored on the motherboard that you can query. What make/model is it?
ASKER
it is a SuperMicro and I have contacted the vendor
That could be an issue which you need to discuss with Supermicro, as their hardware may not necessarily be on the Hardware Compatibility List.
Good Luck with Supermicro.
Good Luck with Supermicro.
ASKER
It has been running fine as a dual node for 2 years.
I will double check, but if it is like the other servers we have, it was built to VMware HCL specs.
I will double check, but if it is like the other servers we have, it was built to VMware HCL specs.
Weird things happen, hopefully hardware fault then.
We don't use Supermicro for Production, or any of our clients, they use Teir 1 vendors, because of support issues. (Response and time to fix, is longer than 1 hour or 4 hours at worse, which does not meet our SLAs)
let us know what they recommend and the fix.
We don't use Supermicro for Production, or any of our clients, they use Teir 1 vendors, because of support issues. (Response and time to fix, is longer than 1 hour or 4 hours at worse, which does not meet our SLAs)
let us know what they recommend and the fix.
ASKER
will do,,,
this is dev and test environment
our next server set may be Dell. I have worked with Dell in the past and we even have field reps assigned here locally
vendor first suggestion was whet you suggested,,,send the support bundle to VMware... but we can remove files, so my guess is the next steps will be
firmware and update ESXi..
i will let you know
this is dev and test environment
our next server set may be Dell. I have worked with Dell in the past and we even have field reps assigned here locally
vendor first suggestion was whet you suggested,,,send the support bundle to VMware... but we can remove files, so my guess is the next steps will be
firmware and update ESXi..
i will let you know
We would recommend starting with firmware updates, and latest ESXi build of ESXi 6.5.
which is ESXi-6.5.0-20191204001-sta ndard (Build 15256549) as of 21 Feb 2020.
and then see if the problem returns.
which is ESXi-6.5.0-20191204001-sta
and then see if the problem returns.
The IPMI guide for SuperMicro mentions the BMC event log but whether that log lists any hardware errors is hard to tell. If a machine just crashes once I generally put it down to cosmic rays; there's no way to completely eliminate bit-flips in the CPUs etc and the smaller scale the lithography gets the more likely they are to occur.
ASKER
I contacted the vendor and there are no logs are kept on the motherboard, at least one that would be of any use
Also,
we just had another PSOD on that node. Same VM looks to have caused it.
The logs showed a message .... Invalid address (bad00026)
Found the below KB article that mentions “Invalid address (bad00026)” which is one of the entries in the logs I searched.
https://kb.vmware.com/s/article/2151113
The current Build Number of the our ESXi is 5969303
The patch that corrects the (bad00026) issue in release ESXi650-201712001, Build 7388607.
Not saying the patch is the answer, but it looks promising.
Also,
we just had another PSOD on that node. Same VM looks to have caused it.
The logs showed a message .... Invalid address (bad00026)
Found the below KB article that mentions “Invalid address (bad00026)” which is one of the entries in the logs I searched.
https://kb.vmware.com/s/article/2151113
The current Build Number of the our ESXi is 5969303
The patch that corrects the (bad00026) issue in release ESXi650-201712001, Build 7388607.
Not saying the patch is the answer, but it looks promising.
It's always recommended to apply the latest builds available.
You are 24 patches behind, 2 years out of date.
The latest patch will have all the other fixes incorporated, so as I recommended.
Firmware updates if any apply, especially to BIOS, Storage controller and Network interfaces and apply ESXi-6.5.0-20191204001-sta ndard (Build 15256549) as of 21 Feb 2020.
You are 24 patches behind, 2 years out of date.
The latest patch will have all the other fixes incorporated, so as I recommended.
Firmware updates if any apply, especially to BIOS, Storage controller and Network interfaces and apply ESXi-6.5.0-20191204001-sta
ASKER
yes,,,i am waiting on the vendor to supply the firmware updates and will get the latest ESXi release or may upgrade to 6.7 latest release along with vCenter
That's a huge jump to 6.7! (just to fix a buglet and PSOD!) but if test & dev, might as wait until March for new release!!!???
ASKER
I don't know if we would jump to the new release of 7.
I usually wait for 1 cycle of patches or updates before upgrading.
I usually wait for 1 cycle of patches or updates before upgrading.
This question needs an answer!
Become an EE member today
7 DAY FREE TRIALMembers can start a 7-Day Free trial then enjoy unlimited access to the platform.
View membership options
or
Learn why we charge membership fees
We get it - no one likes a content blocker. Take one extra minute and find out why we block content.
1. Are you running the latest build of ESXi ?
2. Are you using the latest firmware updates for your server ?
3. Is the server on the HCL ?
4. Can you reproduce the error PSOD ?
5. Check hardware using vendor diagnostics, e.g. CPU, Memory.
6. But a spinlock error is caused by a thread lockup, and requires the support of VMware, to find the issue. (basically a process locked up and held a thread for a very long time, timeout and crashed e.g. PSOD)