VMWare 6.0 u3 host, VM Server would start, go into Maint Mode
Have a critical Win 2016 Server VM that will not start. Others on this host were running OK. It got to 35%, then seems to just stop. Shut down VMs and tried to go into Maint mode. This failed as well. Finally just rebooted the host and tied again. There was again a problem with this one VM. After waiting nearly 30-40 minutes the VM creeped up to 40%, then 70% and finally came up. After dealing with the usual Win2016 Update delays / problem (another hour), the server VM seems to be OK.
I am concerned about how and why this happened. Cannot find any obvious indicators or errors in the events or VMWare monitoring. Also the iDRAC on the DELL shows no clear hardware issues.
How best to troubleshoot?
Thanks!
`
VirtualizationVMware* esxi6
Last Comment
Andrew Hancock (VMware vExpert PRO / EE Fellow/British Beekeeper)
8/22/2022 - Mon
rindi
The only thing I can think of is that you are low on resources, either on your Physical Server, or on your VM. Maybe you need to assign it more RAM, add more disks to your arrays and enlarge the Virtual disk of your VM etc.
ziceman
ASKER
Not a resource thing. The host is barely breaking a sweat from the standpoint of CPU / RAM / Storage.
rindi
That may be true to normal operation. But if your VM had a problem, maybe with HD integrity for example, and had to do a chkdsk during bootup because of that, and your other VM's are also using the same physical disks simultaneously, this would slow the chkdsk process down, & then it can take ages to finish, depending on the sizes of the partitions & the RAID you are using.
So for example it is better to create several RAID arrays with more disks, and then distribute these arrays evenly among your VMs. That should improve the disk performance of your VM's.
I am thinking this was more than a "performance" issue. When there is a VM boot issue or forced disk check, the VM console will show this to be the case. One can see the post screen, and the OS boot problems. If there was a forced chkdsk, this would be visible. The console showed absolutely nothing for this VM, and this condition lasted nearly an hour. Also the failure to go into Maintenance Mode with "General Error" is concerning.
David Johnson, CD
what are your disk queuesn the host .. what are the configureed memory and vcpu or the vm?
Andrew Hancock (VMware vExpert PRO / EE Fellow/British Beekeeper)
Troubleshooting is performed at the ESXi shell.
What is the make and model of the server ?
Also as you are probably aware by now 6.0 U3 is now end of life.
The answer will be in the \var\logs\vmkernel.log
but if you have rebooted the host, unless you kept the logs, or have them sent to a syslog server, the logs may be gone.
Hi Andrew, yes I was thinking that the logs might be lost at this point. Would have taken some some to look through them or copy them off, but was not entirely sure where to go.
Understood regarding 6.0 u3. We tend to lag behind as much of the enterprise DELL hardware is provided by a refurbisher.
Have you ever see this behavior where a VM seems to crawl to life with no visual feedback in the console - then, afterward, all is well?
Andrew Hancock (VMware vExpert PRO / EE Fellow/British Beekeeper)
Have you ever see this behavior where a VM seems to crawl to life with no visual feedback in the console - then, afterward, all is well?
Yes.
ziceman
ASKER
OK. My primary goal here is to try to anticipate and be proactive, but I cannot find any indication of problems. The logs now look clean (of course), and DELL IDRAC is all green and healthy. Resource utilization is nominal and overall performance seems fine. But *something* happened, and I would like to do more sleuthing - however it might be achieved.
Andrew Hancock (VMware vExpert PRO / EE Fellow/British Beekeeper)
Make sure you have and create a Syslog server, so all the logs are off the ESXi host.
The vmware.log for the VM will still be present.
and you never answered the question as to was the VM running on a snapshot at the time, or is it still running on a snapshot ?
How do you backup ?
Are you using FREE ESXi or Licensed with Support ?
and we still do not know make and model of host server, and also build of ESXi 6.0 ?
So lots of unanswered questions which may help the situation if you want to be "proactive" and to be honest and frank running a "Critical" Guest OS on an unsupported platform, is a risk.
Which you may want to escalate higher up the food chain.
ziceman
ASKER
Looks like it is running on a snapshot, but I am not seeing other snapshots listed.
It is backed up with both Veeam and Altaro
The license is commercial with paid with support. 6.0.0 Update 3 (Build 10719132) running on a DELL PowerEdge R620.
Andrew Hancock (VMware vExpert PRO / EE Fellow/British Beekeeper)
That is definately cause of a
PERFORMANCE
issue!
Looks like it is running on a snapshot, but I am not seeing other snapshots listed.
That's normal, when a third party backup program cannot remove the snapshot for whatever reason, which is why you should check after every backup!
How to get rid....
1. Take a new snapshot (it does not have to have memory ticked!)
2. Wait 120 seconds for Guest OS to stablise.
3. DELETE ALL.
4. WAIT and BE PATIENT... it could take seconds, minutes, days or weeks based on how many and how large the snapshot, and how slow/fast existing storage is.
But do not cancel, do not abort, do not shutdown, do not restart host, any meddling can cause snapshot corruption and data loss.
You may also experience more performance issues during the merge and deletion.
I wrote this Article for EE, 9 years ago and it is still relevant today
Whilst the snapshot is deleting/merging I would recommend reading, do not be tempted to watch the progress, it can take a while, grab a coffee, go for a walk!
It can appear to hang, but it will complete if you just leave it to complete. Remember No fiddling!
Your ESXi build is 7 patches behind, and 2.5 years out of date compared to the last build available, I would also recommend updating.
OK. I think a big part of the problem has been found. The backups had been running in Veeam, and it was decided to take a look at Altaro. The jobs were to be scheduled to not overlap, but at some point recently - they did. So, it seems that temp snapshots created by the backup software were conflicting.
The last backup finish after 9 hours and, the removal the temp snapshot had the VM down for over 40 minutes.
Not sure if this is the entire root cause for the behavior on Fri, as the power outage occurred at 5PM-8PM. No backups should have been running at this time.
ziceman
ASKER
It is now running an original snapshot from April 03, 2019
Andrew Hancock (VMware vExpert PRO / EE Fellow/British Beekeeper)
Umm, no ideal, remove the snapshot.
It could be very large! not ideal for a critical VM!
Agreed. Why is there only 1 snapshot in use from way back then? and it seemed to revert back to it automatically when the temp snapshot from the backup was finally deleted.
David Johnson, CD
you still need to merge that old snapshot. This may take a long time. Surprised no one noticed this for over 2 years.
Andrew Hancock (VMware vExpert PRO / EE Fellow/British Beekeeper)
Unknown, it's possible it's stuck too large, really need to have a folder/directory screenshot.
Repeat the above procedure you'll soon know if there is an issue.
It was a huge miss indeed and now we have huge / literally mess. I never manually created any snapshots, so it was either done by one of my colleagues or by a backup program that did not clean up afterwards - a situation exactly as described in the 9yr old article. Should not have been missed. It is not a good situation at all, so I am greatly in need of some insight and recommendations how best to avoid a worst-case scenario.
When the backup program last completed overnight, removing the temp snapshot took nearly an hour - and the VM was not accessible during this time frame. The other VM was OK, but this main seemed dead in the water. The host disk I/O did no seem to go crazy and there was not much CPU or RAM utilization, but removal of the backup temp snapshot hung the VM pretty much the whole hour. The storage is a 8-disk STATA Array - FREE: 1.47 TB 60% USED: 2.16 TB CAPACITY: 3.64 TB
Here are the other hardware specs: Dell PowerEdge R62012 CPUs x Intel(R) Xeon(R) CPU E5-2640 0 @ 2.50GHz 47.96 GB RAM. The VM:
CPU 12 vCPUs Memory 16 GB Hard disk 1 250 GB Hard disk 2 750 GB
I am very concerned (and it seems rightly so) that removing the this one could lock up the VM for days.
We could then restore a backup from the night before to the other host, but that would certainly not be ideal.
Should I do a backup restore to the another host this evening to potentially be ready? We cannot have the VM down for days, and I do not want to follow up a terrible oversight with a worse mistake. Trying to figure out the right next steps.
Here is the directory screenshot. As expected, the snapshots are enormous.
David Johnson, CD
You could restore to the other host.. hopefully it only has the vm disks and no the snapshot but it probably will have the avhd so it would be the same..
In the end, I decided to let Veeam do its thing. Since a little bit of downtime was going to be acceptable (just not many hours or potentially days), I shut down the VM and started a manual backup. Once completed, I restored it to the other VMware host. When that finished, I checked out the particulars. The disk files looked great and there was no snapshot. Powered the sucker up without network connectivity and took some time to look around. All seemed to be good, so I restarted with the network connect. As of this AM, the customer is saying the system is operating fine. The entire process took about 5 hours - from 11:30PM to 4AM, and at least there was a meaningful progress indicator. The most horrifying aspect of the other large snapshot delete nightmare stories was the "not knowing".
Thanks so much for all the recommendations and insight. I dodged a bullet here. Need to now upgrade the ESXI version ASAP, install VCenter and make sure no other client hosts are unknowingly running on a snapshot.
Andrew Hancock (VMware vExpert PRO / EE Fellow/British Beekeeper)
You can also set Alarms to warn you of Snapshots using vCenter Server, and it also records performance statistics to a database for more than 24 hours!
Greatly appreciate both the extensive knowledge and responsiveness with VMWare. Your reliable presence and participation on this topic is perhaps the sole reason I have maintained the EE subscription.
Andrew Hancock (VMware vExpert PRO / EE Fellow/British Beekeeper)