Link to home
Create AccountLog in
Avatar of ziceman
zicemanFlag for United States of America

asked on

VMWare 6.0 u3 host, VM Server would start, go into Maint Mode

Have a critical Win 2016 Server VM that will not start. Others on this host were running OK.  It got to 35%, then seems to just stop. Shut down VMs and tried to go into Maint mode. This failed as well. Finally just rebooted the host and tied again. There was again a problem with this one VM.  After waiting nearly 30-40 minutes the VM creeped up to 40%, then 70% and finally came up. After dealing with the usual Win2016 Update delays / problem (another hour), the server VM seems to be OK.

I am concerned about how and why this happened. Cannot find any obvious indicators or errors in the events or VMWare monitoring. Also the iDRAC on the DELL shows no clear hardware issues.

How best to troubleshoot?

Thanks!

`

Avatar of rindi
rindi
Flag of Switzerland image

The only thing I can think of is that you are low on resources, either on your Physical Server, or on your VM. Maybe you need to assign it more RAM, add more disks to your arrays and enlarge the Virtual disk of your VM etc.
Avatar of ziceman

ASKER

Not a resource thing. The host is barely breaking a sweat from the standpoint of CPU / RAM / Storage. 
That may be true to normal operation. But if your VM had a problem, maybe with HD integrity for example, and had to do a chkdsk during bootup because of that, and your other VM's are also using the same physical disks simultaneously, this would slow the chkdsk process down, & then it can take ages to finish, depending on the sizes of the partitions & the RAID you are using.

So for example it is better to create several RAID arrays with more disks, and then distribute these arrays evenly among your VMs. That should improve the disk performance of your VM's.
Avatar of ziceman

ASKER

I am thinking this was more than a "performance" issue. When there is a VM boot issue or forced disk check, the VM console will show this to be the case. One can see the post screen, and the OS boot problems. If there was a forced chkdsk, this would be visible.  The console showed absolutely nothing for this VM, and this condition lasted nearly an hour. Also the failure to go into Maintenance Mode with "General Error" is concerning. 
what are your disk queuesn the host .. what are the configureed memory and vcpu or the vm?
Troubleshooting is performed at the ESXi shell.

What is the make and model of the server ?

Also as you are probably aware by now 6.0 U3 is now end of life.

The answer will be in the \var\logs\vmkernel.log

but if you have rebooted the host, unless you kept the logs, or have them sent to a syslog server, the logs may be gone.

VM running on a Snapshot?
Avatar of ziceman

ASKER

Hi Andrew, yes I was thinking that the logs might be lost at this point. Would have taken some some to look through them or copy them off, but was not entirely sure where to go.

Understood regarding 6.0 u3. We tend to lag behind as much of the enterprise DELL hardware is provided by a refurbisher.

Have you ever see this behavior where a VM seems to crawl to life with no visual feedback in the console - then, afterward, all is well?
Have you ever see this behavior where a VM seems to crawl to life with no visual feedback in the console - then, afterward, all is well?

Yes.
Avatar of ziceman

ASKER

OK. My primary goal here is to try to anticipate and be proactive, but I cannot find any indication of problems. The logs now look clean (of course), and DELL IDRAC is all green and healthy. Resource utilization is nominal and overall performance seems fine.  But *something* happened, and I would like to do more sleuthing - however it might be achieved. 
Make sure you have and create a Syslog server, so all the logs are off the ESXi host.

The vmware.log for the VM will still be present.

and you never answered the question as to was the VM running on a snapshot at the time, or is it still running on a snapshot ?

How do you backup ?

Are you using FREE ESXi or Licensed with Support ?

and we still do not know make and model of host server, and also build of ESXi 6.0 ?

So lots of unanswered questions which may help the situation if you want to be "proactive" and to be honest and frank running a "Critical" Guest OS on an unsupported platform, is a risk.

Which you may want to escalate higher up the food chain.
Avatar of ziceman

ASKER

Looks like it is running on a snapshot, but I am not seeing other snapshots listed.

It is backed up with both Veeam and Altaro

The license is commercial with paid with support.  6.0.0 Update 3 (Build 10719132)  running on a DELL PowerEdge R620.

That is definately cause of a
PERFORMANCE
issue!

Looks like it is running on a snapshot, but I am not seeing other snapshots listed.

That's normal, when a third party backup program cannot remove the snapshot for whatever reason, which is why you should check after every backup!

How to get rid....

1. Take a new snapshot (it does not have to have memory ticked!)

2. Wait 120 seconds for Guest OS to stablise.

3. DELETE ALL.

4. WAIT and BE PATIENT... it could take seconds, minutes, days or weeks based on how many and how large the snapshot, and how slow/fast existing storage is.

But do not cancel, do not abort, do not shutdown, do not restart host, any meddling can cause snapshot corruption and data loss.

You may also experience more performance issues during the merge and deletion.

I wrote this Article for EE, 9 years ago and it is still relevant today

HOW TO: VMware Snapshots :- Be Patient

Whilst the snapshot is deleting/merging I would recommend reading, do not be tempted to watch the progress, it can take a while, grab a coffee, go for a walk!

It can appear to hang, but it will complete if you just leave it to complete. Remember No fiddling!

Your ESXi build is 7 patches behind, and 2.5 years out of date compared to the last build available, I would also recommend updating.
Avatar of ziceman

ASKER

OK. I think a big part of the problem has been found. The backups had been running in Veeam, and it was decided to take a look at Altaro. The jobs were to be scheduled to not overlap, but at some point recently - they did. So, it seems that temp snapshots created by the backup software were conflicting.

The last backup finish after 9 hours and, the removal the temp snapshot had the VM down for over 40 minutes.

Not sure if this is the entire root cause for the behavior on Fri, as the power outage occurred at 5PM-8PM. No backups should have been running at this time. 
Avatar of ziceman

ASKER

It is now running an original snapshot from April 03, 2019

Umm, no ideal, remove the snapshot.

It could be very large! not ideal for a critical VM!

Need to improve VMware Administration!
Avatar of ziceman

ASKER

Agreed. Why is there only 1 snapshot in use from way back then? and it seemed to revert back to it automatically when the temp snapshot from the backup was finally deleted.
you still need to merge that old snapshot.  This may take a long time.  Surprised no one noticed this for over  2 years.
Unknown, it's possible it's stuck too large, really need to have a folder/directory screenshot.

Repeat the above procedure you'll soon know if there is an issue.
Avatar of ziceman

ASKER

It was a huge miss indeed and now we have huge / literally mess. I never manually created any snapshots, so it was either done by one of my colleagues or by a backup program that did not clean up afterwards - a situation exactly as described in the 9yr old article. Should not have been missed. It is not a good situation at all, so I am greatly in need of some insight and recommendations how best to avoid a worst-case scenario.

When the backup program last completed overnight, removing the temp snapshot took nearly an hour - and the VM was not accessible during this time frame. The other VM was OK, but this main seemed dead in the water. The host disk I/O did no seem to go crazy and there was not much CPU or RAM utilization, but removal of the backup temp snapshot hung the VM pretty much the whole hour.  The storage is a 8-disk STATA Array - FREE: 1.47 TB
60% USED: 2.16 TB CAPACITY: 3.64 TB

Here are the other hardware specs:
   Dell PowerEdge R62012 CPUs x Intel(R) Xeon(R) CPU E5-2640 0 @ 2.50GHz
    47.96 GB RAM.
The VM:

CPU   12 vCPUs
Memory   16 GB
Hard disk 1   250 GB
Hard disk 2   750 GB

I am very concerned (and it seems rightly so) that removing the this one could lock up the VM for days.

We could then restore a backup from the night before to the other host, but that would certainly not be ideal.

Should I do a backup restore to the another host this evening to potentially be ready? We cannot have the VM down for days, and I do not want to follow up a terrible oversight with a worse mistake. Trying to figure out the right next steps. 

Here is the directory screenshot. As expected, the snapshots are enormous.
User generated image


You could  restore to the other host.. hopefully it only has the vm disks and no the snapshot but it probably will have the avhd so it would be the same..
Avatar of ziceman

ASKER

Is there no escape?
You've got the following choices

1. Delete the snapshot and wait You could do it powered OFF

or you could use CLONE

make sure the VM is powered off

CLONE ing will give you a new VM with zero snapshots and after you e checked all working delete the original
Your biggest issue is your storage system using SATA is poor! Not many IOPS on a SATA datastore
Avatar of ziceman

ASKER

If the VM is powered off, would the deletion be potentially quicker?

Same question regarding the CLONE process?

Avatar of ziceman

ASKER

Because this small, 2-host setup running Server Essentials, VCenter is not available.

I came across this - https://interworks.com/blog/ijahanshahi/2013/06/20/vmware-performing-safe-removal-large-snapshots/

Any legitimacy or merit within?
ASKER CERTIFIED SOLUTION
Avatar of Andrew Hancock (VMware vExpert PRO / EE Fellow/British Beekeeper)
Andrew Hancock (VMware vExpert PRO / EE Fellow/British Beekeeper)
Flag of United Kingdom of Great Britain and Northern Ireland image

Link to home
membership
Create an account to see this answer
Signing up is free. No credit card required.
Create Account
Avatar of ziceman

ASKER

In the end, I decided to let Veeam do its thing. Since a little bit of downtime was going to be acceptable (just not many hours or potentially days), I shut down the VM and started a manual backup. Once completed, I restored it to the other VMware host. When that finished, I checked out the particulars. The disk files looked great and there was no snapshot. Powered the sucker up without network connectivity and took some time to look around. All seemed to be good, so I restarted with the network connect. As of this AM, the customer is saying the system is operating fine.
The entire process took about 5 hours - from 11:30PM to 4AM, and at least there was a meaningful progress indicator. The most horrifying aspect of the other large snapshot delete nightmare stories was the "not knowing".

Thanks so much for all the recommendations and insight. I dodged a bullet here. Need to now upgrade the ESXI version ASAP, install VCenter and make sure no other client hosts are unknowingly running on a snapshot.
You can also set Alarms to warn you of Snapshots using vCenter Server, and it also records performance statistics to a database for more than 24 hours!
Avatar of ziceman

ASKER

Thanks, Andrew. Noted.

Greatly appreciate both the extensive knowledge and responsiveness with VMWare. Your reliable presence and participation on this topic is perhaps the sole reason I have maintained the EE subscription.