I am having an issue with VDR. On several of my VM's, the backup fails with:
Failed to create snapshot for Saints DC, error -3941 ( create snapshot failed).
Points to note:
1. VDR has been rebooted several times
2. Disks have been formatted with 1MB blocks, but backups fail on VM/Disks with less than 256GB
3. Trying to snapshot a problem VM manually without memory but with quiesce ticked also fails
4. Backups work on "simple" VM's - standalone servers with no additional disks etc
5. I dont know if this is coincidence but all the VM's it is failing on have multiple disks. Even if I select only to backup the OS disk.
6. Backups fails on locally mounted disk or network share
VMware
Last Comment
hongedit
8/22/2022 - Mon
Andrew Hancock (VMware vExpert PRO / EE Fellow/British Beekeeper)
Okay, the issue is the Snapshot failure. If the snapshot cannot proceed correctly, the vDR Backup job will fail.
hongedit
ASKER
Ok. So what should I be looking at?
Andrew Hancock (VMware vExpert PRO / EE Fellow/British Beekeeper)
As you have confirmed by trying to snapshot the machine with Memory Unticked, but Quiesce ticked, is the snapshot function with vDR triggers. If the snapshot cannot quiecse the virtual machine, the vDR job will fail.
The issue may reside with VMware Tools in the virtual machine.
Andrew Hancock (VMware vExpert PRO / EE Fellow/British Beekeeper)
adding a cdrom iso to the virtual machine.
Yes, you can restart Network Management Agents on the host, the host will disconnect from vCenter for a few moments, but the VMs will still function on the ESX host.
hongedit
ASKER
Hmm, see this in log - this was a manual snapshot I tried which failed:
May 18 19:37:05.229: vmx| SnapshotVMXTakeSnapshotComplete done with snapshot 'Test': 0
May 18 19:37:05.229: vmx| Msg_Reset:
May 18 19:37:05.229: vmx| [msg.checkpoint.save.fail2.std3] An error occurred while saving the snapshot:
May 18 19:37:05.229: vmx| The destination file system does not support large files.----------------------------------------
May 18 19:37:05.229: vmx| Vix: [6062 vmxCommands.c:2532]: VMAutomationCreateSnapshotCallback: Got CreateSnapshot callback, snapshotErr = The destination file system does not support large files (5:C), UID = 0
May 18 19:37:05.877: vcpu-0| HBACommon: First write on scsi0:0.fileName='/vmfs/volumes/4dc86dbf-b7058470-1d9b-001b217f910d/Saints DC/Saints DC.vmdk'
Andrew Hancock (VMware vExpert PRO / EE Fellow/British Beekeeper)
are the VMDKs over 256GB?
hongedit
ASKER
Thanks.
1. There is plenty of space on the datastore for snapshots - as I said, 149GB of which 107Gb is free
Block size is 1MB which allows up to 256GB - I am within these limits.
Should I still explore the option of relocating the snapshot files?
hongedit
ASKER
I dont even see how one could run into this scenario - if the datastore is formatted with 1MB block sie, VMWare will not even let you create a VMDK bigger than 256GB anyway.
Andrew Hancock (VMware vExpert PRO / EE Fellow/British Beekeeper)
First I would quickly restarting Network Management Agents on the console of the affected host, this is very quick, and a quick win, if this works.
and then, change location of snapshot file to another datastore.
hongedit
ASKER
Oh, can I also add this vital information.
I have successfully backed up this VM before using VDR, and very little has changed in terms of size growth etc.
So I dont think it is this!
I did notice though that one of the the VM's that fail have VMNAME-Snapshot#.vmsn in the datastore, with very small size (few KB's). Although the rest that fail do not have these.
hongedit
ASKER
How do I restart these Management Agents?
Also note that on one host, there are a mix of some that work and some that dont - surely if it was the host, none would work?
Andrew Hancock (VMware vExpert PRO / EE Fellow/British Beekeeper)
What can happen is snapshots can hang. (in the VM, and on the Host server), which gives the Another Task is already in Progress, which is also indicative of the -3941 error.
Only way to cure this hung snapshot is to restart network management agents.
hongedit
ASKER
Ok. How do I do this?
Andrew Hancock (VMware vExpert PRO / EE Fellow/British Beekeeper)
Andrew Hancock (VMware vExpert PRO / EE Fellow/British Beekeeper)
Yes, it's supposed to delete it.
hongedit
ASKER
Thanks.
In that case when the SAS array is healthy again I can try creating a seperate dedicated snapshot datastore for all VM snapshots, I guess it will only need to be as big as the biggest VMDK + x% for growth?
Andrew Hancock (VMware vExpert PRO / EE Fellow/British Beekeeper)
It depends on the rate of change of the writes and block changes in the VM, in the backup window.
So if you backups run 9.00am on exchange, probably the busiest time of the day, your Snapshot will be larger than if you run it at 3.00am in the morning.
I apologize if I've posted this before, for you to read. (I'm now losing track, of who, I write to with regard to snapshots)
and I also know, that you are not creating snapshots, but you shoudl be aware of what they do, and why they are required by Backup Applications. Undeleted Snapshots can be potentially very dangerous, if you are not aware that a VM is running on a snapshot disk. You can now setup Alarms using vCenter and Nagios to monitor snapshots! I've seen many large organisations VMs fail, because there Admins were not keeping any eye on Snapshots!
A snap shot is a way to preserve a point in time when the VM was running OK before making changes. A snapshot is NOT a way to get a static copy of a VM before making changes. When you take a snapshot of a VM what happens is that a delta file gets created and the original VMDK file gets converted to a Read-Only file. There is an active link between the original VMDK file and the new delta file. Anything that gets written to the VM actually gets written to the delta file. The correct way to use a snapshot is when you want to make some change to a VM like adding a new app or a patch; something that might damage the guest OS. After you apply the patch or make the change and it’s stable, you should really go into snapshot manager and delete the snapshot which will commit the changes to the original VM, delete the snap, and make the VMDK file RW. The official stance is that you really shouldn’t have more than one snap at a time and that you should not leave them out there for long periods of time. Adding more snaps and leaving them there a long time degrades the performance of the VM. If the patch or whatever goes badly or for some reason you need to get back to the original unmodified VM, that’s possible as well.
I highly recommend reading these 2 articles on snaps:
I'm finding mixed info on redirecting the default location of snapshot files.
So I need to powerdown the VM, edit the config file with the path.
Then:
Do I need to "unregister" and "reregister" the VM or can I boot it back up?
I must admit unreg/rereg sounds a bit risky considering the problems I have been having lately!
Also, does the new Snapshot Datastore need to be connected to the ESXi Server or VM?
Do I need to redirect the configuation file also?
Getting confused :s
Andrew Hancock (VMware vExpert PRO / EE Fellow/British Beekeeper)
The reason you need to Unregister from the vCenter is because vCenter locks and caches the VMX files, if you edit the VMX, the changes will not be saved, because the cached copy will be written back so you need to Unregister.
If you don't want tom Unregister/Re-register machine, you can always shutdown machine, and select Edit Settings of the VM, Options, Under Advanced, General, Configuration Parameters, and add a new ROW, and variable.
variable to add is workingDir
value is "<new_path_location>"
new datastore must be available to the host ESXi server. If it's available to the Server, it's available to the VM.
No need to redirect the config VMX file, just made the modifications in the VMX for the new snapshot location.
Upon looking at the VMX file the workingDir value was "." which explains why it didnt work. Looks like GUI is very experimental.
VDR is now running!
Snapshot datastore now had 6 vmdk's all the same size though, why is this?
Andrew Hancock (VMware vExpert PRO / EE Fellow/British Beekeeper)
Glad we've got vDR working again for you.
I prefer to allocate a Snapshot LUN, which has plenty of space, for snapshots, if you have thin provisioned your Virtual Machine LUNs, you don't have to worry about running out of disk space.
The GUI is experimental. (we always edit the VMX directly, because that's the way we've always gone it, but I know there are plenty of "Children of the GUI" around that like WIndows and GUIs, don't like command prompts, command lines of vi! Call me and old Geek!
Change Block Tracking is enabled, which is how it maintains changes of the blocks on the disks for faster incrementals. Once the backup is finished just make sure they disappear. Don't be too worried, if they don't disappear immediately, but just keep an eye, between backups.
Does VDR scheduling just pick a random time in the backup window to start?
Becuase I just edited the schedule of the VM backup jobs,and they all started at once (I have a seperate backup job for each VM currently. They all have slightly different schdules but most start at 7pm)
Andrew Hancock (VMware vExpert PRO / EE Fellow/British Beekeeper)
Yes, vDR scheduling is a bit weird, it trigers when the data source is out of date.
So if they were all "out of date" sources, they would all trigger, but be careful with this, because at present, vDR does look at any performance of the datastore, and it can flatline a datastore with ease!
Vizionore/Veeam have got options to reduce the number of backups per lun to x, to stop this.
How we have configured clients sites, is to group services into unique backup jobs, and stagger them at different times, and use the option to kill a backup job if it overruns into another job.
As it's doing incrementals, after you first backup, jobs should complete quickly each day, because the delta between each backup should be quick.
So we specify different hourly windows for services, and most services are on different luns as well.
e.g. Different Exchange Stores on different LUNs, Different Exchange Servers different LUNs, SQL etc, and DCs on seperate LUNs, so if we have LUN failure on the SAN, we've spread the services, so we don't have a massive service outage.