We help IT Professionals succeed at work.

removing old snapshots created when virtual disks where in Independant non-persistent mode

gopher_49
gopher_49 used Ask the Experts™
on
I have a VM that reboots whenever I try to create a snapshot.  It started when the VM had disks in Independant Non-persistent mode.  This was enabled by accident.  Since then I've changed the drives to Normal mode, however, whenever I try to create a snapshot using VEEAM or any other 3rd party backup application the snapshot creation fails and then the VM reboots.  VEEAM tech support says I need to call VMWare and have them consolidate these snapshots.  Does anyone has any other suggestions?
Comment
Watch Question

Do more with

Expert Office
EXPERT OFFICE® is a registered trademark of EXPERTS EXCHANGE®
Top Expert 2009

Commented:
Consolidate snapshots? Were there snapshots?
Can you screenshot the VM directory
Top Expert 2010

Commented:
Do you need the snapshots? Keeping snapshots for an extended period of time is *not* a good idea; besides potentially corrupting the VM, it takes up disk space on the datastore. You can easily check if you have snapshots by right-clicking on the VM -> Snapshots -> Snapshot Manager. If you have a 'tree' of snapshots, you'll see it here. You can then click on the 'Delete All' button and commit all the data in the tree of snapshots to the parent disk, which will then also remove the tree of snapshots.

If you could provide a screenshot of the Snapshot Manager of the VM within vSphere Client, that would give us a better idea.

Regards,
~coolsport00

Author

Commented:
Attached is a snapshot of the VM's datastore.
backend1-VM-datastore-folder.jpg
Top Expert 2010

Commented:
Yep...lot of snapshots :)  Do you need them "gopher"?

Author

Commented:
I don't need these snapshots..  At this point all I care about is the Exchange db.  I can always restore the Exchange db to another server is something really bad happened..  My goal is to be able to simply take snapshots of this server via VEEAM.  Currently when trying it reboots the VM and I get a series of errors...  

In the snapshot manager I don't see anyway to delete these.  How can I manually delete these snapshots?
Top Expert 2010

Commented:
Can you please post a screenshot of the Snapshot Manager for this VM?
Thx.

Author

Commented:
attached is the snapshot manager of backend1.
backend1-snapshot-manager.jpg
Top Expert 2010

Commented:
Ok...just as you said...nothing to delete in the GUI. OK...so, we'll have to do this the hard way (to me...me no likey cmd line) :)

OK...so, according to this guide:
http://www.esxguide.com/esx/content/view/2/25/

If you use the cmd "removesnapshots", this will commit the data to the parent disk. You said you don't 'need' the data in the snaps, but it won't hurt anything to be careful. So, the command would be something like this:
[root@esxhost root]# vmware-cmd /vmfs/volumes/44ebf538-51cc7998-2525-00145e1b556a/printer/backend1.vmx removesnapshots
...where that string of numbers is the VMFS volume name ESX assigns to the datastore this VM is on.

You can use WinSCP to 'see' the vmfs volume directory your VM is on (download it here: http://winscp.net/eng/download.php). From this windows-based SSH GUI, you see the directory you'll need to type in the command line to remove the snapshots.

~coolsport00

Author

Commented:
I'm running ESXi and I have console access via my HP RILO board.  Can I still use these commands on ESXi?
Top Expert 2010

Commented:
They should work, but you need to enable SSH. See:
http://www.yellow-bricks.com/2008/08/10/howto-esxi-and-ssh/

If this isn't clear enough (I had to browse a few the 1st time I did it), you can type in "Enable SSH in ESXi" and it will bring up a host of URLs you can browse to help. When SSH is enabled, you can then use WinSCP and then also the cmd line parameters. Some parameters are different between ESX/ESXi, but snapshots are common to both server types, so those commands should work.

~coolsport00

Author

Commented:
SSH is enabled for VMware enabeld it some time ago...  I'll probably do this on the weekend to be safe, don't you agree or is it pretty hard to mess up?
Top Expert 2010

Commented:
Nah...it's just a simple command is all. Just log in to the host (unsupported), and run:
vmware-cmd /vmfs/volumes/44ebf538-51cc7998-2525-00145e1b556a/backend1/backend1.vmx removesnapshots

Nothing happens to the VM except snapshots that are associated with it are committed/deleted. Fairly seamless process really. But, before doing anything, always make a copy of your VM (just in case).

~coolsport00
Top Expert 2010

Commented:
And don't forget...the "44ebf538-51cc7998-2525-00145e1b556a" number is not gonna be the same as what's on your ESXi host; you'll need to change it in the cmd above to reflect the file path of your VM, but everything else should be the same.

Author

Commented:
ok...  I think I understand the command... Seems really simple.. Now... To copy my VM files will I need to power down the VM?
Top Expert 2010

Commented:
Yes; you can use Veeam FastSCP (http://www.veeam.com/vmware-esxi-fastscp.html) to copy it to a local drive if you'd like, or a USB drive connected to your workstation.

Author

Commented:
ok...  this weekend I'll power down the VM and try this..  So, should I see the command line interface with winSCP?  I only see the file system.  
Top Expert 2010

Commented:
Nope, not in WinSCP. That is just for the file path...to give you a better picture of the UNC to your VM file. You'll have to use your ILO (or be at the console directly) and go into 'unsupported' mode and use that above cmd.

~coolsport00

Author

Commented:
I'm at the console via SSH Putty...  SSH is enabled, however, I don't see the correct prompt..  I only see:

~ #

I already logged in, however, I don't see a prompt like mentioned in the links you sent me..  I thought I would see [root@vmhost1]#
Top Expert 2010

Commented:
No...that'll be a bit different. You're in the right area :)

~coolsport00
Top Expert 2010

Commented:
What you can do is create a small VM (if you have 10-12GB of disk space to play with). Create a snapshot in Snapshot Mgr, add a .txt file to the desktop, create another snapshot, add a 2nd .txt file to the desktop. Then, use the cmd I gave you above to remove the snapshots via cmd line and see what the behavior is.

~coolsport00

Author

Commented:
ok... so..  I'm at the ~ # prompt, however, it won't take the command:

vmware-cmd -l

I think I'm supposed to go into the shell a little further...
Top Expert 2010

Commented:
ah...actually, that is the problem (forgot). ESXi uses something different. Let me find out the cmd for you...sorry "gopher". I helped someone a couple wks ago with ESXi cmd and I had to tweak it a bit... :)

~coolsport00

Author

Commented:
no worries.. I think I found the prefix.. It starts with esxcfg-xxxx.  I'll wait to see what you find.
Top Expert 2010

Commented:
Well, according to VMware (http://kb.vmware.com/kb/1002310), the "vmware-cmd" parameter is supposed to work in ESXi, but I explicitly remembering it not working, as indeed you're seeing. Try your esxcfg and let me know (I'm still gonna 'look' around)...

Author

Commented:
To be honest I think I'm going to try these commands on a test maching I have at my office.. I'm not feeling too comfortable messing with their command line interface!  lol!  I have a test machine that is not on the network.  I'll try what I've found and whatever you find and we'll see what happens.  This will allow me to get a game plan together for when I do the real thing.
Top Expert 2010

Commented:
Oh..thought you created a small test box. Yeah...dont test on your prod VM :)  Ok...I'll do a bit more research to see what I come up with and post...

~coolsport00
Top Expert 2010

Commented:
Since I"m not at my work, what is the specific error displayed when you run that cmd? Invalid parameter or parameter not found?

Author

Commented:
I think it said parameter not found...  I'll have to try on a test ESXi host and test VM for I'm not comfortable with the ESXi console...  I'll try it first thing in the morning.
Top Expert 2010

Commented:
Looks like you need the "vim-cmd" parameter:
http://communities.vmware.com/thread/210801

~coolsport00
Top Expert 2009

Commented:
Nice try Mr cool, please take note, if the snapshots are not sync or missing in the gui, you cant simply run that command, it wont work, to esx, the VM appears to be has no snapshot at all, so how to remove?

As stated by the veeam tech, the proper way is to consolidate them by using vmkfstools -i command

For detail steps, refer to http://kb.vmware.com/kb/1007849

-vExpert 2010
Top Expert 2010

Commented:
Well, I was going by VMware's KB above that the command *can* be run when snaps don't appear in the GUI (http://kb.vmware.com/kb/1002310).

Thanks for the additional info/KB! :)

~coolsport00

Author

Commented:
ryder0707,

This seems like a pretty in-depth process.  I'm thinking I might get assistance from VMWare on this one.  It seems like I'll need to power down the VM and do on the weekend...  Of course I'll backup the entire VM folder first.

Author

Commented:
Do I only do step 3 or all of the steps?  It seems that I have to do all of the steps.
Top Expert 2009

Commented:
Mr cool, yeah you can run if you want but what's the point if it wont fix anything

Gopher, yeah...do all, don't skip
if you are not sure...just watch the video...you will do fine
the same KB article has helped many EE users before
it may take a while if you are not familiar with unix-like command...patient is virtue

-vExpert 2010

Author

Commented:
ok...  I'll power down the VM this weekend.  Backup the VM files and then give it a shot.

I'll update the ticket this weekend for there's no way I'm going to mess with a production server on the weekdays.  Currently I'm backing up the Exchange db at the db level with CA BrightStor's Exchanger agent.. so... The important data is being backed up every evening.  
and Knowledge is power :)

Author

Commented:
how will my commands be different since I'm using ESXi and not ESX?
Top Expert 2009

Commented:
same...try it

-vExpert 2010
The esxcfg-* commands which pertain to ESX and the vicfg-* commands which pertain to ESXi are one and the same.
They differ by the method you run them.

Author

Commented:
ok.. well.. In the video tutorial the present uses the command: vmware-cmd -l   What would I type when connected to ESXi?  Also, my prompt shows ~ # which is different than the ESX prompt.

Author

Commented:
I guess what I am asking is that the vmware-cmd -l does not work with ESXi.

Author

Commented:
attached is my console snapshot.  I need to get deeper into the shell for it's not taking my commands...
console-snapshot.jpg
Top Expert 2009

Commented:
Oooh...yes not all command is available its like mini console, anyway that command is just to list running VMs on the host and the path to the vmx file...u can just browse the vmfs directory

/vmfs/volumes/<UUID>/<VMDIR>/<VMNAME>.vmx

-vExpert 2010

Author

Commented:
ok.. I used WinSCP to access the file system...  I'll try a few other commands that are safe...
Please note that using ssh to access the ESXi Host might void warranty.
The recommended way is to use the RemoteCLI or PowerCLI or vMA to access the ESXi Host.
I assume you are using ESXi 4.0.

Author

Commented:
I'm using ESXi v4.  I have direct access to the console via my RILO board on my HP server.  Should I use the RILO remote connection so I'm directly accessing the console?

Author

Commented:
the video tutorial mentioned committing snapshots.  Does this mean it will revert back to the snapshot?  My goal is to consolidate the snapshots and then hopefully get rid of them.  I do not want to go back to the snapshot.  My goal is to get rid of snapshots so my VEEAM backups will work.

Author

Commented:
It seems that this tutorial commits the snapshots and uses them.. I don't want to use these snapshots.  I simply want to discard them so I can backup from VEAAM in the future.  My VM currently powers on and functions. I just can't take snapshots.
Top Expert 2010

Commented:
Hi gopher_49. After working with you on the independent non-persistent problem I am pretty sure you are mainly interested in getting VEEAM to work and to not lose your exchange db. If that is the case as a last resort (and I mean last resort assuming the other experts methods for consolidating the snapshots don't work) you can manually get to where you need to be by the following steps...

1. Stop mail flow to exchange on your edge server
2. Perform full backup of your exchange database
3. Shut down the server
4. Remove the backend1 server from inventory (DO NOT DELETE FROM DISK)
5. From ssh cd to the /vmfs/volumes/..../backend1
6. vi backend1.vmx

7. Here is where it gets a little complicated - we will be editing the entries for scsi0:0, scis0:1, and scsi0:2. They will have the form scsi0:0="backend1.vmdk-delta.REDO_......vmdk" pointing to your last snapshotted image. These have the date 7/6/2010 in the datastore browser image you posted earlier in your post. You will change each of the three of those to point to the base volume:
scsi0:0="backend1.vmdk"
scsi0:1="backend1_1.vmdk"
scsi0:2="backend1_2.vmdk"
This will revert your disk contents to date 7/3/2010 which is where we were when they were all set to independent non-persistent.

ALL DATA WILL BE LOST SINCE THAT DATE IF YOU USE THIS METHOD.

But this is the point where we were when we started the previous issue.
At this point you can, via your ssh session:
rm backend1.vmdk-ctk*
rm backend1.vmdk-REDO*
rm backend1.vmdk-delta*
rm backend1.vmdk-0000*

do the rm for backend1_1 and backend 1_2 extra vmdk files, at the end the only files with a .vmdk extension you should have are:
backend1.vmdk
backend1_1.vmdk
backend1_2.vmdk

8. Import the backend1 vmx file into inventory
9. Edit settings and double check that independent should not be checked on your hard disks
10. Bring up the server
11. Restore exchange
12. Test the veeam backup will work

It sounds convoluted I know. If you can post your backend1.vmx file I will double check I have the scis0:n numbers correctly. This is pretty much a hack but (again only if the other methods experts have recommended to consolidate snapshots don't work) it will get you where you need to be.

Good Luck

Author

Commented:
bgoering,

I plan to backup my entire VM folder so if anything messed up I can always simply copy the files back and mount them, however, based on the online tutorial at http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1007849 and the info noted at http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1002310 which one should I do?  Both of these methods mentiond committing snapshots...  I simply want to get rid of my snapshos so VEEAM will work.  Will these methods (not yours) revert to the snapshots that I'm working on?  I simply want to consolidate them and/or delete them.
gopher_49, ESXi can be accessed using either of the tools I mentioned above.
Using Tech Support Mode is only supported for collecting logs when you log a call support with VMware.
Top Expert 2010

Commented:
gopher_49 I would try the article in the 2nd link first - it seems to be much simpler and if it clears your snapshots so much the better. However, if any of the snapshot files remain on disk after you are done the veeam is unlikely to work. I can't remember, are you using the licensed version or the free version of ESXi? If it is the free version I don't believe veeam supports it.

Author

Commented:
I'm using a paid version of ESXi.  I agree in regards to the second option.. now...  When it says it's committing the delta snapshot files does this mean it's simply building a snapshot file?  I'm scared that it's trying to revert my currently mounted VM when it mentions comitting snapshots.
Top Expert 2010

Commented:
When snapshots are "committed" it means the delta is being applied to the base disk (or to a previous snapshot level) in order to bring it up to date. Once the data is committed the snapshot file is deleted - and no data has been lost.

What is troubling in your case is that you are unable to see the snapshots in snapshot manager. I am thinking you may have had an incompatibility betweeen your Veeam solution and the previous independent non-persistent disk configuration.

I am going to speculate on how it might have gotten into this state:
Basically, with the independent non-persistent option a snapshot is created when you power on the vm and any disk changes are logged to that snapshot while the vm is running. When the vm is powered off the snapshot is automatically deleted and your disk state is returned to the base disk thus explaining what was happening with your original issue. ( http://www.experts-exchange.com/Software/Server_Software/Email_Servers/Exchange/Q_26296389.html )

Now enter Veeam, I don't use Veeam but suspect it does a process where the vm disk i/o is quiesced, a veeam snapshot is created, then a backup is done from the previous, quiesced snap-shot level (that remeber was created automatically when the vm was powered on due to the independent non-persistent disk configuration you had). Once the backup is complete, veeam should delete (commit) the snapshot that it created to hold changes while the backup was running.

Now what happens if the vm gets powered down and reverts to its base disk before veeam finishes? I suspect (but don't know) that VMware doesn't care and just reverts to the base disk and deletes the snapshot levels that it is aware of. If that is the case then there are probably snapshots created by Veeam still laying around. Now when you try to run Veeam backup again, it is unable to create the snapshot names it wants to use and thus fails.

The only way to sure if a snapshot isn't being used is to follow the instructions in the first link you posted above - find your current level (where the .vmx scsi0:n entries are pointing to) and chase the CIDS backward until you get to your base disk in order to determine what files VMware thinks are in the chain for the disk. At this point you can delete any files not in use in the chain. If, for example, the backend1.vmdk-ctk* entry isn't being used anywhere in the chain, then it may be safely deleted. Now you may get lucky and find that Veeam specific names are not anywhere in the chain, and once you delete those file Veeam may work again - however keep following the steps in the article because it still doesn't resolve why the snapshot doesn't appear in snapshot manager where it can be deleted through the vSphere client GUI.

However, I don't expect that to happen in your case because the timestamp on all of the snapshot vmdk files appear to be newer then your base volume. But it may be worth a try.

Again, the last hope instructions I posted above would revert you back to July 3 at 4:19 AM like it was reverting to on your previous EE case, and you would have to restore exchange and run the possibility of losing anything that you hadn't backed up. Try both methods recommended by other experts first before reverting back to that old time. But make sure the methods you try do indeed get rid of ALL of the .vmdk files on the datastore except for those comprising the three base disks.

Good Luck

Top Expert 2010

Commented:
One other comment - while you are trying to commit the snapshots BE PATIENT! It can and often does take a fair amount of time so don't interrupt it. I see that one of your virtual disks will have nearly 3 GB of delta to apply and that will take nearly forever :)

And let me stress to recheck your disk configuration to ensure it hasn't changed back to independent non-persistent. If you are able to successfully commit all of your snapshots your configuration (.vmx) file may also be restored to a previous level that may point back to the non-persistent disk configuration. If that isn't corrected early you may be back where you started!

Author

Commented:
Wow... What a mess.  The first link I posted to you is a ton of steps and the second link makes a little more sense.  Should I try the second link first?  It seems straight to the point and shouldn't take long.  I'll backup Exchange and my VM files prior to doing so.  Also, once done I plan to check the disks before powering on.  Also, when you say committing the large delta file will take a long time...  Do you think maybe 3 hours or so???  If it's under 6 hours I don't think it will be a problem.

Author

Commented:
Of course I'll chase the CID's to see if their not being used.  I can chase these before making a backup, correct?  I'll wait to deleted anything until I make a backup.

Author

Commented:
Also, I'm looking at the delta and redo files and those match the exact time I ran a VEEAM backup.  I guess this is good news in regards to the chain not pointing back to these files, right?  If so I might be able to simply delete them once I prove their not in the chain.
Top Expert 2010

Commented:
You can chase the chain anytime to see if they belong and delete them if they don't. But that still leaves the issue of why the current snapshot volume doesn't show up in the gui to delete.

I would say 6 hours should be more than enough time, what I meant was sometimes people get impatient and a half hour may not be enough time, although it seems like forever when you are doing it.

You can see if it is working by observing the contents of the folder while the delete/commit is running, there will be a file in there that is growing as the changes are merged. Once complete some deleting and renaming goes on as well as .vmx edits to point to the new .vmdk file.

I can't see that anyone has mentioned it, but be sure you have plenty of free space on the datastore to commit your snapshot. As a rule of thumb I like to have double the size of the base disk free. Not certain if you need all of that, but you will need at least the size of the base disk free.

Here is another article that may work for you about getting rid of orphaned snapshots, sounds a bunch easier than some of the other options if it works....

http://virtualandy.wordpress.com/2009/04/24/esxi-snapshots-not-showing-in-vi-client/

I know I am giving you a lot to think about... but so long as you backup before starting you shouldn't lose anything.

Good Luck

Author

Commented:
Thanks for the info and I have 2 TB's free of fault tolerant storage. I'll check out the new link and might even do a little tracing tonight.
Top Expert 2010

Commented:
Did you have any luck with your tracing?

Author

Commented:
I was going to try your link first..  I was going to create a snapshot from the snapshot manager and then try the delete all command...  If that didn't work I was going to try the tracing and then manually delete the old snapshot files assuming their not linked.  I'm gonig to try tonight for Friday evening no one will be on the server and if anything messes up I have all weekend to work on it.

Author

Commented:
I think the snapshot issue might be fixed.  I backed up my Exchange db and then stopped the MTA service.  I powered down the VM and tried to copy the VM files.  VEEAM SCP would copy a few and then stop...  The vsphere client was way too slow...  So, I skipped copying the VM files for the step I was about to do didn't require them to be backed up.  So, with the VM powered off I created snapshot via the snapshot manager.  I then did the delete all command.  It mentioned consolidating snapshots and then said it would delete them.  Based on the file system it seems that all snapshots are now gone.  I'll post the snapshot of the file system tomorrow.  I'm also about to test a snapshot backup via VEEAM.

Author

Commented:
I just powered on the VM and noticed that it reverted back to 7/3.  I guess when consolidating the snapshots it went back in time.  This is the day I turned off the non-persistent disk mode.  So, I changed the disk back and then performed a restore of my Exchange db.  So, this will leave my VM functioning again.  Now, I'm about to attach a snapshot of the file system for I still see some snapshot files.  The redo and delta files are gone though.

Author

Commented:
datastore folder of VM as of 7/10
backend1-VM-datastore-folder-7-1.jpg
Top Expert 2010

Commented:
I still see snapshot files in your screenshot. Are those ones created by Veeam? Do any snapshots show up in snapshot manager? Are the disks still have the independent unchecked?

Author

Commented:
When I created a snapshot in the snapshot manager on Friday it consolidated the snapshots.  I then deleted all of them.  I then powered on the VM.  Since it consolidate the snapshots it reverted back in time.  It reverted back to the day when the snapshots where created which was 7/3.  On 7/3 I still have independant non-persitent enabled so it rolled back to that.  Due to this I powered down the VM, changed the disk back to Normal.  Restored my Exchange db.  Now I have my VM with the current Exchange db and Normal mode enabled on the disk.  Keep in mind when it rolled back to 7/3 it also ran the VEEAM job that crashes the server.  The redo and delta files are now gone, however, the other snapshots are still there.  

Author

Commented:
This makes sense to why the currently snapshot files have 7/3 enabled for when it commited the snapshots it went back to 7/3.  When the VM boots on 7/3 it runs the job that crashes the server.  I need to find a way to simple delete these snapshots and not have it go back in time.  When going back in time it's an endless loop.  Going back in time results in non-persitent being enabled and the VEAAM job running which results in messed up snapshots due to non-persistent being enabled.  Non-persistent is not supported by VEAAM and obviously seems to cause issues.  I need to delete the snapshots and void going back in time.
Top Expert 2010

Commented:
Well if you are willing to go back to 7/3 one more time you can follow my hack workaround, it is my first post in this thread. It is basically backup, shut down the vm, remove (don't delete) vm from inventory, edit the vmx file with vi, change the scsi0:n entries to point to the base disks, manually delete (rm) all other .vmdk files that are not your base images, add the vm back to inventory, ensure the disk configurations do not have indepenent checked, power up, restore, done.

Actually, it looks like your base image files are all dated 7/10 now. If the vmx file is already pointing to the base image you may only have to delete what is left over.

After that check your datastore with the machine running, you should see no snapshot files. Run your Veeam backup, if you check the datastore while it is running I expect you will see snapshots, but they should go away after it completes. I am not certain that Veeam deletes them as I don't use it, but most backup products work by quiesing the I/O to get a consistant disk image, taking a snapshot to log changes to while the backup is running, backing up the consistent disk images, then removing the snapshot and consolidating the changes back to the base disk.

I am still concerned that I see snapshot images out there with the 7/3 date, we need to make those go away.

Author

Commented:
When you say backup... Are you referring to the VM files?  If so, I couldn't get VEEAM to copy them.  The vsphere client said it would take like 3 days?!  The VEEAM software started the copy but files would stop copying.  I let it run for 3 hours and it didn't progress anymore after the first initial 5 files.

Author

Commented:
My paid VEEAM install has a VM copy feature.  I'll use it.. So, I need to backup all of the VM files while it's powered off, correct?
Top Expert 2010

Commented:
When I said backup I just meant whatever is important you not lose - such as exchange db. You won't lose any more than you did when it was reverting. Maybe even less because of the newer date on the base disks.

Author

Commented:
Also,

I dont see how to edit the .VMX inside of the vSphere Client.  Can I simply download it and edit it with word pad?  Notepad doesn't read the file easily.
Top Expert 2010

Commented:
log n to the console or a ssh session as root on the esxi server hosting your vm and use vi (a unix style editor).

Author

Commented:
I downloaded the .vmx file to my local hard disk.  And I see in SSH how I can edit via VI if needed.. Check out my below section of my .vmx file.  It's showing persistent as the mode, however, independant is NOT checked in my server settings.  I disbabled persistent mode before powering on.  I guess it will revert back next time but then after that it will stop reverting back...  it's going to revert back anyway but I can work around that with my exchange db backups.  

scsi0:0.fileName = "backend1.vmdk"
scsi0:0.mode = "persistent"
scsi0:0.ctkEnabled = "TRUE"
scsi0:0.deviceType = "scsi-hardDisk"
scsi0:0.redo = ""
scsi0:1.fileName = "backend1_1.vmdk"
scsi0:1.mode = "persistent"
scsi0:1.deviceType = "scsi-hardDisk"
scsi0:1.present = "TRUE"
scsi0:2.fileName = "backend1_2.vmdk"
scsi0:2.mode = "persistent"
scsi0:2.deviceType = "scsi-hardDisk"
scsi0:2.present = "TRUE"

Author

Commented:
So,

My SCSI drives are NOT pointing to the snapshots files.. So..  I can stop mail flow, backup exchange, powerd down VM, backup VM files, delete snapshot files, import .vmx files, make sure drives are set to normal, then power on server?  Then restore if needed?  Instead of using SSH can't I simply delete the files from vSphere client?  Since my .vmx file looks correct I don't see why I can't simply delete the files from vSphere client.. I probably don't even need to use SSH.  The extra snapshots files are simply there due to my drives going back in time when consolidating the snapshots.

Author

Commented:
Also,

Do I have to backup my VM files?  It seems that it's not really needed.  Of course I'll backup my Exchange db.
Top Expert 2010

Commented:
it says mode="persistent" that should be fine. It is probably left over from when it was set independent. It really isn't necessary though - if you want you can take those lines out. Here is a sample config from one of my vms.

scsi0:0.present = "TRUE"
scsi0:0.fileName = "ARR.vmdk"
scsi0:0.deviceType = "scsi-hardDisk"

That is all it has to identify the disk.
The ctkEnabled setting I don't have anywhere, I gather it is for block level incremental backup so may be something Veeam has set up.

Author

Commented:
I'll keep it how it is...  Do I need to run a backup of my VM files before performing the below steps?

stop mail flow, backup exchange, powerd down VM, delete snapshot files, import .vmx files, make sure drives are set to normal, then power on server?  Then restore exchange if needed?  Instead of using SSH can't I simply delete the files from vSphere client?  Since my .vmx file looks correct I don't see why I can't simply delete the files from vSphere client.. I probably don't even need to use SSH.  The extra snapshots files are simply there due to my drives going back in time when consolidating the snapshots
Top Expert 2010

Commented:
I wouldn't so long as you have a good exchange backup

Author

Commented:
How do I import the .vmx file?  I don't see the option in the vSphere client.

Author

Commented:
I see how to import...  It's imported and powered on.  I'll test veeam shortly

Author

Commented:
It seems to have worked... All snapshots are gone and I'm able to perform a full backup of the VM with VEEAM.  One of my other VM's is messing up when backing up now, however, I think it's due to the VEEAM db being reverted back in time and only confused.  The next time I perform a full backup it should work... I'll call VEEAM about that, however, it seems that this VM is backing up now.  I'll keep this ticket open until Monday and will update.

Thanks!

Author

Commented:
I spoke too soon.. Backups still failed...  ouch...  It gets a good bit into the backup and fails and then reboots...  At least now when it comes up it's not going back in time...  I'll call VEAAM for something aside from old snapshots is causing this.  

Author

Commented:
I also just noticed something...  When VEEAM fails it somehow reverts back to an old snapshot?!  This is odd.  My disk are set back to non-persistent.  There must be an old snapshot out there in a different folder OR VEEAM is doing some type of restore?!

Author

Commented:
I also jsut noticed something...  My VM did not go back in time.  For some reason this recent backup set the drives into non-persistent mode.  I triple checked before the backup and they where normal.. Also, my data is current.  My email db is current and there are changes I made since 7/3 that are still there...  For some reason when the job fails it sets the drives to persistent mode.  I only noticed persistent mode and it's symptoms after my VM crashed the FIRST time I ran VEEAM.  I've rebooted and powered the VM on and off at least 3 times prior to my first VEEAM backup and all data was current...  I have no idea why VEEAM triggers my VM to go into this mode but it's deffinately something to do with VEEAM.

Author

Commented:
I'm able to replicate my symptoms from scratch...  I'm able to look at the .vmx file and verify that the disks aren't using snapshots.  I then power down the VM.  Delete all snapshots and chk files...  I then power the VM on.  All disks are set to normal.  I can reboot the VM and all disks are still set to normal.  I run the VEEAM backup the it crashes when it gets to the second disk.  When it crashes the server reboots and non-persistent is set on all disks.  There are then a VEEAM temp snapshots in the snapshot manager and they are notes as this.  I then use the delete all command and those are deleted, however, what's left are the snapshots needed for persistent disks.  When you look at the .vmx file the 3 x disks are pointing to the newly created snapshot files.  When I change the disks to Normal it then doesnt not point to the snapshot files anymore.  So...  I then boot the disks via Normal mode and then reboot to assure it keeps the settings and it does.  From then on the VM is fine assuming I don't run a backup.  
Top Expert 2010

Commented:
Wow - sorry was out of pocket for a while. Sounds like you have a good record of the symptoms, might be time to contact Veeam support. I keep thinking about trying it, but might not now lol. I don't back up the entire vm disks in my shop -- just run OS level backups like I always did before they were virtualized. It is sounding like Veeam backup crashing during a backup that may be the root of your entire issue though. Persistent mode sounds good though - it was the independent non-persistent that we started with on your other case that has me puzzled how it got that way.

Post back here if you think there is anything else I can do.

Good Luck

Author

Commented:
VEEAM seems to be a commonly used solution for VM networks my size...  It works great on all VM's but this one.  To be honest...  I had problems using the VCB method when I first setup my VM but I thought that was due to volume space.  The thing is...  VEEAM simply sends commands to Vmware so the problem probably lies somewhere in the ESX host and/or the Vmware OS.  It's almost easier at this point to create a new VM...  Make it an backend Exchange server and move mailboxes to it....  But.. I'm so anxious to figure this out...  Now that I can reproduce everything it will help in regards to troubleshooting...  I'm just getting nervous having to lean on my backups so much...  I'll harrass VEEAM and will get back with you.  At least I now know how in the world non-persistent got enabled!  I knew i didn't do that.

Author

Commented:
Here's the response from VEEAM.  They are blaming it on me backing up to a linux file system when using the virtual appliance mode.  The problem is that I'm able to backup another VM using the exact same method on the same host. Both hosts use the same method and backup to the exact same appliance hardware model and volume type.  Each host are able to successfully backup a vm...  Only one vm is failing.  See their response below...

This behavior will almost always happen if both of the following conditions are met:
• vStorage API "Virtual Appliance" mode is used.
• You are either replicating, or backing up to Linux server or another ESX host.

Seems that regular backups to locally attached storage or CIFS share are not affected.

In short, what happens is the following:
1. On job completion (or failure) Veeam Backup will issue VMware API command to remove snapshot.
2. VMware API will ask ESX to perform snapshot removal.
3. ESX will first clean up internal ESX configuration database from the snapshot record (not sure why this is not done only after snapshot is removed - design bug?)
3. ESX will then proceed to removing actual snapshot file.
4. In most cases, actual VMDK files will still be locked by hot-add process at this time (VMDKs are still "un-hot-adding"). Appears that there are no inter-process synchronization in this scenario, and if ESX cannot remove snapshot files immediately, it fails right away instead of waiting.
5. And so snapshot file removal fails.
6. But snapshot record is already gone from ESX configuration database, so snapshot files remain orphaned.

That will help you moving forward with this issue, please contact VMware support in regards to a fix.
Top Expert 2010
Commented:
What is your backup target? Can you configure your backup to use vStorage API Network (NBD) mode and see if that helps?

Author

Commented:
I can use that option...  I can also use iSCSI.  I'll try both methods and see which works and has better performance...

Author

Commented:
Due to the symptoms and having to restore to backups I'll wait until the weekend.  Users are using the exchange db too much during the week.
Top Expert 2010

Commented:
Good Luck - keep me posted. Thsi is turning into quite a long thread lol

Author

Commented:
Yea... VEEAM and VMWare will most likely just point their fingers at each other!  I'll try the different backup methods this weekend.

Author

Commented:
check out these two .vmx files.  Frontend is a VM that backs up properly and backend1 does not...  Look at the mode statements in backend1's .vmx file...  All of the disks on backend1 are still set to Normal and when I reboot they still stay on Normal according to the settings menu, however, this .vmx file show differently.

backend1 .vmx file
scsi0:0.fileName = "backend1.vmdk"
scsi0:0.mode = "persistent"
scsi0:0.ctkEnabled = "TRUE"
scsi0:0.deviceType = "scsi-hardDisk"
scsi0:0.redo = ""
scsi0:1.fileName = "backend1_1.vmdk"
scsi0:1.mode = "persistent"
scsi0:1.ctkEnabled = "TRUE"
scsi0:1.deviceType = "scsi-hardDisk"
scsi0:1.redo = ""
scsi0:2.fileName = "backend1_2.vmdk"
scsi0:2.mode = "persistent"
scsi0:2.ctkEnabled = "TRUE"
scsi0:2.deviceType = "scsi-hardDisk"
scsi0:2.redo = ""

frontened .vmx file
scsi0.present = "true"
scsi0.sharedBus = "none"
scsi0.virtualDev = "lsilogic"
memsize = "3072"
scsi0:0.present = "true"
scsi0:0.fileName = "frontend.vmdk"
scsi0:0.deviceType = "scsi-hardDisk"
Top Expert 2010

Commented:
Was the backend1 machine ever configured for fault tolerance? I did run across a knowlege base article that mentions backup problems and it recommends removing all of the entries containing "ctk"

http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1013400

Take a look and see what you think

Author

Commented:
It was never configured for fault tolerance... I do utilize change block tracing.  I'm concerned to the below line.  Why is this line there?  Persistent mode will not work with VEEAM.  I'll try removing the items mentinoed in the KB this weekend.

scsi0:0.mode = "persistent
Top Expert 2010

Commented:
I think you can safely delete those lines as they are likely left over from your experience with independent disks.

What I did was generate a new test vm. I looked at the .vmx file and there was no scsi0:0.mode paramaeter at all.

I then changed the disk configuration to independent-nonpersistent, the scsi0:0.mode="independent-nonpersistent" parameter was added to the vmx file.

Next I changed the disk configuration to independent-persistent, the scsi0:0.mode="independent-nonpersistent" parameter was changed to "independent-persistent"

Finally I unchecked the independent box (reverting back to the original normal config), and rather than remove the mode parameter completely it was changed to scsi0:0.mode = "persistent"

I would (again after having good backups just in case) delete those lines and see if it helps.

Author

Commented:
ok.. when you say delete those parameters do you also mean delete the 'scsi0:0,mode = "persistent" parameter too?  Also, when you refer to good backups I assume you're referring to Exchange backups, correct?
Top Expert 2010

Commented:
yes - delete the lines for scsi0:(0,1,2).mode="persistent"

when I said backup - I mean whatever data is importent to you - In this case you have indicated the exchange database is pretty much it on this machine.

I can't imagine that it would make a difference to esx whether the parameter is there or not. It is the default option for normal disks. However, it may make enough of a difference to confuse the backup software. Suppose the backup software thinks that if the scsi0:0.mode="whatever" parameter indicates that the disk is in independent mode, and doesn't really process the value. Stranger things have happened.

Author

Commented:
sounds good.   I'll give it a shot Friday evening once the exchange db is done backing up.  I'll stop mail flow before backups start so I don't loose any emails.

I'll keep you posted.

Author

Commented:
Also,

What about the .chk files?  Should I delete those too?  They are listed below..

backend1_1-ctk.vmdk
backend1_2-ctk.vmdk
backend1-ctk.vmdk
Top Expert 2010

Commented:
You indicated earlier that you are using the change tracking - so those files are probably needed for that

Author

Commented:
ok great.  I'll make the suggested changes Friday evening.  We'll see what happens and go on from there.  I'm really trying to void opening a ticket with VMware for tickets are so expensive.

Thanks.

Author

Commented:
I tried what the KB article said and it crashed again.  So, I deleted the temporary snapshots VEEAM created and then changed the disks back to normal.  This had to be done to allow the disk to become Normal disks.  I then let it reboot for it booted to the snapshots it created when going in the non-persistent independant mode.  I looked in the .vmx file and saw that it was pointing to the snapshot files it created.. So..  I let it boot to those..  Then powered it down so this time it would boot the normal .vmdk file.. This is so nerve racking... hahahaha... wow...  Anyway..  I'm about to try the network with NBD option.  If this doesn't work I'm opening a ticket with VEEAM unless you have something else to try....  The problem is that the network with NBD option is SO slow.
Top Expert 2010

Commented:
The only thing I have left to try is to clone the thing to a brand new vm.. cloning also will consolidate snapshots. If the new vm works then just go with it. Other than that I am about out of ideas.

Author

Commented:
The network backup option worked...  It's slow, but it worked.  I'm going to try the network option from another computer to see if it's faster.  Usually when a server backs itself up it's slower, however, when backing up from another physical server there might be more network paths...  It also seems that I can run two backup streams per computer.  VEEAM suggests that the source and target be different per stream.  I tried this last night and my backup speeds where just enough to fit my window.  This being around 6 hours.  If I can backup my entire network via full in 6 hours I'm happy....  I have about 800 GB's of data.  Anyway.  I'll try a few variations of network mode and will update the ticket.

Author

Commented:
The backup of the trouble VM finished but had problems writting some type of .xml file... It's giving a bad username/pw error.  Of course contacting VEEAM is pointless for they just say call VMWare....  I guess this is why people like CA and Veritas have been in business for so long for they actually troubleshoot problems unlike VEEAM who simply points their finger at VMWare for they only send commands to VMWare...  I'm almost to the point to getting rid of VEEAM and using CA's VDDK option.  It works and they actually support it unlike VEEAM.  VEEAM does very little of support... Anyway...  So, below is the error.

10 of 10 files processed

Total VM size: 168.00 GB
Processed size: 168.00 GB
Processing rate: 7 MB/s
Backup mode: NBD with changed block tracking
Start time: 7/16/2010 9:29:23 PM
End time: 7/17/2010 4:41:18 AM
Duration: 7:11:55

Finalizing target session
TextToTar failed
Client error: boost::filesystem::exists: Logon failure: unknown user name or bad password: "\\nas2\vmhost1backup\veeamvmhost1\backend1.vbk"
Failed to backup text locally. Backup: [veeamfs:0:2880ea0f-3157-4631-bf28-8f72aa28a83c (176)\summary.xml@\\nas2\vmhost1backup\veeamvmhost1\backend1.vbkVBK: 'veeamfs:0:2880ea0f-3157-4631-bf28-8f72aa28a83c (176)\summary.xml@\\nas2\vmhost1backup\veeamvmhost1\backend1.vbk'RBK: ''].

Server error: End of file

Any suggestions?

Author

Commented:
I emailed my buddies at VEEAM to see if they'll actually troubleshoot this.  We'll see...  It seems like the backup completed... just had a issue writting the .xml summary file.
Top Expert 2010

Commented:
Good Luck, I may download a trial of Veeam sometime just to see what it does. So far I don't do full VM backups, still running traditional backups at the OS level just as I always have for a physical machine. That seems to fit my needs.

Author

Commented:
I ran a full job from a physical server using the network mode and it worked just fine.  It looks like I need to stick with network mode and dno't use the problem VM to back itself up or any other servers.  I'm using my vCenter VM and a physical server as my backup agents.  They will run two jobs each at the same time.  This will allow my backups to get done in my 6-7 hour window on full...  The network mode is actually running the same speed as the virtual appliance mode in my network.. So..  I'll just stick with that mode since it seems to be compatible with my environment.  I'm running more tests and will update the ticket later.

Author

Commented:
After doing many tests on full backup and incremental differentials it seems that all my jobs should run off of my physical server.  My physical server and my VM's backup via the full method at the exact same speeds, however, when doing the incremental differentials (delta copies I think they call it) my physical server is almost three times faster...  I'm getting around 200+ MB/sec sometimes much higher during this process versus my VM's getting about 80 MB/sec when they process the incremental differentials...  I guess the speeds are fast in general due to the change block tracing, compression, whitespace being ignored, and inline deduplication...  My full backups only get about 10 MB/sec per job, however, I plan to run 4 jobs at once so I'll end up with right at 40 MB/sec which is plenty fast enough for my environment.  I guess I'll just stick with network mode on my backups and have my physical server run the jobs.  This particular server is also my CA BrightStor server so it makes sense to consolidate all backup services onto one box...  

Author

Commented:
I've ran a few more incremental differentials and everything seems to be okay.  I guess I'm going to go ahead and close this ticket...  For anyone reading this ticket it's so long and will be hard to specially note which comment actually fixed it... But...  Below is a summary and I hope no one has to go through what I went though...  And many thanks to bgoering.

On a Friday evening after my Exchange backups completed I created a job in VEEAM's Backup and Replication software.  The VEEAM software was installed on a VM.  The VM is a backend Exchange 2003 server.  I set the job to use virtual appliance mode.  When running a job for the first time it always takes a synthetic full backup.  The job was able to backup the first disk just fine, however, it crashed on the second disk and resulted in the server to reboot.  Now.  At this moment in time VMWare creates a snapshot on the VM's disk and sets the disk to Independant Non-Persistent mode.  For some reason when crashing during backups via the virtual appliance mode it can do this.  When your disks are set to Non-Persistent mode any changes written to the disk are ignored when you reboot or power off the VM... Obviously not a good thing with an Exchange server... So...  When the backups failed, the disk where put into this mode, and then the server rebooted.  So, the next time the server booted the exchange db didn't mount due to the log files being off sync.  I performed a restore but then when rebooting in the future it happend again...  When rebooting it reverts the disks back in time.  Keep in mind... I didn't realize what was gonig on until a few days into this...  So...  This is what I had to do...  Stop mail flow to the server, make a full backup, power down the VM, set the VM to Normal disk mode, power on the VM.  Then power it off.  see if the disks stayed in Normal mode.  If so, then create a snapshot via the snapshot manager, then use the delete all command.  Then power on the VM.  After the it powers on lookat your .vmx file.  I suggest copying it to your local desktop and opening in a text editor.  Make sure the virtual drives are not pointing to snapshot files...  If so, reboot again.  It should NOT be pointing to snapshot files.  Then.  Copy the existing snapshots files to your local hard disk and then delete them from the datastore.  MAKE SURE the .vmx is not pointing to these files...  They need to be on the datastore if the .vmx is pointing to them.  Only when the .vmx is NOT pointing to them do you delete them.  Then change your VEEAM job to use 'network' mode via the VMWare API section...  Now check your email db.  If it's current then your good, if not, restore... It might not even mount.. This is why you need a good FULL backup prior to doing this...

You should be good to go now.. The goal is to stop mail flow, backup exchange db or any data that's important via a FULL method, have your disks set to Normal and not revert back.  Make sure your .vmx is NOT pointing to snapshot files, removed old snapshot files, and most importantly set your VEEAM backups to use network mode via the VMWare API section.

Below is VEEAM's statement in regards to these symptoms.

This behavior will almost always happen if both of the following conditions are met:
• vStorage API "Virtual Appliance" mode is used.
• You are either replicating, or backing up to Linux server or another ESX host.

I was backing up to a NFS share off of a Linux server.  Now, this is the same appliance that I has VM's stored on, however, I'm using a different share for my VM appliances.  I think the fact it was a Linux file system caused this.  

Good Luck!

Author

Commented:
See my lasts statement for a summary.