Link to home
Start Free TrialLog in
Avatar of zen shaw
zen shaw

asked on

Recreating vmdk snapshots broken chain

Hi All,
I had a server crash with the RAID controller unable to detect the hard drive pair. To recover I had to get into the BIOS and change the settings so that the disk are detected again.

After the reboot the disk were detected but the vmdks were disconnected from the virtual machines in the inventory.

I attached the vmdk back to Windows 2008 server and restarted the VM and it worked fine. However, I had another VM running Windows 2000 with two snapshots. When I try to restart the VM with only flat file it works fine but when I try to restart with the snapshot 0002, it throws an error.

I assume that the snapshot chain has broken between the snapshots and the base vmdk file.

How do I recover / rebuild the snapshot chain (preferably without the need of downloading the vmdks and manipulating using Workstation or vConverter)

a. How do I check / change the CID for the affected VM / VMDKs

Another strange problem occurred when I tried to remove the affected VM from the inventory and within the datastore tired to rename its folder name from ABC to ABC_ORIGINAL .... instead of renaming the directory, vmware automatically started moving files from ABC to ABC_ORIGINAL ... on the client it displayed a task MOVE FILE with cancel greyed out ...

Even this task failed after 3% .... so I don't know what is the state of my VM now?

Could anyone help with it please?

Thanks
ASKER CERTIFIED SOLUTION
Avatar of Andrew Hancock (VMware vExpert PRO / EE Fellow/British Beekeeper)
Andrew Hancock (VMware vExpert PRO / EE Fellow/British Beekeeper)
Flag of United Kingdom of Great Britain and Northern Ireland image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of zen shaw
zen shaw

ASKER

I'll try to follow the link you sent.

I am not sure if the delta are corrupted but when I tried to download 0002 ... if failed to download to local machine.

Meanwhile, I am attaching the data store jpeg ... let me know if you find any obvious anomaly.

Thanks
No screen attached ?

I am not sure if the delta are corrupted but when I tried to download 0002 ... if failed to download to local machine.

download via datastore browser? if so the datastore is corrupted and so is the file that sits on the datastore.

If that is the case, I would *BACKUP NOW* ALL your VMs.

Check the hardware and RAID, and disks, and erase the current datastore, RAID array, and re-create.
Hi,
I am attaching the descriptors ... you can have a look and tell me the problem:

I assumed the chain would be

C drive Docobo_BE.vmdk------>Docobo_BE_0000001.vmdk------->DOCOBO_BE_0000002.vmdk
E drive Docobo_BE_1.vmdk------>Docobo_BE_1_0000001.vmdk------->DOCOBO_BE_1_0000002.vmdk

But based on the descriptors, for C drive, there is the base descriptor file missing - How can I generate this file?

For D: If we trace the CID and Parent CID, it seems the order of the chain is
E drive Docobo_BE_1.vmdk------>Docobo_BE_1_0000002.vmdk------->DOCOBO_BE_1_0000001.vmdk

Could it be possible that the snapshots have the order change maybe by deleting or consolidating ....

How could I rebuild the chain ... would appreciate if you could look at the CID and Parent CID and suggest me a solution.

Thanks
can I just query something, you refer to C: D: and E:

but there are ONLY two virtual disks, with two snapshots ?

can I also have the exact error message, when you tried to start the VM, or add the snapshot 0002 disk.

and was the error the same for both disks ?

Okay, I can see that you are missing a descriptor file for

DOCOBO_BE.vmdk

the first disk......

this is also a common issue, that it disappears...

so, you will need this article also

Recreating a missing virtual disk (VMDK) descriptor file for delta disks (1026353)

Recreating a missing virtual machine disk descriptor file (1002511)
Hi Andrew,
There error I had received was NO OPERATING SYSTEM

Sorry for the confusion: There are only two drive C: and E: --- No D:

I am attaching the original vmx file also in which for the C: drive .... the current vmdk is
scsi0:0.present = "true"
scsi0:0.fileName = "DOCOBO_BE-000001.vmdk"
scsi0:0.deviceType = "scsi-hardDisk"

Based on the .vmx ... I am assuming that the .00002 vmdk was never used. Thus, I will have a missing base vmdk descriptor of the C: drive ....
DOCOBO-BE-Orig.vmx.txt
Okay, so the virtual machine actually powered on with no VM error ? and the BIOS reported ?

NO OPERATING SYSTEM ?

and this was when you added the 0002 snapshot, connecting the first VMDK only without the snapshot BOOTED an OS?

If that is the original file, the 2nd snapshot was *NOT IN USE* at the time of crash!

Just add the 0001 snapshot to the VM and power on!
Yes, I am recreating the C: drive base descriptor file as that file is missing .... please refer my attached screenshot earlier....

I am going to create that base descriptor and point the 00001 descriptor to base file (newly created) and start the server.

I'll skip 0002 on both C and E drives.....

Am I doing it right?
Yes, that's correct.

So all that's wrong, is the missing descriptor.

If after adding the 0001 snapshot, the OS does not boot or is corrupted, the 0001 snapshot is corrupted and you will have to discard all snapshots, and just use the parent.
I corrected the chain as BASE ---> 00002 ----> 00001 and I am sure this is the right sequence. However, when I start the VM, I get NO OPERATING SYSTEM Detected at Bios....

When I go to BIOS .... I do not see any disk available under Primary / Secondary slave.

Note: The OS on C is windows 2000

Any idea?
discard the second snapshot.....0002, it's not required, it's not been in use....

the chain is 00002 ----- 00001 ------ BASE

but from your original VMX, 00002 was not in use so do not use.
1. I recreated the chain as 00001 -----> 00002 ---> Base
2. Rebooted the machine and change the boot sequence to hard disk - no luck - cannot detect OS
3. Changed the Controller from Buslogic to Lsilogic - Gave a warning saying the drives were created for BusLogic, so cancelled the conversion and click No instead of Yes and continue
4. Create a new VM and attached the two VMDKs pointing to 0001 for both drives.
5. System booted correctly
6. Checked the eventlog and it showed 2012 entries.

I definitely changed the vmx entry to 0001 ...

Has my chain worked or its defaulting to the base file not allowing the changes done to the snapshots?
When you power on the VM, it checks the snapshot chain is correct, or it will give an error message and HALT!

It would seem the 00002 snapshot is corrupted, and when this is added, it causes the issue, which is normal, of corrupt snapshot disks.
If they are all working, get RID of the snapshot state!

If the current disk, is running on a snapshot 0001 and parent disk (base), this is all you can do now, and then merge DELETE ALL the snapshot to ensure the changes are committed to the parent disk (base), and the snapshot is deleted and gone!
So are you saying that the 0002 was never used? And the chain was 0001 ---> Base?

Is that a conclusion based on the original vmx file?

In that case, are you suggesting me to have  0001 ---> Base for both C: and E: and reboot?

how do I get the RID?
Do you have the original, unmodified VMX file?

Yes or No ?

does the VMX file include any reference to the 2nd snapshot file ? (-00002.vmdk)

Yes or No ?

Did you know that this VM was running on a snapshot ?
Do you have the original, unmodified VMX file?

I was passed a VMX file saying that was the original, which I forwarded to you. I assume the engineer before me did not change an thing in it and it is indeed the original.

does the VMX file include any reference to the 2nd snapshot file ? (-00002.vmdk)

Assuming the above is true- No there is no reference.

Did you know that this VM was running on a snapshot ?

I don't know - I am told that there were no snapshots visible in the Snapshot manager earlier.
Okay, assuming the VMX file DID NOT include any reference to the 2nd snapshot file ? (-00002.vmdk).

DO NOT USE IT!

Just use the parent and 00001.vmdk, does this make any difference?

Does the VM BOOT ?
I tried just 0001 ---> Base but it did not work.

It complains that the parent disk has changed and cant open error.

I think I have to restore the backup of C: drive and start tinkering again.

I restored the service from a standby system and praise the Lord, it is working on that front at-least.

I will stop now as the Director is not in a mood to do further troubleshooting but once off the network, I'll continue to recover it from where we left today.

Thank you so much for all your input. I really appreciate your help.
It complains that the parent disk has changed and cant open error.

this is normal if the CIDs mismatch, match the CIDs, and this will align the snapshot to the parent.

whether this forced match will give you a working VM, is another issue.

As stated in the first post, VMs running on a snapshot, when the server fails or crashes, can cause corruption in the VM.

This is why snapshots are dangerous. DO NOT rely in SNAPSHOT Manager to tell you if you have a VM running on a snapshot, it should be your VMware Admin daily task to check each VM....

see my EE Article how to check

HOW TO: VMware Snapshots :- Be Patient

Snapshots are evil, and cause issues all the time!
It complains that the parent disk has changed and cant open error.

I assume this is occurring due to the fact that I created another virtual machine and attached the .VMDK files to the new machine. While doing so only the base-file must have been attached and changed the file hash/properties or what ever VMware uses to know if the base file is in tact or not.

1. With the base vmdk attached, the machine struggles at boot with NO OPERATING SYSTEM found.
This seems to be an issue with the scsi controller - buslogic and thus recreating a new VM with buslogic scsi controller and attaching the base files resolved the issue and machine was able to boot.

2. Thus if the machine booting is a problem, I have to then recreate the snapshot chain as 00001 ---> Base and then edit the VMX of the new machine with VMDK as 00001, in which case I hope the controller problem and the up to date data could be achieved.

I need to try this once the machine is off network.

@ Andrew thank you for you timely help.

Snapshots are evil, and cause issues all the time!
 Here comes my related question to this then.

1. Popular backup solutions use snapshots to make backups of virtual machines - Do these cause the same issues often? and as you suggested the Administrator has to check if any orphan snapshots are lying after a backup each day. Can we automate this check?

2. What is the best strategy to backup applications and data?
Traditional: Backing up code (tar / zip / sync) and data backups (.sql / .bak)
or
Block-level backup of VMs (VMware Data Protection / Veeam / BackupExec)

3. How do you compare (VMware Data Protection / Veeam / BackupExec)
Is there a reason to choose Veeam over DP & BackupExec especially when DP Is included in the VMware license.

4. Best practices and deployment design / strategy to implement backups of VMs (any doc / video please)

Thanks
Zen
I assume this is occurring due to the fact that I created another virtual machine and attached the .VMDK files to the new machine. While doing so only the base-file must have been attached and changed the file hash/properties or what ever VMware uses to know if the base file is in tact or not.

normal behaviour.

1. With the base vmdk attached, the machine struggles at boot with NO OPERATING SYSTEM found.
This seems to be an issue with the scsi controller - buslogic and thus recreating a new VM with buslogic scsi controller and attaching the base files resolved the issue and machine was able to boot.

Okay, so the parent file looks to be okay.

2. Thus if the machine booting is a problem, I have to then recreate the snapshot chain as 00001 ---> Base and then edit the VMX of the new machine with VMDK as 00001, in which case I hope the controller problem and the up to date data could be achieved.

yes, or the snaphsots are corrupted.

1. Popular backup solutions use snapshots to make backups of virtual machines - Do these cause the same issues often? and as you suggested the Administrator has to check if any orphan snapshots are lying after a backup each day. Can we automate this check?

Yes, they cause this issue. You can either check manully daily after backups, or set vCenter alarms, or run automated scripts.

2. What is the best strategy to backup applications and data?
Traditional: Backing up code (tar / zip / sync) and data backups (.sql / .bak)
or
Block-level backup of VMs (VMware Data Protection / Veeam / BackupExec)

Block Level, and SQL Backups as an option.

3. How do you compare (VMware Data Protection / Veeam / BackupExec)
Is there a reason to choose Veeam over DP & BackupExec especially when DP Is included in the VMware license.

No contest here, Veeam is the world leader.

4. Best practices and deployment design / strategy to implement backups of VMs (any doc / video please)

Select Veeam.

If you require more information, on the above, this really needs a new question for myself or other experts to answer, about VMware Backups.
I'll update this question once I get any further with the recovery of snapshots ...  

Thanks Andrew ...