Andrew Davis
asked on
corrupt redo log. VMware ESXi
Host: VMware ESXi, 5.0.0, 469512
Client shows error while booting up:
Actions:
Just about every article i have found regarding this says to consolidate the snapshots, so i deleted the snapshots (vSphere Client - Snapshot manager), which then told me that it wanted to consolidate the drives. When i tell it to consolidate i get
Following http://dunfraggin.blogspot.com.au/2012/09/virtual-machine-disk-consolidation.html i susspect there may b e locked files. The server reports that it is powered Off, but did come up with an error during the Power Off task.
The article above leads me to http://kb.vmware.com/selfservice/microsites/search.do?cmd=displayKC&docType=kc&docTypeID=DT_KB_1_1&externalId=10051
if i look at the vmware.log file i am not seeing anything unexpected at the end of the file.
however if i try to use vmkfstools i get the following
So not really sure where to go from here ;)
Additional notes:
If this helps at all (Vmid 2 is the one in question)
Have rebooted the host:- no change.
Cheers
Andrew
Client shows error while booting up:
The redo log of {Servername}-000002.vmdk is corrupted. Power off the virtual machine, if the problem still persists, discard the redo log.
Actions:
Just about every article i have found regarding this says to consolidate the snapshots, so i deleted the snapshots (vSphere Client - Snapshot manager), which then told me that it wanted to consolidate the drives. When i tell it to consolidate i get
Consolidate virtual machine disk files
A general system error occurred: Input/output error
Following http://dunfraggin.blogspot.com.au/2012/09/virtual-machine-disk-consolidation.html i susspect there may b e locked files. The server reports that it is powered Off, but did come up with an error during the Power Off task.
The article above leads me to http://kb.vmware.com/selfservice/microsites/search.do?cmd=displayKC&docType=kc&docTypeID=DT_KB_1_1&externalId=10051
if i look at the vmware.log file i am not seeing anything unexpected at the end of the file.
however if i try to use vmkfstools i get the following
login as: itsupport
Using keyboard-interactive authentication.
Password:
The time and date of this login have been sent to the system logs.
VMware offers supported, powerful system administration tools. Please
see www.vmware.com/go/sysadmintools for details.
The ESXi Shell can be disabled by an administrative user. See the
vSphere Security documentation for more information.
~ $ su
Password:
~ # vmkfstools
sh: vmkfstools: not found
~ #
So not really sure where to go from here ;)
Additional notes:
If this helps at all (Vmid 2 is the one in question)
There is only the one host, so no other host could possibly have a lock.
~ # vim-cmd vmsvc/getallvms
Vmid Name File Guest OS Version Annotation
1 SBS2011 [datastore1] SBS2011/SBS2011.vmx windows7Server64Guest vmx-08
2 2008Leap [datastore2] 2008Leap/2008Leap.vmx windows7Server64Guest vmx-08
3 2008Leap-2 [NAS01] 2008Leap-2/2008Leap-2.vmx windows7Server64Guest vmx-08
~ # vim-cmd vmsvc/power.getstate 2
Retrieved runtime info
Powered off
~ #
Have rebooted the host:- no change.
Cheers
Andrew
ASKER
Thanks for your time ;)
Datastore out of space? NO 50% head room.
Host Server restart? Kinda. There was a power outage around the time of the crash. UPS should have seen to a clean shutdown.
Files in directory:-
Cheers
Datastore out of space? NO 50% head room.
Host Server restart? Kinda. There was a power outage around the time of the crash. UPS should have seen to a clean shutdown.
Files in directory:-
# cd /vmfs/volumes/datastore2/2Note "2008Leap.vmx" is green in colour.008Leap
/vmfs/volumes/4ff54135-0181f04a-d355 -00215e6ea e71/2008Le ap # ls
2008Leap-000001-delta.vmdk2008Leap-flat.vmdk 2008Leap.vmxf 2008Leap_1-flat.vmdk vmmcores-5.gz vmware-13.log vmware.log
2008Leap-000001.vmdk 2008Leap.nvram 2008Leap_1-000001-delta.vmdk 2008Leap_1.vmdk vmmcores-6.gz vmware-14.log vmx-2008Leap-3315273081-2. vswp
2008Leap-000002-delta.vmdk2008Leap.vmdk 2008Leap_1-000001.vmdk vmmcores-2.gz vmmcores-7.gz vmware-15.log vmx-zdump.001
2008Leap-000002.vmdk 2008Leap.vmsd 2008Leap_1-000002-delta.vmdk vmmcores-3.gz vmmcores.gz vmware-16.log vmx-zdump.002
2008Leap-aux.xml 2008Leap.vmx 2008Leap_1-000002.vmdk vmmcores-4.gz vmware-12.log vmware-17.log vmx-zdump.003
/vmfs/volumes/4ff54135-0181f04a-d355 -00215e6ea e71/2008Le ap #
Cheers
okay, this is the reason for failure...
The redo log to which the error refers is the 2nd snapshot delta disk, snapshot deltas do become corrupted, and when the chain is corrupted, the VM will not start.
e.g. parent disk (vmdk) + 1st snapshot disk + 2nd snapshot disk = VM VMDK
the chain of the above must be correct. if the host server crashed, restarted, datastore ran out of space, the 2nd snapshot disk gets corrupted. ALL the Information in the 2nd snapshot disk needs to be discarded, to get the VM started, which means data loss.
The redo log to which the error refers is the 2nd snapshot delta disk, snapshot deltas do become corrupted, and when the chain is corrupted, the VM will not start.
e.g. parent disk (vmdk) + 1st snapshot disk + 2nd snapshot disk = VM VMDK
the chain of the above must be correct. if the host server crashed, restarted, datastore ran out of space, the 2nd snapshot disk gets corrupted. ALL the Information in the 2nd snapshot disk needs to be discarded, to get the VM started, which means data loss.
ASKER
In preperation for wost case, i have also:-
1. Copied the servers directory to another datastore (in case i need to get anything back).
2. Created a new VM (2008Leap-2) and done a successfull restore to a NFS share on NAS01. I have not connected it to the network as yet, as i just wanted to prove the disaster recovery, and cannot leave it sitting on the NAS (tooooo slow).
(Just saw another post on this question come in so this comment will be out of order)
Cheers
1. Copied the servers directory to another datastore (in case i need to get anything back).
2. Created a new VM (2008Leap-2) and done a successfull restore to a NFS share on NAS01. I have not connected it to the network as yet, as i just wanted to prove the disaster recovery, and cannot leave it sitting on the NAS (tooooo slow).
(Just saw another post on this question come in so this comment will be out of order)
Cheers
ASKER
So go with disaster recovery?
I have shadow protect backup that will have all data, as office was closed (Thankyou holiday period).
Cheers
Andrew
I have shadow protect backup that will have all data, as office was closed (Thankyou holiday period).
Cheers
Andrew
okay, so this is a two disk virtual machine....with disks...
2008Leap-flat.vmdk
2008Leap-000001.vmdk
2008Leap-000001-delta.vmdk
2008Leap-000002-delta.vmdk
2008Leap_1-flat.vmdk
2008Leap_1-000001-delta.vm dk
2008Leap_1-000002-delta.vm dk
both disks have two snapshots.
Can you try the following:-
1. Take a Snapshot of the current VM
2. Wait 60 seconds
3. Delete the Snapshot
report any errors, and then check the disk if snapshots have gone. (the delta files!)
2008Leap-flat.vmdk
2008Leap-000001.vmdk
2008Leap-000001-delta.vmdk
2008Leap-000002-delta.vmdk
2008Leap_1-flat.vmdk
2008Leap_1-000001-delta.vm
2008Leap_1-000002-delta.vm
both disks have two snapshots.
Can you try the following:-
1. Take a Snapshot of the current VM
2. Wait 60 seconds
3. Delete the Snapshot
report any errors, and then check the disk if snapshots have gone. (the delta files!)
ASKER
With it powered off?
I'm afraid this file
{Servername}-000002.vmdk
is corrupt, and would have to be excluded from the VM configuration to get the VM started, depending on how long the machine was running on this snapshot disk, e.g. 12 days, the VM would be 12 days old.
If you have Backups, time to restore, and monitor those snapshots daily...and check your VMs are not running on snapshots.
{Servername}-000002.vmdk
is corrupt, and would have to be excluded from the VM configuration to get the VM started, depending on how long the machine was running on this snapshot disk, e.g. 12 days, the VM would be 12 days old.
If you have Backups, time to restore, and monitor those snapshots daily...and check your VMs are not running on snapshots.
Yes, Powered OFF.
Is the VM ON?
Is the VM ON?
ASKER
Everything looked fine with no errors.
It did report that it needs consolidation now.
I havnt done the consolidation yet, but is not showing the delta's gone (as expected i now have another).
Cheers
It did report that it needs consolidation now.
I havnt done the consolidation yet, but is not showing the delta's gone (as expected i now have another).
Cheers
ASKER
Yes, Powered OFF.
Is the VM ON?
Sorry was a stupid question on my behalf;)
Consolidation Message appears because it detects snapshots, but it's not intelligent to know if they are corrupted and cannot be merged or discarded.
did you do the above test?
did you do the above test?
ASKER
sorry just also noticed
root is denied direct login, i have to ssh with another user and then su to root, as i dont have physical access to this server.
Cheers
Andrew
(also try logging in as root, to try vmkfstools!)
root is denied direct login, i have to ssh with another user and then su to root, as i dont have physical access to this server.
Cheers
Andrew
ASKER
Yes i have done the above do you want me to try to restart it?
Sorry, as you obviously know more than me, i am not presuming anything ;)
Cheers
Andrew
Sorry, as you obviously know more than me, i am not presuming anything ;)
Cheers
Andrew
can you get a new listing of the folder for me? (with sizes)
ls - al
ls - al
ASKER
listing
/vmfs/volumes/4ff54135-0181f04a-d355 -00215e6ea e71/2008Le ap # ls -al
drwxr-xr-x 1 root root 5740 Dec 31 02:23 .
drwxr-xr-t 1 root root 1400 Dec 30 07:13 ..
-rw------- 1 root root 30098534400 Dec 30 05:43 2008Leap-000001-delta.vmdk
-rw------- 1 root root 320 Dec 30 05:41 2008Leap-000001.vmdk
-rw------- 1 root root 16986112 Dec 31 01:01 2008Leap-000002-delta.vmdk
-rw------- 1 root root 327 Dec 31 00:59 2008Leap-000002.vmdk
-rw------- 1 root root 208896 Dec 31 02:22 2008Leap-000003-delta.vmdk
-rw------- 1 root root 327 Dec 31 02:22 2008Leap-000003.vmdk
-rw-r--r-- 1 root root 13 Dec 31 02:23 2008Leap-aux.xml
-rw------- 1 root root 107374182400 Dec 31 02:23 2008Leap-flat.vmdk
-rw------- 1 root root 8684 Dec 31 01:00 2008Leap.nvram
-rw------- 1 root root 523 Dec 31 02:23 2008Leap.vmdk
-rw-r--r-- 1 root root 77 Dec 31 02:23 2008Leap.vmsd
-rwxr-xr-x 1 root root 3573 Dec 31 02:22 2008Leap.vmx
-rw-r--r-- 1 root root 263 Dec 30 08:02 2008Leap.vmxf
-rw------- 1 root root 43923165184 Dec 30 05:41 2008Leap_1-000001-delta.vmdk
-rw------- 1 root root 324 Mar 22 2013 2008Leap_1-000001.vmdk
-rw------- 1 root root 17190912 Dec 30 05:46 2008Leap_1-000002-delta.vmdk
-rw------- 1 root root 331 Dec 30 05:45 2008Leap_1-000002.vmdk
-rw------- 1 root root 413696 Dec 31 02:22 2008Leap_1-000003-delta.vmdk
-rw------- 1 root root 331 Dec 31 02:22 2008Leap_1-000003.vmdk
-rw------- 1 root root 214748364800 Nov 11 2012 2008Leap_1-flat.vmdk
-rw------- 1 root root 525 Jul 8 2012 2008Leap_1.vmdk
-rw-r--r-- 1 root root 5416529 Dec 30 05:46 vmmcores-2.gz
-rw-r--r-- 1 root root 4948934 Dec 30 05:54 vmmcores-3.gz
-rw-r--r-- 1 root root 5739430 Dec 30 06:53 vmmcores-4.gz
-rw-r--r-- 1 root root 5664529 Dec 30 07:03 vmmcores-5.gz
-rw-r--r-- 1 root root 5828171 Dec 30 23:59 vmmcores-6.gz
-rw-r--r-- 1 root root 5686604 Dec 31 00:10 vmmcores-7.gz
-rw-r--r-- 1 root root 5733245 Dec 31 01:01 vmmcores.gz
-rw-r--r-- 1 root root 164211 Dec 30 05:46 vmware-12.log
-rw-r--r-- 1 root root 164471 Dec 30 05:55 vmware-13.log
-rw-r--r-- 1 root root 158944 Dec 30 06:53 vmware-14.log
-rw-r--r-- 1 root root 157904 Dec 30 07:03 vmware-15.log
-rw-r--r-- 1 root root 163543 Dec 30 23:59 vmware-16.log
-rw-r--r-- 1 root root 164108 Dec 31 00:10 vmware-17.log
-rw-r--r-- 1 root root 164443 Dec 31 01:01 vmware.log
-rw------- 1 root root 52428800 Jul 8 2012 vmx-2008Leap-3315273081-2.vswp
-r-------- 1 root root 5042176 Dec 30 23:59 vmx-zdump.001
-r-------- 1 root root 4980736 Dec 31 00:10 vmx-zdump.002
-r-------- 1 root root 5001216 Dec 31 01:01 vmx-zdump.003
/vmfs/volumes/4ff54135-0181f04a-d355 -00215e6ea e71/2008Le ap #
did you delete the snapshot?
because it's created the third.....
then select DELETE ALL!
because it's created the third.....
then select DELETE ALL!
ASKER
if screen shot is easier
vm1.JPG
vm1.JPG
ASKER
Selecting DELETE ALL should have "removed all the snapshots", when used Take Snapshot, did it appear in the list?
can you also confirm, if you look at the VM Settings, Right Click Edit VM, check the disks, you'll see if the VM is running on the disks....because it will be using 00003 etc
can you confirm?
also what size are these disks?
can you also confirm, if you look at the VM Settings, Right Click Edit VM, check the disks, you'll see if the VM is running on the disks....because it will be using 00003 etc
can you confirm?
also what size are these disks?
ASKER
Can you power on the VM now?
ASKER
As expected. same problem. see screen shot.
Redo log corrupted.
Cheers
Andrew
In preparation i have started moving the backup recovery from the NAS to the datastore2 ;)
vm5.JPG
Redo log corrupted.
Cheers
Andrew
In preparation i have started moving the backup recovery from the NAS to the datastore2 ;)
vm5.JPG
ASKER
If it matters (i dont think it does but hey i could be wrong), when monitoing the console, it does show the usual 2008 R2 startup screens, to the point of "loading windows" then dies.
Cheers
Cheers
how very odd, it usually will NOT add another snapshot if the chain is incorrect or corrupt.
try at the console
vim-cmd vmsvc/snapshot.removeall 2 (this is a consolidation task!)
this will try and remove and merge all the snapshots, but if there is corruption it will fail.
try at the console
vim-cmd vmsvc/snapshot.removeall 2 (this is a consolidation task!)
this will try and remove and merge all the snapshots, but if there is corruption it will fail.
ASKER
after the error above, when i do a "Power off" of the VM it hangs at 95% then comes up with,
Cheers
vm6.JPG
The attempted operation cannot be performed in the current state (Powered off).
Cheers
vm6.JPG
ASKER
try at the consoleit looks like it asks a question but doesnt pause for an answer.
vim-cmd vmsvc/snapshot.removeall 2 (this is a consolidation task!)
if i then do a ls -all nothing has changed.
/vmfs/volumes/4ff54135-0181f04a-d355-00215e6eae71/2008Leap # vim-cmd vmsvc/snapshot.removeall 2
Remove All Snapshots:
/vmfs/volumes/4ff54135-0181f04a-d355-00215e6eae71/2008Leap #
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
hehe....
Kinda thought that was coming at some point ;)
on a side note is there any great benefit in upgrading my version of ESXi 5.0
If so do you have a good guide on how to?
Cheers
Andrew
Kinda thought that was coming at some point ;)
on a side note is there any great benefit in upgrading my version of ESXi 5.0
If so do you have a good guide on how to?
Cheers
Andrew
You should really be on the latest version of 5.0 U3, for security reasons. e.g. 5.0 - 5.0u3, as for whether you should be on 5.1 or 5.5.
it's done similar to this:-
HOW TO: Upgrade from VMware vSphere Hypervisor ESXi 5.1 to VMware vSphere Hypervisor ESXi 5.5 for FREE
depends on your requirements....
HOW TO: What's New in VMware vSphere Hypervisor 5.5 (ESXi 5.5)
All the Best, Happy New Year, must get some Zzzzzzs now.
it's done similar to this:-
HOW TO: Upgrade from VMware vSphere Hypervisor ESXi 5.1 to VMware vSphere Hypervisor ESXi 5.5 for FREE
depends on your requirements....
HOW TO: What's New in VMware vSphere Hypervisor 5.5 (ESXi 5.5)
All the Best, Happy New Year, must get some Zzzzzzs now.
ASKER
No Problem, will have a read.
Thanks for all your help, now go get some sleep ;)
Hope you had a merry Christmas, and have a Happy New Year.
Cheers
Andrew
Australia.
Thanks for all your help, now go get some sleep ;)
Hope you had a merry Christmas, and have a Happy New Year.
Cheers
Andrew
Australia.
e.g. parent disk (vmdk) + 1st snapshot disk + 2nd snapshot disk = VM VMDK
the chain of the above must be correct. if the host server crashed, restarted, datastore ran out of space, the 2nd snapshot disk gets corrupted. ALL the Information in the 2nd snapshot disk needs to be discarded, to get the VM started, which means data loss.
It's possible damage or corruption has already occured to the snapshot disk, and you would have to discard this delta, to start the VM, resulting in a corrupt or out dated VM.
can you get me a list of the current files in the folder, and I can work with you to see if we can get the VM started. Please do not try to fiddle, because you could cause more damage.
HOW TO: VMware Snapshots :- Be Patient
Also, can you reply, events which happened before this?
Datastore out of space?
Host Server restart?
Host Server crash?
(I'm on GMT UK Time, so just about to get some Zzzzzs!). But I'll hang in here for 30 mins.
(also try logging in as root, to try vmkfstools!)