Link to home
Start Free TrialLog in
Avatar of Andrew Davis
Andrew DavisFlag for Australia

asked on

corrupt redo log. VMware ESXi

Host: VMware ESXi, 5.0.0, 469512
Client shows error while booting up:
The redo log of {Servername}-000002.vmdk is corrupted. Power off the virtual machine, if the problem still persists, discard the redo log.

Actions:
Just about every article i have found regarding this says to consolidate the snapshots, so i deleted the snapshots (vSphere Client - Snapshot manager), which then told me that it wanted to consolidate the drives. When i tell it to consolidate i get
Consolidate virtual machine disk files
A general system error occurred: Input/output error

Following http://dunfraggin.blogspot.com.au/2012/09/virtual-machine-disk-consolidation.html i susspect there may b e locked files. The server reports that it is powered Off, but did come up with an error during the Power Off task.

 The article above leads me to http://kb.vmware.com/selfservice/microsites/search.do?cmd=displayKC&docType=kc&docTypeID=DT_KB_1_1&externalId=10051
if i look at the  vmware.log file i am not seeing anything unexpected at the end of the file.
however if i try to use vmkfstools i get the following

login as: itsupport
Using keyboard-interactive authentication.
Password:
The time and date of this login have been sent to the system logs.

VMware offers supported, powerful system administration tools.  Please
see www.vmware.com/go/sysadmintools for details.

The ESXi Shell can be disabled by an administrative user. See the
vSphere Security documentation for more information.
~ $ su
Password:
~ # vmkfstools
sh: vmkfstools: not found
~ #

So not really sure where to go from here ;)

Additional notes:
If this helps at all (Vmid 2 is the one in question)

~ # vim-cmd vmsvc/getallvms
Vmid      Name                     File                        Guest OS          Version   Annotation
1      SBS2011      [datastore1] SBS2011/SBS2011.vmx     windows7Server64Guest   vmx-08
2      2008Leap     [datastore2] 2008Leap/2008Leap.vmx   windows7Server64Guest   vmx-08
3      2008Leap-2   [NAS01] 2008Leap-2/2008Leap-2.vmx    windows7Server64Guest   vmx-08
~ # vim-cmd vmsvc/power.getstate 2
Retrieved runtime info
Powered off
~ #
There is only the one host, so no other host could possibly have a lock.
Have rebooted the host:- no change.

Cheers
Andrew
Avatar of Andrew Hancock (VMware vExpert PRO / EE Fellow/British Beekeeper)
Andrew Hancock (VMware vExpert PRO / EE Fellow/British Beekeeper)
Flag of United Kingdom of Great Britain and Northern Ireland image

The redo log to which the error refers is the 2nd snapshot delta disk, snapshot deltas do become corrupted, and when the chain is corrupted, the VM will not start.

e.g. parent disk (vmdk) + 1st snapshot disk + 2nd snapshot disk = VM VMDK

the chain of the above must be correct. if the host server crashed, restarted, datastore ran out of space, the 2nd snapshot disk gets corrupted. ALL the Information in the 2nd snapshot disk needs to be discarded, to get the VM started, which means data loss.

It's possible damage or corruption has already occured to the snapshot disk, and you would have to discard this delta, to start the VM, resulting in a corrupt or out dated VM.

can you get me a list of the current files in the folder, and I can work with you to see if we can get the VM started. Please do not try to fiddle, because you could cause more damage.

HOW TO: VMware Snapshots :- Be Patient

Also, can you reply, events which happened before this?

Datastore out of space?

Host Server restart?

Host Server crash?

(I'm on GMT UK Time, so just about to get some Zzzzzs!). But I'll hang in here for 30 mins.

(also try logging in as root, to try vmkfstools!)
Avatar of Andrew Davis

ASKER

Thanks for your time ;)
Datastore out of space? NO 50% head room.
Host Server restart? Kinda. There was a power outage around the time of the crash. UPS should have seen to a clean shutdown.
Files in directory:-
# cd /vmfs/volumes/datastore2/2008Leap
/vmfs/volumes/4ff54135-0181f04a-d355-00215e6eae71/2008Leap # ls
2008Leap-000001-delta.vmdk      2008Leap-flat.vmdk              2008Leap.vmxf                   2008Leap_1-flat.vmdk            vmmcores-5.gz                   vmware-13.log                   vmware.log
2008Leap-000001.vmdk            2008Leap.nvram                  2008Leap_1-000001-delta.vmdk    2008Leap_1.vmdk                 vmmcores-6.gz                   vmware-14.log                   vmx-2008Leap-3315273081-2.vswp
2008Leap-000002-delta.vmdk      2008Leap.vmdk                   2008Leap_1-000001.vmdk          vmmcores-2.gz                   vmmcores-7.gz                   vmware-15.log                   vmx-zdump.001
2008Leap-000002.vmdk            2008Leap.vmsd                   2008Leap_1-000002-delta.vmdk    vmmcores-3.gz                   vmmcores.gz                     vmware-16.log                   vmx-zdump.002
2008Leap-aux.xml                2008Leap.vmx                    2008Leap_1-000002.vmdk          vmmcores-4.gz                   vmware-12.log                   vmware-17.log                   vmx-zdump.003
/vmfs/volumes/4ff54135-0181f04a-d355-00215e6eae71/2008Leap #
Note "2008Leap.vmx" is green in colour.
Cheers
okay, this is the reason for failure...

The redo log to which the error refers is the 2nd snapshot delta disk, snapshot deltas do become corrupted, and when the chain is corrupted, the VM will not start.

e.g. parent disk (vmdk) + 1st snapshot disk + 2nd snapshot disk = VM VMDK

the chain of the above must be correct. if the host server crashed, restarted, datastore ran out of space, the 2nd snapshot disk gets corrupted. ALL the Information in the 2nd snapshot disk needs to be discarded, to get the VM started, which means data loss.
In preperation for wost case, i have also:-

1. Copied the servers directory to another datastore (in case i need to get anything back).
2. Created a new VM (2008Leap-2) and done a successfull restore to a NFS share on NAS01. I have not connected it to the network as yet, as i just wanted to prove the disaster recovery, and cannot leave it sitting on the NAS (tooooo slow).

(Just saw another post on this question come in so this comment will be out of order)

Cheers
So go with disaster recovery?

I have shadow protect backup that will have all data, as office was closed (Thankyou holiday period).


Cheers
Andrew
okay, so this is a two disk virtual machine....with disks...

2008Leap-flat.vmdk      
2008Leap-000001.vmdk  
2008Leap-000001-delta.vmdk
2008Leap-000002-delta.vmdk

2008Leap_1-flat.vmdk  
2008Leap_1-000001-delta.vmdk
2008Leap_1-000002-delta.vmdk

both disks have two snapshots.

Can you try the following:-

1. Take a Snapshot of the current VM
2. Wait 60 seconds
3. Delete the Snapshot

report any errors, and then check the disk if snapshots have gone. (the delta files!)
With it powered off?
I'm afraid this file

{Servername}-000002.vmdk

is corrupt, and would have to be excluded from the VM configuration to get the VM started, depending on how long the machine was running on this snapshot disk, e.g. 12 days, the VM would be 12 days old.

If you have Backups, time to restore, and monitor those snapshots daily...and check your VMs are not running on snapshots.
Everything looked fine with no errors.
It did report that it needs consolidation now.
I havnt done the consolidation yet, but is not showing the delta's gone (as expected i now have another).

Cheers
Yes, Powered OFF.

Is the VM ON?

Sorry was a stupid question on my behalf;)
Consolidation Message appears because it detects snapshots, but it's not intelligent to know if they are corrupted and cannot be merged or discarded.

did you do the above test?
sorry just also noticed
(also try logging in as root, to try vmkfstools!)

root is denied direct login, i have to ssh with another user and then su to root, as i dont have physical access to this server.

Cheers
Andrew
Yes i have done the above do you want me to try to restart it?

Sorry, as you obviously know more than me, i am not presuming anything ;)

Cheers
Andrew
can you get a new listing of the folder for me? (with sizes)

ls - al
listing
/vmfs/volumes/4ff54135-0181f04a-d355-00215e6eae71/2008Leap # ls -al
drwxr-xr-x    1 root     root               5740 Dec 31 02:23 .
drwxr-xr-t    1 root     root               1400 Dec 30 07:13 ..
-rw-------    1 root     root        30098534400 Dec 30 05:43 2008Leap-000001-delta.vmdk
-rw-------    1 root     root                320 Dec 30 05:41 2008Leap-000001.vmdk
-rw-------    1 root     root           16986112 Dec 31 01:01 2008Leap-000002-delta.vmdk
-rw-------    1 root     root                327 Dec 31 00:59 2008Leap-000002.vmdk
-rw-------    1 root     root             208896 Dec 31 02:22 2008Leap-000003-delta.vmdk
-rw-------    1 root     root                327 Dec 31 02:22 2008Leap-000003.vmdk
-rw-r--r--    1 root     root                 13 Dec 31 02:23 2008Leap-aux.xml
-rw-------    1 root     root       107374182400 Dec 31 02:23 2008Leap-flat.vmdk
-rw-------    1 root     root               8684 Dec 31 01:00 2008Leap.nvram
-rw-------    1 root     root                523 Dec 31 02:23 2008Leap.vmdk
-rw-r--r--    1 root     root                 77 Dec 31 02:23 2008Leap.vmsd
-rwxr-xr-x    1 root     root               3573 Dec 31 02:22 2008Leap.vmx
-rw-r--r--    1 root     root                263 Dec 30 08:02 2008Leap.vmxf
-rw-------    1 root     root        43923165184 Dec 30 05:41 2008Leap_1-000001-delta.vmdk
-rw-------    1 root     root                324 Mar 22  2013 2008Leap_1-000001.vmdk
-rw-------    1 root     root           17190912 Dec 30 05:46 2008Leap_1-000002-delta.vmdk
-rw-------    1 root     root                331 Dec 30 05:45 2008Leap_1-000002.vmdk
-rw-------    1 root     root             413696 Dec 31 02:22 2008Leap_1-000003-delta.vmdk
-rw-------    1 root     root                331 Dec 31 02:22 2008Leap_1-000003.vmdk
-rw-------    1 root     root       214748364800 Nov 11  2012 2008Leap_1-flat.vmdk
-rw-------    1 root     root                525 Jul  8  2012 2008Leap_1.vmdk
-rw-r--r--    1 root     root            5416529 Dec 30 05:46 vmmcores-2.gz
-rw-r--r--    1 root     root            4948934 Dec 30 05:54 vmmcores-3.gz
-rw-r--r--    1 root     root            5739430 Dec 30 06:53 vmmcores-4.gz
-rw-r--r--    1 root     root            5664529 Dec 30 07:03 vmmcores-5.gz
-rw-r--r--    1 root     root            5828171 Dec 30 23:59 vmmcores-6.gz
-rw-r--r--    1 root     root            5686604 Dec 31 00:10 vmmcores-7.gz
-rw-r--r--    1 root     root            5733245 Dec 31 01:01 vmmcores.gz
-rw-r--r--    1 root     root             164211 Dec 30 05:46 vmware-12.log
-rw-r--r--    1 root     root             164471 Dec 30 05:55 vmware-13.log
-rw-r--r--    1 root     root             158944 Dec 30 06:53 vmware-14.log
-rw-r--r--    1 root     root             157904 Dec 30 07:03 vmware-15.log
-rw-r--r--    1 root     root             163543 Dec 30 23:59 vmware-16.log
-rw-r--r--    1 root     root             164108 Dec 31 00:10 vmware-17.log
-rw-r--r--    1 root     root             164443 Dec 31 01:01 vmware.log
-rw-------    1 root     root           52428800 Jul  8  2012 vmx-2008Leap-3315273081-2.vswp
-r--------    1 root     root            5042176 Dec 30 23:59 vmx-zdump.001
-r--------    1 root     root            4980736 Dec 31 00:10 vmx-zdump.002
-r--------    1 root     root            5001216 Dec 31 01:01 vmx-zdump.003
/vmfs/volumes/4ff54135-0181f04a-d355-00215e6eae71/2008Leap #
did you delete the snapshot?

because it's created the third.....

then select DELETE ALL!
if screen shot is easier
vm1.JPG
did you delete the snapshot?

Sure did. as per screen shot. I didnt do the consolidate.
vm2.JPG
Selecting DELETE ALL should have "removed all the snapshots", when used Take Snapshot, did it appear in the list?

can you also confirm, if you look at the VM Settings, Right Click Edit VM, check the disks, you'll see if the VM is running on the disks....because it will be using 00003 etc

can you confirm?

also what size are these disks?
Yes everything appeared correctly. could see the snapshots listed in the manager, then they were gone (after delete)

vm3.jpg shows 388.55 Gig free of 838Gig drive.

vm4.jpg shows settings. it reports that it is running from 000003.vmdk snapshot.

Cheers
Andrew
vm3.JPG
vm4.JPG
As expected. same problem. see screen shot.

Redo log corrupted.


Cheers
Andrew

In preparation i have started moving the backup recovery from the NAS to the datastore2 ;)
vm5.JPG
If it matters (i dont think it does but hey i could be wrong), when monitoing the console, it does show the usual 2008 R2 startup screens, to the point of "loading windows" then dies.

Cheers
how very odd, it usually will NOT add another snapshot if the chain is incorrect or corrupt.

try at the console

vim-cmd vmsvc/snapshot.removeall 2 (this is a consolidation task!)

this will try and remove and merge all the snapshots, but if there is corruption it will fail.
after the error above, when i do a "Power off" of the VM it hangs at 95% then comes up with,
The attempted operation cannot be performed in the current state (Powered off).

Cheers
vm6.JPG
try at the console

vim-cmd vmsvc/snapshot.removeall 2 (this is a consolidation task!)
it looks like it asks a question but doesnt pause for an answer.

if i then do a ls -all nothing has changed.

/vmfs/volumes/4ff54135-0181f04a-d355-00215e6eae71/2008Leap # vim-cmd vmsvc/snapshot.removeall 2
Remove All Snapshots:
/vmfs/volumes/4ff54135-0181f04a-d355-00215e6eae71/2008Leap #

Open in new window

ASKER CERTIFIED SOLUTION
Avatar of Andrew Hancock (VMware vExpert PRO / EE Fellow/British Beekeeper)
Andrew Hancock (VMware vExpert PRO / EE Fellow/British Beekeeper)
Flag of United Kingdom of Great Britain and Northern Ireland image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
hehe....

Kinda thought that was coming at some point ;)

on a side note is there any great benefit in upgrading my version of ESXi 5.0
If so do you have a good guide on how to?

Cheers
Andrew
You should really be on the latest version of 5.0 U3, for security reasons. e.g. 5.0 - 5.0u3, as for whether you should be on 5.1 or 5.5.

it's done similar to this:-

HOW TO: Upgrade from VMware vSphere Hypervisor ESXi 5.1 to VMware vSphere Hypervisor ESXi 5.5 for FREE

depends on your requirements....

HOW TO: What's New in VMware vSphere Hypervisor 5.5 (ESXi 5.5)

All the Best, Happy New Year, must get some Zzzzzzs now.
No Problem, will have a read.

Thanks for all your help, now go get some sleep ;)

Hope you had a merry Christmas, and have a Happy New Year.

Cheers
Andrew
Australia.