Link to home
Start Free TrialLog in
Avatar of projects
projects

asked on

ESXi 5.5 - Cannot boot any vms after crash

After a crash on local RAID5 storage (which it turned out is a known problem on this IBM), the vms will no longer boot. All are acting as if there is no storage to boot from.

This leads me to believe that the vms are listed in vsphere but that the storage has in fact become inaccessible.
Linux vmz only show Grub or the F2 Vmware bios screen pops up, goes away, pops up, goes away, and nothing else happens.

I cannot export because esxi says it cannot find the storage. I am able to copy the files from the datastore onto remote storage but the copies have the same problem once added back into inventory and trying to start them.

I checked the datastore using voma and some other commands and see 'Found Stale Lock' on the storage but then the report also says Total Errors Found: 0.

I'm not sure what caused the locks but at one point, I ran the same command and those were gone.  

I'm at a total loss on how to fix this problem and sure hope some can offer some insight so I can get these vms back up.

esxi 5.5
server; ibm 3655 with local SAS storage
Avatar of Andrew Hancock (VMware vExpert PRO / EE Fellow/British Beekeeper)
Andrew Hancock (VMware vExpert PRO / EE Fellow/British Beekeeper)
Flag of United Kingdom of Great Britain and Northern Ireland image

Firstly, has there been any corruption to the RAID volume, which supports the datastore (VMFS), where the VMs are stored.

A few questions...

1. Where is ESXi installed ? Same disks RAID array as VMFS storage ?

2. Does ESXi boot?

3. Are VMs registered in the Inventory ?
Avatar of projects
projects

ASKER

1;
VMFS:  /vmfs/devices/disks/mpx.vmhba1:C0:T0:L0:3

ls -la /vmfs/volumes/
 datastore1 -> 555d93fc-101c4070-ff0e-00145e5b014a

Strangely, df leads to this error, which it didn't earlier.
# df
Traceback (most recent call last):
  File "/bin/df", line 101, in <module>
    sys.exit(main(sys.argv))
  File "/bin/df", line 55, in main
    o = eval(output)
  File "<string>", line 1
    Errors:
          ^
SyntaxError: invalid syntax

2: Yes, esxi boots
3: yes, all VMs are registered and showing.

As mentioned, I can see the vms, I can see their data files in the storage but no matter local or copies, none boot.
corruption in the VMFS datastore, caused by RAID or storage controller.

If all VMs are showing the same symptoms the virtual disks are corrupted, or data is missing.

I would suggest

1. Restore VMs from backup, after carefully looking at the server.

2. If no backups, consultant a Data Recovery Specialist ASAP, turn off the server to prevent further damage.

We can recommend Kroll Ontrack

http://www.krollontrack.com/
No support on this so we're on our own. Backups are too old since they had not done any during a storage migration.

Might you know is there is a way of doing the following.

For example, copying all of the files to another storage, already did that.
Replacing only the files which cause the vm to boot, keeping the main data files? Any chance of doing something like that?
before you say no... we've nothing to lose at this point.
how are your VMs split ?

the data for the VM is contained in the VMDK, that is the only file you need.

but if it's not intact and corrupted the data is gone.

the VMDK can be added to another VM, of it can be mounted in Windows.
My thought it, import a working backup.
Then, shut it down, make a copy of what ever file I'm going to change, then copy the most current file in place.
See if it'll boot. If everything looks intact, then make a new backup asap.
If it doesn't work out, shut it down, overwrite the new file with the old backup and fire it back up.

This is the directory structure for this vm for example, on datastore1. In this case, the vm has two drives.
I've nothing to lose by copying which ever file I need.

zim-192-63_C624.vmx
2.87 KB
Virtual Machine
[datastore1] zim-192-63_C624
8/7/2015 12:46:53 PM

zim-192-63_C624.vmdk
12,582,910.00 KB
Virtual Disk
[datastore1] zim-192-63_C624
6/16/2015 4:28:55 PM

zim-192-63_C624_1.vmdk
16,777,220.00 KB
Virtual Disk
[datastore1] zim-192-63_C624
6/16/2015 4:29:05 PM

zim-192-63_C624.nvram
8.48 KB
Non-volatile memory file
[datastore1] zim-192-63_C624
8/7/2015 12:46:53 PM

vmware-8.log
28.74 KB
Virtual Machine log file
[datastore1] zim-192-63_C624
8/6/2015 10:40:41 PM

vmware-7.log
198.36 KB
Virtual Machine log file
[datastore1] zim-192-63_C624
8/6/2015 7:11:44 PM

vmware-10.log
1,767.88 KB
Virtual Machine log file
[datastore1] zim-192-63_C624
8/7/2015 12:09:16 PM

zim-192-63_C624-000001.vmdk
1,262,592.00 KB
12,582,910.00 KB
Virtual Disk
[datastore1] zim-192-63_C624
8/7/2015 12:13:26 PM

vmware.log
240.38 KB
Virtual Machine log file
[datastore1] zim-192-63_C624
8/7/2015 12:46:53 PM

zim-192-63_C624_1-000001.vmdk
1,655,808.00 KB
16,777,220.00 KB
Virtual Disk
[datastore1] zim-192-63_C624
8/7/2015 12:13:26 PM

zim-192-63_C624-Snapshot1.vmsn
4,195,572.00 KB
Snapshot file
[datastore1] zim-192-63_C624
7/29/2015 11:12:25 AM

vmware-6.log
125.12 KB
Virtual Machine log file
[datastore1] zim-192-63_C624
8/6/2015 12:19:39 PM

vmware-5.log
109.34 KB
Virtual Machine log file
[datastore1] zim-192-63_C624
8/5/2015 2:29:23 PM

vmware-9.log
762.22 KB
Virtual Machine log file
[datastore1] zim-192-63_C624
8/6/2015 11:17:02 PM

zim-192-63_C624.vmxf
0.36 KB
File
[datastore1] zim-192-63_C624
8/7/2015 12:13:26 PM

zim-192-63_C624.vmsd
0.48 KB
File
[datastore1] zim-192-63_C624
7/29/2015 11:09:21 AM
all you need is the VMDKs, they contain the data.

you can re-create the configuration file and add the disks.
Sorry?

>you can re-create the configuration file and add the disks.

Can you explain this a little more please? I'm not sure I understand.
Select New Virtual Machine, and create a new VM, but then add the existing disks you have.

your data is in the VMDKs, the other files are just configuration files, non-booting VMs is always because the VMDKs virtual disks are corrupted, otherwise, you would have no VMs start or registered with inventory.
For example, in my old backup, I've got;

05/26/2015  11:49 PM     4,499,921,408 zim-192-63_C624-disk1.vmdk
05/27/2015  12:49 AM     5,745,056,256 zim-192-63_C624-disk2.vmdk
05/27/2015  12:49 AM               221 zim-192-63_C624.mf
05/27/2015  12:49 AM             8,936 zim-192-63_C624.ovf

I was thinking of copying the most current vmdk files into this directory, then importing to see if it would boot and have the data. If it doesn't work, I'll copy the old vmdk files back.
if those files are from an EXPORT, you will need to IMPORT them back and not just copy.

see my EE Article here

Part 10: HOW TO: Backup (Export) and Restore (Import) virtual machines to VMware vSphere Hypervisor 5.1 for FREE

also if your VMFS datastore is corrupted, you may not be able to write to it correctly, or import them.

e.g. it's not normal for df to crash!
Since they don't boot, there is little chance this will work then. I'll try, just for the heck of it, nothing to lose :)
PS, the files are not from an export, they were copied directly from the datastore to another location.
Therefore, if I understand, you're saying...

Copying them to an exported vm, then importing is not going to work.
Instead, create a new vm, stop it, copy the saved vmdk file over the new vmdk.
I just noticed two things with the files you listed....


there was an OVF file, an MF file, and the VMDKs are not correct, there should be two, per disks, and these files compared to your originals, are compressed.

just like it looks like and Export!

So to me, they look like you need to Import them, certainly as they are, they will not work....

I'm about to disappear to bed here in the UK, as time is 1.30am, I can continue with you over the weekend, I'll be around....
I see two.

zim-192-63_C624.vmx 2.87 KB Virtual Machine [datastore1] zim-192-63_C624 8/7/2015 12:46:53 PM
zim-192-63_C624.vmdk 12,582,910.00 KB Virtual Disk [datastore1] zim-192-63_C624 6/16/2015 4:28:55 PM
zim-192-63_C624_1.vmdk 16,777,220.00 KB Virtual Disk [datastore1] zim-192-63_C624 6/16/2015 4:29:05 PM
zim-192-63_C624.nvram 8.48 KB Non-volatile memory file [datastore1] zim-192-63_C624 8/7/2015 12:46:53 PM
vmware-8.log 28.74 KB Virtual Machine log file [datastore1] zim-192-63_C624 8/6/2015 10:40:41 PM
vmware-7.log 198.36 KB Virtual Machine log file [datastore1] zim-192-63_C624 8/6/2015 7:11:44 PM
vmware-10.log 1,767.88 KB Virtual Machine log file [datastore1] zim-192-63_C624 8/7/2015 12:09:16 PM
zim-192-63_C624-000001.vmdk 1,262,592.00 KB 12,582,910.00 KB Virtual Disk [datastore1] zim-192-63_C624 8/7/2015 12:13:26 PM
vmware.log 240.38 KB Virtual Machine log file [datastore1] zim-192-63_C624 8/7/2015 12:46:53 PM
zim-192-63_C624_1-000001.vmdk 1,655,808.00 KB 16,777,220.00 KB Virtual Disk [datastore1] zim-192-63_C624 8/7/2015 12:13:26 PM
zim-192-63_C624-Snapshot1.vmsn 4,195,572.00 KB Snapshot file [datastore1] zim-192-63_C624 7/29/2015 11:12:25 AM
vmware-6.log 125.12 KB Virtual Machine log file [datastore1] zim-192-63_C624 8/6/2015 12:19:39 PM
vmware-5.log 109.34 KB Virtual Machine log file [datastore1] zim-192-63_C624 8/5/2015 2:29:23 PM
vmware-9.log 762.22 KB Virtual Machine log file [datastore1] zim-192-63_C624 8/6/2015 11:17:02 PM
zim-192-63_C624.vmxf 0.36 KB File [datastore1] zim-192-63_C624 8/7/2015 12:13:26 PM
zim-192-63_C624.vmsd 0.48 KB

Open in new window


I tried as you suggested, creating a new vm, then copying the vmdk over the new one. Same deal, boot problems.
Can the vmdk files be opened and maybe some of the their contents saved assuming they aren't 100% trashed.
Can't get it to work on windows but the files are accessible via Linux nfs shares so am trying from a centos box.

# kpartx -av dev-192-58-64bit-flat.vmdk; mount -o /dev/mapper/loop0p1 /mnt/vmdk/
mount: can't find /mnt/vmdk in /etc/fstab or /etc/mtab

# ls /dev/mapper/
control  VolGroup00-LogVol00  VolGroup00-LogVol01

Any idea how I can mount the image? Maybe I can get at some of the data this way.
It isn't 100% clear to me but I think that kpartx is when you have more than one partition in the vmdk. Since it's a Linux os, it definitely has multiples.
you should be able to mount the partitions to an existing mount point
Sorry? Not sure what the means. I'm trying to find the command/method of mounting the vmdk but have yet to find enough info or example which works.

I'm on a centos 6.x box with the vmdk file copied there.
your error suggests you do not have the folder /mnt/vmware to mount to.
The folder is called /mnt/vmdk and yup, it exists.
You're saying this should simply mount? I've shown the errors, I can't get past them.
no I was simply responding to the error, which says it cannot find /mnt/vmdk
I understand that you were saying the mount point was missing and I was confirming that it was not. That is not the cause of the error since the mount point exists. I created it before trying to mount the file. Something else is causing the error which is why I posted it because so far, I've not found the reason.
corrupted VMDK, there is no error checking on the VMDK for the mount to succeed or fail.
That's what I figured sadly.
I've read lots of posts from people asking if there is a way of running a disk checker on the files but never read any definitive answers.

I've often been able to access messed up hardware disks but virtual files, I don't know if there are any recovery methods/programs.
That part that gave me hope is that all of the vms log, xml and other files can be read. They aren't trashed so I figured the vmdk files have a slight chance of being recoverable.
but the vmdk, is also larger, compared to smaller files, so therefore, a larger file, has a far higher probability of being damaged, as it spans more clusters/sectors of the VMFS datastore, or underlying hardware RAID storage.
Yes, very true.
The thing is, here is the last log showing that the vm seemed to have been shut down gracefully, giving me hope that what ever happened, the vms seem to go down cleanly.

2015-08-07T23:51:37.775Z| vmx| I120: WORKER: asyncOps=2 maxActiveOps=1 maxPending=0 maxCompleted=1
2015-08-07T23:51:37.842Z| vmx| I120: Vix: [81477 mainDispatch.c:3870]: VMAutomation_ReportPowerOpFinished: statevar=1, newAppState=1873, success=1 additionalError=0
2015-08-07T23:51:37.842Z| vmx| I120: Vix: [81477 mainDispatch.c:3889]: VMAutomation: Ignoring ReportPowerOpFinished because the VMX is shutting down.
2015-08-07T23:51:37.925Z| vmx| I120: Vix: [81477 mainDispatch.c:3870]: VMAutomation_ReportPowerOpFinished: statevar=0, newAppState=1870, success=1 additionalError=0
2015-08-07T23:51:37.926Z| vmx| I120: Vix: [81477 mainDispatch.c:3889]: VMAutomation: Ignoring ReportPowerOpFinished because the VMX is shutting down.
2015-08-07T23:51:37.926Z| vmx| I120: Transitioned vmx/execState/val to poweredOff
2015-08-07T23:51:37.926Z| vmx| I120: VMX idle exit
2015-08-07T23:51:37.926Z| vmx| I120: VMIOP: Exit
2015-08-07T23:51:37.954Z| vmx| I120: Vix: [81477 mainDispatch.c:861]: VMAutomation_LateShutdown()
2015-08-07T23:51:37.954Z| vmx| I120: Vix: [81477 mainDispatch.c:811]: VMAutomationCloseListenerSocket. Closing listener socket.
2015-08-07T23:51:37.961Z| vmx| I120: Flushing VMX VMDB connections
2015-08-07T23:51:37.962Z| vmx| I120: VmdbDbRemoveCnx: Removing Cnx from Db for '/db/connection/#1/'
2015-08-07T23:51:37.962Z| vmx| I120: VmdbCnxDisconnect: Disconnect: closed pipe for pub cnx '/db/connection/#1/' (0)
2015-08-07T23:51:37.968Z| vmx| I120: VMX exit (0).
2015-08-07T23:51:37.968Z| vmx| I120: AIOMGR-S : stat o=1 r=3 w=0 i=0 br=49152 bw=0
2015-08-07T23:51:37.968Z| vmx| I120: OBJLIB-LIB : ObjLib cleanup done.
2015-08-07T23:51:37.968Z| vmx| W110: VMX has left the building: 0.

Open in new window

Trying the vmware mount tool

C:\Program Files\VMware\VMware DiskMount Utility>vmware-mount.exe Z: dev58.vmdk
Unable to mount the virtual disk.  The disk may be in use by a virtual
machine or mounted under another drive letter.  If not, verify that the
disk is a virtual disk file, and that the disk file has not been corrupted.
ASKER CERTIFIED SOLUTION
Avatar of Andrew Hancock (VMware vExpert PRO / EE Fellow/British Beekeeper)
Andrew Hancock (VMware vExpert PRO / EE Fellow/British Beekeeper)
Flag of United Kingdom of Great Britain and Northern Ireland image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Thanks. I wanted to make darn sure before having to agree.
Since they are all corrupted, I'll try some other tools since I have nothing to lose.
It's not over yet.
I found that you can log into the esxi host and run some local commands against the files.

http://blog.asiantuntijakaveri.fi/2012/01/fixing-broken-vmware-vsphere-5-vmdk.html

# vmkfstools --fix check dev-192-58-64bit.vmdk
Disk is error free

# vmkfstools -i dev-192-58-64bit.vmdk new-disk.vmdk
Destination disk format: VMFS zeroedthick
Cloning disk 'dev-192-58-64bit.vmdk'...
Clone: 100% done.

I then removed the first virtual drive and added this one. Interestingly, this new, supposedly zero error file won't boot either, only shows GRUB at the boot.

so, next I'm making a copy of this working file over to another esxi host and will see if I can mount it as a second drive. I'll first add it into the vm  inventory, boot the vm and see if I can deal with it as any other drive.  Will update.
the fact that you can clone it, is odd, because it would usually fail, if there was an error in the vmdk!

the vmdk may be error free, but is are contents inside correct as they should be ?

e.g. can you create a new Linux virtual machine, and add these disks, and mount and check the partitions ?
That's exactly what I'm trying to find out. If I can find a way of accessing the file, then I can find out how it's contents are. However, there is something about e very vmdk that is messed up so I've not been able to access.

The next thing I tried was to take this supposedly good copy of the file above and trying to add it as a second disk to a vm. I first tried simply pointing to the repaired, existing file but that didn't work. I then created a new disk, saved it, then copied my old file over the new one but that too didn't work, the vm complaining and not booting.

I'll next try what you mentioned, creating a new vm using the repaired disk.
So this is odd.
I used a different vm this time and added one of the vmdk's from the crashed host.
This time, the vm boots and I can see the second disk (sdb, sdb1 and sdb2) which is positive but trying to mount it leads to;

# mount -t ext3 /dev/sdb2 /mnt
mount: /dev/sdb2 already mounted or /mnt busy
some form of corruption have you check with OS tools.

Linux tools in the OS will have a much better chance of fixing the corruption if possible.
So far, the volume looks good.

]# lvdisplay /dev/VolGroup00
  --- Logical volume ---
  LV Path                /dev/VolGroup00/LogVol00
  LV Name                LogVol00
  VG Name                VolGroup00
  LV UUID                v0Y8nF-oujb-hBf2-frzZ-Jtxh-yP2U-YmE8wM
  LV Write Access        read/write
  LV Creation host, time ,
  LV Status              available
  # open                 0
  LV Size                7.12 GiB
  Current LE             228
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     256
  Block device           253:2

  --- Logical volume ---
  LV Path                /dev/VolGroup00/LogVol01
  LV Name                LogVol01
  VG Name                VolGroup00
  LV UUID                rpj5Nk-4x5A-pZga-FHb3-Xslx-Wchl-ISONQx
  LV Write Access        read/write
  LV Creation host, time ,
  LV Status              available
  # open                 0
  LV Size                768.00 MiB
  Current LE             24
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     256
  Block device           253:3

As I understand it, all I need to do now is to mount a LVM volume? But, I can't seem to get past this now.

# mount /dev/VolGroup01/LogVol00 /mnt
mount: you must specify the filesystem type

# vgchange -ay VolGroup00
  2 logical volume(s) in volume group "VolGroup00" now active

# lvs
  LV       VG         Attr       LSize   Pool Origin Data%  Move Log Cpy%Sync Convert
  LogVol00 VolGroup00 -wi-a-----   7.12g
  LogVol01 VolGroup00 -wi-a----- 768.00m
  lv_root  vg_stor65  -wi-ao----   3.57g
  lv_swap  vg_stor65  -wi-ao----   3.94g

# mount /dev/VolGroup00/LogVol00 /mnt -o ro,user
mount: wrong fs type, bad option, bad superblock on /dev/mapper/VolGroup00-LogVol00,
       missing codepage or helper program, or other error
       In some cases useful info is found in syslog - try
       dmesg | tail  or so

EXT3-fs (sdb1): error: no journal found
EXT3-fs (dm-2): error: no journal found
EXT3-fs (dm-2): recovery required on readonly filesystem
EXT3-fs (dm-2): write access will be enabled during recovery
EXT3-fs (dm-2): error: no journal found
EXT3-fs (dm-2): recovery required on readonly filesystem
EXT3-fs (dm-2): write access will be enabled during recovery
EXT3-fs (dm-2): error: no journal found

# tune2fs -j /dev/VolGroup00/LogVol00
tune2fs 1.41.12 (17-May-2010)
The filesystem already has a journal.

# fsck /dev/VolGroup00/LogVol00
fsck from util-linux-ng 2.17.2
e2fsck 1.41.12 (17-May-2010)
Superblock has an invalid journal (inode 8).
Finally rebuilt the drive and found nothing useful on it. Too many directories and stuff to look through but it was a positive.

I'll post another question about where I'm at now and will be using one of the important vmdk files.