[Last Call] Learn how to a build a cloud-first strategyRegister Now

x
?
Solved

Vmware ESXi 5.1 - Datastore missing suddenly

Posted on 2015-02-23
7
Medium Priority
?
1,749 Views
Last Modified: 2015-03-04
Hi,
I am running a VMware ESXi server with two RAID controllers each hosting a data store A & B.

I had few virtual machines sitting on Data Store A and few on Data Store B ... One of my client complained about the VM being not reachable, which forced me to check vSphere Client for any obvious issues.

When connecting to the VM running on Data Store B, it showed a yellow ribbon saying the vm config file is not accessible. To trouble-shoot I went into the Data Store B and searched for the VM files but it was all empty.

I did a Data Store Refresh / Rescan after which the Data Store disappeared from the list. The devices still show both adapters LocalAdaptecDisk 1 TB (PrimaryRAID) and 3TB (SecondaryRAID) but only Data Store A (Primary Raid) is visible and Data Store B (Secondary RAID) is missing. But after some time even the second adapter device was gone.

The VM of Data Store A are fine and I can also connect to the server using SSH (WinSCP / Putty) ...

Is my data safe? or I lost the data?
Is there a way to re-discover the Data Store B without rebooting the server?

Output of  /dev/disks # ls /dev/disks/

mpx.vmhba2:C0:T0:L0                     vml.0000000000766d686261323a303a30
mpx.vmhba2:C0:T0:L0:1                   vml.0000000000766d686261323a303a30:1
mpx.vmhba2:C0:T1:L0                     vml.0000000000766d686261323a313a30
mpx.vmhba2:C0:T1:L0:1                   vml.0000000000766d686261323a313a30:1
mpx.vmhba32:C0:T0:L0                    vml.0000000000766d68626133323a303a30
mpx.vmhba32:C0:T0:L0:1                  vml.0000000000766d68626133323a303a30:1
mpx.vmhba32:C0:T0:L0:5                  vml.0000000000766d68626133323a303a30:5
mpx.vmhba32:C0:T0:L0:6                  vml.0000000000766d68626133323a303a30:6
mpx.vmhba32:C0:T0:L0:7                  vml.0000000000766d68626133323a303a30:7
mpx.vmhba32:C0:T0:L0:8                  vml.0000000000766d68626133323a303a30:8

If I click on Add Data store, it shows the disconnected Device but then it does not allow me to remount the disappeared Data Store, If I continue, it may create a new partition and just wipe the whole hard disk.

Hence, I cancelled the Add Storage step.

/vmfs/volumes # ls -al
drwxr-xr-x    1 root     root           512 Feb 23 18:42 .
drwxr-xr-x    1 root     root           512 Jan 18 16:04 ..
drwxr-xr-x    1 root     root             8 Jan  1  1970 343f43eb-cebf069a-2e3a-34165eb1baac
drwxr-xr-t    1 root     root          1680 Jun 27  2014 4dfa32f8-d53e6d8e-541d-001b215e0514
drwxr-xr-x    1 root     root             8 Jan  1  1970 50acdf0a-49686666-09f8-6c626daf4bfc
lrwxr-xr-x    1 root     root            35 Feb 23 18:42 PrimaryRAID1 -> 4dfa32f8-d53e6d8e-541d-001b215e0514
drwxr-xr-x    1 root     root             8 Jan  1  1970 c66e35e1-e51874c0-8f69-58451bad876f

As you can see there is no SecondaryRAID listed at all.

Regards,
Zen
0
Comment
Question by:zen shaw
  • 4
  • 3
7 Comments
 
LVL 62

Expert Comment

by:gheist
ID: 40627529
What happened recently?
0
 

Author Comment

by:zen shaw
ID: 40627967
Hi,
The VM machine was running fine and I assume the database engineer was running a big query (high read) at that time.

Speaking to him he said the server was running short of memory (peaking at 95%) and must have tried to dump the memory to the disk.

I have not restarted the server yet as the other RAID controller and the data store on it is fine and running Live applications. I could see the failed disk but not the volume/data store on it. How could the volume be missing for a RAID controller crash?

Do you want any diagnostic information?

Regards,
Zen
0
 
LVL 62

Accepted Solution

by:
gheist earned 2000 total points
ID: 40628706
I am afraid you need to power everything off and then on.
If you still have chance set disk queue depth to something reasonable:
http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1008113
That will prevent RAID controllers from locking up and reduce total IO wait for guests.
0
Put Machine Learning to Work--Protect Your Clients

Machine learning means Smarter Cybersecurity™ Solutions.
As technology continues to advance, managing and analyzing massive data sets just can’t be accomplished by humans alone. It requires huge amounts of memory and storage, as well as high-speed processing of the cloud.

 

Author Comment

by:zen shaw
ID: 40629470
Thanks ... I have no issues turning the server down ... at max it would be down for some time.

But I am worried about the Volume / Datastore and the VMDKs on them.

I'll give a reboot tomorrow and update ... meanwhile if you have any other hint/tip please help.

Regards,
Zen
0
 
LVL 62

Expert Comment

by:gheist
ID: 40645113
Power OFF with cable unplugging is a bit stronger than restart with modern hardware with 10 service guard processors sunning all the time.
0
 

Author Comment

by:zen shaw
ID: 40645834
Update: I tried to reboot the machine ... after the reboot boot the data stores on Primary RAID 1 and Secondary RAID 1 reappear. But after some time, the Secondary RAID 1 disappears again.

This makes me conclude that the RAID Controller is fine as the Primary RAID does not disappear. So I assume it is the problem with the disks used for Secondary RAID 1.

Now in this situation:
a) How would I recover the .VMDK or the SQL server data file on it?
b) Since primary RAID 1 is OK and still running a live server ... can I turn the server off for some time and remove the Secondary RAID 1 disks (non-hot swappable)  .... if then I am not sure if Adaptec raid controller would allow me to do that without deleting the RAID configuration or if it has a feature to disable the RAID detection for some time.
0
 
LVL 62

Expert Comment

by:gheist
ID: 40646218
I'd boot Ubuntu live disk and try to check SMART status on the disks and just pull out the bad one (Raid1 can handle that)
0

Featured Post

Concerto Cloud for Software Providers & ISVs

Can Concerto Cloud Services help you focus on evolving your application offerings, while delivering the best cloud experience to your customers? From DevOps to revenue models and customer support, the answer is yes!

Learn how Concerto can help you.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

In this article, I am going to show you how to simulate a multi-site Lab environment on a single Hyper-V host. I use this method successfully in my own lab to simulate three fully routed global AD Sites on a Windows 10 Hyper-V host.
Ransomware is a malware that is again in the list of security  concerns. Not only for companies, but also for Government security and  even at personal use. IT departments should be aware and have the right  knowledge to how to fight it.
Teach the user how to install log collectors and how to configure ESXi 5.5 for remote logging Open console session and mount vCenter Server installer: Install vSphere Core Dump Collector: Install vSphere Syslog Collector: Open vSphere Client: Config…
This Micro Tutorial walks you through using a remote console to access a server and install ESXi 5.1. This example is showing remote access and installation using a Dell server. The hypervisor is the very first component of your virtual infrastructu…
Suggested Courses

830 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question