Solved

need advice - recovering  raid 5 array member disk

Posted on 2013-05-19
9
1,642 Views
Last Modified: 2016-12-08
Hi All -
For everyone who wants to point out the obvious, I'll start with that.  I'm an idiot, I'm a bonehead.  I thought it could never happen to me.  I have no excuses and I'm trying to dig myself out of a hole of my own making.

Situation:

At this point I'm waiting to hear from www.krollontrack.com as to the options they can provide.

Equipment:
ESXi hypervisor 4.0 U1  

3ware 90650se-4LPML controller
Raid 5 - 4 drives 1tb each
vmfs3 file system
1 important vm - windows 2003 sbs 2 vmdk 207 gb 465 gb ~850 gb information total on the drive array
All files except the vmdk have been recovered


Situation -
1 drive in the array failed, array went into recovery mode.  Recovery was not successful paused at 6%, 9%, 62% in that order
Array shows drive as OK, though will move it to WARNING or DEVICE-ERROR during recovery of RAID or attempts to copy data


Datastore and File Directory visible by:
- Mounting the data store in ESXi 5.0
- Mounting datastore using tw_cli 3ware command line tool and Open Source VMFS Driver can see the file structure and was able to download all files except the vmdks
- Smart data says all rives OK.  Drives or seagate barracuda 7200 RPM, I believe SMART data is suspect


IMHO - 1 drive on the remaining array is suspect.  I believe if the drive can be recovered, the data can be recovered.  Potentially issues with the RAID controller


Attempted, in order, with Rebuild array / copy files through Open Source VMFS Driver attempted throughout
Rebuild array from degraded mode through array rebuild process via tw_cli by adding new 4th drive.
Recovery of data from within ESXi - copy vmdk from datastore. Copy on vmdk halts. Attempt abandoned
- Mounting datastore using tw_cli 3ware command line tool and Open Source VMFS Driver can see the file structure and was able to download all files except the vmdks
Recovery of raid via raid reconstructor.  Automatic settings could not determine raid.  Said needed manual configuration, Attempt abandoned
Repair of drive with spinrite spin right would not attempt - says the partition size reported by the BIOS is different than the one reported by the drive. Attempt abandoned
Clone of drive via clonezilla with VMFS support --> Clonezilla would not attempt  Attempt halted
Recover VMDK  / files via disk internals VMFS Recovery.  While still running, after 6 hours and still showing CPU activity, progress bar had halted.
0
Comment
Question by:dpedersen13
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 4
  • 3
  • 2
9 Comments
 
LVL 47

Accepted Solution

by:
dlethe earned 500 total points
ID: 39178729
Here is the deal .
1.  It isn't that data is suspect on at least some of the surviving disks, it is that the disks are in deep recovery.   But that recovery is working because the rebuild isn't failing.  So good news.
2. Spinrite absolutely will result in at least partial data corruption in event it recovers any blocks that were already marked bad.  NEVER run that on disks in a RAID array.
3.  You have made things worse by what you have attempted.  
4. Those 'cuda drives aren't enterprise class, which is root cause. They are consumer disks that are designed to go into deep recovery rather than give up in 1-2 seconds so that the RAID controller can extrapolate the correct data and resector the bad block, rewrite and move on.
5. Those 'cuda drives CAUSED the predicament you are in right now, because deep recovery causes drives to lock up an the 3ware firmware failed the disk.

Solution
1. If you don't want this to happen any more, you are just going to have to replace ALL disks with enterprise class, or go to RAID10 (lesser evil than RAID5, but at least RAID10 is lower risk).

2. Pay a pro to get this data back. You do not have the software to recover from partial rebuilds, and such software is not available retail anyway.  

3. If you are willing to live with partial recovery, and risk 100% data loss due to continued use of the disks, then just copy what you can while it is degraded and prioritize your most important files first.

Bottom line, I am not going to sugar coat it .. You are in over your head and do not have the experience and the software necessary to recover data, or to assess the likelihood that the act of recovering with your plan would cause irrevocable data loss.
0
 
LVL 21

Expert Comment

by:CompProbSolv
ID: 39178832
I have successfully used Raid Reconstructor from runtime.org:
http://runtime.org/raid.htm

The process is that you install the RAID-5 drives in a non-RAID controller and analyze them there.  You can do them all at once or individually.  It attempts to reconstruct the proper order of drives and stripe size.

If you have enough disk space, the next step is to have the program make an image of each of the drives to your local hard drive.  From those images, the program will allow you to recover data using the organizational information from the first step.

What I think is especially important to your issue is the step of copying the drives individually.  If a second drive is failing (as you suspect), you will find out during the copy process.

There is a free trial version and the full program runs about $100.

As always with hard drive problems, be aware that the longer you work on the drive attempting to recover data, the more likely the failure will become worse and be more difficult (=more expensive) for a professional recovery shop to recover your data.

Depending on the value of the data, you may want to shut it all off and have the pros deal with it.  Otherwise, I'd try Raid Reconstructor.
0
 
LVL 2

Author Closing Comment

by:dpedersen13
ID: 39178844
Thanks for the detail and help on understanding what is happening right now.  Especially the details on the 'cuda drives and the reason for the RAID card lockup.  It makes sense, figuring trying to figure out that root cause was really bothering meI. 've stopped all attempts and I am impatiently awaiting a call back from ontrack.

In the meantime I'me recovering data files from other sources Fortunately most the data is in week old backups.

Thanks again,
Dylan
0
Report: Liquid Web beats Amazon, Rackspace & More

A study by performance analyst firm Cloud Spectator finds that Liquid Web beats rivals Amazon, Rackspace and DigitalOcean when it comes to website and cloud application performance.

 
LVL 47

Expert Comment

by:dlethe
ID: 39179333
no, prob, hope I didn't rub you the wrong way.  If I was, accept my apology.  Some drives are totally unacceptable for RAID5, and those disks are one of them. The firmware is effectively incompatible.

In fact, dig up the WD data sheets and they even tell you that those disks are not warrantied for use on RAID5.
0
 
LVL 47

Expert Comment

by:dlethe
ID: 39179337
P.S. RAID reconstructor would not have helped in this scenario, and would have likely made things worse.
0
 
LVL 21

Expert Comment

by:CompProbSolv
ID: 39179619
"would have made things worse"
What I suggested should have been read-only.  How would it have made things worse?
0
 
LVL 47

Expert Comment

by:dlethe
ID: 39179646
When you read a HDD you run risk of stressing the drive to the point of failure, as well as turning what a controller thinks is an unrecoverable read error into a recovered block.  

Depending on the specifics, you could corrupt data by turning a known bad block which is already being handled into a repaired block which would invalidate all data on a stripe.
0
 
LVL 2

Author Comment

by:dpedersen13
ID: 39180139
Let's hope I didn't do that during the recover process on the controller.  Disks go out tomorrow.  Thanks again!

Dylan
0
 
LVL 21

Expert Comment

by:CompProbSolv
ID: 39191287
Thanks for the info.  It is appreciated.

As far as "run the risk of stressing..." I did try to cover that with my comment about "..  the more likely the failure will become worse".

I had not considered the conversion of the unrecoverable read error into a recovered block and am curious about this.  I recognize how the enterprise drives have shorter timeouts for proper use with RAID configurations, but am not aware of differences in reallocation of bad blocks.  If I read your comments correctly, an enterprise drive will not do the reallocation but will rely on the RAID controller to perform that function.  Do I have this correct?

If that is correct, does this mean that an enterprise drive used with a non-RAID controller will never reallocate bad blocks or is it only when instructed not to do so by the RAID controller?

If you had any references I could use to get better educated on this I'd appreciate links to them.
0

Featured Post

Free Tool: IP Lookup

Get more info about an IP address or domain name, such as organization, abuse contacts and geolocation.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
ThinkServer TS440 No RAID Volumes 4 77
Replace RAID partitions with larger capacity drives 11 103
DDP and RAID 3 64
Unable to configure Proliant Gen-8 for Raid 5 6 46
Create your own, high-performance VM backup appliance by installing NAKIVO Backup & Replication directly onto a Synology NAS!
Facing problems with you memory card? Cannot access your memory card? All stored data, images, videos are lost? If these are your questions...than this small article might help you out in retrieving your lost or inaccessible data.
In this Micro Tutorial viewers will learn how to restore their server from Bare Metal Backup image created with Windows Server Backup feature. As an example Windows 2012R2 is used.
This tutorial will walk an individual through the steps necessary to install and configure the Windows Server Backup Utility. Directly connect an external storage device such as a USB drive, or CD\DVD burner: If the device is a USB drive, ensure i…

739 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question