Solved

need advice - recovering  raid 5 array member disk

Posted on 2013-05-19
9
1,565 Views
Last Modified: 2013-05-23
Hi All -
For everyone who wants to point out the obvious, I'll start with that.  I'm an idiot, I'm a bonehead.  I thought it could never happen to me.  I have no excuses and I'm trying to dig myself out of a hole of my own making.

Situation:

At this point I'm waiting to hear from www.krollontrack.com as to the options they can provide.

Equipment:
ESXi hypervisor 4.0 U1  

3ware 90650se-4LPML controller
Raid 5 - 4 drives 1tb each
vmfs3 file system
1 important vm - windows 2003 sbs 2 vmdk 207 gb 465 gb ~850 gb information total on the drive array
All files except the vmdk have been recovered


Situation -
1 drive in the array failed, array went into recovery mode.  Recovery was not successful paused at 6%, 9%, 62% in that order
Array shows drive as OK, though will move it to WARNING or DEVICE-ERROR during recovery of RAID or attempts to copy data


Datastore and File Directory visible by:
- Mounting the data store in ESXi 5.0
- Mounting datastore using tw_cli 3ware command line tool and Open Source VMFS Driver can see the file structure and was able to download all files except the vmdks
- Smart data says all rives OK.  Drives or seagate barracuda 7200 RPM, I believe SMART data is suspect


IMHO - 1 drive on the remaining array is suspect.  I believe if the drive can be recovered, the data can be recovered.  Potentially issues with the RAID controller


Attempted, in order, with Rebuild array / copy files through Open Source VMFS Driver attempted throughout
Rebuild array from degraded mode through array rebuild process via tw_cli by adding new 4th drive.
Recovery of data from within ESXi - copy vmdk from datastore. Copy on vmdk halts. Attempt abandoned
- Mounting datastore using tw_cli 3ware command line tool and Open Source VMFS Driver can see the file structure and was able to download all files except the vmdks
Recovery of raid via raid reconstructor.  Automatic settings could not determine raid.  Said needed manual configuration, Attempt abandoned
Repair of drive with spinrite spin right would not attempt - says the partition size reported by the BIOS is different than the one reported by the drive. Attempt abandoned
Clone of drive via clonezilla with VMFS support --> Clonezilla would not attempt  Attempt halted
Recover VMDK  / files via disk internals VMFS Recovery.  While still running, after 6 hours and still showing CPU activity, progress bar had halted.
0
Comment
Question by:dpedersen13
  • 4
  • 3
  • 2
9 Comments
 
LVL 47

Accepted Solution

by:
dlethe earned 500 total points
Comment Utility
Here is the deal .
1.  It isn't that data is suspect on at least some of the surviving disks, it is that the disks are in deep recovery.   But that recovery is working because the rebuild isn't failing.  So good news.
2. Spinrite absolutely will result in at least partial data corruption in event it recovers any blocks that were already marked bad.  NEVER run that on disks in a RAID array.
3.  You have made things worse by what you have attempted.  
4. Those 'cuda drives aren't enterprise class, which is root cause. They are consumer disks that are designed to go into deep recovery rather than give up in 1-2 seconds so that the RAID controller can extrapolate the correct data and resector the bad block, rewrite and move on.
5. Those 'cuda drives CAUSED the predicament you are in right now, because deep recovery causes drives to lock up an the 3ware firmware failed the disk.

Solution
1. If you don't want this to happen any more, you are just going to have to replace ALL disks with enterprise class, or go to RAID10 (lesser evil than RAID5, but at least RAID10 is lower risk).

2. Pay a pro to get this data back. You do not have the software to recover from partial rebuilds, and such software is not available retail anyway.  

3. If you are willing to live with partial recovery, and risk 100% data loss due to continued use of the disks, then just copy what you can while it is degraded and prioritize your most important files first.

Bottom line, I am not going to sugar coat it .. You are in over your head and do not have the experience and the software necessary to recover data, or to assess the likelihood that the act of recovering with your plan would cause irrevocable data loss.
0
 
LVL 20

Expert Comment

by:CompProbSolv
Comment Utility
I have successfully used Raid Reconstructor from runtime.org:
http://runtime.org/raid.htm

The process is that you install the RAID-5 drives in a non-RAID controller and analyze them there.  You can do them all at once or individually.  It attempts to reconstruct the proper order of drives and stripe size.

If you have enough disk space, the next step is to have the program make an image of each of the drives to your local hard drive.  From those images, the program will allow you to recover data using the organizational information from the first step.

What I think is especially important to your issue is the step of copying the drives individually.  If a second drive is failing (as you suspect), you will find out during the copy process.

There is a free trial version and the full program runs about $100.

As always with hard drive problems, be aware that the longer you work on the drive attempting to recover data, the more likely the failure will become worse and be more difficult (=more expensive) for a professional recovery shop to recover your data.

Depending on the value of the data, you may want to shut it all off and have the pros deal with it.  Otherwise, I'd try Raid Reconstructor.
0
 
LVL 2

Author Closing Comment

by:dpedersen13
Comment Utility
Thanks for the detail and help on understanding what is happening right now.  Especially the details on the 'cuda drives and the reason for the RAID card lockup.  It makes sense, figuring trying to figure out that root cause was really bothering meI. 've stopped all attempts and I am impatiently awaiting a call back from ontrack.

In the meantime I'me recovering data files from other sources Fortunately most the data is in week old backups.

Thanks again,
Dylan
0
 
LVL 47

Expert Comment

by:dlethe
Comment Utility
no, prob, hope I didn't rub you the wrong way.  If I was, accept my apology.  Some drives are totally unacceptable for RAID5, and those disks are one of them. The firmware is effectively incompatible.

In fact, dig up the WD data sheets and they even tell you that those disks are not warrantied for use on RAID5.
0
How to run any project with ease

Manage projects of all sizes how you want. Great for personal to-do lists, project milestones, team priorities and launch plans.
- Combine task lists, docs, spreadsheets, and chat in one
- View and edit from mobile/offline
- Cut down on emails

 
LVL 47

Expert Comment

by:dlethe
Comment Utility
P.S. RAID reconstructor would not have helped in this scenario, and would have likely made things worse.
0
 
LVL 20

Expert Comment

by:CompProbSolv
Comment Utility
"would have made things worse"
What I suggested should have been read-only.  How would it have made things worse?
0
 
LVL 47

Expert Comment

by:dlethe
Comment Utility
When you read a HDD you run risk of stressing the drive to the point of failure, as well as turning what a controller thinks is an unrecoverable read error into a recovered block.  

Depending on the specifics, you could corrupt data by turning a known bad block which is already being handled into a repaired block which would invalidate all data on a stripe.
0
 
LVL 2

Author Comment

by:dpedersen13
Comment Utility
Let's hope I didn't do that during the recover process on the controller.  Disks go out tomorrow.  Thanks again!

Dylan
0
 
LVL 20

Expert Comment

by:CompProbSolv
Comment Utility
Thanks for the info.  It is appreciated.

As far as "run the risk of stressing..." I did try to cover that with my comment about "..  the more likely the failure will become worse".

I had not considered the conversion of the unrecoverable read error into a recovered block and am curious about this.  I recognize how the enterprise drives have shorter timeouts for proper use with RAID configurations, but am not aware of differences in reallocation of bad blocks.  If I read your comments correctly, an enterprise drive will not do the reallocation but will rely on the RAID controller to perform that function.  Do I have this correct?

If that is correct, does this mean that an enterprise drive used with a non-RAID controller will never reallocate bad blocks or is it only when instructed not to do so by the RAID controller?

If you had any references I could use to get better educated on this I'd appreciate links to them.
0

Featured Post

How your wiki can always stay up-to-date

Quip doubles as a “living” wiki and a project management tool that evolves with your organization. As you finish projects in Quip, the work remains, easily accessible to all team members, new and old.
- Increase transparency
- Onboard new hires faster
- Access from mobile/offline

Join & Write a Comment

Create your own, high-performance VM backup appliance by installing NAKIVO Backup & Replication directly onto a Synology NAS!
A Bare Metal Image backup allows for the restore of an entire system to a similar or dissimilar hardware. They are highly useful for migrations and disaster recovery. Bare Metal Image backups support Full and Incremental backups. Differential backup…
This tutorial will walk an individual through the steps necessary to enable the VMware\Hyper-V licensed feature of Backup Exec 2012. In addition, how to add a VMware server and configure a backup job. The first step is to acquire the necessary licen…
This tutorial will walk an individual through the process of configuring basic necessities in order to use the 2010 version of Data Protection Manager. These include storage, agents, and protection jobs. Launch Data Protection Manager from the deskt…

744 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

7 Experts available now in Live!

Get 1:1 Help Now