asked on

need advice - recovering raid 5 array member disk

Hi All -
For everyone who wants to point out the obvious, I'll start with that. I'm an idiot, I'm a bonehead. I thought it could never happen to me. I have no excuses and I'm trying to dig myself out of a hole of my own making.

Situation:

At this point I'm waiting to hear from www.krollontrack.com as to the options they can provide.

Equipment:
ESXi hypervisor 4.0 U1

3ware 90650se-4LPML controller
Raid 5 - 4 drives 1tb each
vmfs3 file system
1 important vm - windows 2003 sbs 2 vmdk 207 gb 465 gb ~850 gb information total on the drive array
All files except the vmdk have been recovered

Situation -
1 drive in the array failed, array went into recovery mode. Recovery was not successful paused at 6%, 9%, 62% in that order
Array shows drive as OK, though will move it to WARNING or DEVICE-ERROR during recovery of RAID or attempts to copy data

Datastore and File Directory visible by:
- Mounting the data store in ESXi 5.0
- Mounting datastore using tw_cli 3ware command line tool and Open Source VMFS Driver can see the file structure and was able to download all files except the vmdks
- Smart data says all rives OK. Drives or seagate barracuda 7200 RPM, I believe SMART data is suspect

IMHO - 1 drive on the remaining array is suspect. I believe if the drive can be recovered, the data can be recovered. Potentially issues with the RAID controller

Attempted, in order, with Rebuild array / copy files through Open Source VMFS Driver attempted throughout
Rebuild array from degraded mode through array rebuild process via tw_cli by adding new 4th drive.
Recovery of data from within ESXi - copy vmdk from datastore. Copy on vmdk halts. Attempt abandoned
- Mounting datastore using tw_cli 3ware command line tool and Open Source VMFS Driver can see the file structure and was able to download all files except the vmdks
Recovery of raid via raid reconstructor. Automatic settings could not determine raid. Said needed manual configuration, Attempt abandoned
Repair of drive with spinrite spin right would not attempt - says the partition size reported by the BIOS is different than the one reported by the drive. Attempt abandoned
Clone of drive via clonezilla with VMFS support --> Clonezilla would not attempt Attempt halted
Recover VMDK / files via disk internals VMFS Recovery. While still running, after 6 hours and still showing CPU activity, progress bar had halted.

ASKER CERTIFIED SOLUTION

David

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

CompProbSolv

I have successfully used Raid Reconstructor from runtime.org:
http://runtime.org/raid.htm

The process is that you install the RAID-5 drives in a non-RAID controller and analyze them there. You can do them all at once or individually. It attempts to reconstruct the proper order of drives and stripe size.

If you have enough disk space, the next step is to have the program make an image of each of the drives to your local hard drive. From those images, the program will allow you to recover data using the organizational information from the first step.

What I think is especially important to your issue is the step of copying the drives individually. If a second drive is failing (as you suspect), you will find out during the copy process.

There is a free trial version and the full program runs about $100.

As always with hard drive problems, be aware that the longer you work on the drive attempting to recover data, the more likely the failure will become worse and be more difficult (=more expensive) for a professional recovery shop to recover your data.

Depending on the value of the data, you may want to shut it all off and have the pros deal with it. Otherwise, I'd try Raid Reconstructor.

dpedersen13

ASKER

Thanks for the detail and help on understanding what is happening right now. Especially the details on the 'cuda drives and the reason for the RAID card lockup. It makes sense, figuring trying to figure out that root cause was really bothering meI. 've stopped all attempts and I am impatiently awaiting a call back from ontrack.

In the meantime I'me recovering data files from other sources Fortunately most the data is in week old backups.

Thanks again,
Dylan

David

no, prob, hope I didn't rub you the wrong way. If I was, accept my apology. Some drives are totally unacceptable for RAID5, and those disks are one of them. The firmware is effectively incompatible.

In fact, dig up the WD data sheets and they even tell you that those disks are not warrantied for use on RAID5.

David

P.S. RAID reconstructor would not have helped in this scenario, and would have likely made things worse.

CompProbSolv

"would have made things worse"
What I suggested should have been read-only. How would it have made things worse?

David

When you read a HDD you run risk of stressing the drive to the point of failure, as well as turning what a controller thinks is an unrecoverable read error into a recovered block.

Depending on the specifics, you could corrupt data by turning a known bad block which is already being handled into a repaired block which would invalidate all data on a stripe.

dpedersen13

ASKER

Let's hope I didn't do that during the recover process on the controller. Disks go out tomorrow. Thanks again!

Dylan

CompProbSolv

Thanks for the info. It is appreciated.

As far as "run the risk of stressing..." I did try to cover that with my comment about ".. the more likely the failure will become worse".

I had not considered the conversion of the unrecoverable read error into a recovered block and am curious about this. I recognize how the enterprise drives have shorter timeouts for proper use with RAID configurations, but am not aware of differences in reallocation of bad blocks. If I read your comments correctly, an enterprise drive will not do the reallocation but will rely on the RAID controller to perform that function. Do I have this correct?

If that is correct, does this mean that an enterprise drive used with a non-RAID controller will never reallocate bad blocks or is it only when instructed not to do so by the RAID controller?

If you had any references I could use to get better educated on this I'd appreciate links to them.