asked on

Lost RAID5 on Openfiler

Hi, first Let me tell you that I am a newbie when it comes to Linux :(. I've installed an OpenFiler NAS, set 4 drives in RAID5 (4x1.5Tb) and all was good and I was VERY happy with my setup. That is until today. At one point I couldn't see one of my volumes, the one where the 4.5 TB RAID5 is. If I go in Volumes, soft RAID I see the State : Active & degraded in orange, Syncronization not started.

I've looked in forums and here, and saw a post that was similar to mine, where the solution by arnold was : "mdadm -A --force dev1 dev2 dev3 dev4 dev? wher dev? is the one you think failed last out of the two."

I've tried that but I get : device /dev/md0 already active - cannot assemble it.

when I do mdadm /dev/md0 --examine
I get : No md superblock detected on /dev/md0

All I want is to bring the RAID back in whatever state so I can recover some of the data.

All drives are NO faulty... again a newbie here (which makes me think trusting OF with blind faith was a terrible mistake :().

Got about 3.5 Tb of data, some stuff I have nowhere else... not too much though, still, would like to try and access that data. I know I should do backups and most of my data I have somewhere else, just not all of it and certainly not the last week worth of work.

Any help, step by step procedure I could follow (please type FULL linux commands, not abbreviations). My devices are md0, and the disks are sdb1 (removed), sdc1 (removed), sdd1 (Member) and sde1 (Member).

Thanks in advance
All the best,

Chris

BigSchmuh

I guess the 1.5TB drives used are the cheap "Desktop class" kind of with a Unrecoverable Bit Error (UBE) of 1 sector per 10e14 bits read...

"Desktop class" drives has TWO very serious problems in a RAID array:
-Their sector recovery process may last for more than 8 seconds ... but almost all RAID controller will consider the drive dropped after 8 seconds

-Their UBE of 1 unreadable sector per 10e14 bits read means that when you rebuild an array, you have about 13% probability of a unreadable sector error per drive...which means a 4 drives array using 3 drives to rebuild a 4th one has about 33% of not being rebuild

Now, regarding OpenFiler, I am very sorry not to be able to help you furthermore...

kalliasx

ASKER

Hi BigSchmuh,

Thanks for your answer,

Indeed the drives are desktop class Samsung 1.5 Tb drive, HD154UI models with 32Mb of cache. As for the RAID itself, it's a software RAID, not an hardware one.

I've started testing the drives with a Ultimate Boot CD (HDD32 4.6), but it takes 6 hours per drives so I only tested two of of my 4 drives (not knowing which are out of the RAID), they come only with 5-6 warnings each so far, no bad block, about 5 <500ms and 1>500ms. Not sure of the results of this test.

What I would need would be assistance on how to use mdadm to try and mount my array back, even corrupted and temporarily to attempt salvaging a few critical data (<10 Giga out of 4Tb). What I think I should do but don't understand the command very well :

-use mdadm to verify the state of the versioning of my RAID array md0, as I suspect when I did the add members in the UI right after the crash, I might have either re-created another md0, or erased some sensible data needed to rebuilt the array itself. Is this reversible ? When I reboot, it always comes back to 2 drives attached and 2 not...
-I have tried mdadm (too much time passed for me to just sit there and wait) mdadm --add /dev/md0 /dev/sdb1 and the same for the other drive, the md0 array appears, all OK and synced, but there si no volume. The web interface proposes me to create a volume on it now. A reboot lands me in the same situation, 2/4 drives only.

So is there a command after I add the drives (if that wasn't a terminal mistake already) that allows me to rebuild the volume, even FORCING it, knowing I would try to salvage a few data while up and not care that it would end up this volume for good after that.

Otherwise if you know a software that has good success at recovering data from XFS soft RAID array, then I would like to try this route, that is, if it's not already too late (as I doubt it's a virgin RAID array anymore from my add members in web interface + --add attempt yesterday).

Let me know,
Thanks in advance

Chris

BigSchmuh

According to Samsung
http://www.samsung.com/global/system/business/hdd/prdmodel/2009/1/29/397366f2eg_rev0.2.pdf
their SpinPoint EcoGreen F2 HD154UI drives has "Non-recoverable Read Error" of "1 sector in 10^15bits" which is not bad ...so this is very good for a 5400rpm drive.

...but sorry, I still can't help regarding OpenFiler disaster recovery processes...

kalliasx

ASKER

Alright some news. I have identified with --examine | grep Event

That my sdb1 disk actually "died" a loooooong time ago as it has a very old event number, the last disk to get ejected from my md0 raid5 is just a few revisions behind in event numbers. So, I tried -A /sdc1 and oh miracle, the raid now showing a volume group (the one I created in the first place when I installed the raid 5).

However I don't see partition nor folders. So maybe I need to send a command to rebuild or resync the raid5 ? I understand doing so on 3 he'd is risky but I just want a few hours (minutes even) of access to recover vital data I know I lost otherwise.

So what should be my next step ? Remember I'm a Linux newbie. I'm starting to get used with mdadm now but that's it, that and basic vi editing and a few commands is all i know.

Thanks in a advance to anyone who can answer me...
Chris

kalliasx

ASKER

Ok more news, maybe someone can help cause I would really need it.

I've done this :

mdadm --add /dev/md0 /dev/sdc1
mdadm --run /dev/md0

In openfiler it says about the raid : clean and degraded, synchronization not started (do I need to manually start sync ? if so what's the command ?)

It shows the volume group now, but it shows unknown for volume size, free and used size.
So I did :

now xfs_repair -n /dev/md0

(-n is for read only, no write is performed)

Many times between ...... it said found candidate superblock, unable to verifying superblock, continuing....

The command is still running (about 9 hours into it for now)

The idea is to see the end result of this command and run a xfs_repair /dev/md0 once it's done, with the hope it will show the filesystem and that I will be able to access my data (even momentarily). I will do that tomorrow, hopefully by then the command will have run it's course, I take it a 4.5Tb data xfs_repair is actually taking a long time, hope I won't run out of memory (2Gb).

Let me know what you think,
Thanks in advance

Chris

ASKER CERTIFIED SOLUTION

kalliasx

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

kalliasx

ASKER

The reason is I solved my own problem, there was very little posts to help me do otherwise (though I don't blame as I know OpenFiler is a proprietary solution and maybe not used that much... though it's based on RedHat).