Link to home
Start Free TrialLog in
Avatar of ThePhreakshow
ThePhreakshow

asked on

RAID-5 Multiple Drive Failures

I have a Dell PV220S, populated with (14) 72GB drives, running a split conroller setup off two channels of a Perc 4DC controller. It is configured as a RAID-5 with two of the 14 drives (one on each channel) setup as a hot spare.

Today, suddenly I found data missing from this volume. Upon closer inspection, yellow lights were flashing on drives 0,1,2,3,4 and 8. Dell OMSA reports that on that channel (channel 0) that drives 0,1,2,3,4 and 8 (physical drive 8 is the hotspare) have failed.

I have powered everything down, checked my cables, connectors and power and everything seems OK. Drives spinup just fine, but error out when reading the RAID configuration.

I would think is highly unlikely to have six drives fail at the same time (no errors were reported in event logs last night, or this morning).

I would suspect perhaps the conroller or something in the PowerVault, but it still shows that one of the drives on that channel (physical drive 5) is still working just fine.

TIA

Avatar of Duncan Meyers
Duncan Meyers
Flag of Australia image

Providing all the drives failed simultaneously (and given the symptoms, that's highly likely), you can force the drives back online. If they did not fail simultaneously, you'll corrupt all your data by doing that. First thing to do is to take a look at the RAID controller logs in OMSA and see if yu can determine what happened. If you see a message that the hotspare is rebuilding (or similar), you'll need to identify the drive that had failed. Once you know what happened, proceed as follows:

All drives failed at once:
You can go into the RAID controller from the BIOS (it's safer and, paradoxically, more intuitive from the BIOS), find the drives marked as OFLLINE, click on each drive and select Force Online. Once all the drives are back online, reboot the server and you should be good to go.

Drives failed in dribs and drabs:
Use the procedure outlined above, but do not force the original failed drive online. Once you reboot the server, the array should start rebuilding to the hotspare. You can then take steps to replace the failed drive.
Avatar of ThePhreakshow
ThePhreakshow

ASKER

OMSA logs show that on last friday, physical discs 0,1,2,3,4  failed.
They also show that physical discs 0,1,2,3,4 and 8 failed today.
The logs show nothing about ever having started to rebuild the array.....
There is also no evidence of drive controller failure, or failure of one of the redundant power supplies, and no power failure detected on the network UPS. No configuration changes have taken place and system has been up in its (former) running state for nearly a year.
There are two strange errors from last friday... The logs show two entries, one that physical disc 0:0 and physical disc 0:1 were removed and that redundancy was lost.

Nobody was in the server room, and none of these drives were ever physically detached or removed from the storage enclosure.
ASKER CERTIFIED SOLUTION
Avatar of Duncan Meyers
Duncan Meyers
Flag of Australia image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Well, after forcing the drives back on and 4 hours worth of CHKDSK index repair and SID resets the system is back up 100%.... Well at least it was seven hours ago when I left there.

A check of health this morning shows that on the same channel 0, drives 1,3, and four all asserted a fault state at the same time and the drive is again down.

Obviously there is something wrong with the PV220S, but I have no indication of any power problems from the UPS logs, voltage and temperatures have been rock steady for as far back as the logs go.

What could this indicate the source of the problem is?

Again, no sign of the arrary rebuilding or hot spares coming online. Just simultaneous failure of three drives this time, instead of yesterday's five drives.
I noticed another strange happening in the OMSA logs.
There are a couple other errors that occur, right around the same time the PV 220S fails.

BMC Intrusion Sensor detects intrusion then returns to normal
BMC Planar Temp Sensor detects warning then returns to normal
BMC Riser Temp Sensor detects warning then returns to normal
BMC ROMB Battery sensor de-asserts and then re-asserts

Is this going toward a problem with the Perc 4DC
Now everything shows failed in OMSA that is connected to Connector 0 of the Perc 4DC. Emms, fans, PS1 and PS2, Temp Probes...
Hard to say what the problem could have been... EMM's in the 220, external cables, BMC, riser, Perc 4DC.... For now, I have just removed the only thing different from the configuration a month ago... I had installed a 1394 FireWire card in this machine... Perhaps it was causing some problems on the riser.... There are so many variables, and I have read quite a few other rants about the PV220s doing this same thing, which could not be tracked down to one specific device or item that failed. Nature of the beast, I guess. Thanks.