[Okta Webinar] Learn how to a build a cloud-first strategyRegister Now

x
?
Solved

RAID-5 Multiple Drive Failures

Posted on 2009-02-23
9
Medium Priority
?
3,037 Views
Last Modified: 2012-05-06
I have a Dell PV220S, populated with (14) 72GB drives, running a split conroller setup off two channels of a Perc 4DC controller. It is configured as a RAID-5 with two of the 14 drives (one on each channel) setup as a hot spare.

Today, suddenly I found data missing from this volume. Upon closer inspection, yellow lights were flashing on drives 0,1,2,3,4 and 8. Dell OMSA reports that on that channel (channel 0) that drives 0,1,2,3,4 and 8 (physical drive 8 is the hotspare) have failed.

I have powered everything down, checked my cables, connectors and power and everything seems OK. Drives spinup just fine, but error out when reading the RAID configuration.

I would think is highly unlikely to have six drives fail at the same time (no errors were reported in event logs last night, or this morning).

I would suspect perhaps the conroller or something in the PowerVault, but it still shows that one of the drives on that channel (physical drive 5) is still working just fine.

TIA

0
Comment
Question by:ThePhreakshow
  • 7
  • 2
9 Comments
 
LVL 30

Expert Comment

by:Duncan Meyers
ID: 23716684
Providing all the drives failed simultaneously (and given the symptoms, that's highly likely), you can force the drives back online. If they did not fail simultaneously, you'll corrupt all your data by doing that. First thing to do is to take a look at the RAID controller logs in OMSA and see if yu can determine what happened. If you see a message that the hotspare is rebuilding (or similar), you'll need to identify the drive that had failed. Once you know what happened, proceed as follows:

All drives failed at once:
You can go into the RAID controller from the BIOS (it's safer and, paradoxically, more intuitive from the BIOS), find the drives marked as OFLLINE, click on each drive and select Force Online. Once all the drives are back online, reboot the server and you should be good to go.

Drives failed in dribs and drabs:
Use the procedure outlined above, but do not force the original failed drive online. Once you reboot the server, the array should start rebuilding to the hotspare. You can then take steps to replace the failed drive.
0
 
LVL 3

Author Comment

by:ThePhreakshow
ID: 23717099
OMSA logs show that on last friday, physical discs 0,1,2,3,4  failed.
They also show that physical discs 0,1,2,3,4 and 8 failed today.
The logs show nothing about ever having started to rebuild the array.....
0
 
LVL 3

Author Comment

by:ThePhreakshow
ID: 23717116
There is also no evidence of drive controller failure, or failure of one of the redundant power supplies, and no power failure detected on the network UPS. No configuration changes have taken place and system has been up in its (former) running state for nearly a year.
0
Industry Leaders: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 
LVL 3

Author Comment

by:ThePhreakshow
ID: 23717133
There are two strange errors from last friday... The logs show two entries, one that physical disc 0:0 and physical disc 0:1 were removed and that redundancy was lost.

Nobody was in the server room, and none of these drives were ever physically detached or removed from the storage enclosure.
0
 
LVL 30

Accepted Solution

by:
Duncan Meyers earned 2000 total points
ID: 23717146
You'll be thrilled to hear, I'm sure, that the PV220S was notorious for this sort of nonsense. You are safe to force the drives online as outlined above. The issue is unlikely to re-occur, but having said that, bad power (noise, marginal line voltage etc) is known to cause this problem.
0
 
LVL 3

Author Comment

by:ThePhreakshow
ID: 23721001
Well, after forcing the drives back on and 4 hours worth of CHKDSK index repair and SID resets the system is back up 100%.... Well at least it was seven hours ago when I left there.

A check of health this morning shows that on the same channel 0, drives 1,3, and four all asserted a fault state at the same time and the drive is again down.

Obviously there is something wrong with the PV220S, but I have no indication of any power problems from the UPS logs, voltage and temperatures have been rock steady for as far back as the logs go.

What could this indicate the source of the problem is?

Again, no sign of the arrary rebuilding or hot spares coming online. Just simultaneous failure of three drives this time, instead of yesterday's five drives.
0
 
LVL 3

Author Comment

by:ThePhreakshow
ID: 23722016
I noticed another strange happening in the OMSA logs.
There are a couple other errors that occur, right around the same time the PV 220S fails.

BMC Intrusion Sensor detects intrusion then returns to normal
BMC Planar Temp Sensor detects warning then returns to normal
BMC Riser Temp Sensor detects warning then returns to normal
BMC ROMB Battery sensor de-asserts and then re-asserts

Is this going toward a problem with the Perc 4DC
0
 
LVL 3

Author Comment

by:ThePhreakshow
ID: 23722045
Now everything shows failed in OMSA that is connected to Connector 0 of the Perc 4DC. Emms, fans, PS1 and PS2, Temp Probes...
0
 
LVL 3

Author Closing Comment

by:ThePhreakshow
ID: 31550697
Hard to say what the problem could have been... EMM's in the 220, external cables, BMC, riser, Perc 4DC.... For now, I have just removed the only thing different from the configuration a month ago... I had installed a 1394 FireWire card in this machine... Perhaps it was causing some problems on the riser.... There are so many variables, and I have read quite a few other rants about the PV220s doing this same thing, which could not be tracked down to one specific device or item that failed. Nature of the beast, I guess. Thanks.
0

Featured Post

Get your Disaster Recovery as a Service basics

Disaster Recovery as a Service is one go-to solution that revolutionizes DR planning. Implementing DRaaS could be an efficient process, easily accessible to non-DR experts. Learn about monitoring, testing, executing failovers and failbacks to ensure a "healthy" DR environment.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Many businesses neglect disaster recovery and treat it as an after-thought. I can tell you first hand that data will be lost, hard drives die, servers will be hacked, and careless (or malicious) employees can ruin your data.
Exchange administrators are always vigilant about Exchange crashes and disasters that are possible any time. It is quite essential to identify the symptoms of a possible Exchange issue and be prepared with a proper recovery plan. There are multiple…
This tutorial will walk an individual through locating and launching the BEUtility application and how to execute it on the appropriate database. Log onto the server running the Backup Exec database. In a larger environment, this would generally be …
Whether it be Exchange Server Crash Issues, Dirty Shutdown Errors or Failed to mount error, Stellar Phoenix Mailbox Exchange Recovery has always got your back. With the help of its easy to understand user interface and 3 simple steps recovery proced…
Suggested Courses

872 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question