Solved

Reseating RAID5 Failed Drive

Posted on 2014-07-25
12
729 Views
Last Modified: 2014-07-28
We seem to be having two schools of thought around the office and I am having trouble finding any manufacturer's documentation or other facts to support arguments on either side.

We had a failed drive in a RAID5 array. Now first of all I understand there s a lot of discussion out there about whether RAID5 is even necessary nowadays, but it is in this particular server. So that is not what this is about. The question is about what to do with the failed drive:

Past experience for some has been that tech support for the manufacturer recommended first testing and then reseating said failed drive to see if the problem was due to a controller glitch, drive being loose, etc.

Others say that you shouldn't reseat a failed drive because you risk corrupting the rest of the array during the rebuild process.

- I can understand if the drive is under warranty, you might as well replace it anyway, but that could take longer than rebuilding the failed drive.
- I also understand there is risk to another disk failing during a rebuild of the array, but is that somehow made worse if the drive you are rebuilding has previously failed?

I know there are certain rules on EE regarding third party website links, but if you could provide support to your arguments, especially with manufacturer's technical documents, that would be great. Not that this is my opinion, but I am especially interested in anything that specifically says why you shouldn't reseat an already failed drive.
0
Comment
Question by:Brian B
  • 4
  • 3
  • 2
  • +2
12 Comments
 
LVL 87

Assisted Solution

by:rindi
rindi earned 84 total points
ID: 40219395
The first thing to do is make sure your last backup is 100% OK. If you aren't sure, make another full backup (I would do that even when the previous backup is OK). Once that is done, check check via your RAID controller's management utility to see what it says about the connected disks. If it tells you a disk is bad, remove it and and connect it to another PC without RAID controller, so you can run the disk manufacturer's diagnostic utility on it. If the tool tells you the disk is bad, get a replacement, if it says it is OK, you can return to the server and try getting the array rebuilt. This will stress the system and if the other disks are in not so good a condition, it could cause the array to fail, but then you can restore from the backup you just made.

Personally I'd suggest you always add another additional disk as a hot-spare, which, whenever a disk fails, takes over automatically. That reduces the time during which the system is at risk of data loss, and regularly check the system's health.
0
 
LVL 47

Expert Comment

by:dlethe
ID: 40219466
If they are telling you to reseat the drive then you're talking to an amateur.   They should have asked to see controller logs first.   Yanking out a drive degrades an array and guarantees data loss if you have a bad block elsewhere.  It guarantees 100% data loss on a rebuild if a drive fails.

The most stressful thing you can do to an array is rebuild it.  Their advice requires you to put it at extreme risk.

Back up first.   Frankly I would also consider doing an online migration from RAID5->RAID6 if the controller allows it.  Then you can sleep at night.
0
 
LVL 23

Author Comment

by:Brian B
ID: 40219538
> Yanking out a drive degrades an array and guarantees data loss if you have a bad block elsewhere.
But wouldn't I also be doing that when I replace it with a new drive? Or do you mean if a block goes bad during the rebuild? Admittedly I would be potentially be pulling a drive twice, once to see if it would recover, then a second time if the drive has be replaced. So double the odds, I suppose.

As far as the whole hot spare or RAID6 argument goes, like I said it's not possible. Other servers in this shop do have hot spares.

As far as the reason against reusing the drive, the concern is that if there is corrupt data on the failed drive it might spread to the rest of the array when it rebuilds. The counterargument is if I drive been pulled, would the data on it not be considered garbage? I wouldn't think a controller would read the data on the drive at all if it was being rebuilt.
0
 
LVL 32

Assisted Solution

by:PowerEdgeTech
PowerEdgeTech earned 166 total points
ID: 40219736
the reason against reusing the drive, the concern is that if there is corrupt data on the failed drive it might spread to the rest of the array when it rebuilds. The counterargument is if I drive been pulled, would the data on it not be considered garbage?

The controller doesn't care what is on the "offline" drive.  As long as the drive is inserted/replaced "hot", it will completely ignore the data on that drive, overwriting it with data to match the existing RAID 5 during the rebuild.

Many controller, backplane, and drive firmware updates address issues that can cause drives to go offline, even if they are healthy, so it is important to keep firmware up to date, but dlethe is absolutely right ... there is no need to blindly reseat (or replace) a drive that goes offline when there is a controller log present that can be used to determine the reason for the failure.  It is a good idea to investigate any drive failure.  The drive can also be tested before rebuilding, so if the drive fails diagnostics, there is no point in even trying to rebuild it.

You didn't say which system or controller you are using, so this is general information ... for example, not all controllers have logs for review, in which case you are left in kind of a guessing game.  Since RAID is not a backup solution, no important system (RAID or not) should be without a solid backup plan.
0
 
LVL 47

Expert Comment

by:dlethe
ID: 40219835
If the controller doesn't have any logging mechanism, then you can bet it is a dumb controller and probably puts you at great risk to begin with.  If your O/S supports host-based software RAID then you are often much better off going RAID1 or RAID10.

(If Solaris, then you're better off with software RAID and ZFS than any PCI-based RAID anyway).  If windows, the native software RAID is quite good and even has read load balancing and full event logging.  Just don't use it for other than RAID1/RAID10.    So with the software RAID1/10 you get much better read performance and more caching.
0
 
LVL 23

Author Comment

by:Brian B
ID: 40219953
The server is a new premium brand server, so I'll look and see if it has a controller log.

So is the theory that trying to reuse the failed drive could cause corruption of the rest of the data false? Again I'm trying to look for documentation to back all of this up. Others seem to have this theory going and claim they have seen it happen I their previous work places. From what you experts have been telling me, I could see corrupt data and also a failed drive could be caused by a bad controller, or something else in the write process, but not from reusing a failed disk (that was never bad in the first place).
0
Do email signature updates give you a headache?

Do you feel like you are constantly making changes to email signatures? Are the images not formatting how you want them to? Want high-quality HTML signatures on all devices, including on mobiles and Macs? Then, let Exclaimer solve all your email signature problems today.

 
LVL 2

Assisted Solution

by:Jim_Nim
Jim_Nim earned 167 total points
ID: 40219959
If a drive in a RAID5 has failed, then the RAID set is already "degraded". There is no inherent danger in the action of removing/reseating the drive, as the data it contains is already invalid.

There is danger in rebuilding to this disk though, especially when you're using RAID5. Performing a rebuild operation involves reading data from the entire capacity of each remaining online drive, which may place a significantly higher level of stress on each of those disks than your normal day-to-day I/O does. This in turn can increase your risk of another drive failing due to mechanical errors.

The risk of a "double fault" (encountering a bad block while reading from all drives to calculate what data to write for the rebuild) exists whether you're rebuilding to a reseated drive or a replacement drive. Once a drive in a RAID5 has failed, the risk of a double fault is already in place, and you're almost equally likely to be affected by it whether you perform a backup or rebuild with either of the aforementioned methods.

The best course of action is to already have a backup of the data before a drive failure occurs, and to avoid using RAID5 for mission-critical or irreplaceable data - as others have mentioned, RAID6 is a much better option, with exponentially lower risks of double-fault occurrence after a single drive failure.

Source: Personal experience supporting enterprise storage solutions on a daily basis
Web searches for documentation: "raid double fault", "raid URE"
0
 
LVL 32

Assisted Solution

by:PowerEdgeTech
PowerEdgeTech earned 166 total points
ID: 40220021
So is the theory that trying to reuse the failed drive could cause corruption of the rest of the data false?
Yes.  The only way it could cause corruption is if there was some kind of power malfunction with the drive that caused the other drives to go offline, but you would know there was an issue right away.  Another way it could cause corruption is STRICTLY user error - forcing the drive online instead of rebuilding it will corrupt the entire array.  There is otherwise no danger of the failed drive corrupting the array data like you describe.  A failed healthy disk is no different than a healthy replacement disk when it comes to rebuilding the array with it.

As far as documentation goes, I've been at this too long to know what documentation I may have absorbed along the way.  I worked hundreds of cases as a level 2 support agent for Dell servers, and worked with hundreds more Dell servers since then.  You seem reluctant to name your server/controller, but the make/model can make a difference.  Every controller model has their glitches, features, and processes, and without knowing specifics of their experiences, it's hard to draw a hard line for how ALL controllers will behave or how gracefully they handle failure.
0
 
LVL 47

Assisted Solution

by:dlethe
dlethe earned 83 total points
ID: 40220052
No there are failure scenarios.  A drive fails for a reason. If the drive failed because there are no more spare sectors and you use it, then the disk can't be trusted to write the correct data at block X.  If then another drive fails then all data on that stripe is lost.
0
 
LVL 2

Accepted Solution

by:
Jim_Nim earned 167 total points
ID: 40220183
In regards to this drive failure being on a new server... brand new drives do show up with immediate failures from time to time. It's very possible that the drive failed due to legitimate problems, but it's also possible that it was due to an error on the controller. As previously suggested, a controller log is likely to give more information about what types of problems occurred on that drive slot. Typically you'll have SCSI Sense Key data reported just before the drive failed, and an experienced support tech would be able to tell you whether the particular errors seen are indicative of a drive problem, or just a fluke or controller communication problem that you can ignore in this instance.
0
 
LVL 23

Author Comment

by:Brian B
ID: 40220330
<edit> It's an HP P420i controller, but <end edit> I'm trying to keep the issue general and not specific to a certain kind of hardware. I'll see if I can find any sort of controller logs. Oddly enough, when I called support they just sent a replacement drive (well, a replacement drive and a tech to install it for me). Never asked any questions about logs or anything of the sort.
0
 
LVL 23

Author Closing Comment

by:Brian B
ID: 40225185
Thanks for helping me understand the problem better.
0

Featured Post

Free Gift Card with Acronis Backup Purchase!

Backup any data in any location: local and remote systems, physical and virtual servers, private and public clouds, Macs and PCs, tablets and mobile devices, & more! For limited time only, buy any Acronis backup products and get a FREE Amazon/Best Buy gift card worth up to $200!

Join & Write a Comment

Suggested Solutions

Learn about cloud computing and its benefits for small business owners.
Moving your enterprise fax infrastructure from in-house fax machines and servers to the cloud makes sense — from both an efficiency and productivity standpoint. But does migrating to a cloud fax solution mean you will no longer be able to send or re…
This tutorial will walk an individual through the process of installing the necessary services and then configuring a Windows Server 2012 system as an iSCSI target. To install the necessary roles, go to Server Manager, and select Add Roles and Featu…
This Micro Tutorial will teach you how to reformat your flash drive. Sometimes your flash drive may have issues carrying files so this will completely restore it to manufacturing settings. Make sure to backup all files before reformatting. This w…

762 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

20 Experts available now in Live!

Get 1:1 Help Now