Reseating RAID5 Failed Drive

We seem to be having two schools of thought around the office and I am having trouble finding any manufacturer's documentation or other facts to support arguments on either side.

We had a failed drive in a RAID5 array. Now first of all I understand there s a lot of discussion out there about whether RAID5 is even necessary nowadays, but it is in this particular server. So that is not what this is about. The question is about what to do with the failed drive:

Past experience for some has been that tech support for the manufacturer recommended first testing and then reseating said failed drive to see if the problem was due to a controller glitch, drive being loose, etc.

Others say that you shouldn't reseat a failed drive because you risk corrupting the rest of the array during the rebuild process.

- I can understand if the drive is under warranty, you might as well replace it anyway, but that could take longer than rebuilding the failed drive.
- I also understand there is risk to another disk failing during a rebuild of the array, but is that somehow made worse if the drive you are rebuilding has previously failed?

I know there are certain rules on EE regarding third party website links, but if you could provide support to your arguments, especially with manufacturer's technical documents, that would be great. Not that this is my opinion, but I am especially interested in anything that specifically says why you shouldn't reseat an already failed drive.
LVL 26
Brian BEE Topic Advisor, Independant Technology ProfessionalAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

The first thing to do is make sure your last backup is 100% OK. If you aren't sure, make another full backup (I would do that even when the previous backup is OK). Once that is done, check check via your RAID controller's management utility to see what it says about the connected disks. If it tells you a disk is bad, remove it and and connect it to another PC without RAID controller, so you can run the disk manufacturer's diagnostic utility on it. If the tool tells you the disk is bad, get a replacement, if it says it is OK, you can return to the server and try getting the array rebuilt. This will stress the system and if the other disks are in not so good a condition, it could cause the array to fail, but then you can restore from the backup you just made.

Personally I'd suggest you always add another additional disk as a hot-spare, which, whenever a disk fails, takes over automatically. That reduces the time during which the system is at risk of data loss, and regularly check the system's health.
If they are telling you to reseat the drive then you're talking to an amateur.   They should have asked to see controller logs first.   Yanking out a drive degrades an array and guarantees data loss if you have a bad block elsewhere.  It guarantees 100% data loss on a rebuild if a drive fails.

The most stressful thing you can do to an array is rebuild it.  Their advice requires you to put it at extreme risk.

Back up first.   Frankly I would also consider doing an online migration from RAID5->RAID6 if the controller allows it.  Then you can sleep at night.
Brian BEE Topic Advisor, Independant Technology ProfessionalAuthor Commented:
> Yanking out a drive degrades an array and guarantees data loss if you have a bad block elsewhere.
But wouldn't I also be doing that when I replace it with a new drive? Or do you mean if a block goes bad during the rebuild? Admittedly I would be potentially be pulling a drive twice, once to see if it would recover, then a second time if the drive has be replaced. So double the odds, I suppose.

As far as the whole hot spare or RAID6 argument goes, like I said it's not possible. Other servers in this shop do have hot spares.

As far as the reason against reusing the drive, the concern is that if there is corrupt data on the failed drive it might spread to the rest of the array when it rebuilds. The counterargument is if I drive been pulled, would the data on it not be considered garbage? I wouldn't think a controller would read the data on the drive at all if it was being rebuilt.
Ultimate Tool Kit for Technology Solution Provider

Broken down into practical pointers and step-by-step instructions, the IT Service Excellence Tool Kit delivers expert advice for technology solution providers. Get your free copy now.

PowerEdgeTechIT ConsultantCommented:
the reason against reusing the drive, the concern is that if there is corrupt data on the failed drive it might spread to the rest of the array when it rebuilds. The counterargument is if I drive been pulled, would the data on it not be considered garbage?

The controller doesn't care what is on the "offline" drive.  As long as the drive is inserted/replaced "hot", it will completely ignore the data on that drive, overwriting it with data to match the existing RAID 5 during the rebuild.

Many controller, backplane, and drive firmware updates address issues that can cause drives to go offline, even if they are healthy, so it is important to keep firmware up to date, but dlethe is absolutely right ... there is no need to blindly reseat (or replace) a drive that goes offline when there is a controller log present that can be used to determine the reason for the failure.  It is a good idea to investigate any drive failure.  The drive can also be tested before rebuilding, so if the drive fails diagnostics, there is no point in even trying to rebuild it.

You didn't say which system or controller you are using, so this is general information ... for example, not all controllers have logs for review, in which case you are left in kind of a guessing game.  Since RAID is not a backup solution, no important system (RAID or not) should be without a solid backup plan.
If the controller doesn't have any logging mechanism, then you can bet it is a dumb controller and probably puts you at great risk to begin with.  If your O/S supports host-based software RAID then you are often much better off going RAID1 or RAID10.

(If Solaris, then you're better off with software RAID and ZFS than any PCI-based RAID anyway).  If windows, the native software RAID is quite good and even has read load balancing and full event logging.  Just don't use it for other than RAID1/RAID10.    So with the software RAID1/10 you get much better read performance and more caching.
Brian BEE Topic Advisor, Independant Technology ProfessionalAuthor Commented:
The server is a new premium brand server, so I'll look and see if it has a controller log.

So is the theory that trying to reuse the failed drive could cause corruption of the rest of the data false? Again I'm trying to look for documentation to back all of this up. Others seem to have this theory going and claim they have seen it happen I their previous work places. From what you experts have been telling me, I could see corrupt data and also a failed drive could be caused by a bad controller, or something else in the write process, but not from reusing a failed disk (that was never bad in the first place).
Jim_NimSenior EngineerCommented:
If a drive in a RAID5 has failed, then the RAID set is already "degraded". There is no inherent danger in the action of removing/reseating the drive, as the data it contains is already invalid.

There is danger in rebuilding to this disk though, especially when you're using RAID5. Performing a rebuild operation involves reading data from the entire capacity of each remaining online drive, which may place a significantly higher level of stress on each of those disks than your normal day-to-day I/O does. This in turn can increase your risk of another drive failing due to mechanical errors.

The risk of a "double fault" (encountering a bad block while reading from all drives to calculate what data to write for the rebuild) exists whether you're rebuilding to a reseated drive or a replacement drive. Once a drive in a RAID5 has failed, the risk of a double fault is already in place, and you're almost equally likely to be affected by it whether you perform a backup or rebuild with either of the aforementioned methods.

The best course of action is to already have a backup of the data before a drive failure occurs, and to avoid using RAID5 for mission-critical or irreplaceable data - as others have mentioned, RAID6 is a much better option, with exponentially lower risks of double-fault occurrence after a single drive failure.

Source: Personal experience supporting enterprise storage solutions on a daily basis
Web searches for documentation: "raid double fault", "raid URE"
PowerEdgeTechIT ConsultantCommented:
So is the theory that trying to reuse the failed drive could cause corruption of the rest of the data false?
Yes.  The only way it could cause corruption is if there was some kind of power malfunction with the drive that caused the other drives to go offline, but you would know there was an issue right away.  Another way it could cause corruption is STRICTLY user error - forcing the drive online instead of rebuilding it will corrupt the entire array.  There is otherwise no danger of the failed drive corrupting the array data like you describe.  A failed healthy disk is no different than a healthy replacement disk when it comes to rebuilding the array with it.

As far as documentation goes, I've been at this too long to know what documentation I may have absorbed along the way.  I worked hundreds of cases as a level 2 support agent for Dell servers, and worked with hundreds more Dell servers since then.  You seem reluctant to name your server/controller, but the make/model can make a difference.  Every controller model has their glitches, features, and processes, and without knowing specifics of their experiences, it's hard to draw a hard line for how ALL controllers will behave or how gracefully they handle failure.
No there are failure scenarios.  A drive fails for a reason. If the drive failed because there are no more spare sectors and you use it, then the disk can't be trusted to write the correct data at block X.  If then another drive fails then all data on that stripe is lost.
Jim_NimSenior EngineerCommented:
In regards to this drive failure being on a new server... brand new drives do show up with immediate failures from time to time. It's very possible that the drive failed due to legitimate problems, but it's also possible that it was due to an error on the controller. As previously suggested, a controller log is likely to give more information about what types of problems occurred on that drive slot. Typically you'll have SCSI Sense Key data reported just before the drive failed, and an experienced support tech would be able to tell you whether the particular errors seen are indicative of a drive problem, or just a fluke or controller communication problem that you can ignore in this instance.

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
Brian BEE Topic Advisor, Independant Technology ProfessionalAuthor Commented:
<edit> It's an HP P420i controller, but <end edit> I'm trying to keep the issue general and not specific to a certain kind of hardware. I'll see if I can find any sort of controller logs. Oddly enough, when I called support they just sent a replacement drive (well, a replacement drive and a tech to install it for me). Never asked any questions about logs or anything of the sort.
Brian BEE Topic Advisor, Independant Technology ProfessionalAuthor Commented:
Thanks for helping me understand the problem better.
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today

From novice to tech pro — start learning today.

Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.