Two Raid 5 disks have died. Is my disk controller bad?

The history of today's problem is complicated so I will start with the following oversimplification.

I have a 3 disk Raid 5 array.
During a recent ShadowProtect system restore, I got an amber light on hot swap Disk #1.
The Dell Open Manage utility showed:
    Drive#1 was offline? with predicted failure=yes
    Drive#2 online, failure=no
    Drive#3 online, failure =no

I swapped in a brand new disk, and all 3 disks started flashing as the server automatically started the rebuild process.  I immediately noted that Disk #2 was alternating between green and amber.

The rebuild finished, and we now have a working server with a degraded array.

Before I replace the second faulty disk I want to examine the controller to verify that it is working properly.
I suspect the controller because having  two drives fail within 2 days of each other is probably not a coincidence.

I am hoping someone in this forum can help me trouble shoot the issue.   Dell tech support says that the server is out of warranty and it is too old to simply buy an extension.  A repair would be time and materials.

------  more painful details follow --------.

The above oversimplification may be all you experts need to give me advice.  Nonetheless, there are lots of factors that might be helpful in resolving the issue.

* The problem is on the server named "ServerInAttic" which is kept in my attic in case we need a disaster recovery.  The production server is named "ServerProd" and is working fine. This means I am in no hurry to resolve the issue, nor do I need to recover the data in the arrays. This is a wonderful luxury when it comes to a damaged server.
* My server is a Dell Power Edge t300 running Small Business Server 2003.
* Our Backup software is ShadowProtect
* The Raid 3 originally had 3  Seagate 222 Gig disks.
* That exact model disk is no longer available, so the replacement is the 500 Gig disk that Dell recommended. Perhaps the Raid controller is not properly handling the different disk sizes?
* ServerProd has a more recent Bios than ServerInAttic. Perhaps that somehow caused a problem when I used ShadowProtect to restore ServerProd data to ServerInAttic?

==> The following things happened before the amber light came on, so I do not think they have anything to do with the failure.  I present them so you can get a complete picture.

Unlikely * I had physically moved the server by car about a week before the problem. No special packing was done.  Perhaps a bumpy car ride led to a loose internal cable? (That seems unlikely because the server booted normally several times after the move.)

Unlikely * Before the problem occurred I ran Dell SUU (server update utility) on ServerInAttic in an attempt to upgrade the BIOS.  I abandoned that effort when I discovered it required an 8 gig media download.  Perhaps the SUU somehow damaged the Raid?  

Unlikely * I was intentionally increasing the size of the server's 4 primary partitions.  I used windows disk manager to delete then recreate the D: E: and F: volumes. I had intended to do a quick format on all 3 volumes, but accidentally did a full format on the D: volume.  Perhaps a full format could have damaged the Raid?  

Unlikely * The paging file had originally be spread across C: and E: which is contrary to Raid 5 "best practice". I reconfigured it to be only on the C: drive.  I rebooted and the Raid still worked fine, so that seems unlikely to be a root cause of my problem.
===> Here is where things went wrong.
* After reconfiguring the partitions, I booted ServerInAttic from a shadowprotect DVD and attempted to restore c: d: e: and f:.  with yesterday's data from ServerProd. All 4 restores aborted.
* I rebooted and the amber light came on for the first time. During the boot, I got the following messages.

    Foreign configurations found on adapter
    Press any key to continue or c load the configuration utility
    Or F to import foreign configurations and continue

    There are offline or missing virtual drives with preserved cache.
    Please check the cables and ensure that all drives are present.
    Press any key to enter the configuration utility.

* I rebooted again and took the F option and got the following messages during bootup.

    Foreign configurations found on adapter
    Press any key to continue or c load the configuration utility
    Or F to import foreign configurations and continue

    1 virtual drive found on the host adapter
    1 virtual drive degraded
    1 virtual drive handled by BIOS

* I was then able to get shadowprotect to restore C: E: and F: (I am intentionally skipping D: because it is very time consuming top restore.)

* I left the server running and purchased the new disk.  I inserted it
* I swapped in the new disk, and all 3 disks started flashing as the server automatically started the rebuild process.  I immediately noted that Disk #2 was alternating between green and amber.

* After the rebuild completed, the light remained amber,  so I tried reseating the drive.  A few minutes later the server rebooted all by itself with the following messages.

    Foreign configurations found on adapter
    Press any key to continue or c load the configuration utility
    Or F to import foreign configurations and continue

    1 virtual drive found on the host adapter
    1 virtual drive degraded
    1 virtual drive handled by BIOS

* Open manage shows the following
    0.0.0   Online, failure = no
    0.0.1  Foreign, Failure = Yes
    0.0.2  Online, failure = no.

* The only tasks available on 0.0.1 are "Blink" and "Unblink".  I tried both, but they did not seem useful.

Sorry to be so long winded. I hope someone can help me figure out if the controller is bad.


Who is Participating?

Improve company productivity with a Business Account.Sign Up

serialbandConnect With a Mentor Commented:
They have RAID 6 now because having 2 drives fail within a week of one another is actually a more frequent occurrence than you think.  You bought the RAID and drives at the same time.  The disks are likely from the same batch.  The RAID companies now try to mix up batches if they can, but you can't completely avoid that.

It's less likely that your drive controller went bad and more likely that your bad drives are from the same batch. You should have a cold spare around if you're only using RAID 5 or RAID 3.  Which disk did you first replace?

Also, you moved your RAID.  It was on 24/7 and you shut it off.  How long was it off?  How long was it in operation? If it was in use for more than 2-3 years and you had it off more than 24 hours, then your disks could easily have failed.  Unless your system was relatively new, it's better to just buy a new RAID instead of shutting off the old one for more than 24 hours and moving it.

I had an old linux server running 24/7 for 6 years straight.  When I had to shut it down for a move, the disks died.  Prior to that, I would periodically shut it down for maintenance and dust cleaning and it would come up.
andyalderConnect With a Mentor Commented:
Having two drives fail within days of each other is not uncommon, this is your backup server so it has very little disk work to do, then one disk fails and the other two have to work twice as hard until you replace the failed disk. When you do replace it the other two have to work flat-out until the rebuild is complete, it's like running for a bus today when you've only walked to the car every other day this year.

>After the rebuild completed, the light remained amber,
Flashing or solid?
rberkeConsultantAuthor Commented:
I forgot to mention that my dead drives have been working hard for 5 years, so Andy's hypothesis is not related to my situation.

ServerInMyAttic  had been running 24/7 as the production server since 2009.  A month ago I moved it from the office to my attic as part of the partition resizing project.  That was why it went for its car ride.

serialband suggests the disks came from the same batch so the wear out at the same time, which fits my situation perfectly.
andyalder adds a novel hypothesis -- that a disk that previously did light duty is somehow more fragile when a heavy load comes along. That does not match my situation at all, but Andy couldn't have known that.

Since both drives came from the same batch, and both had the same amount of heavy usage, I suspect I just got unlucky.

After the rebuild completed, the light remained FLASHING amber/green.  When I take the Blink action, it turns to flashing green, but the drive remains "Foreign, Fail = Yes". When I take the Unblink action, it returns to flashing amber/green.
Easily Design & Build Your Next Website

Squarespace’s all-in-one platform gives you everything you need to express yourself creatively online, whether it is with a domain, website, or online store. Get started with your free trial today, and when ready, take 10% off your first purchase with offer code 'EXPERTS'.

rberkeConsultantAuthor Commented:
by the way after ServerInMyAttic went for the car ride, it was off for 2 weeks before I reconfigured it.

So, what is the the lesson to be learned?  
1)  every time you turn off an old server you should be prepared for it not to come back on?  
2) People always have 3 hard drives handy just in case?  
3) People should have rock solid restore documentation including  Raid reconfiguration documentation?

All of the above?
Your disks have been on 24x7 but they haven't been running full tilt or you would be complaining that as soon as one failed shadowprotect started complaining because it couldn't keep up. Maybe you did say serverprotect started erroring with one disk down, it's hard to tell with all the extra waffle.
The remaining 2 disks in a 3 disk RAID do not do more work when one disk fails.  They work the same.  They all run at the same time in a RAID and they're synchronized in a RAID 3.  One disk is a checksum and having one fail just means you lack checksum protection.  It does not write more data to the remaining disks.  The RAID controller manages the data and access.  During a rebuild, all disks experience heavier activity, but they're not doing more work than they would otherwise in a normal rebuild.

They also don't have to be running full tilt for it to fail after 5 years.  They just have to be powered and spinning.  As long as the disk is powered and spinning, it is wearing out.  The disks remain spinning 24x7 and shutting them down for an extended period after 5 years causes the problems, that's when his first disk failed.  If your system is constantly powered on and off, such as with most office Desktops, the disks may last a bit longer.

The lesson should be that disks and RAID running 24x7 need to be replaced in 3-5 years regardless of how much usage it experiences.  A disk will likely fail in that time, plus, you'll have better, larger volume disks by that time.  RAID is for uptime and speed.  RAID is not backup.  << -- I keep telling people this and they still don't believe me.  They run RAID without backups until it fails.  It was never for designed for backup.  Backup is orthogonal to what RAID is.  A backup is essentially another copy, not a checksum.  (Even if you use a RAID1 mirror)  You can use RAID to backup another RAID, but a single RAID by itself is not a backup solution.  You use RAID for storage and speed and add more disks to fill the SCSI/SATA/SAS channel to get the necessary data throughput.
Thomas RushConnect With a Mentor Commented:
I absolutely that RAID is not backup.
RAID (other than RAID 0) protects against a disk failure, but that's it.  There's no protection against a virus, controller failure, Cryptolocker, a flood, an embezzler cooking the books, and on and on.  You still need a backup.

You can greatly decrease the chance of a RAID failure (other than RAID 0!) by having an online spare, if supported by your controller and disk enclosure.   To implement this, you buy one "extra" disk and load it into your enclosure, but instead of making it part of the RAID set, you tell the controller to use it as a hot spare.  Now, as soon as any drive in the array fails, the controller will rebuild its data onto the hot spare.  You're still going to lose all your data from a RAID 5 set if a second disk fails before that rebuild is complete, but the chance of that happening is significantly less than if you had to wait to notice the failure, and spend a few hours or days to get the failing disk replaced.  Note: This is still not backup!

Oh -- I agree with the others; this sounds like normal disk mortality, and not like a controller failure.
rberkeConsultantAuthor Commented:
I replaced the second failing disk and the rebuild went smoothly. Now all 3 drives are online, with fail=no.  
So, I will close this question shortly.

A minor point -- "One disk is a checksum" is true for Raid 4, but not Raid in 5. In Raid 5 all disks hold both parity and data. This  causes  the physical disk usage to be spread evenly between the disks.  Over a 5 year period all 3 disks will get  approximately the same amount of I/O, which makes the likelihood of a double failure be far greater than I initially expected.

So, I now completely agree that a double failure is quite likely when 3 disks from the same batch are continuously run for 5 years on a server with Raid 5.   It is also intuitive that having the server turned off for a week may be an aggravating condition.

I have one remaining question.

What would have happened if a 4th disk had been installed 5 years ago and configured as a hot spare?

Would disks 1 to 4 all be nearing end of life, OR would disks 1 to 3 be old, while disk 4 was only middle age?

Wear and tear on disks occur whenever a disk is spinning. Presumably more wear and tear is added when there are I/Os against the disk. But, I have no idea if there are guidelines that quantify the two types of wear and tear.

Anyhow, it sounds like the following strategy might be pretty good for a new server:
1) configure a hot spare when a server is first installed
Then do the following every 2 years.
3) Decommission one of the disks (whichever is oldest) and promote the hot spare into the array, and install a newly purchased disk as a new hot spare.

In a future question, I may ask for details on how to accomplish that.
Of course the other disks in RAID 5 with one disk down has to work harder,  any read that should come from the failed disk comes from the other ones instead. I've seen servers where the users have had to stop work until rebuilding is finished as performance has  been so poor with a failed disk. Since you're not a storage engineer it doesn't matter much that you're wrong since you aren't going to be  passing that misinformation to others. Bye, I'm unsubbing.
It only works harder during the rebuild.  It does not work harder otherwise.  They're always reading and writing that checksum anyway.  The rebuild is accessing all disks.  You can run with one failed disk and it's not going to be slower until you replace the failed disk and the rebuild starts.  If the RAID slowed down when it wasn't doing a rebuild, then you were using a low end RAID.  I've also had my mid-high end RAIDs rebuild without significant slowdowns.
rberkeConsultantAuthor Commented:
Its been several years since I studied this, but I am pretty sure Andy is correct about read operations.  With Raid 5, data is normally available on two disks, and the controller get the data from the "closest" disk. For instance, disk 1 might have to move the head 2 cylinders, while disk 2 might have to move its head 5 cylinders.    

If one of the 3 disks is down,  the other 2 disks will each work 50% harder than normal for read operations. And half of those reads will be slower than normal because the heads will have to move farther.

But, in a way serialband, is also correct.  When a new disk is hot swapped in, all 3 disks will start working full blast for a couple of hours.  During that time they will all be working WAAYYY more than 50% harder and it would not be surprising if that is when an old disk crashes.  Nor would it be surprising if users had to stop work while the rebuild was in progress.

I may not be a storage engineer, but I think that 95% of what I have said here is correct, and uncontroversial.

I looked that up and it appears I was wrong.  I've worked with RAID systems with many more disks so the slowdown is not even ever noticeable that I made an incorrect assumption.  I also work with higher end RAID that I don't see a significant slowdown during a rebuild.
rberkeConsultantAuthor Commented:
Thanks to all
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.