asked on

Two Raid 5 disks have died. Is my disk controller bad?

The history of today's problem is complicated so I will start with the following oversimplification.

I have a 3 disk Raid 5 array.
During a recent ShadowProtect system restore, I got an amber light on hot swap Disk #1.
The Dell Open Manage utility showed:
Drive#1 was offline? with predicted failure=yes
Drive#2 online, failure=no
Drive#3 online, failure =no

I swapped in a brand new disk, and all 3 disks started flashing as the server automatically started the rebuild process. I immediately noted that Disk #2 was alternating between green and amber.

The rebuild finished, and we now have a working server with a degraded array.

Before I replace the second faulty disk I want to examine the controller to verify that it is working properly.
I suspect the controller because having two drives fail within 2 days of each other is probably not a coincidence.

I am hoping someone in this forum can help me trouble shoot the issue. Dell tech support says that the server is out of warranty and it is too old to simply buy an extension. A repair would be time and materials.

------ more painful details follow --------.

The above oversimplification may be all you experts need to give me advice. Nonetheless, there are lots of factors that might be helpful in resolving the issue.

* The problem is on the server named "ServerInAttic" which is kept in my attic in case we need a disaster recovery. The production server is named "ServerProd" and is working fine. This means I am in no hurry to resolve the issue, nor do I need to recover the data in the arrays. This is a wonderful luxury when it comes to a damaged server.
* My server is a Dell Power Edge t300 running Small Business Server 2003.
* Our Backup software is ShadowProtect 3.3.0.16
* The Raid 3 originally had 3 Seagate 222 Gig disks.
* That exact model disk is no longer available, so the replacement is the 500 Gig disk that Dell recommended. Perhaps the Raid controller is not properly handling the different disk sizes?
* ServerProd has a more recent Bios than ServerInAttic. Perhaps that somehow caused a problem when I used ShadowProtect to restore ServerProd data to ServerInAttic?

==> The following things happened before the amber light came on, so I do not think they have anything to do with the failure. I present them so you can get a complete picture.

Unlikely * I had physically moved the server by car about a week before the problem. No special packing was done. Perhaps a bumpy car ride led to a loose internal cable? (That seems unlikely because the server booted normally several times after the move.)

Unlikely * Before the problem occurred I ran Dell SUU (server update utility) on ServerInAttic in an attempt to upgrade the BIOS. I abandoned that effort when I discovered it required an 8 gig media download. Perhaps the SUU somehow damaged the Raid?

Unlikely * I was intentionally increasing the size of the server's 4 primary partitions. I used windows disk manager to delete then recreate the D: E: and F: volumes. I had intended to do a quick format on all 3 volumes, but accidentally did a full format on the D: volume. Perhaps a full format could have damaged the Raid?

Unlikely * The paging file had originally be spread across C: and E: which is contrary to Raid 5 "best practice". I reconfigured it to be only on the C: drive. I rebooted and the Raid still worked fine, so that seems unlikely to be a root cause of my problem.
===> Here is where things went wrong.
* After reconfiguring the partitions, I booted ServerInAttic from a shadowprotect DVD and attempted to restore c: d: e: and f:. with yesterday's data from ServerProd. All 4 restores aborted.
* I rebooted and the amber light came on for the first time. During the boot, I got the following messages.

Foreign configurations found on adapter
Press any key to continue or c load the configuration utility
Or F to import foreign configurations and continue

There are offline or missing virtual drives with preserved cache.
Please check the cables and ensure that all drives are present.
Press any key to enter the configuration utility.

* I rebooted again and took the F option and got the following messages during bootup.

Foreign configurations found on adapter
Press any key to continue or c load the configuration utility
Or F to import foreign configurations and continue

1 virtual drive found on the host adapter
1 virtual drive degraded
1 virtual drive handled by BIOS

* I was then able to get shadowprotect to restore C: E: and F: (I am intentionally skipping D: because it is very time consuming top restore.)

* I left the server running and purchased the new disk. I inserted it
* I swapped in the new disk, and all 3 disks started flashing as the server automatically started the rebuild process. I immediately noted that Disk #2 was alternating between green and amber.

* After the rebuild completed, the light remained amber, so I tried reseating the drive. A few minutes later the server rebooted all by itself with the following messages.

Foreign configurations found on adapter
Press any key to continue or c load the configuration utility
Or F to import foreign configurations and continue

1 virtual drive found on the host adapter
1 virtual drive degraded
1 virtual drive handled by BIOS

* Open manage shows the following
0.0.0 Online, failure = no
0.0.1 Foreign, Failure = Yes
0.0.2 Online, failure = no.

* The only tasks available on 0.0.1 are "Blink" and "Unblink". I tried both, but they did not seem useful.

Sorry to be so long winded. I hope someone can help me figure out if the controller is bad.

rberke

/

ASKER CERTIFIED SOLUTION

serialband

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

SOLUTION

Member_2_231077

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

Robert Berke

ASKER

I forgot to mention that my dead drives have been working hard for 5 years, so Andy's hypothesis is not related to my situation.

ServerInMyAttic had been running 24/7 as the production server since 2009. A month ago I moved it from the office to my attic as part of the partition resizing project. That was why it went for its car ride.

serialband suggests the disks came from the same batch so the wear out at the same time, which fits my situation perfectly.
andyalder adds a novel hypothesis -- that a disk that previously did light duty is somehow more fragile when a heavy load comes along. That does not match my situation at all, but Andy couldn't have known that.

Since both drives came from the same batch, and both had the same amount of heavy usage, I suspect I just got unlucky.

After the rebuild completed, the light remained FLASHING amber/green. When I take the Blink action, it turns to flashing green, but the drive remains "Foreign, Fail = Yes". When I take the Unblink action, it returns to flashing amber/green.

Robert Berke

ASKER

by the way after ServerInMyAttic went for the car ride, it was off for 2 weeks before I reconfigured it.

So, what is the the lesson to be learned?
1) every time you turn off an old server you should be prepared for it not to come back on?
2) People always have 3 hard drives handy just in case?
3) People should have rock solid restore documentation including Raid reconfiguration documentation?

All of the above?

Member_2_231077

Your disks have been on 24x7 but they haven't been running full tilt or you would be complaining that as soon as one failed shadowprotect started complaining because it couldn't keep up. Maybe you did say serverprotect started erroring with one disk down, it's hard to tell with all the extra waffle.

serialband

@andyalder
The remaining 2 disks in a 3 disk RAID do not do more work when one disk fails. They work the same. They all run at the same time in a RAID and they're synchronized in a RAID 3. One disk is a checksum and having one fail just means you lack checksum protection. It does not write more data to the remaining disks. The RAID controller manages the data and access. During a rebuild, all disks experience heavier activity, but they're not doing more work than they would otherwise in a normal rebuild.

They also don't have to be running full tilt for it to fail after 5 years. They just have to be powered and spinning. As long as the disk is powered and spinning, it is wearing out. The disks remain spinning 24x7 and shutting them down for an extended period after 5 years causes the problems, that's when his first disk failed. If your system is constantly powered on and off, such as with most office Desktops, the disks may last a bit longer.

@rberke
The lesson should be that disks and RAID running 24x7 need to be replaced in 3-5 years regardless of how much usage it experiences. A disk will likely fail in that time, plus, you'll have better, larger volume disks by that time. RAID is for uptime and speed. RAID is not backup. << -- I keep telling people this and they still don't believe me. They run RAID without backups until it fails. It was never for designed for backup. Backup is orthogonal to what RAID is. A backup is essentially another copy, not a checksum. (Even if you use a RAID1 mirror) You can use RAID to backup another RAID, but a single RAID by itself is not a backup solution. You use RAID for storage and speed and add more disks to fill the SCSI/SATA/SAS channel to get the necessary data throughput.

SOLUTION

Thomas Rush

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

Robert Berke

ASKER

I replaced the second failing disk and the rebuild went smoothly. Now all 3 drives are online, with fail=no.
So, I will close this question shortly.

A minor point -- "One disk is a checksum" is true for Raid 4, but not Raid in 5. In Raid 5 all disks hold both parity and data. This causes the physical disk usage to be spread evenly between the disks. Over a 5 year period all 3 disks will get approximately the same amount of I/O, which makes the likelihood of a double failure be far greater than I initially expected.

So, I now completely agree that a double failure is quite likely when 3 disks from the same batch are continuously run for 5 years on a server with Raid 5. It is also intuitive that having the server turned off for a week may be an aggravating condition.

I have one remaining question.

What would have happened if a 4th disk had been installed 5 years ago and configured as a hot spare?

Would disks 1 to 4 all be nearing end of life, OR would disks 1 to 3 be old, while disk 4 was only middle age?

Wear and tear on disks occur whenever a disk is spinning. Presumably more wear and tear is added when there are I/Os against the disk. But, I have no idea if there are guidelines that quantify the two types of wear and tear.

Anyhow, it sounds like the following strategy might be pretty good for a new server:
1) configure a hot spare when a server is first installed
Then do the following every 2 years.
3) Decommission one of the disks (whichever is oldest) and promote the hot spare into the array, and install a newly purchased disk as a new hot spare.

In a future question, I may ask for details on how to accomplish that.

Member_2_231077

Of course the other disks in RAID 5 with one disk down has to work harder, any read that should come from the failed disk comes from the other ones instead. I've seen servers where the users have had to stop work until rebuilding is finished as performance has been so poor with a failed disk. Since you're not a storage engineer it doesn't matter much that you're wrong since you aren't going to be passing that misinformation to others. Bye, I'm unsubbing.

serialband

It only works harder during the rebuild. It does not work harder otherwise. They're always reading and writing that checksum anyway. The rebuild is accessing all disks. You can run with one failed disk and it's not going to be slower until you replace the failed disk and the rebuild starts. If the RAID slowed down when it wasn't doing a rebuild, then you were using a low end RAID. I've also had my mid-high end RAIDs rebuild without significant slowdowns.

Robert Berke

ASKER

Its been several years since I studied this, but I am pretty sure Andy is correct about read operations. With Raid 5, data is normally available on two disks, and the controller get the data from the "closest" disk. For instance, disk 1 might have to move the head 2 cylinders, while disk 2 might have to move its head 5 cylinders.

If one of the 3 disks is down, the other 2 disks will each work 50% harder than normal for read operations. And half of those reads will be slower than normal because the heads will have to move farther.

But, in a way serialband, is also correct. When a new disk is hot swapped in, all 3 disks will start working full blast for a couple of hours. During that time they will all be working WAAYYY more than 50% harder and it would not be surprising if that is when an old disk crashes. Nor would it be surprising if users had to stop work while the rebuild was in progress.

I may not be a storage engineer, but I think that 95% of what I have said here is correct, and uncontroversial.

Bob

serialband

I looked that up and it appears I was wrong. I've worked with RAID systems with many more disks so the slowdown is not even ever noticeable that I made an incorrect assumption. I also work with higher end RAID that I don't see a significant slowdown during a rebuild.

Robert Berke

ASKER

Thanks to all