2 HDD Failed. How to indentify the cause?

Hi experts,
A nightmare happened saturday night here as 2 of 3 hdd (raid 5) failed. Impossible to rebuilt and data recovery failed. Fortunatelly we had backups so we're up again, on a single hdd (still scary)...Before doing anything (buying hdd and or raid controller and or brand new server (getting old)) we would like to know why it happened. Can you help me with all the possibilities. Thank you (sorry for my bad english)

3 MAXTOR 18GB ATLAS 15K ULTRA320 SCSI 68PIN 3.5" LP
1  Adaptec SCSI RAID 2100S
1 disk was "optimal" the other 2 were "failed"
Array = Dead - multiple drives failure.
Tito_MahawkAsked:
Who is Participating?
 
Hypercat (Deb)Commented:
It's very hard to say what might have happened.  If these are 18GB HDs then I have to assume they are pretty old (i.e., at least 3-4 years old if not more).  It's one of the known weaknesses of RAID arrays that, because you have 3 (or more) identical hard drives, they are more prone to multiple failures due to age.  In other words, all the drives will tend to get to their end-of-life condition at around the same time.  So, that's a possibility.  Were there any other environmental factors that might have played a part?  These would be things like excessive heat or dust/dirt in the server room, electrical brownouts or surges on a server that's not properly protected from such things, repeated server crashes due to other software or hardware failures, etc.

It's very doubtful that the controller has anything to do with it - these controllers aren't that prone to failures.  Also, if you had any array diagnostics running you would have seen problems earlier if the controller was at fault.
0
 
svsCommented:
Maybe a tool like smartctl (http://smartmontools.sf.net) will give you some insight...
0
 
Tito_MahawkAuthor Commented:
Thanks Hypercat. You're right about the hdd's age. The server room is clean. Maybe a blackout,? We have a lot ofelectrical storms here. Log file did'nt show any unexpected shutdown (behind an APC SMart UPS). DO you think it's possible that even with the surge protector and battery backup it could cause electrical brownouts or surges?

SVS: I guess a tool like that is most likely useless AFTER your disk has failed? Tell me if i'm wrong.
0
Improve Your Query Performance Tuning

In this FREE six-day email course, you'll learn from Janis Griffin, Database Performance Evangelist. She'll teach 12 steps that you can use to optimize your queries as much as possible and see measurable results in your work. Get started today!

 
svsCommented:
It's definitely better than guesswork.  Plus, you will probably use it to monitor disks that are still alive.
0
 
Hypercat (Deb)Commented:
It's possible that even with surge protection and battery backup, something could have happened.  A lot of it depends on the load on the battery backup and the state of the battery itself - i.e., have you replaced it regularly? Also, in a blackout, for example, if there was a power outage for long enough that the battery backup died, the server could still shut down unexpectedly.  Some of that is to be expected, but over time it can cause stress on the hard drives.  I really think it's probably mostly just age.  

As they say, live and learn - next time around you might want to start replacing the drives at about 3 years old or so, even if none of them has failed.  You do know, I hope, that although it's recommended it is not absolutely required that all of the drives in your array be exactly the same.
0
 
RDAdamsCommented:
If you know when the equipment was purchased you could do a quick cost analysis on a new server.  We just went through this replacing 5 of 8 servers that were over 5 years old.  The other 3 are not critical so we are not concerned with them failing.   The others though ..... it was a quick sell to management to replace them.  At least your backups worked.

I would try and get approval for a new server asap.  Save yourself the future trouble.
0
 
mastooCommented:
Do you monitor the drives regularly?  It could be one failed and was marked at risk for some time.  The array still functions so it may not be obvious there is a problem.  Then at some future date, another drive fails causing the array to fail.
0
 
Tito_MahawkAuthor Commented:
Thanks hypercat. I'm new here so i'll check how old is the battery.

RDAdams: I've been pushing for a new server for some time...No money..they say. But i'm the one having nightmares...

mastoo: The sound alarm when a disk fails works perfectly believe me :) Looks like a swap team coming down from helicopters...
0
 
sifueditionCommented:
I know the Dell hard drives of a similar model have firmware updates for the drives.  Those firmware updates are usually identified and provided by the original vendor.  You should check to see if Maxtor or the manufacturer of the server have released a firmware update for the drives.  The one for Dell was specific to drive response times which could cause the controller to knock the drives offline as a potential "risk" to the data.
0
 
sifueditionCommented:
Also, did you try the ctrl+r option in the Adaptec BIOS to recover the array?  If you know which drive failed first, you can do the ctrl+r and then pull the drive that failed first.  That may allow you to get the array going again if you haven't already.
0
 
Tito_MahawkAuthor Commented:
Pretty interesting sifuedition. Its been a while but i have'nt touched the hdds since then. I'll look forward to that idea.
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.