Solved

2 HDD Failed. How to indentify the cause?

Posted on 2007-11-20
11
584 Views
Last Modified: 2013-11-14
Hi experts,
A nightmare happened saturday night here as 2 of 3 hdd (raid 5) failed. Impossible to rebuilt and data recovery failed. Fortunatelly we had backups so we're up again, on a single hdd (still scary)...Before doing anything (buying hdd and or raid controller and or brand new server (getting old)) we would like to know why it happened. Can you help me with all the possibilities. Thank you (sorry for my bad english)

3 MAXTOR 18GB ATLAS 15K ULTRA320 SCSI 68PIN 3.5" LP
1  Adaptec SCSI RAID 2100S
1 disk was "optimal" the other 2 were "failed"
Array = Dead - multiple drives failure.
0
Comment
Question by:Tito_Mahawk
  • 3
  • 2
  • 2
  • +3
11 Comments
 
LVL 38

Accepted Solution

by:
Hypercat (Deb) earned 45 total points
Comment Utility
It's very hard to say what might have happened.  If these are 18GB HDs then I have to assume they are pretty old (i.e., at least 3-4 years old if not more).  It's one of the known weaknesses of RAID arrays that, because you have 3 (or more) identical hard drives, they are more prone to multiple failures due to age.  In other words, all the drives will tend to get to their end-of-life condition at around the same time.  So, that's a possibility.  Were there any other environmental factors that might have played a part?  These would be things like excessive heat or dust/dirt in the server room, electrical brownouts or surges on a server that's not properly protected from such things, repeated server crashes due to other software or hardware failures, etc.

It's very doubtful that the controller has anything to do with it - these controllers aren't that prone to failures.  Also, if you had any array diagnostics running you would have seen problems earlier if the controller was at fault.
0
 
LVL 9

Assisted Solution

by:svs
svs earned 20 total points
Comment Utility
Maybe a tool like smartctl (http://smartmontools.sf.net) will give you some insight...
0
 

Author Comment

by:Tito_Mahawk
Comment Utility
Thanks Hypercat. You're right about the hdd's age. The server room is clean. Maybe a blackout,? We have a lot ofelectrical storms here. Log file did'nt show any unexpected shutdown (behind an APC SMart UPS). DO you think it's possible that even with the surge protector and battery backup it could cause electrical brownouts or surges?

SVS: I guess a tool like that is most likely useless AFTER your disk has failed? Tell me if i'm wrong.
0
 
LVL 9

Expert Comment

by:svs
Comment Utility
It's definitely better than guesswork.  Plus, you will probably use it to monitor disks that are still alive.
0
 
LVL 38

Expert Comment

by:Hypercat (Deb)
Comment Utility
It's possible that even with surge protection and battery backup, something could have happened.  A lot of it depends on the load on the battery backup and the state of the battery itself - i.e., have you replaced it regularly? Also, in a blackout, for example, if there was a power outage for long enough that the battery backup died, the server could still shut down unexpectedly.  Some of that is to be expected, but over time it can cause stress on the hard drives.  I really think it's probably mostly just age.  

As they say, live and learn - next time around you might want to start replacing the drives at about 3 years old or so, even if none of them has failed.  You do know, I hope, that although it's recommended it is not absolutely required that all of the drives in your array be exactly the same.
0
How to Backup Ubuntu to Amazon S3

CloudBerry Backup offers automatic cloud backup and restoration for Linux. It has both GUI and command line interface (CLI) ensuring its flexibility in use. Find out more

 
LVL 17

Assisted Solution

by:RDAdams
RDAdams earned 20 total points
Comment Utility
If you know when the equipment was purchased you could do a quick cost analysis on a new server.  We just went through this replacing 5 of 8 servers that were over 5 years old.  The other 3 are not critical so we are not concerned with them failing.   The others though ..... it was a quick sell to management to replace them.  At least your backups worked.

I would try and get approval for a new server asap.  Save yourself the future trouble.
0
 
LVL 21

Assisted Solution

by:mastoo
mastoo earned 20 total points
Comment Utility
Do you monitor the drives regularly?  It could be one failed and was marked at risk for some time.  The array still functions so it may not be obvious there is a problem.  Then at some future date, another drive fails causing the array to fail.
0
 

Author Comment

by:Tito_Mahawk
Comment Utility
Thanks hypercat. I'm new here so i'll check how old is the battery.

RDAdams: I've been pushing for a new server for some time...No money..they say. But i'm the one having nightmares...

mastoo: The sound alarm when a disk fails works perfectly believe me :) Looks like a swap team coming down from helicopters...
0
 
LVL 6

Assisted Solution

by:sifuedition
sifuedition earned 20 total points
Comment Utility
I know the Dell hard drives of a similar model have firmware updates for the drives.  Those firmware updates are usually identified and provided by the original vendor.  You should check to see if Maxtor or the manufacturer of the server have released a firmware update for the drives.  The one for Dell was specific to drive response times which could cause the controller to knock the drives offline as a potential "risk" to the data.
0
 
LVL 6

Expert Comment

by:sifuedition
Comment Utility
Also, did you try the ctrl+r option in the Adaptec BIOS to recover the array?  If you know which drive failed first, you can do the ctrl+r and then pull the drive that failed first.  That may allow you to get the array going again if you haven't already.
0
 

Author Comment

by:Tito_Mahawk
Comment Utility
Pretty interesting sifuedition. Its been a while but i have'nt touched the hdds since then. I'll look forward to that idea.
0

Featured Post

How your wiki can always stay up-to-date

Quip doubles as a “living” wiki and a project management tool that evolves with your organization. As you finish projects in Quip, the work remains, easily accessible to all team members, new and old.
- Increase transparency
- Onboard new hires faster
- Access from mobile/offline

Join & Write a Comment

More or less everybody in the IT market understands the basics of Networking, however when we start talking about Storage Networks, things get a bit dizzier, and this is where I would like to help.
Hyper-convergence systems have taken the IT world by storm and have quickly started to change our point of view of how the data center should and could be architected. In this article, I’ll explain the benefits of employing a hyper-converged system …
This video Micro Tutorial explains how to clone a hard drive using a commercial software product for Windows systems called Casper from Future Systems Solutions (FSS). Cloning makes an exact, complete copy of one hard disk drive (HDD) onto another d…
This tutorial will walk an individual through the process of installing the necessary services and then configuring a Windows Server 2012 system as an iSCSI target. To install the necessary roles, go to Server Manager, and select Add Roles and Featu…

762 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

11 Experts available now in Live!

Get 1:1 Help Now