Still celebrating National IT Professionals Day with 3 months of free Premium Membership. Use Code ITDAY17

x
?
Solved

2 HDD Failed. How to indentify the cause?

Posted on 2007-11-20
11
Medium Priority
?
591 Views
Last Modified: 2013-11-14
Hi experts,
A nightmare happened saturday night here as 2 of 3 hdd (raid 5) failed. Impossible to rebuilt and data recovery failed. Fortunatelly we had backups so we're up again, on a single hdd (still scary)...Before doing anything (buying hdd and or raid controller and or brand new server (getting old)) we would like to know why it happened. Can you help me with all the possibilities. Thank you (sorry for my bad english)

3 MAXTOR 18GB ATLAS 15K ULTRA320 SCSI 68PIN 3.5" LP
1  Adaptec SCSI RAID 2100S
1 disk was "optimal" the other 2 were "failed"
Array = Dead - multiple drives failure.
0
Comment
Question by:Tito_Mahawk
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 3
  • 2
  • 2
  • +3
11 Comments
 
LVL 38

Accepted Solution

by:
Hypercat (Deb) earned 135 total points
ID: 20322661
It's very hard to say what might have happened.  If these are 18GB HDs then I have to assume they are pretty old (i.e., at least 3-4 years old if not more).  It's one of the known weaknesses of RAID arrays that, because you have 3 (or more) identical hard drives, they are more prone to multiple failures due to age.  In other words, all the drives will tend to get to their end-of-life condition at around the same time.  So, that's a possibility.  Were there any other environmental factors that might have played a part?  These would be things like excessive heat or dust/dirt in the server room, electrical brownouts or surges on a server that's not properly protected from such things, repeated server crashes due to other software or hardware failures, etc.

It's very doubtful that the controller has anything to do with it - these controllers aren't that prone to failures.  Also, if you had any array diagnostics running you would have seen problems earlier if the controller was at fault.
0
 
LVL 9

Assisted Solution

by:svs
svs earned 60 total points
ID: 20323373
Maybe a tool like smartctl (http://smartmontools.sf.net) will give you some insight...
0
 

Author Comment

by:Tito_Mahawk
ID: 20323560
Thanks Hypercat. You're right about the hdd's age. The server room is clean. Maybe a blackout,? We have a lot ofelectrical storms here. Log file did'nt show any unexpected shutdown (behind an APC SMart UPS). DO you think it's possible that even with the surge protector and battery backup it could cause electrical brownouts or surges?

SVS: I guess a tool like that is most likely useless AFTER your disk has failed? Tell me if i'm wrong.
0
NEW Veeam Agent for Microsoft Windows

Backup and recover physical and cloud-based servers and workstations, as well as endpoint devices that belong to remote users. Avoid downtime and data loss quickly and easily for Windows-based physical or public cloud-based workloads!

 
LVL 9

Expert Comment

by:svs
ID: 20323646
It's definitely better than guesswork.  Plus, you will probably use it to monitor disks that are still alive.
0
 
LVL 38

Expert Comment

by:Hypercat (Deb)
ID: 20323761
It's possible that even with surge protection and battery backup, something could have happened.  A lot of it depends on the load on the battery backup and the state of the battery itself - i.e., have you replaced it regularly? Also, in a blackout, for example, if there was a power outage for long enough that the battery backup died, the server could still shut down unexpectedly.  Some of that is to be expected, but over time it can cause stress on the hard drives.  I really think it's probably mostly just age.  

As they say, live and learn - next time around you might want to start replacing the drives at about 3 years old or so, even if none of them has failed.  You do know, I hope, that although it's recommended it is not absolutely required that all of the drives in your array be exactly the same.
0
 
LVL 17

Assisted Solution

by:RDAdams
RDAdams earned 60 total points
ID: 20323912
If you know when the equipment was purchased you could do a quick cost analysis on a new server.  We just went through this replacing 5 of 8 servers that were over 5 years old.  The other 3 are not critical so we are not concerned with them failing.   The others though ..... it was a quick sell to management to replace them.  At least your backups worked.

I would try and get approval for a new server asap.  Save yourself the future trouble.
0
 
LVL 21

Assisted Solution

by:mastoo
mastoo earned 60 total points
ID: 20328334
Do you monitor the drives regularly?  It could be one failed and was marked at risk for some time.  The array still functions so it may not be obvious there is a problem.  Then at some future date, another drive fails causing the array to fail.
0
 

Author Comment

by:Tito_Mahawk
ID: 20329574
Thanks hypercat. I'm new here so i'll check how old is the battery.

RDAdams: I've been pushing for a new server for some time...No money..they say. But i'm the one having nightmares...

mastoo: The sound alarm when a disk fails works perfectly believe me :) Looks like a swap team coming down from helicopters...
0
 
LVL 6

Assisted Solution

by:sifuedition
sifuedition earned 60 total points
ID: 20678916
I know the Dell hard drives of a similar model have firmware updates for the drives.  Those firmware updates are usually identified and provided by the original vendor.  You should check to see if Maxtor or the manufacturer of the server have released a firmware update for the drives.  The one for Dell was specific to drive response times which could cause the controller to knock the drives offline as a potential "risk" to the data.
0
 
LVL 6

Expert Comment

by:sifuedition
ID: 20678925
Also, did you try the ctrl+r option in the Adaptec BIOS to recover the array?  If you know which drive failed first, you can do the ctrl+r and then pull the drive that failed first.  That may allow you to get the array going again if you haven't already.
0
 

Author Comment

by:Tito_Mahawk
ID: 20714698
Pretty interesting sifuedition. Its been a while but i have'nt touched the hdds since then. I'll look forward to that idea.
0

Featured Post

Concerto's Cloud Advisory Services

Want to avoid the missteps to gaining all the benefits of the cloud? Learn more about the different assessment options from our Cloud Advisory team.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Backups and Disaster RecoveryIn this post, we’ll look at strategies for backups and disaster recovery.
New style of hardware planning for Microsoft Exchange server.
This Micro Tutorial will teach you how to reformat your flash drive. Sometimes your flash drive may have issues carrying files so this will completely restore it to manufacturing settings. Make sure to backup all files before reformatting. This w…
In this video, Percona Director of Solution Engineering Jon Tobin discusses the function and features of Percona Server for MongoDB. How Percona can help Percona can help you determine if Percona Server for MongoDB is the right solution for …

704 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question