Solved

2 HDD Failed. How to indentify the cause?

Posted on 2007-11-20
11
585 Views
Last Modified: 2013-11-14
Hi experts,
A nightmare happened saturday night here as 2 of 3 hdd (raid 5) failed. Impossible to rebuilt and data recovery failed. Fortunatelly we had backups so we're up again, on a single hdd (still scary)...Before doing anything (buying hdd and or raid controller and or brand new server (getting old)) we would like to know why it happened. Can you help me with all the possibilities. Thank you (sorry for my bad english)

3 MAXTOR 18GB ATLAS 15K ULTRA320 SCSI 68PIN 3.5" LP
1  Adaptec SCSI RAID 2100S
1 disk was "optimal" the other 2 were "failed"
Array = Dead - multiple drives failure.
0
Comment
Question by:Tito_Mahawk
  • 3
  • 2
  • 2
  • +3
11 Comments
 
LVL 38

Accepted Solution

by:
Hypercat (Deb) earned 45 total points
ID: 20322661
It's very hard to say what might have happened.  If these are 18GB HDs then I have to assume they are pretty old (i.e., at least 3-4 years old if not more).  It's one of the known weaknesses of RAID arrays that, because you have 3 (or more) identical hard drives, they are more prone to multiple failures due to age.  In other words, all the drives will tend to get to their end-of-life condition at around the same time.  So, that's a possibility.  Were there any other environmental factors that might have played a part?  These would be things like excessive heat or dust/dirt in the server room, electrical brownouts or surges on a server that's not properly protected from such things, repeated server crashes due to other software or hardware failures, etc.

It's very doubtful that the controller has anything to do with it - these controllers aren't that prone to failures.  Also, if you had any array diagnostics running you would have seen problems earlier if the controller was at fault.
0
 
LVL 9

Assisted Solution

by:svs
svs earned 20 total points
ID: 20323373
Maybe a tool like smartctl (http://smartmontools.sf.net) will give you some insight...
0
 

Author Comment

by:Tito_Mahawk
ID: 20323560
Thanks Hypercat. You're right about the hdd's age. The server room is clean. Maybe a blackout,? We have a lot ofelectrical storms here. Log file did'nt show any unexpected shutdown (behind an APC SMart UPS). DO you think it's possible that even with the surge protector and battery backup it could cause electrical brownouts or surges?

SVS: I guess a tool like that is most likely useless AFTER your disk has failed? Tell me if i'm wrong.
0
 
LVL 9

Expert Comment

by:svs
ID: 20323646
It's definitely better than guesswork.  Plus, you will probably use it to monitor disks that are still alive.
0
 
LVL 38

Expert Comment

by:Hypercat (Deb)
ID: 20323761
It's possible that even with surge protection and battery backup, something could have happened.  A lot of it depends on the load on the battery backup and the state of the battery itself - i.e., have you replaced it regularly? Also, in a blackout, for example, if there was a power outage for long enough that the battery backup died, the server could still shut down unexpectedly.  Some of that is to be expected, but over time it can cause stress on the hard drives.  I really think it's probably mostly just age.  

As they say, live and learn - next time around you might want to start replacing the drives at about 3 years old or so, even if none of them has failed.  You do know, I hope, that although it's recommended it is not absolutely required that all of the drives in your array be exactly the same.
0
Give your grad a cloud of their own!

With up to 8TB of storage, give your favorite graduate their own personal cloud to centralize all their photos, videos and music in one safe place. They can save, sync and share all their stuff, and automatic photo backup helps free up space on their smartphone and tablet.

 
LVL 17

Assisted Solution

by:RDAdams
RDAdams earned 20 total points
ID: 20323912
If you know when the equipment was purchased you could do a quick cost analysis on a new server.  We just went through this replacing 5 of 8 servers that were over 5 years old.  The other 3 are not critical so we are not concerned with them failing.   The others though ..... it was a quick sell to management to replace them.  At least your backups worked.

I would try and get approval for a new server asap.  Save yourself the future trouble.
0
 
LVL 21

Assisted Solution

by:mastoo
mastoo earned 20 total points
ID: 20328334
Do you monitor the drives regularly?  It could be one failed and was marked at risk for some time.  The array still functions so it may not be obvious there is a problem.  Then at some future date, another drive fails causing the array to fail.
0
 

Author Comment

by:Tito_Mahawk
ID: 20329574
Thanks hypercat. I'm new here so i'll check how old is the battery.

RDAdams: I've been pushing for a new server for some time...No money..they say. But i'm the one having nightmares...

mastoo: The sound alarm when a disk fails works perfectly believe me :) Looks like a swap team coming down from helicopters...
0
 
LVL 6

Assisted Solution

by:sifuedition
sifuedition earned 20 total points
ID: 20678916
I know the Dell hard drives of a similar model have firmware updates for the drives.  Those firmware updates are usually identified and provided by the original vendor.  You should check to see if Maxtor or the manufacturer of the server have released a firmware update for the drives.  The one for Dell was specific to drive response times which could cause the controller to knock the drives offline as a potential "risk" to the data.
0
 
LVL 6

Expert Comment

by:sifuedition
ID: 20678925
Also, did you try the ctrl+r option in the Adaptec BIOS to recover the array?  If you know which drive failed first, you can do the ctrl+r and then pull the drive that failed first.  That may allow you to get the array going again if you haven't already.
0
 

Author Comment

by:Tito_Mahawk
ID: 20714698
Pretty interesting sifuedition. Its been a while but i have'nt touched the hdds since then. I'll look forward to that idea.
0

Featured Post

Ransomware-A Revenue Bonanza for Service Providers

Ransomware – malware that gets on your customers’ computers, encrypts their data, and extorts a hefty ransom for the decryption keys – is a surging new threat.  The purpose of this eBook is to educate the reader about ransomware attacks.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
HP ML PROLIANT M350P Gen 8 + Virtualisation 5 62
File "Archiving" 5 58
Degraded disk Dell Perc S300 9 55
Find power supply plug from picture.. 6 44
Hyper-convergence systems have taken the IT world by storm and have quickly started to change our point of view of how the data center should and could be architected. In this article, I’ll explain the benefits of employing a hyper-converged system …
Moving your enterprise fax infrastructure from in-house fax machines and servers to the cloud makes sense — from both an efficiency and productivity standpoint. But does migrating to a cloud fax solution mean you will no longer be able to send or re…
This video Micro Tutorial explains how to clone a hard drive using a commercial software product for Windows systems called Casper from Future Systems Solutions (FSS). Cloning makes an exact, complete copy of one hard disk drive (HDD) onto another d…
This Micro Tutorial will teach you how to reformat your flash drive. Sometimes your flash drive may have issues carrying files so this will completely restore it to manufacturing settings. Make sure to backup all files before reformatting. This w…

867 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

12 Experts available now in Live!

Get 1:1 Help Now