Solved

2 HDD Failed. How to indentify the cause?

Posted on 2007-11-20
11
589 Views
Last Modified: 2013-11-14
Hi experts,
A nightmare happened saturday night here as 2 of 3 hdd (raid 5) failed. Impossible to rebuilt and data recovery failed. Fortunatelly we had backups so we're up again, on a single hdd (still scary)...Before doing anything (buying hdd and or raid controller and or brand new server (getting old)) we would like to know why it happened. Can you help me with all the possibilities. Thank you (sorry for my bad english)

3 MAXTOR 18GB ATLAS 15K ULTRA320 SCSI 68PIN 3.5" LP
1  Adaptec SCSI RAID 2100S
1 disk was "optimal" the other 2 were "failed"
Array = Dead - multiple drives failure.
0
Comment
Question by:Tito_Mahawk
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 3
  • 2
  • 2
  • +3
11 Comments
 
LVL 38

Accepted Solution

by:
Hypercat (Deb) earned 45 total points
ID: 20322661
It's very hard to say what might have happened.  If these are 18GB HDs then I have to assume they are pretty old (i.e., at least 3-4 years old if not more).  It's one of the known weaknesses of RAID arrays that, because you have 3 (or more) identical hard drives, they are more prone to multiple failures due to age.  In other words, all the drives will tend to get to their end-of-life condition at around the same time.  So, that's a possibility.  Were there any other environmental factors that might have played a part?  These would be things like excessive heat or dust/dirt in the server room, electrical brownouts or surges on a server that's not properly protected from such things, repeated server crashes due to other software or hardware failures, etc.

It's very doubtful that the controller has anything to do with it - these controllers aren't that prone to failures.  Also, if you had any array diagnostics running you would have seen problems earlier if the controller was at fault.
0
 
LVL 9

Assisted Solution

by:svs
svs earned 20 total points
ID: 20323373
Maybe a tool like smartctl (http://smartmontools.sf.net) will give you some insight...
0
 

Author Comment

by:Tito_Mahawk
ID: 20323560
Thanks Hypercat. You're right about the hdd's age. The server room is clean. Maybe a blackout,? We have a lot ofelectrical storms here. Log file did'nt show any unexpected shutdown (behind an APC SMart UPS). DO you think it's possible that even with the surge protector and battery backup it could cause electrical brownouts or surges?

SVS: I guess a tool like that is most likely useless AFTER your disk has failed? Tell me if i'm wrong.
0
Enterprise Mobility and BYOD For Dummies

Like “For Dummies” books, you can read this in whatever order you choose and learn about mobility and BYOD; and how to put a competitive mobile infrastructure in place. Developed for SMBs and large enterprises alike, you will find helpful use cases, planning, and implementation.

 
LVL 9

Expert Comment

by:svs
ID: 20323646
It's definitely better than guesswork.  Plus, you will probably use it to monitor disks that are still alive.
0
 
LVL 38

Expert Comment

by:Hypercat (Deb)
ID: 20323761
It's possible that even with surge protection and battery backup, something could have happened.  A lot of it depends on the load on the battery backup and the state of the battery itself - i.e., have you replaced it regularly? Also, in a blackout, for example, if there was a power outage for long enough that the battery backup died, the server could still shut down unexpectedly.  Some of that is to be expected, but over time it can cause stress on the hard drives.  I really think it's probably mostly just age.  

As they say, live and learn - next time around you might want to start replacing the drives at about 3 years old or so, even if none of them has failed.  You do know, I hope, that although it's recommended it is not absolutely required that all of the drives in your array be exactly the same.
0
 
LVL 17

Assisted Solution

by:RDAdams
RDAdams earned 20 total points
ID: 20323912
If you know when the equipment was purchased you could do a quick cost analysis on a new server.  We just went through this replacing 5 of 8 servers that were over 5 years old.  The other 3 are not critical so we are not concerned with them failing.   The others though ..... it was a quick sell to management to replace them.  At least your backups worked.

I would try and get approval for a new server asap.  Save yourself the future trouble.
0
 
LVL 21

Assisted Solution

by:mastoo
mastoo earned 20 total points
ID: 20328334
Do you monitor the drives regularly?  It could be one failed and was marked at risk for some time.  The array still functions so it may not be obvious there is a problem.  Then at some future date, another drive fails causing the array to fail.
0
 

Author Comment

by:Tito_Mahawk
ID: 20329574
Thanks hypercat. I'm new here so i'll check how old is the battery.

RDAdams: I've been pushing for a new server for some time...No money..they say. But i'm the one having nightmares...

mastoo: The sound alarm when a disk fails works perfectly believe me :) Looks like a swap team coming down from helicopters...
0
 
LVL 6

Assisted Solution

by:sifuedition
sifuedition earned 20 total points
ID: 20678916
I know the Dell hard drives of a similar model have firmware updates for the drives.  Those firmware updates are usually identified and provided by the original vendor.  You should check to see if Maxtor or the manufacturer of the server have released a firmware update for the drives.  The one for Dell was specific to drive response times which could cause the controller to knock the drives offline as a potential "risk" to the data.
0
 
LVL 6

Expert Comment

by:sifuedition
ID: 20678925
Also, did you try the ctrl+r option in the Adaptec BIOS to recover the array?  If you know which drive failed first, you can do the ctrl+r and then pull the drive that failed first.  That may allow you to get the array going again if you haven't already.
0
 

Author Comment

by:Tito_Mahawk
ID: 20714698
Pretty interesting sifuedition. Its been a while but i have'nt touched the hdds since then. I'll look forward to that idea.
0

Featured Post

How our DevOps Teams Maximize Uptime

Our Dev teams are like yours. They’re continually cranking out code for new features/bugs fixes, testing, deploying, responding to production monitoring events and more. It’s complex. So, we thought you’d like to see what’s working for us. Read the use case whitepaper.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

This article aims to explain the working of CircularLogArchiver. This tool was designed to solve the buildup of log file in cases where systems do not support circular logging or where circular logging is not enabled
The question appears often enough, how do I transfer my data from my old server to the new server while preserving file shares, share permissions, and NTFS permisions.  Here are my tips for handling such a transfer.
This video teaches viewers how to encrypt an external drive that requires a password to read and edit the drive. All tasks are done in Disk Utility. Plug in the external drive you wish to encrypt: Make sure all previous data on the drive has been …
This Micro Tutorial will teach you how to reformat your flash drive. Sometimes your flash drive may have issues carrying files so this will completely restore it to manufacturing settings. Make sure to backup all files before reformatting. This w…

734 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question