• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 299
  • Last Modified:

Multiple failed hard drives

Hi everyone.  

I have a 1u supermicro server that holds four HDs.  I purchased the machine 10 months ago and have repeatedly had the HDs fail on it.  Four HDs have died in the 10 months that we have had the server, but they have only died on slots 3 and 4 (two died on each slot).  We originally assumed that it was just a crappy set of HDs because they all had similar serial numbers, but I am now unsure that that is the case... it just seems like too many to be the HDs in my oppinion.  The HDs are Western Digital 7200 RPM 320 Gig drives.

So what do you guys think?  What would cause so many HDs to fail on a machine in so little time?

Thanks in advance for your ideas.

0
Caliguian
Asked:
Caliguian
  • 3
  • 2
  • 2
  • +8
8 Solutions
 
zephyr_hex (Megan)DeveloperCommented:
power spikes are the #1 cause of drives going bad.  is this computer on a dedicated power source, with a UPS?
0
 
David_FongCommented:
If it's the same bays each time then it may be a cooling problem. Ribbon cables are notorious for blocking airflow in 1U servers.
0
 
rindiCommented:
I've had problems with inadequate cables. Have you tried changing the cables of these bays? It could be that the drives themselves are still good.
0
Concerto's Cloud Advisory Services

Want to avoid the missteps to gaining all the benefits of the cloud? Learn more about the different assessment options from our Cloud Advisory team.

 
scrathcyboyCommented:
If the hard disks were NOT WD, I would have said the controller is at fault, giving them wrong signal voltage, and other issues that burn out drives.  But since they are WD drives, I would suspect your original guess was right.  WD drives can have whole lots of 1000s with bad seek mechanisms.  They often will run for just about the warranty period of 1 year or less, then die, and they die hard.  You can hear the seek head clicking endlessly because of crappy seek components.  

If you want reliable drives, buy IBM-Hitachi and give up on WD.  Hitachi are not foolproof, but a typical life is 3-5 years, and they are usually very reliable, and no more cost than WD.
0
 
ISoulCommented:
This is interesting. We've purchased four 2U and 1U Supermicro servers in the last 5 months or so, and we've had hard drive issues with two of them in the past month. We also have 12 Dell servers that haven't had any problems. The hard drives in ours are Maxtors.
0
 
rindiCommented:
I've only had problems with maxtors, but those were many. The most reliable drives I've used until now are samsung, and a good one is also seagate, as those generally have a 5 year warranty.
0
 
knoxzooCommented:
After supporting several businesses using 1U and 2U units, I learned to add cooling capacity to any of them before putting them in place.  They simply don't have the airflow necessary to cool drives of any significant size, like the ones you've described.  

If you can place the units far enough apart, you can simply pull the top and bottom cover, drill a few rows of 1/8" holes across the bottom and a bunch of 1/8" holes in 3" circles above the drive locations, mount standard case fans above the holes on the outside of the top cover, lay hands on an old AT style power supply for power, and you're off to the races.

If you can't get the spacing to do that, you have to get a bit more creative using sheet metal and hose tubing attached to the back of the case, or liquid cool the drives.

I read a case study some time ago that showed the maximum drive size the cooling capcity of a standard 1U case can deal with was a 7200rpm 80gb SATA, which ran just below critical failure temps - aka, way too hot.
0
 
JimsZCommented:
heat... you probably don't have enough cooling from any fans etc.  Try moving the drives further apart, maybe in different drive bays or even mounting one of them to the lower end of bay.  With 2 drives right next to each other, they produce a tremendous amount of heat and heat is one of the biggest problems with hard drives, especially the faster drives
0
 
David_FongCommented:
We really need a photo of the rack and the surrounding area, you start off as a server engineer and end up as a plumber to make computer rooms work properly. I like the idea of drilling lots of holes in a 1U server to make it take up 3Us, easier on the ears.
0
 
f-kingCommented:
Hi a bad power supply can also cause the drives to malfunction by sending bad current to them.
It looks like the board itself could be giving off a bad charge.and as suggested by the other members harddrives to close toghether can cause overheating of the drives.
0
 
CaliguianAuthor Commented:
hey guys, thanks for all of the responses.  I talked to the guy that is charge of taking care of the servers (and had him read your responses), but he is hesitant to believe that anything can be wrong with the heat because he says that there are six identical machines that have identical setup to this one and they aren't having any problems.  So basically he will only believe that the HDs are from a bad batch, so no preventitive action will be done. :(

I would appreciate any other comments that any of you might have so that in case he does start opening up a bit I can at least have something else to have him read and help him get the oppinions of others that are in charge of hardware maintenance.


Thanks!
0
 
knoxzooCommented:
Air flow in one machine is not equal to air flow in another.  One might be producing enough airflow to keep things working, while another isn't.  Or, one might be getting enough air going to each drive while another has a cable, or some other obstacle, keeping sufficient airflow from getting to a couple of the drives.

His reasoning is the same as "My Dad's truck has never broken down, therefore my truck never will either."  No two machines are created equal.  Subtle variances can make for huge differences.
0
 
rindiCommented:
Has he changed the cables already?
0
 
pweegarCommented:
Sounds like this guy in charge of servers needs a good talking too.  Esp. if the server stores critical/sensitive data.  The server obviously has issues that need to be addressed.  You might want to think about going over this guy's head until the server is fixed right.
0
 
georgecooldudeCommented:
are you running them in raid? sounds like a bad raid card to me. try replacing that
0
 
CaliguianAuthor Commented:
Yes, they are running RAID 5.
0

Featured Post

Free Tool: SSL Checker

Scans your site and returns information about your SSL implementation and certificate. Helpful for debugging and validating your SSL configuration.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

  • 3
  • 2
  • 2
  • +8
Tackle projects and never again get stuck behind a technical roadblock.
Join Now