Faulty H/W components on Unix/Linux servers?

Posted on 2011-10-09
Last Modified: 2012-05-12
Friends Hope all doing well.  Could you please provide me the related risk involved on day-to-day server operation to these following h/w faulty components on Unix/Linux servers?

Please kindly provide me details information and how they are related. Willing to open another question for more points.

Faulty Hard Disk
Faulty CPU  
Faulty Memory DIMM  
Faulty Power supply
Faulty CPU Fan
Faulty SAN Card ( HBA) replacement
Faulty NIC card  

I am relatively new to this area and trying to understand what /why?

Thanks OM
Question by:Oramcle
    LVL 89

    Assisted Solution

    by:John Hurst
    Assuming your Server is business-critical, keep it under maintenance and have skilled technicians service it to replace parts. Parts normally want to be exact replacements.

    For Faulty Hard Disk, use RAID 5. Assuming you use a solid RAID configuration and have good hardware, a bad disk will show up either in RAID management with errors or as a Red Light on the drive if serious enough. Either way, you get an exact replacement and the drives are normally user serviceable. Pull out ONE bad drive, replace it and rebuild the array. Let us hope you have good backups and do not have 2 drives at once going out.

    ... Thinkpads_User

    Author Comment

    Thank you Thinkpads_User,  i am looking for list of risk factors for each Hardware components.
    LVL 89

    Expert Comment

    by:John Hurst
    I don't think there is much in the way of risk factors for each component. These all have mean time between failure numbers (or should have). See manufacturers' specifications for MTBF numbers. But these are just statistics and mean nothing for individual parts. They can fail anytime and you cannot predict. ... Thinkpads_User
    LVL 8

    Accepted Solution

    Generally, things with moving parts have higher failure rates than things without moving parts.  But that is, as said, a generality.  A hard drive designed for server applications may have much longer MTBF than a power supply targeted for cheap desktops and built with marginal parts.

    Risk factors:
    Hard drives wear out over time.  Noise and vibration can result in damage to the disk surface.  (Use SMART monitoring to track HD condition).  
    Power supplies have fans and capacitors which fail, especially if stressed close to their specifications.
    Motherboards or added cards can have capacitors which fail if overstressed or poor quality.
    Semiconductors (CPUs, DIMMs, etc.) have very long MTBF, unless overstressed by heat or higher voltage.  This can be caused by other components which become marginal without failing, for example, a fan which doesn't move enough air over the CPU or filters which become clogged by dust.  Bad capacitors can allow voltage spikes to reach sensitive components which eventually cause them to fail.  

    On a day-to-day basis, a well designed server provided by a reputable manufacturer has MTBF in the tens of thousands of hours.  You can probably expect more failures in hard drives and fans than in the other components.  As @thinkpadsads_user suggests, configuring your hard drives in a RAID configuration will compensate for this and allow faster recovery.  Other than hard drives, I think that most failures will appear to be random.  A company like Google which has perhaps a million servers, will see failure patterns.  For an individual server, there's too much individual variability.

    If you require high availability (see you can use a server with multiple power supplies, multiple fans, redundant NICs, CPUs configured to fail over, etc.  

    Author Closing Comment

    Thank you all.

    Featured Post

    Why You Should Analyze Threat Actor TTPs

    After years of analyzing threat actor behavior, it’s become clear that at any given time there are specific tactics, techniques, and procedures (TTPs) that are particularly prevalent. By analyzing and understanding these TTPs, you can dramatically enhance your security program.

    Join & Write a Comment

    Introduction We as admins face situation where we need to redirect websites to another. This may be required as a part of an upgrade keeping the old URL but website should be served from new URL. This document would brief you on different ways ca…
    Join Greg Farro and Ethan Banks from Packet Pushers ( and Greg Ross from Paessler ( for a discussion about smart network …
    Connecting to an Amazon Linux EC2 Instance from Windows Using PuTTY.
    This demo shows you how to set up the containerized NetScaler CPX with NetScaler Management and Analytics System in a non-routable Mesos/Marathon environment for use with Micro-Services applications.

    746 members asked questions and received personalized solutions in the past 7 days.

    Join the community of 500,000 technology professionals and ask your questions.

    Join & Ask a Question

    Need Help in Real-Time?

    Connect with top rated Experts

    14 Experts available now in Live!

    Get 1:1 Help Now