Faulty H/W components on Unix/Linux servers?

Friends Hope all doing well.  Could you please provide me the related risk involved on day-to-day server operation to these following h/w faulty components on Unix/Linux servers?

Please kindly provide me details information and how they are related. Willing to open another question for more points.


Faulty Hard Disk
Faulty CPU  
Faulty Memory DIMM  
Faulty Power supply
Faulty CPU Fan
Faulty SAN Card ( HBA) replacement
Faulty NIC card  

I am relatively new to this area and trying to understand what /why?


Thanks OM
OramcleAsked:
Who is Participating?
 
eagerCommented:
Generally, things with moving parts have higher failure rates than things without moving parts.  But that is, as said, a generality.  A hard drive designed for server applications may have much longer MTBF than a power supply targeted for cheap desktops and built with marginal parts.

Risk factors:
Hard drives wear out over time.  Noise and vibration can result in damage to the disk surface.  (Use SMART monitoring to track HD condition).  
Power supplies have fans and capacitors which fail, especially if stressed close to their specifications.
Motherboards or added cards can have capacitors which fail if overstressed or poor quality.
Semiconductors (CPUs, DIMMs, etc.) have very long MTBF, unless overstressed by heat or higher voltage.  This can be caused by other components which become marginal without failing, for example, a fan which doesn't move enough air over the CPU or filters which become clogged by dust.  Bad capacitors can allow voltage spikes to reach sensitive components which eventually cause them to fail.  

On a day-to-day basis, a well designed server provided by a reputable manufacturer has MTBF in the tens of thousands of hours.  You can probably expect more failures in hard drives and fans than in the other components.  As @thinkpadsads_user suggests, configuring your hard drives in a RAID configuration will compensate for this and allow faster recovery.  Other than hard drives, I think that most failures will appear to be random.  A company like Google which has perhaps a million servers, will see failure patterns.  For an individual server, there's too much individual variability.

If you require high availability (see http://en.wikipedia.org/wiki/High_availability) you can use a server with multiple power supplies, multiple fans, redundant NICs, CPUs configured to fail over, etc.  
0
 
JohnBusiness Consultant (Owner)Commented:
Assuming your Server is business-critical, keep it under maintenance and have skilled technicians service it to replace parts. Parts normally want to be exact replacements.

For Faulty Hard Disk, use RAID 5. Assuming you use a solid RAID configuration and have good hardware, a bad disk will show up either in RAID management with errors or as a Red Light on the drive if serious enough. Either way, you get an exact replacement and the drives are normally user serviceable. Pull out ONE bad drive, replace it and rebuild the array. Let us hope you have good backups and do not have 2 drives at once going out.

... Thinkpads_User
0
 
OramcleAuthor Commented:
Thank you Thinkpads_User,  i am looking for list of risk factors for each Hardware components.
0
 
JohnBusiness Consultant (Owner)Commented:
I don't think there is much in the way of risk factors for each component. These all have mean time between failure numbers (or should have). See manufacturers' specifications for MTBF numbers. But these are just statistics and mean nothing for individual parts. They can fail anytime and you cannot predict. ... Thinkpads_User
0
 
OramcleAuthor Commented:
Thank you all.
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.