Community Pick: Many members of our community have endorsed this article.
Editor's Choice: This article has been selected by our editors as an exceptional contribution.

Disk drive reliability overview

DavidPresident
CERTIFIED EXPERT
Published:
Updated:
Don't buy such-and-such disk drives, they are not reliable!    Boy, how many times have we read a post from a well-meaning person who had some bad experiences with a drive and now they exclaim the entire company's products can't be trusted.  First, consider annual volume shipments.  The major manufacturers each shipped between 50 - 100 million drives in 2009.  [Update:  Seagate alone shipped over 500,000,000 disk drives from April 2008 to Oct 2010] Ask yourself if the blogger could possibly have first-hand experience with anything more than 0.00000001 percent of any particular manufacturer, so how could such statements be statistically relevant?

All these well-intentioned people are correct in one respect.  All disks are unreliable.  Anything that has 100% probability of eventual failure is, by definition, unreliable, is it not?  

What should you do?  First, you can count on drive failure, so invest in RAID1/5/6 technology.  Perform backups, and take backups offsite.  Remember that RAID is not an excuse for not performing backups. All it takes is one stupid typo and a RAID1 array can delete both copies of your precious data in a fraction of a second.  Backups protect against human and computer errors. Archiving (taking backup copies to your parent's house or to a safety deposit box) prevents against natural disasters such as fire, floods, and tornados (or earthquakes if you live in California instead of Texas).

Next, one should look at drive specifications.  Don't base a buying decision just on speeds and feeds.  Ever notice "enterprise class" or "consumer class" drives when reading the specs?  The bottom line is that enterprise drives are designed for 365x24x7, have 100X more error-correcting circuitry, and are designed with much faster error recovery/remapping circuitry so they play nice behind RAID controllers.   The disks also cost 2-3X more money.

If you read the fine print on consumer drives, you will usually discover that they are rated for a whole 2400 hours annual use.   Do the math. If you run a computer 24x7 then you exceed specifications unless you turn the computer off in April.  The reduced number of ECC bits means that if you have a 2TB disk and you copy the entire disk back and forth a few times, then you WILL lose data. 1 x 10E14 bit error rate sounds like a really big number until you consider that this is only 6 times more bits then you have on the disk to begin with.  

Not a day goes by where I don't hear horror stories from people who have overpriced pimped up PCs where people pay money for pretty lights, unnecessary cores, or extra RAM that won't increase overall performance on anything other than benchmarks.  Ask those same people if they have an external USB HDD for backup and you get funny looks.  My advice to all is to consider whether or not your data is worth the $100 insurance policy it costs to get an external drive for archiving purposes.  Or get an online backup product.  Some are even free. Just back up and put the data offsite.

As for error recovery/reallocation timing .. without getting too deep into it, the reason why all the NAS, SAN, appliance vendors, and server manufacturers ship enterprise class drives with their servers that have RAID controllers is not JUST to make more money off of you. Sure they like the money, but more importantly, they hate returns. Manufacturers spend millions of dollars qualifying hardware. Part of the process involves choosing disks that meet requirements for error recovery timings.  

The consumer class disks take too long to recover when they pick up bad blocks.  While enterprise class drives remap bad sectors (use a spare reserved area of disk to replace one that failed) in a few seconds, it is not unusual to have a consumer class disk deal with bad sectors by not responding for 30 seconds or longer.  Some RAID controllers, when presented with a disk drive that is taking 30 seconds to respond to a write request, think that the disk failed, so they kill it.

The result is that if you use consumer class drives behind many RAID controllers, is that the controller will "fail" a perfectly good disk drive just because it took to long to recover.  If you have multi-terabyte configurations, then you risk 100% data loss should a bad block be discovered while the array is in degraded mode as it rebuilds.   This is why you should invest in enterprise disks if using RAID1/10/5.  If your hardware supports RAID levels that support 2 redundant disks (RAID6, RAIDZ2), then statistically, you can get away with using consumer class drives ... just as long as you perform frequent consistency checks.

Also take time to read a HCL (hardware compatibility list).  If the vendor took the time to say XYZ disk with firmware revision whatever is certified, then my gosh, give them the benefit of the doubt that they know more about what disk will work then you do!   Have you seen source code for the controller firmware, or are you (or your buddy who said it was OK to use consumer drives) seen the SEV-1/2 bug list for firmware rev   xxx?  Then you have no idea whether or not certain disk errors can have a cascade effect and result in data loss.  Remember, pretty much everything works just fine UNTIL something unexpected happens.  

While you are at it, keep up on RAID controller firmware, drivers, and management software.   Always upgrade them as they are released.  Read release notes.  I once saved somebody on EE who was upgrading firmware, and told him to read release notes first.  he wrote back and thanked me profusely as the release notes clearly stated that the firmware update should be preceded by a full backup as it would reformat the logical disks due to metadata changes.  Had user just upgraded the firmware then he would have lost everything.

Also NEVER update firmware if a RAID system is in stress (rebuilding, in degraded mode, or is throwing errors).   Never upgrade unless you have a UPS with battery backup.  You may brick it in event of power loss.  Murphy's laws come into play here.

Sometimes a disk fails HCL because of firmware bugs too.  Unfortunately I am on strict NDA so can not comment on specific disks and firmware bugs that are only known to large OEM suppliers.  Suffice to say that there are a heck of a lot of SEVERITY-1 bugs out there.  Update firmware when it is made available (but if using RAID controllers, check with the RAID manufacturer first ... do what they tell you to do).

If a data consistency check/rebuild is an unfamiliar term, and you are using a RAID controller, then you should first stop reading, go to your RAID controller, and kick one off.  Now. Then come back.

Consistency checks are vital, and should be run every weekend.  They insure that if you lose a disk then you won't also lose data.  You see, all disk drives get bad blocks from time to time. That is why they have thousands or tens of thousands of spare blocks built into them. When you tell the RAID controller to do the consistency check, it reads all blocks from all disks, and in event it discovers an unreadable block of data, it calculates what is supposed to go there from the XOR parity and "fixes" it.  When the process completes, you have 100% error-free data.    The problem is that if you do not do this, and you have a bad block on disk "A", and disk "B" dies (assuming RAID1/10/5), then there is no way for a controller to fix block "A", so you end up with partial data loss at best.

At worse, when your controller kicks off the rebuild, it will halt and you end up with 100% data loss because it can't handle this situation.  (That is why you should purchase premium RAID controllers that have the intelligence to deal with such things).

RAID controllers on motherboards should be avoided except for RAID1.  They simply do not have the intelligence to deal with many common error recovery scenarios.  As a former RAID firmware architect, I can tell you that 75% of the firmware in a controller deals with error situations.  Spend $500+ for a controller and you get one that can pretty much deal with anything and recover gracefully.   Spend $10 for a controller (like what you end up on a motherboard), and error recovery isn't much more than the pseudocode, "IF IO_SUCCESS_STATUS !=TRUE THEN KILL DISK".

Finally, a note about "data recovery software".  This is the software you run that "recovers" bad blocks and fixes drives.  I will write something more in depth on them in the future, but the quick soundbyte is that all they do is read and re-read and re-read and re-read all the blocks on the disk in an attempt to "recover" the data.  The software can not help you if the drive won't appear in a BIOS.  It won't help if the disk is making funny noises.  In fact, it will make things much worse.  If the disk had a media or head crash then every moment the disk is powered up could result in further data loss.  Bottom line on drive failures, is that software recovery is good to potentially recover from a "error reading block XYZ" problem.  Any other issue and you need the guys in the bunny suits and a check for $500-$1000.





Have a question about something in this article? You can receive help directly from the article author. Sign up for a free trial to get started.

DavidPresident
CERTIFIED EXPERT
51
12,544 Views

Comments (28)

Rob HutchinsonTech Lead, Desktop Support
CERTIFIED EXPERT

Commented:
Good article, but I still remember the Hitachi DeathStar( DeskStar) laptop drives and avoid them like the plague--maybe these drives had a higher than normal failure rate, but whenever we got called to help a user with a bad laptop hard drive a few years ago it was almost always this model hard drive...hmm, or maybe it was because this was the drive that was shipped most at the time with these laptops?..nah, it was the evil Hitachi DeathStars!

Commented:
Very Nice article and very informative, thank you, a BIG yes from me.
DavidPresident
CERTIFIED EXPERT
Top Expert 2010

Author

Commented:
Here is the google temperature vs failure graph.  Google uses standard desktop drives also.    

It supports my statements, at least until the HDD gets to 45 degrees C, then it starts trending up slowly...
From Google reliability study - data copyright google.comHere is the location of the entire paper that contains the chart above : http://labs.google.com/papers/disk_failures.pdf

Actually the issue with the IBM deskstars is that the 65GXP series (before they became Hitachi) had some serious QC issues.  I remember this because I used IBM drives almost exclusively especially on client desktops (except some Quantum, Maxtor and Seagate drives on other higher end applications.)

After IBM sold the drive operations to Hitachi, far as I could tell, a huge issue came up with drives being sold to Taiwanese clients having some sort of spyware embedded in their firmware or some other such issue, I remembered reading some myriad articles, but it has been a LONG time. (I'd like to say 2006, but it really is just fuzzy memories at this point.)

Sure, we all miss the days of Quantum and Maxtor being in business on their own, but hey, that's the price of doing business these days.  People age, people decide to quit the biz, people on the board want to retire, and a sell off seems more profitable than another heart attack.  All sorts of things come up.  We live in a world populated by human beings, each with individual needs and wants, and there comes dlethe's comment that each of us is by necessity defined by our experiences.  Nevermind that many of our experiences are limited to a certain set of knowns, and framed by a much larger set of beliefs... beliefs by definition being that which we are told, or we assume but do not know, for sure.  (For example, I did not live in Taiwan and take part of the traffic shaping/verification efforts which resulted in the articles I read back then.  I do not know for sure they are even true.)

Overall I find that most people whining about a certain consumer grade drive being bad don't realize that drives are like ammunition.  A manufacturer makes millions of units of product, be it drives or bullets (as dlethe mentions above) and just like ammunition (which is a far simpler product which, which nonetheless, in certain failure modes, can kill or severely injure the user) these come in batches.  Occasionally a bad batch comes out.  Some of you may remember the old desktop greencore (aluminum trace) Athlons which were volcanic piles of ash with astonishing frequency, or the early purple cores (copper trace), of which certain batches were overclockable to the limits of one's cooling capability (we pushed one in the lab to more than twice its clock speed running a refrigerated cooler.)  Some of you may remember explicitly ordering those specific processors for one's own home performance desktop.

This same issue arises over and over with various devices, firmware revisions, batch numbers, lot numbers, you name it.  Many times the whining on places such as Newegg or Tiger Direct starts because a certain bad batch showed up and had serious issues, sometimes its people with basic or nonexistent knowledge rating themselves as a 5/5 tech expert.  Sometimes a certain troublesome employee wasn't packing them right, or was mishandling product, or you name it, its been done.  Sometimes the users didn't read that a particular product line (western digital red NAS drives, or WD Caviar/Scorpio Black drives) were not intended to be installed into surveillance DVR's despite being rated for 365/24/7 uptime.  (Having a drive hang while it deals with a bad bit instead of simply dropping a frame and moving on during a break-in can result in the surveillance system failing in its primary mission, surveilling a particular event.)  Meanwhile, having a surveillance drive in your desktop because it was cheaper, can likewise cause all sorts of nastiness when you put important data on said drive.

I can't begin to say how many articles I read just on Newegg where people rate themselves tech experts and yet fail simple scope consistency checks.  As in, plugged the wrong drive into the wrong system for the wrong reasons (cheap, didn't read drive description, didn't know what it actually did, etc.)  I have had one system on my bench recently into which someone plugged an AV drive.  Then again, I live and work in an area where people make buying decisions solely based on price.  I've got two CAD guys on contract ever since we swapped out their desktop drives in their workstations for actual enterprise class drives.  Happy customers now.
DavidPresident
CERTIFIED EXPERT
Top Expert 2010

Author

Commented:
NorthPlateau - Your use of jargon indicates you're an old-timer like me (old as in experience in the industry) with insider knowledge.  Thank's for the war stories.  

Remember the stiction problem, or when SATA disks first came out and they were 'only' going to be used for near line storage?  Nobody imagined people would be silly enough to use them as primary storage.

View All Comments

Have a question about something in this article? You can receive help directly from the article author. Sign up for a free trial to get started.

Get access with a 7-day free trial.
Continue Growing Your Skills and Your Career
  • Interact with leading experts on your specific technology problems.
  • Receive the guidance of experienced professionals.
  • Learn from troubleshooting others have experienced.
  • Gain knowledge from a library of courses, all included.