<

Disk drive reliability overview

Published on
39,898 Points
12,298 Views
51 Endorsements
Last Modified:
Awarded
Editor's Choice
Community Pick
Don't buy such-and-such disk drives, they are not reliable!    Boy, how many times have we read a post from a well-meaning person who had some bad experiences with a drive and now they exclaim the entire company's products can't be trusted.  First, consider annual volume shipments.  The major manufacturers each shipped between 50 - 100 million drives in 2009.  [Update:  Seagate alone shipped over 500,000,000 disk drives from April 2008 to Oct 2010] Ask yourself if the blogger could possibly have first-hand experience with anything more than 0.00000001 percent of any particular manufacturer, so how could such statements be statistically relevant?

All these well-intentioned people are correct in one respect.  All disks are unreliable.  Anything that has 100% probability of eventual failure is, by definition, unreliable, is it not?  

What should you do?  First, you can count on drive failure, so invest in RAID1/5/6 technology.  Perform backups, and take backups offsite.  Remember that RAID is not an excuse for not performing backups. All it takes is one stupid typo and a RAID1 array can delete both copies of your precious data in a fraction of a second.  Backups protect against human and computer errors. Archiving (taking backup copies to your parent's house or to a safety deposit box) prevents against natural disasters such as fire, floods, and tornados (or earthquakes if you live in California instead of Texas).

Next, one should look at drive specifications.  Don't base a buying decision just on speeds and feeds.  Ever notice "enterprise class" or "consumer class" drives when reading the specs?  The bottom line is that enterprise drives are designed for 365x24x7, have 100X more error-correcting circuitry, and are designed with much faster error recovery/remapping circuitry so they play nice behind RAID controllers.   The disks also cost 2-3X more money.

If you read the fine print on consumer drives, you will usually discover that they are rated for a whole 2400 hours annual use.   Do the math. If you run a computer 24x7 then you exceed specifications unless you turn the computer off in April.  The reduced number of ECC bits means that if you have a 2TB disk and you copy the entire disk back and forth a few times, then you WILL lose data. 1 x 10E14 bit error rate sounds like a really big number until you consider that this is only 6 times more bits then you have on the disk to begin with.  

Not a day goes by where I don't hear horror stories from people who have overpriced pimped up PCs where people pay money for pretty lights, unnecessary cores, or extra RAM that won't increase overall performance on anything other than benchmarks.  Ask those same people if they have an external USB HDD for backup and you get funny looks.  My advice to all is to consider whether or not your data is worth the $100 insurance policy it costs to get an external drive for archiving purposes.  Or get an online backup product.  Some are even free. Just back up and put the data offsite.

As for error recovery/reallocation timing .. without getting too deep into it, the reason why all the NAS, SAN, appliance vendors, and server manufacturers ship enterprise class drives with their servers that have RAID controllers is not JUST to make more money off of you. Sure they like the money, but more importantly, they hate returns. Manufacturers spend millions of dollars qualifying hardware. Part of the process involves choosing disks that meet requirements for error recovery timings.  

The consumer class disks take too long to recover when they pick up bad blocks.  While enterprise class drives remap bad sectors (use a spare reserved area of disk to replace one that failed) in a few seconds, it is not unusual to have a consumer class disk deal with bad sectors by not responding for 30 seconds or longer.  Some RAID controllers, when presented with a disk drive that is taking 30 seconds to respond to a write request, think that the disk failed, so they kill it.

The result is that if you use consumer class drives behind many RAID controllers, is that the controller will "fail" a perfectly good disk drive just because it took to long to recover.  If you have multi-terabyte configurations, then you risk 100% data loss should a bad block be discovered while the array is in degraded mode as it rebuilds.   This is why you should invest in enterprise disks if using RAID1/10/5.  If your hardware supports RAID levels that support 2 redundant disks (RAID6, RAIDZ2), then statistically, you can get away with using consumer class drives ... just as long as you perform frequent consistency checks.

Also take time to read a HCL (hardware compatibility list).  If the vendor took the time to say XYZ disk with firmware revision whatever is certified, then my gosh, give them the benefit of the doubt that they know more about what disk will work then you do!   Have you seen source code for the controller firmware, or are you (or your buddy who said it was OK to use consumer drives) seen the SEV-1/2 bug list for firmware rev   xxx?  Then you have no idea whether or not certain disk errors can have a cascade effect and result in data loss.  Remember, pretty much everything works just fine UNTIL something unexpected happens.  

While you are at it, keep up on RAID controller firmware, drivers, and management software.   Always upgrade them as they are released.  Read release notes.  I once saved somebody on EE who was upgrading firmware, and told him to read release notes first.  he wrote back and thanked me profusely as the release notes clearly stated that the firmware update should be preceded by a full backup as it would reformat the logical disks due to metadata changes.  Had user just upgraded the firmware then he would have lost everything.

Also NEVER update firmware if a RAID system is in stress (rebuilding, in degraded mode, or is throwing errors).   Never upgrade unless you have a UPS with battery backup.  You may brick it in event of power loss.  Murphy's laws come into play here.

Sometimes a disk fails HCL because of firmware bugs too.  Unfortunately I am on strict NDA so can not comment on specific disks and firmware bugs that are only known to large OEM suppliers.  Suffice to say that there are a heck of a lot of SEVERITY-1 bugs out there.  Update firmware when it is made available (but if using RAID controllers, check with the RAID manufacturer first ... do what they tell you to do).

If a data consistency check/rebuild is an unfamiliar term, and you are using a RAID controller, then you should first stop reading, go to your RAID controller, and kick one off.  Now. Then come back.

Consistency checks are vital, and should be run every weekend.  They insure that if you lose a disk then you won't also lose data.  You see, all disk drives get bad blocks from time to time. That is why they have thousands or tens of thousands of spare blocks built into them. When you tell the RAID controller to do the consistency check, it reads all blocks from all disks, and in event it discovers an unreadable block of data, it calculates what is supposed to go there from the XOR parity and "fixes" it.  When the process completes, you have 100% error-free data.    The problem is that if you do not do this, and you have a bad block on disk "A", and disk "B" dies (assuming RAID1/10/5), then there is no way for a controller to fix block "A", so you end up with partial data loss at best.

At worse, when your controller kicks off the rebuild, it will halt and you end up with 100% data loss because it can't handle this situation.  (That is why you should purchase premium RAID controllers that have the intelligence to deal with such things).

RAID controllers on motherboards should be avoided except for RAID1.  They simply do not have the intelligence to deal with many common error recovery scenarios.  As a former RAID firmware architect, I can tell you that 75% of the firmware in a controller deals with error situations.  Spend $500+ for a controller and you get one that can pretty much deal with anything and recover gracefully.   Spend $10 for a controller (like what you end up on a motherboard), and error recovery isn't much more than the pseudocode, "IF IO_SUCCESS_STATUS !=TRUE THEN KILL DISK".

Finally, a note about "data recovery software".  This is the software you run that "recovers" bad blocks and fixes drives.  I will write something more in depth on them in the future, but the quick soundbyte is that all they do is read and re-read and re-read and re-read all the blocks on the disk in an attempt to "recover" the data.  The software can not help you if the drive won't appear in a BIOS.  It won't help if the disk is making funny noises.  In fact, it will make things much worse.  If the disk had a media or head crash then every moment the disk is powered up could result in further data loss.  Bottom line on drive failures, is that software recovery is good to potentially recover from a "error reading block XYZ" problem.  Any other issue and you need the guys in the bunny suits and a check for $500-$1000.





51
Author:David
Ask questions about what you read
If you have a question about something within an article, you can receive help directly from the article author. Experts Exchange article authors are available to answer questions and further the discussion.
Get 7 days free