Disk drive reliability overview

AID: 2757
  • Status: Published

25181 points

  • Bydlethe
  • TypeGeneral
  • Posted on2010-03-27 at 05:09:35
Awards
  • Community Pick
  • Experts Exchange Approved
  • Editor's Choice
Don't buy such-and-such disk drives, they are not reliable!    Boy, how many times have we read a post from a well-meaning person who had some bad experiences with a drive and now they exclaim the entire company's products can't be trusted.  First, consider annual volume shipments.  The major manufacturers each shipped between 50 - 100 million drives in 2009.  [Update:  Seagate alone shipped over 500,000,000 disk drives from April 2008 to Oct 2010] Ask yourself if the blogger could possibly have first-hand experience with anything more than 0.00000001 percent of any particular manufacturer, so how could such statements be statistically relevant?

All these well-intentioned people are correct in one respect.  All disks are unreliable.  Anything that has 100% probability of eventual failure is, by definition, unreliable, is it not?  

What should you do?  First, you can count on drive failure, so invest in RAID1/5/6 technology.  Perform backups, and take backups offsite.  Remember that RAID is not an excuse for not performing backups. All it takes is one stupid typo and a RAID1 array can delete both copies of your precious data in a fraction of a second.  Backups protect against human and computer errors. Archiving (taking backup copies to your parent's house or to a safety deposit box) prevents against natural disasters such as fire, floods, and tornados (or earthquakes if you live in California instead of Texas).

Next, one should look at drive specifications.  Don't base a buying decision just on speeds and feeds.  Ever notice "enterprise class" or "consumer class" drives when reading the specs?  The bottom line is that enterprise drives are designed for 365x24x7, have 100X more error-correcting circuitry, and are designed with much faster error recovery/remapping circuitry so they play nice behind RAID controllers.   The disks also cost 2-3X more money.

If you read the fine print on consumer drives, you will usually discover that they are rated for a whole 2400 hours annual use.   Do the math. If you run a computer 24x7 then you exceed specifications unless you turn the computer off in April.  The reduced number of ECC bits means that if you have a 2TB disk and you copy the entire disk back and forth a few times, then you WILL lose data. 1 x 10E14 bit error rate sounds like a really big number until you consider that this is only 6 times more bits then you have on the disk to begin with.  

Not a day goes by where I don't hear horror stories from people who have overpriced pimped up PCs where people pay money for pretty lights, unnecessary cores, or extra RAM that won't increase overall performance on anything other than benchmarks.  Ask those same people if they have an external USB HDD for backup and you get funny looks.  My advice to all is to consider whether or not your data is worth the $100 insurance policy it costs to get an external drive for archiving purposes.  Or get an online backup product.  Some are even free. Just back up and put the data offsite.

As for error recovery/reallocation timing .. without getting too deep into it, the reason why all the NAS, SAN, appliance vendors, and server manufacturers ship enterprise class drives with their servers that have RAID controllers is not JUST to make more money off of you. Sure they like the money, but more importantly, they hate returns. Manufacturers spend millions of dollars qualifying hardware. Part of the process involves choosing disks that meet requirements for error recovery timings.  

The consumer class disks take too long to recover when they pick up bad blocks.  While enterprise class drives remap bad sectors (use a spare reserved area of disk to replace one that failed) in a few seconds, it is not unusual to have a consumer class disk deal with bad sectors by not responding for 30 seconds or longer.  Some RAID controllers, when presented with a disk drive that is taking 30 seconds to respond to a write request, think that the disk failed, so they kill it.

The result is that if you use consumer class drives behind many RAID controllers, is that the controller will "fail" a perfectly good disk drive just because it took to long to recover.  If you have multi-terabyte configurations, then you risk 100% data loss should a bad block be discovered while the array is in degraded mode as it rebuilds.   This is why you should invest in enterprise disks if using RAID1/10/5.  If your hardware supports RAID levels that support 2 redundant disks (RAID6, RAIDZ2), then statistically, you can get away with using consumer class drives ... just as long as you perform frequent consistency checks.

Also take time to read a HCL (hardware compatibility list).  If the vendor took the time to say XYZ disk with firmware revision whatever is certified, then my gosh, give them the benefit of the doubt that they know more about what disk will work then you do!   Have you seen source code for the controller firmware, or are you (or your buddy who said it was OK to use consumer drives) seen the SEV-1/2 bug list for firmware rev   xxx?  Then you have no idea whether or not certain disk errors can have a cascade effect and result in data loss.  Remember, pretty much everything works just fine UNTIL something unexpected happens.  

While you are at it, keep up on RAID controller firmware, drivers, and management software.   Always upgrade them as they are released.  Read release notes.  I once saved somebody on EE who was upgrading firmware, and told him to read release notes first.  he wrote back and thanked me profusely as the release notes clearly stated that the firmware update should be preceded by a full backup as it would reformat the logical disks due to metadata changes.  Had user just upgraded the firmware then he would have lost everything.

Also NEVER update firmware if a RAID system is in stress (rebuilding, in degraded mode, or is throwing errors).   Never upgrade unless you have a UPS with battery backup.  You may brick it in event of power loss.  Murphy's laws come into play here.

Sometimes a disk fails HCL because of firmware bugs too.  Unfortunately I am on strict NDA so can not comment on specific disks and firmware bugs that are only known to large OEM suppliers.  Suffice to say that there are a heck of a lot of SEVERITY-1 bugs out there.  Update firmware when it is made available (but if using RAID controllers, check with the RAID manufacturer first ... do what they tell you to do).

If a data consistency check/rebuild is an unfamiliar term, and you are using a RAID controller, then you should first stop reading, go to your RAID controller, and kick one off.  Now. Then come back.

Consistency checks are vital, and should be run every weekend.  They insure that if you lose a disk then you won't also lose data.  You see, all disk drives get bad blocks from time to time. That is why they have thousands or tens of thousands of spare blocks built into them. When you tell the RAID controller to do the consistency check, it reads all blocks from all disks, and in event it discovers an unreadable block of data, it calculates what is supposed to go there from the XOR parity and "fixes" it.  When the process completes, you have 100% error-free data.    The problem is that if you do not do this, and you have a bad block on disk "A", and disk "B" dies (assuming RAID1/10/5), then there is no way for a controller to fix block "A", so you end up with partial data loss at best.

At worse, when your controller kicks off the rebuild, it will halt and you end up with 100% data loss because it can't handle this situation.  (That is why you should purchase premium RAID controllers that have the intelligence to deal with such things).

RAID controllers on motherboards should be avoided except for RAID1.  They simply do not have the intelligence to deal with many common error recovery scenarios.  As a former RAID firmware architect, I can tell you that 75% of the firmware in a controller deals with error situations.  Spend $500+ for a controller and you get one that can pretty much deal with anything and recover gracefully.   Spend $10 for a controller (like what you end up on a motherboard), and error recovery isn't much more than the pseudocode, "IF IO_SUCCESS_STATUS !=TRUE THEN KILL DISK".

Finally, a note about "data recovery software".  This is the software you run that "recovers" bad blocks and fixes drives.  I will write something more in depth on them in the future, but the quick soundbyte is that all they do is read and re-read and re-read and re-read all the blocks on the disk in an attempt to "recover" the data.  The software can not help you if the drive won't appear in a BIOS.  It won't help if the disk is making funny noises.  In fact, it will make things much worse.  If the disk had a media or head crash then every moment the disk is powered up could result in further data loss.  Bottom line on drive failures, is that software recovery is good to potentially recover from a "error reading block XYZ" problem.  Any other issue and you need the guys in the bunny suits and a check for $500-$1000.





Asked On
2010-03-27 at 05:09:35ID2757
Tags

disks

,

disk drives

,

reliability

,

failure

,

data recovery

,

RAID

,

disk errors

,

bad blocks

Topic

Hard Drives & Storage

Views
5196

Comments

Expert Comment

by: evilrix on 2010-03-28 at 11:46:50ID: 12044

A must read for anyone who cares about their data. I voted yes!

Expert Comment

by: demazter on 2010-03-29 at 06:06:31ID: 12103

An excellent well written article.
You got my vote :)

>>Maxtor HDDs because he always found them "unreliable", and user should go with hitachi
Really? I would go with Maxtor over Hitachi any day :))

Just kidding! :)

Expert Comment

by: alanhardisty on 2010-03-29 at 06:11:02ID: 12104

Hmmm - I had issues with Maxtor - 3 drives dying within 2 weeks of each other - but I was buying the cheapo versions.  Ironically, I like the Seagate ones - may by the same people who produce the Maxtor drives.

Author Comment

by: dlethe on 2010-03-29 at 06:25:47ID: 12105

alan - I spent years working for a RAID vendor that purchased 500,000+ Maxtor drives PER MONTH for a while. Nothing wrong with the company as a whole, but they also purchased enterprise class.  As for drives dying within 2 weeks of each other. This is to be expected. As I have stated before ... they are probably same engineering run, same I/O, same temperature, same duty cycle.  When one disk in a RAID dies, the others may not be far behind because of that.

Expert Comment

by: alanhardisty on 2010-03-29 at 06:29:44ID: 12106

Indeed - I read your article with a guilty conscience - I too had been guilty of not favouring a particular manufacturer's drives (no clues as to who) but now have seen the light.

Thanks for correcting my mistake and for writing a great article.

Alan

Expert Comment

by: redrumkev on 2010-03-29 at 15:12:59ID: 12137

dlethe,

This is an outstanding article. You really put into perspective that what users see as a large company that doesn't care and puts out bad products, is really just a bad "luck of the draw".

On a side note, I think that have "known" a lot of this stuff, but never really put it together. Thank you for taking the time to do just that!

Kevin

Author Comment

by: dlethe on 2010-03-29 at 16:36:02ID: 12160

Glad to help.  This is really a subset of  things I learned over the years.   I am tempted to turning it into a bullet list that doesn't go into a whole lot of details to keep it clean, but then links to a shorter article on items many of which are worthy of a paper.

Other topics
 *S.M.A.R.T.  - as I told moderator, vast majority of comments on S.M.A.R.T. is incorrect or misleading.
 *More on RAID / from perspective of architect & what will likely work, not work, and no way has been tested
 *All those data recovery programs, what they really do , how they work, and when you are wasting your time

Any good suggestions on a title or other topics? "David's Disk & RAID Truths" isn't very catchy

Author Comment

by: dlethe on 2010-03-29 at 16:37:57ID: 12161

P.S. Got LOTS of details on why consumer-class disks really fail, how temperature comes into play; burn-in importance; how busy and idle disks affect longevity; real-world MTBF
This is going to be either 2nd or 3rd article I do next

Expert Comment

by: alanhardisty on 2010-03-29 at 16:50:29ID: 12164

Belated Vote for Helpful from me - looking forward to the next ones already.

Alan

Expert Comment

by: redrumkev on 2010-03-29 at 19:13:08ID: 12172

David,

How about:

"RAID - A magic pill or proper planning?"

"The evolution of a Disk Junkie" (your life story, or progression of knowledge)

As for a topic, what would be the best solution for the average developer, home user? Example, two notebooks and 1 or 2 desktops in a home. The notebooks are wireless and the desktops are wired... what solution would exist that "does it all" but is "easy to use"... in terms of the hardware. Low cost (pros and cons), mid cost (pros and cons) and high cost (pros and cons), maybe a matrix graphic to illustrate, something like this:
http://www.ad-lister.co.uk/Shared/UserImages/35bc0e55-b13c-4b45-a7e7-bab423d29fa9/Img/microsoft/office2007_compare3.gif

Where the top row is one of the following:
In desktop raid 1
In desktop raid 5
External (dumb) drive
External (dumb) drive (with built in raid 1)
Managed storage (NAS/SAN) etc.

The side column could be:
Cost
Up-time
Rebuild time
Maintenance
Technical Knowledge Needed, etc.

Then just put a number in each box, or a check if those features are available. Or even something like:
NR (not recommended)
V (good)
VV (better)
VVV (best)

See attached.

Kevin

    Author Comment

    by: dlethe on 2010-03-29 at 19:41:53ID: 12176

    Suggestions on topics are good, but only if I break up Disk & RAID, but that probably wouldn't be a bad Idea.  List maybe 25, "things you probably didn't know, but should know about RAID and DISKs"

    As for the "best solution".  Not going down there.  Too many variables, and the minute i post it, the doc would be obsolete ... just like everything else in this industry.  To do a decent matrix, I probably need 6 or 7 dimensions, not 2 :)

    Expert Comment

    by: redrumkev on 2010-03-29 at 20:49:31ID: 12177

    On that note, i think an article about "things you probably didn't know, but should know about RAID and DISKs", would be outstanding. From there, people can make their own decisions, but you are simply providing facts!

    I agree that an obsolete "king of all posts" would not be a good way to go!

    I look forward to this... love to learn!

    Kevin

    Expert Comment

    by: younghv on 2010-03-30 at 03:11:06ID: 12308

    Sign me up as another guilty one about "that" manufacturer...or "that other" one.
    Excellent piece and looking forward to more.

    Yes vote above.

    Expert Comment

    by: rizla7 on 2010-03-31 at 20:14:23ID: 12512

    you dont especially need guys in bunny suits. just a harddrive with an approximately similar controller board. rip the plates out, blow as much dust out as you can and copy the data before the new one fails ;p

    Author Comment

    by: dlethe on 2010-03-31 at 20:19:49ID: 12513

    Well it depends on the disk, with the 1-2TB SATA drives, the tolerance is so tight that it just isn't as easy as it used to be even a few years ago.  

    Expert Comment

    by: Jerkson on 2010-04-01 at 06:31:46ID: 12532

    dlethe,

    Great article, and thanks for saving my butt by making me read the release notes. I love to learn, and I will read (and re-read) all your future articles as well.

    Expert Comment

    by: Ryan_R on 2010-04-21 at 22:18:29ID: 13679

    Voted yes, thanks for the article.

    I've also heard and experienced bad things with Maxtor drives in years past. I tend to stick with WD at the moment, but also have a Samsung drive that seems ok.

    I have 4TB spread over 3 HDDs in my main PC at home, I only have the necessities within my profile backed up on a USB drive and again on my laptop, and that's about it.

    Expert Comment

    by: kamutzo on 2010-05-01 at 10:30:05ID: 14014

    nice article,
    i'm wishing for an in depth article on SSD's controller architecture and general build,
    people don't know what they buy mostly and there are so many things to understand and consider,
    especially secrets and facts on the inside, things that are not always being told by companies or they're representatives.
    we were discussing a ROC capable SSD with dual PCB's or an extended single one for double bandwidth capacities,
    take two 32 or 64GB drives and combine them for a single running dual speed on a SATA3 interface,
    it goes off-topic from the article, yet if we deal with storage and you are dealing with RAID controllers, maybe we could merge this :).

    Expert Comment

    by: lucius_the on 2010-10-31 at 07:29:52ID: 20963

    A very informative article, helped a lot in my decisions on buying server-grade enterprise drives in the future.

    Expert Comment

    by: BigSchmuh on 2010-11-18 at 05:00:08ID: 21466

    Very good article, I would add the reference to Google "Failure Trends in a Large Disk Drive Population" study where the Google lab showed that:

    • models based on SMART parameters alone are unlikely to be useful for predicting individual drive failures.

    • temperature and activity levels were much less correlated with drive failures than previously reported

    Author Comment

    by: dlethe on 2010-11-18 at 05:11:10ID: 21467

    Thx - actually if you analyze the google study further, you will see that disks run much better at higher temperatures.   So those drive temperature coolers that some people swear by actually lower disk life :)

    Expert Comment

    by: BigSchmuh on 2010-11-18 at 05:31:45ID: 21468

    @dlethe : I just re-read the google study...but I am uncomfortable with your assertion stating that "disks run much better at higher temperature"
    Looking at p.6 figure 5 "AFR for average drive temperature", one can see that the all times lowest AFR (Annual Failure Rates) are in the 30-40°C range.
    ==> In most datacenters, the drives may suffering from a lower temperature range...but home users (those who buy coolers) should raise their HDD life with a passive cooler (or simple fan) that allow the drive to go back from its traditional 55°C context

    Expert Comment

    by: brettdj on 2011-01-10 at 02:29:02ID: 22753

    Thanks for this article, I've filed it away for future reference having finally got around to RAID and now thinking about a NAS

    Regards

    Dave

    Expert Comment

    by: WiReDNeT on 2011-02-25 at 15:06:45ID: 24114

    Good article, but I still remember the Hitachi DeathStar( DeskStar) laptop drives and avoid them like the plague--maybe these drives had a higher than normal failure rate, but whenever we got called to help a user with a bad laptop hard drive a few years ago it was almost always this model hard drive...hmm, or maybe it was because this was the drive that was shipped most at the time with these laptops?..nah, it was the evil Hitachi DeathStars!

    Expert Comment

    by: Netpro7 on 2011-04-11 at 06:30:29ID: 25674

    Very Nice article and very informative, thank you, a BIG yes from me.

    Author Comment

    by: dlethe on 2011-05-03 at 12:01:45ID: 26183

    Here is the google temperature vs failure graph.  Google uses standard desktop drives also.    

    It supports my statements, at least until the HDD gets to 45 degrees C, then it starts trending up slowly...
    AvgTempVsFailure.GIF
    • 25 KB
    • From Google reliability study - data copyright google.com
    From Google reliability study - data copyright google.com

    Here is the location of the entire paper that contains the chart above : http://labs.google.com/papers/disk_failures.pdf

      Add your Comment

      Please Sign up or Log in to comment on this article.

      Join Experts Exchange Today

      Gain Access to all our Tech Resources

      Get personalized answers

      Ask unlimited questions

      Access Proven Solutions

      Search 3.2 million solutions

      Read In-Depth How-To Guides

      1000+ articles, demos, & tips

      Watch Step by Step Tutorials

      Learn direct from top tech pros

      And Much More!

      Your complete tech resource

      See Plans and Pricing

      30-day free trial. Register in 60 seconds.

      Loading Advertisement...

      Top Storage Misc Experts

      1. hanccocka

        247,063

        Guru

        0 points yesterday

        Profile
        Rank: Genius
      2. andyalder

        186,674

        Guru

        2,000 points yesterday

        Profile
        Rank: Genius
      3. dlethe

        151,091

        Guru

        0 points yesterday

        Profile
        Rank: Genius
      4. Callandor

        76,887

        Master

        0 points yesterday

        Profile
        Rank: Genius
      5. paulsolov

        72,153

        Master

        0 points yesterday

        Profile
        Rank: Genius
      6. meyersd

        50,146

        Master

        0 points yesterday

        Profile
        Rank: Genius
      7. kevinhsieh

        42,168

        2,800 points yesterday

        Profile
        Rank: Genius
      8. woolmilkporc

        41,212

        0 points yesterday

        Profile
        Rank: Genius
      9. rindi

        33,652

        0 points yesterday

        Profile
        Rank: Savant
      10. shahzoor

        33,509

        0 points yesterday

        Profile
        Rank: Guru
      11. nobus

        33,393

        0 points yesterday

        Profile
        Rank: Savant
      12. noxcho

        33,214

        0 points yesterday

        Profile
        Rank: Genius
      13. BigSchmuh

        28,987

        0 points yesterday

        Profile
        Rank: Sage
      14. garycase

        27,968

        0 points yesterday

        Profile
        Rank: Genius
      15. arnold

        26,694

        0 points yesterday

        Profile
        Rank: Genius
      16. SelfGovern

        22,967

        0 points yesterday

        Profile
        Rank: Wizard
      17. Shbasha

        18,208

        0 points yesterday

        Profile
        Rank: Master
      18. charlestasse

        18,144

        0 points yesterday

        Profile
        Rank: Wizard
      19. pgm554

        18,126

        0 points yesterday

        Profile
        Rank: Sage
      20. robocat

        16,490

        0 points yesterday

        Profile
        Rank: Sage
      21. ve3ofa

        16,400

        0 points yesterday

        Profile
        Rank: Genius
      22. DavisMcCarn

        15,300

        0 points yesterday

        Profile
        Rank: Genius
      23. strung

        13,800

        0 points yesterday

        Profile
        Rank: Genius
      24. chakko

        12,725

        0 points yesterday

        Profile
        Rank: Genius
      25. millardjk

        12,424

        0 points yesterday

        Profile
        Rank: Master

      Hall Of Fame