<

Disk drive reliability overview

Published on
39,786 Points
12,186 Views
51 Endorsements
Last Modified:
Awarded
Editor's Choice
Community Pick
Don't buy such-and-such disk drives, they are not reliable!    Boy, how many times have we read a post from a well-meaning person who had some bad experiences with a drive and now they exclaim the entire company's products can't be trusted.  First, consider annual volume shipments.  The major manufacturers each shipped between 50 - 100 million drives in 2009.  [Update:  Seagate alone shipped over 500,000,000 disk drives from April 2008 to Oct 2010] Ask yourself if the blogger could possibly have first-hand experience with anything more than 0.00000001 percent of any particular manufacturer, so how could such statements be statistically relevant?

All these well-intentioned people are correct in one respect.  All disks are unreliable.  Anything that has 100% probability of eventual failure is, by definition, unreliable, is it not?  

What should you do?  First, you can count on drive failure, so invest in RAID1/5/6 technology.  Perform backups, and take backups offsite.  Remember that RAID is not an excuse for not performing backups. All it takes is one stupid typo and a RAID1 array can delete both copies of your precious data in a fraction of a second.  Backups protect against human and computer errors. Archiving (taking backup copies to your parent's house or to a safety deposit box) prevents against natural disasters such as fire, floods, and tornados (or earthquakes if you live in California instead of Texas).

Next, one should look at drive specifications.  Don't base a buying decision just on speeds and feeds.  Ever notice "enterprise class" or "consumer class" drives when reading the specs?  The bottom line is that enterprise drives are designed for 365x24x7, have 100X more error-correcting circuitry, and are designed with much faster error recovery/remapping circuitry so they play nice behind RAID controllers.   The disks also cost 2-3X more money.

If you read the fine print on consumer drives, you will usually discover that they are rated for a whole 2400 hours annual use.   Do the math. If you run a computer 24x7 then you exceed specifications unless you turn the computer off in April.  The reduced number of ECC bits means that if you have a 2TB disk and you copy the entire disk back and forth a few times, then you WILL lose data. 1 x 10E14 bit error rate sounds like a really big number until you consider that this is only 6 times more bits then you have on the disk to begin with.  

Not a day goes by where I don't hear horror stories from people who have overpriced pimped up PCs where people pay money for pretty lights, unnecessary cores, or extra RAM that won't increase overall performance on anything other than benchmarks.  Ask those same people if they have an external USB HDD for backup and you get funny looks.  My advice to all is to consider whether or not your data is worth the $100 insurance policy it costs to get an external drive for archiving purposes.  Or get an online backup product.  Some are even free. Just back up and put the data offsite.

As for error recovery/reallocation timing .. without getting too deep into it, the reason why all the NAS, SAN, appliance vendors, and server manufacturers ship enterprise class drives with their servers that have RAID controllers is not JUST to make more money off of you. Sure they like the money, but more importantly, they hate returns. Manufacturers spend millions of dollars qualifying hardware. Part of the process involves choosing disks that meet requirements for error recovery timings.  

The consumer class disks take too long to recover when they pick up bad blocks.  While enterprise class drives remap bad sectors (use a spare reserved area of disk to replace one that failed) in a few seconds, it is not unusual to have a consumer class disk deal with bad sectors by not responding for 30 seconds or longer.  Some RAID controllers, when presented with a disk drive that is taking 30 seconds to respond to a write request, think that the disk failed, so they kill it.

The result is that if you use consumer class drives behind many RAID controllers, is that the controller will "fail" a perfectly good disk drive just because it took to long to recover.  If you have multi-terabyte configurations, then you risk 100% data loss should a bad block be discovered while the array is in degraded mode as it rebuilds.   This is why you should invest in enterprise disks if using RAID1/10/5.  If your hardware supports RAID levels that support 2 redundant disks (RAID6, RAIDZ2), then statistically, you can get away with using consumer class drives ... just as long as you perform frequent consistency checks.

Also take time to read a HCL (hardware compatibility list).  If the vendor took the time to say XYZ disk with firmware revision whatever is certified, then my gosh, give them the benefit of the doubt that they know more about what disk will work then you do!   Have you seen source code for the controller firmware, or are you (or your buddy who said it was OK to use consumer drives) seen the SEV-1/2 bug list for firmware rev   xxx?  Then you have no idea whether or not certain disk errors can have a cascade effect and result in data loss.  Remember, pretty much everything works just fine UNTIL something unexpected happens.  

While you are at it, keep up on RAID controller firmware, drivers, and management software.   Always upgrade them as they are released.  Read release notes.  I once saved somebody on EE who was upgrading firmware, and told him to read release notes first.  he wrote back and thanked me profusely as the release notes clearly stated that the firmware update should be preceded by a full backup as it would reformat the logical disks due to metadata changes.  Had user just upgraded the firmware then he would have lost everything.

Also NEVER update firmware if a RAID system is in stress (rebuilding, in degraded mode, or is throwing errors).   Never upgrade unless you have a UPS with battery backup.  You may brick it in event of power loss.  Murphy's laws come into play here.

Sometimes a disk fails HCL because of firmware bugs too.  Unfortunately I am on strict NDA so can not comment on specific disks and firmware bugs that are only known to large OEM suppliers.  Suffice to say that there are a heck of a lot of SEVERITY-1 bugs out there.  Update firmware when it is made available (but if using RAID controllers, check with the RAID manufacturer first ... do what they tell you to do).

If a data consistency check/rebuild is an unfamiliar term, and you are using a RAID controller, then you should first stop reading, go to your RAID controller, and kick one off.  Now. Then come back.

Consistency checks are vital, and should be run every weekend.  They insure that if you lose a disk then you won't also lose data.  You see, all disk drives get bad blocks from time to time. That is why they have thousands or tens of thousands of spare blocks built into them. When you tell the RAID controller to do the consistency check, it reads all blocks from all disks, and in event it discovers an unreadable block of data, it calculates what is supposed to go there from the XOR parity and "fixes" it.  When the process completes, you have 100% error-free data.    The problem is that if you do not do this, and you have a bad block on disk "A", and disk "B" dies (assuming RAID1/10/5), then there is no way for a controller to fix block "A", so you end up with partial data loss at best.

At worse, when your controller kicks off the rebuild, it will halt and you end up with 100% data loss because it can't handle this situation.  (That is why you should purchase premium RAID controllers that have the intelligence to deal with such things).

RAID controllers on motherboards should be avoided except for RAID1.  They simply do not have the intelligence to deal with many common error recovery scenarios.  As a former RAID firmware architect, I can tell you that 75% of the firmware in a controller deals with error situations.  Spend $500+ for a controller and you get one that can pretty much deal with anything and recover gracefully.   Spend $10 for a controller (like what you end up on a motherboard), and error recovery isn't much more than the pseudocode, "IF IO_SUCCESS_STATUS !=TRUE THEN KILL DISK".

Finally, a note about "data recovery software".  This is the software you run that "recovers" bad blocks and fixes drives.  I will write something more in depth on them in the future, but the quick soundbyte is that all they do is read and re-read and re-read and re-read all the blocks on the disk in an attempt to "recover" the data.  The software can not help you if the drive won't appear in a BIOS.  It won't help if the disk is making funny noises.  In fact, it will make things much worse.  If the disk had a media or head crash then every moment the disk is powered up could result in further data loss.  Bottom line on drive failures, is that software recovery is good to potentially recover from a "error reading block XYZ" problem.  Any other issue and you need the guys in the bunny suits and a check for $500-$1000.





51
Comment
Author:David
28 Comments
LVL 40

Expert Comment

by:evilrix
A must read for anyone who cares about their data. I voted yes!
0
LVL 74

Expert Comment

by:Glen Knight
An excellent well written article.
You got my vote :)

>>Maxtor HDDs because he always found them "unreliable", and user should go with hitachi
Really? I would go with Maxtor over Hitachi any day :))

Just kidding! :)
0
LVL 76

Expert Comment

by:Alan Hardisty
Hmmm - I had issues with Maxtor - 3 drives dying within 2 weeks of each other - but I was buying the cheapo versions.  Ironically, I like the Seagate ones - may by the same people who produce the Maxtor drives.
0
Newly released Acronis True Image 2019

In announcing the release of the 15th Anniversary Edition of Acronis True Image 2019, the company revealed that its artificial intelligence-based anti-ransomware technology – stopped more than 200,000 ransomware attacks on 150,000 customers last year.

LVL 47

Author Comment

by:David
alan - I spent years working for a RAID vendor that purchased 500,000+ Maxtor drives PER MONTH for a while. Nothing wrong with the company as a whole, but they also purchased enterprise class.  As for drives dying within 2 weeks of each other. This is to be expected. As I have stated before ... they are probably same engineering run, same I/O, same temperature, same duty cycle.  When one disk in a RAID dies, the others may not be far behind because of that.
0
LVL 76

Expert Comment

by:Alan Hardisty
Indeed - I read your article with a guilty conscience - I too had been guilty of not favouring a particular manufacturer's drives (no clues as to who) but now have seen the light.

Thanks for correcting my mistake and for writing a great article.

Alan
0
LVL 23

Expert Comment

by:redrumkev
dlethe,

This is an outstanding article. You really put into perspective that what users see as a large company that doesn't care and puts out bad products, is really just a bad "luck of the draw".

On a side note, I think that have "known" a lot of this stuff, but never really put it together. Thank you for taking the time to do just that!

Kevin
0
LVL 47

Author Comment

by:David
Glad to help.  This is really a subset of  things I learned over the years.   I am tempted to turning it into a bullet list that doesn't go into a whole lot of details to keep it clean, but then links to a shorter article on items many of which are worthy of a paper.

Other topics
 *S.M.A.R.T.  - as I told moderator, vast majority of comments on S.M.A.R.T. is incorrect or misleading.
 *More on RAID / from perspective of architect & what will likely work, not work, and no way has been tested
 *All those data recovery programs, what they really do , how they work, and when you are wasting your time

Any good suggestions on a title or other topics? "David's Disk & RAID Truths" isn't very catchy
0
LVL 47

Author Comment

by:David
P.S. Got LOTS of details on why consumer-class disks really fail, how temperature comes into play; burn-in importance; how busy and idle disks affect longevity; real-world MTBF
This is going to be either 2nd or 3rd article I do next
0
LVL 76

Expert Comment

by:Alan Hardisty
Belated Vote for Helpful from me - looking forward to the next ones already.

Alan
0
LVL 23

Expert Comment

by:redrumkev
David,

How about:

"RAID - A magic pill or proper planning?"

"The evolution of a Disk Junkie" (your life story, or progression of knowledge)

As for a topic, what would be the best solution for the average developer, home user? Example, two notebooks and 1 or 2 desktops in a home. The notebooks are wireless and the desktops are wired... what solution would exist that "does it all" but is "easy to use"... in terms of the hardware. Low cost (pros and cons), mid cost (pros and cons) and high cost (pros and cons), maybe a matrix graphic to illustrate, something like this:
http://www.ad-lister.co.uk/Shared/UserImages/35bc0e55-b13c-4b45-a7e7-bab423d29fa9/Img/microsoft/office2007_compare3.gif

Where the top row is one of the following:
In desktop raid 1
In desktop raid 5
External (dumb) drive
External (dumb) drive (with built in raid 1)
Managed storage (NAS/SAN) etc.

The side column could be:
Cost
Up-time
Rebuild time
Maintenance
Technical Knowledge Needed, etc.

Then just put a number in each box, or a check if those features are available. Or even something like:
NR (not recommended)
V (good)
VV (better)
VVV (best)

See attached.

Kevin Home-Systems-Matrix.doc
0
LVL 47

Author Comment

by:David
Suggestions on topics are good, but only if I break up Disk & RAID, but that probably wouldn't be a bad Idea.  List maybe 25, "things you probably didn't know, but should know about RAID and DISKs"

As for the "best solution".  Not going down there.  Too many variables, and the minute i post it, the doc would be obsolete ... just like everything else in this industry.  To do a decent matrix, I probably need 6 or 7 dimensions, not 2 :)
0
LVL 23

Expert Comment

by:redrumkev
On that note, i think an article about "things you probably didn't know, but should know about RAID and DISKs", would be outstanding. From there, people can make their own decisions, but you are simply providing facts!

I agree that an obsolete "king of all posts" would not be a good way to go!

I look forward to this... love to learn!

Kevin
0
LVL 38

Expert Comment

by:younghv
Sign me up as another guilty one about "that" manufacturer...or "that other" one.
Excellent piece and looking forward to more.

Yes vote above.
0
LVL 3

Expert Comment

by:rizla7
you dont especially need guys in bunny suits. just a harddrive with an approximately similar controller board. rip the plates out, blow as much dust out as you can and copy the data before the new one fails ;p
0
LVL 47

Author Comment

by:David
Well it depends on the disk, with the 1-2TB SATA drives, the tolerance is so tight that it just isn't as easy as it used to be even a few years ago.  
0
LVL 6

Expert Comment

by:Jerkson
dlethe,

Great article, and thanks for saving my butt by making me read the release notes. I love to learn, and I will read (and re-read) all your future articles as well.
0
LVL 15

Expert Comment

by:Ryan_R
Voted yes, thanks for the article.

I've also heard and experienced bad things with Maxtor drives in years past. I tend to stick with WD at the moment, but also have a Samsung drive that seems ok.

I have 4TB spread over 3 HDDs in my main PC at home, I only have the necessities within my profile backed up on a USB drive and again on my laptop, and that's about it.
0

Expert Comment

by:kamutzo
nice article,
i'm wishing for an in depth article on SSD's controller architecture and general build,
people don't know what they buy mostly and there are so many things to understand and consider,
especially secrets and facts on the inside, things that are not always being told by companies or they're representatives.
we were discussing a ROC capable SSD with dual PCB's or an extended single one for double bandwidth capacities,
take two 32 or 64GB drives and combine them for a single running dual speed on a SATA3 interface,
it goes off-topic from the article, yet if we deal with storage and you are dealing with RAID controllers, maybe we could merge this :).
0
LVL 10

Expert Comment

by:lucius_the
A very informative article, helped a lot in my decisions on buying server-grade enterprise drives in the future.
0
LVL 18

Expert Comment

by:BigSchmuh
Very good article, I would add the reference to Google "Failure Trends in a Large Disk Drive Population" study where the Google lab showed that:

models based on SMART parameters alone are unlikely to be useful for predicting individual drive failures.
temperature and activity levels were much less correlated with drive failures than previously reported
0
LVL 47

Author Comment

by:David
Thx - actually if you analyze the google study further, you will see that disks run much better at higher temperatures.   So those drive temperature coolers that some people swear by actually lower disk life :)
0
LVL 18

Expert Comment

by:BigSchmuh
@dlethe : I just re-read the google study...but I am uncomfortable with your assertion stating that "disks run much better at higher temperature"
Looking at p.6 figure 5 "AFR for average drive temperature", one can see that the all times lowest AFR (Annual Failure Rates) are in the 30-40°C range.
==> In most datacenters, the drives may suffering from a lower temperature range...but home users (those who buy coolers) should raise their HDD life with a passive cooler (or simple fan) that allow the drive to go back from its traditional 55°C context
0
LVL 50

Expert Comment

by:Dave
Thanks for this article, I've filed it away for future reference having finally got around to RAID and now thinking about a NAS

Regards

Dave
0
LVL 19

Expert Comment

by:Rob Hutchinson
Good article, but I still remember the Hitachi DeathStar( DeskStar) laptop drives and avoid them like the plague--maybe these drives had a higher than normal failure rate, but whenever we got called to help a user with a bad laptop hard drive a few years ago it was almost always this model hard drive...hmm, or maybe it was because this was the drive that was shipped most at the time with these laptops?..nah, it was the evil Hitachi DeathStars!
0
LVL 2

Expert Comment

by:Netpro7
Very Nice article and very informative, thank you, a BIG yes from me.
0
LVL 47

Author Comment

by:David
Here is the google temperature vs failure graph.  Google uses standard desktop drives also.    

It supports my statements, at least until the HDD gets to 45 degrees C, then it starts trending up slowly...
From Google reliability study - data copyright google.comHere is the location of the entire paper that contains the chart above : http://labs.google.com/papers/disk_failures.pdf

0

Expert Comment

by:Paul Constantin
Actually the issue with the IBM deskstars is that the 65GXP series (before they became Hitachi) had some serious QC issues.  I remember this because I used IBM drives almost exclusively especially on client desktops (except some Quantum, Maxtor and Seagate drives on other higher end applications.)

After IBM sold the drive operations to Hitachi, far as I could tell, a huge issue came up with drives being sold to Taiwanese clients having some sort of spyware embedded in their firmware or some other such issue, I remembered reading some myriad articles, but it has been a LONG time. (I'd like to say 2006, but it really is just fuzzy memories at this point.)

Sure, we all miss the days of Quantum and Maxtor being in business on their own, but hey, that's the price of doing business these days.  People age, people decide to quit the biz, people on the board want to retire, and a sell off seems more profitable than another heart attack.  All sorts of things come up.  We live in a world populated by human beings, each with individual needs and wants, and there comes dlethe's comment that each of us is by necessity defined by our experiences.  Nevermind that many of our experiences are limited to a certain set of knowns, and framed by a much larger set of beliefs... beliefs by definition being that which we are told, or we assume but do not know, for sure.  (For example, I did not live in Taiwan and take part of the traffic shaping/verification efforts which resulted in the articles I read back then.  I do not know for sure they are even true.)

Overall I find that most people whining about a certain consumer grade drive being bad don't realize that drives are like ammunition.  A manufacturer makes millions of units of product, be it drives or bullets (as dlethe mentions above) and just like ammunition (which is a far simpler product which, which nonetheless, in certain failure modes, can kill or severely injure the user) these come in batches.  Occasionally a bad batch comes out.  Some of you may remember the old desktop greencore (aluminum trace) Athlons which were volcanic piles of ash with astonishing frequency, or the early purple cores (copper trace), of which certain batches were overclockable to the limits of one's cooling capability (we pushed one in the lab to more than twice its clock speed running a refrigerated cooler.)  Some of you may remember explicitly ordering those specific processors for one's own home performance desktop.

This same issue arises over and over with various devices, firmware revisions, batch numbers, lot numbers, you name it.  Many times the whining on places such as Newegg or Tiger Direct starts because a certain bad batch showed up and had serious issues, sometimes its people with basic or nonexistent knowledge rating themselves as a 5/5 tech expert.  Sometimes a certain troublesome employee wasn't packing them right, or was mishandling product, or you name it, its been done.  Sometimes the users didn't read that a particular product line (western digital red NAS drives, or WD Caviar/Scorpio Black drives) were not intended to be installed into surveillance DVR's despite being rated for 365/24/7 uptime.  (Having a drive hang while it deals with a bad bit instead of simply dropping a frame and moving on during a break-in can result in the surveillance system failing in its primary mission, surveilling a particular event.)  Meanwhile, having a surveillance drive in your desktop because it was cheaper, can likewise cause all sorts of nastiness when you put important data on said drive.

I can't begin to say how many articles I read just on Newegg where people rate themselves tech experts and yet fail simple scope consistency checks.  As in, plugged the wrong drive into the wrong system for the wrong reasons (cheap, didn't read drive description, didn't know what it actually did, etc.)  I have had one system on my bench recently into which someone plugged an AV drive.  Then again, I live and work in an area where people make buying decisions solely based on price.  I've got two CAD guys on contract ever since we swapped out their desktop drives in their workstations for actual enterprise class drives.  Happy customers now.
0
LVL 47

Author Comment

by:David
NorthPlateau - Your use of jargon indicates you're an old-timer like me (old as in experience in the industry) with insider knowledge.  Thank's for the war stories.  

Remember the stiction problem, or when SATA disks first came out and they were 'only' going to be used for near line storage?  Nobody imagined people would be silly enough to use them as primary storage.
0

Featured Post

10 Tips to Protect Your Business from Ransomware

Did you know that ransomware is the most widespread, destructive malware in the world today? It accounts for 39% of all security breaches, with ransomware gangsters projected to make $11.5B in profits from online extortion by 2019.

Join & Write a Comment

This Micro Tutorial will teach you how to reformat your flash drive. Sometimes your flash drive may have issues carrying files so this will completely restore it to manufacturing settings. Make sure to backup all files before reformatting. This w…
Despite its rising prevalence in the business world, "the cloud" is still misunderstood. Some companies still believe common misconceptions about lack of security in cloud solutions and many misuses of cloud storage options still occur every day. …

Keep in touch with Experts Exchange

Tech news and trends delivered to your inbox every month