Link to home
Create AccountLog in
Avatar of NEILCHAMPION
NEILCHAMPION

asked on

EVA8100 constantly having failing drives

Hi,

Currently one of our EVA8100 is constantly getting failed HDD's.

This issue has been ongoing for the last couple of months but over the last 5 days we have had 4 failures.

Below is a brief description of the unit.

The number and size of the drives installed.
      •         240 Drives
      •         Manufacturer (HPQ), Model number (BF450DA483 or Equivalent
      ) , Firmware (HP01-HP07) – 450GB FC disks
   The firmware level on each drive.
      •         (HP01-HP07) : (mostly HP06)
   The number and model of the raid card(s).
      •         Controller A: Model (HSV210),Version(
      CR1FCAxcsp-6250),Control Cache(2048MB),Read Cache(1024MB),Write Cache
      (1024MB),Mirror Cache(512MB),Total(4096MB)
      •         Controller B: Model (HSV210),Version(CR1FCAxcsp-6250),
      Control Cache(2048MB),Read Cache(1024MB),Write Cache(1024MB),Mirror
      Cache(512MB),Total(4096MB)
   The firmware of each raid card.
      •         Management Gui Does not appear to have an option to extract
      this
   How much storage space (in size and percentage).
      •         Total:                          100360 GB
      •         Allocated                    91877 GB
      •         Allocation Level         92%
      •         Available:                   8483 GB

It would be great if someone could please help us in resolving this issue as it is becoming very costly.
Avatar of David
David
Flag of United States of America image

I can't speak for HP, but I've been involved in similar situations with statistically unlikely levels of HDD failures. Request HP do a failure analysis on the HDDs.  This is something they may not do for you, but I've had to order a few of them with Seagate in similar situations back when I was a field engineer for one of the RAID manufacturers.

Root cause in one case was high levels of EMR due to some heavy equipment installed nearby; another took an electron microscope to diagnose the problem (long story); another was a batch of HDDs that had stiction.   If you are not under HP support, then you can forget them doing it as it is QUITE expensive ... almost cost prohibitive, so it depends on how important a customer you are to HP.  (You're talking tens of thousands of dollars or even six figures in some cases, just to give you some perspective).

Do you have the failed disks?  Even if you cant get a failure analysis done, you could try sending them to somebody who knows what to look for and has some spare time, but this isn't the type of thing that can be diagnosed without some special software, lots of experience, and the right hardware.  Even then if nothing looks obvious then it could still require extraordinary efforts like an electron microscope.

Bottom line, it is still statistically possible that this can happen if they all went in service at the same time and I/O highly balanced and they are all in same RAID groups ... but no way to know without some forensic analysis.

My suggestion --- call HP if you are on support.  If not, buy more disks and migrate to RAID6, and find somebody who will take a look at the dead drives to see if they all died the same way. That will tell you what the next steps are to getting the situation under control
Avatar of NEILCHAMPION
NEILCHAMPION

ASKER

Thanks for your reply but we aren't covered by HP Support.

Could the HDD firmware or the amount of available free space cause this?
ASKER CERTIFIED SOLUTION
Avatar of David
David
Flag of United States of America image

Link to home
membership
Create a free account to see this answer
Signing up is free and takes 30 seconds. No credit card required.
See answer
but couldn't the fact that the there is only 8gb free mean that it could be try to write data to hdd that are almost full and there for increasing the error count on the hdd causing a predictive faulure?
absolutely not.  There is NO concept of "fullness" inside of the HDD.  A 1000GiB disk drive has 1000 x 1000 x 1000 x 8 (addressable) bits . It doesn't matter what you write on the HDD, the number of bits will not change, only whether or not any arbitrary bit is 1 or 0.

Now if you are out of reserved (spare) sectors, then it is a whole 'nuther thing.

Well, there are more than that many bits to be technically correct, I am not counting reserved, or the ECC bits associated with each block.  But hopefully you get the concept.  The disk has no idea what the data looks like, whether it is a partition, or used by windows, or inside of a DVR or even a UNIX system.  It just sees 1s and 0s.
if I was out of reserved (spare) sectors, could this cause the predictive failures?

we seem to be getting both failures and predictive failures.

what could be done to resolve this?
Absolutely, yes, there are thresholds and one can programmatically read the exact number on each drive and determine locations, and cumulative errors, and it is highly probable root cause can be determined w/o going to extraordinary means.

But, you need to have somebody do it for you. The software you need is expensive and even then you need to have a person who knows what they are doing.  It also has to be done out-of-band, which means hooking up disk(s) to a test bench.
could increasing the amount of free space resolve the issue?
no.  You still don't get it.  My apologies.  A HDD has no concept of free space.  It just does what it is told, meaning it reads and writes a chunk of data at a specific offset as efficiently as possible, and repeats until power is turned off.
so if this is the case how does a hdd determine it is about to fail. (predictive failure)
The algorithms are vendor/product specific, but the metrics include
 * # of remaining spare sectors
 * head fly distance
 * changes in RPMs
 * spike in rate of recoverable and unrecoverable read/write errors
 * ECC error bit rate
 * rise in operating temperature
 * time to spinup

Just a few off the top of my head.  At this point you should talk to a storage pro. It is just not possible to diagnose and tell you WHY this is happening because the diagnostic data requires hardware/software you do not have.   By the way, many EE experts have a way to contact them listed in their profile, and do paid consulting.