asked on

FAS2040 Disk Failure

We have dual controllers on FAS2040. Each controller has 1 aggregate. SAS disk on one, SATA on the other. Both configured with RAID-DP. We have 1 spare disk assigned to each controller. Had a failed disk and now the System Manager has a notification "There are insufficient spare disks". With RAID-DP and this lost spare, what happens if we lose another disk waiting for the replacement to arrive? Should EACH controller have 2 spares?

ASKER CERTIFIED SOLUTION

Paul Solovyovsky

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

SWRegistration

ASKER

Just to verify, we would not have data loss OR a system shut down right away with a second disk failure, right? It seems I've read that you have approximately 24 hours before the system shuts down - with no action taken - does that seem correct?

SWRegistration

ASKER

Ok, I think I've found the answer to the last one. I couldn't find anything to let us know if we were ok with just having the one spare per set. I'll share what I just found. Thank you for your assistance.

A few things to consider. Remember that there can be multiple aggregates on the system, each of which can consist of multiple raid groups - aggr status -r will show the raid groups.
You can have two failed drives in every aggregate and still not lose data, because each aggregate is a separate entity. In addition, you can lose two drives in each raid group of a _single_ aggregate without losing data, because each raid group is in its own raid-dp setup. So, if you have an aggregate with 4 raid-dp raid groups, you could lose 8 drives, as long as two come from each raid group, without losing data. For the record, I've seen this - an entire shelf powered off, but only two drives from that shelf were in any single raid group, so no data loss.
If you lost a third drive in a raid-dp raid group, that raid group would fail and the aggregate would go offline, and you'd lose data.

Paul Solovyovsky

Here's what Netapp says

How Data ONTAP handles a failed disk that has no available hot spare

When a failed disk has no appropriate hot spare available, Data ONTAP puts the affected RAID group into degraded mode indefinitely and the storage system automatically shuts down within a specified time period.

If the maximum number of disks have failed in a RAID group (two for RAID-DP, one for RAID4), the storage system automatically shuts down in the period of time specified by the raid.timeout option. The default timeout value is 24 hours.

To ensure that you are aware of the situation, Data ONTAP sends an AutoSupport message whenever a disk fails. In addition, it logs a warning message in the /etc/message file once per hour after a disk fails.
Attention: If a disk fails and no hot spare disk is available, contact technical support.

SWRegistration

ASKER

Thanks again for your assistance!