SWRegistration
asked on
FAS2040 Disk Failure
We have dual controllers on FAS2040. Each controller has 1 aggregate. SAS disk on one, SATA on the other. Both configured with RAID-DP. We have 1 spare disk assigned to each controller. Had a failed disk and now the System Manager has a notification "There are insufficient spare disks". With RAID-DP and this lost spare, what happens if we lose another disk waiting for the replacement to arrive? Should EACH controller have 2 spares?
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
Ok, I think I've found the answer to the last one. I couldn't find anything to let us know if we were ok with just having the one spare per set. I'll share what I just found. Thank you for your assistance.
A few things to consider. Remember that there can be multiple aggregates on the system, each of which can consist of multiple raid groups - aggr status -r will show the raid groups.
You can have two failed drives in every aggregate and still not lose data, because each aggregate is a separate entity. In addition, you can lose two drives in each raid group of a _single_ aggregate without losing data, because each raid group is in its own raid-dp setup. So, if you have an aggregate with 4 raid-dp raid groups, you could lose 8 drives, as long as two come from each raid group, without losing data. For the record, I've seen this - an entire shelf powered off, but only two drives from that shelf were in any single raid group, so no data loss.
If you lost a third drive in a raid-dp raid group, that raid group would fail and the aggregate would go offline, and you'd lose data.
A few things to consider. Remember that there can be multiple aggregates on the system, each of which can consist of multiple raid groups - aggr status -r will show the raid groups.
You can have two failed drives in every aggregate and still not lose data, because each aggregate is a separate entity. In addition, you can lose two drives in each raid group of a _single_ aggregate without losing data, because each raid group is in its own raid-dp setup. So, if you have an aggregate with 4 raid-dp raid groups, you could lose 8 drives, as long as two come from each raid group, without losing data. For the record, I've seen this - an entire shelf powered off, but only two drives from that shelf were in any single raid group, so no data loss.
If you lost a third drive in a raid-dp raid group, that raid group would fail and the aggregate would go offline, and you'd lose data.
Here's what Netapp says
How Data ONTAP handles a failed disk that has no available hot spare
When a failed disk has no appropriate hot spare available, Data ONTAP puts the affected RAID group into degraded mode indefinitely and the storage system automatically shuts down within a specified time period.
If the maximum number of disks have failed in a RAID group (two for RAID-DP, one for RAID4), the storage system automatically shuts down in the period of time specified by the raid.timeout option. The default timeout value is 24 hours.
To ensure that you are aware of the situation, Data ONTAP sends an AutoSupport message whenever a disk fails. In addition, it logs a warning message in the /etc/message file once per hour after a disk fails.
Attention: If a disk fails and no hot spare disk is available, contact technical support.
How Data ONTAP handles a failed disk that has no available hot spare
When a failed disk has no appropriate hot spare available, Data ONTAP puts the affected RAID group into degraded mode indefinitely and the storage system automatically shuts down within a specified time period.
If the maximum number of disks have failed in a RAID group (two for RAID-DP, one for RAID4), the storage system automatically shuts down in the period of time specified by the raid.timeout option. The default timeout value is 24 hours.
To ensure that you are aware of the situation, Data ONTAP sends an AutoSupport message whenever a disk fails. In addition, it logs a warning message in the /etc/message file once per hour after a disk fails.
Attention: If a disk fails and no hot spare disk is available, contact technical support.
ASKER
Thanks again for your assistance!
ASKER