We have just spent a day recovering from a serious data loss following a single disk failure in a RAID10 array.
We have a DELL R510 12-bay server (approx 15 months old) with 8xDell SAS 15k 600GB disks and 4xDELL Near-line SAS 7.2k 2TB disks. The Raid Controller is PERC 700 with slightly out-of-date firmware.
The RAID config is
VD0 = 4x600GB in RAID 10 (OS)
VD1 = 4x600GB in RAID 10 (Data)
VD2 = 2x2TB in RAID 1 (Files)
VD3 = 2x2TB in RAID 1 (Backups)
OS is Windows Server 2008 R2 and Hyper-V. We have a number of virtual servers with VHDs placed across the Virtual Disks.
Yesterday we observed one of the HDDs in the VD1 was blinking as errored. OK - not a big problem, we shutdown gracefully, replaced the phyiscal disk with a cold standby, set as a Hot Standy, attach it to the array and rebuild.
On bringing the server up (which presented some other issues for a later post), there was severe corruption and data loss across the VD1 disk. Most of the files looked like they were there OK, but there were just far too many obvious corrupt file issues with Exchange/SQL/ file shares etc. that we decided to go through a full recovery process from backups. This was painful and time consuming but concluded fine.
Perviously we used good old RAID1 using cheapy disks and RAID through the standard Windows Server interface. Sure we lost some disks but we never lost any data. I thought we'd bolstered our resiliance using the advanced PERC/RAID setup with expensive Dell 15k disks, but it seems we have fallen at our first failure. I feel like we purchased an expensive airbag that worked at all times until the car crashed!
So we lost a day and there will no doubt be some more clean-up. I'm happy our backup strategy was fine as we lost virtually nothing bar a few middle-of-the-night spam emails in the end.
So here's the question: We simply can't afford a repeat of this with a single disk failure. Did we do something obviously wrong? We want to learn the lessons from this and would welcome any advice.