Link to home
Get AccessLog in
Avatar of Contigo1
Contigo1

asked on

Data loss and corruption following single disk failure in RAID 10 array on Dell R510 / PERC H700

We have just spent a day recovering from a serious data loss following a single disk failure in a RAID10 array.

We have a DELL R510 12-bay server (approx 15 months old) with 8xDell SAS 15k 600GB disks and 4xDELL Near-line SAS 7.2k 2TB disks. The Raid Controller is PERC 700 with slightly out-of-date firmware.

The RAID config is
VD0 = 4x600GB in RAID 10 (OS)
VD1 = 4x600GB in RAID 10 (Data)
VD2 = 2x2TB in RAID 1 (Files)
VD3 = 2x2TB in RAID 1 (Backups)

OS is Windows Server 2008 R2 and Hyper-V. We have a number of virtual servers with VHDs placed across the Virtual Disks.

Yesterday we observed one of the HDDs in the VD1 was blinking as errored. OK - not a big problem, we shutdown gracefully, replaced the phyiscal disk with a cold standby, set as a Hot Standy, attach it to the array and rebuild.

On bringing the server up (which presented some other issues for a later post), there was severe corruption and data loss across the VD1 disk. Most of the files looked like they were there OK, but there were just far too many obvious corrupt file issues with Exchange/SQL/ file shares etc. that we decided to go through a full recovery process from backups. This was painful and time consuming but concluded fine.

Perviously we used good old RAID1 using cheapy disks and RAID through the standard Windows Server interface. Sure we lost some disks but we never lost any data. I thought we'd bolstered our resiliance using the advanced PERC/RAID setup with expensive Dell 15k disks, but it seems we have fallen at our first failure. I feel like we purchased an expensive airbag that worked at all times until the car crashed!

So we lost a day and there will no doubt be some more clean-up. I'm happy our backup strategy was fine as we lost virtually nothing bar a few middle-of-the-night spam emails in the end.

So here's the question: We simply can't afford a repeat of this with a single disk failure. Did we do something obviously wrong? We want to learn the lessons from this and would welcome any advice.
ASKER CERTIFIED SOLUTION
Avatar of Member_2_231077
Member_2_231077

Link to home
membership
This content is only available to members.
To access this content, you must be a member of Experts Exchange.
Get Access
Avatar of Contigo1
Contigo1

ASKER

>>we shutdown gracefully, replaced the phyiscal disk

>That is bad practice, you should always replace failed disks hot if possible. Shouldn't cause data corruption though.

OK - thanks for this advice.

>Have you had a previous power out that lasted longer than the cache battery stayed charged for? say about 72 hours. That can lead to a write hole where everything seems to be in order but parity hasn't been updated since writes aren't always atomic. The write hole is most likely to occur with RAID 5 but can happen with other RAID levels as well and only shows up after a disk failure even though the hole was created weeks before that.


No the servers have remained powered up constantly barring the odd reboot.
SOLUTION
Avatar of arnold
arnold
Flag of United States of America image

Link to home
membership
This content is only available to members.
To access this content, you must be a member of Experts Exchange.
Get Access
SOLUTION
Link to home
membership
This content is only available to members.
To access this content, you must be a member of Experts Exchange.
Get Access