troubleshooting Question

Data loss and corruption following single disk failure in RAID 10 array on Dell R510 / PERC H700

Avatar of Contigo1
Contigo1 asked on
Server HardwareMicrosoft Legacy OSDell
4 Comments3 Solutions2115 ViewsLast Modified:
We have just spent a day recovering from a serious data loss following a single disk failure in a RAID10 array.

We have a DELL R510 12-bay server (approx 15 months old) with 8xDell SAS 15k 600GB disks and 4xDELL Near-line SAS 7.2k 2TB disks. The Raid Controller is PERC 700 with slightly out-of-date firmware.

The RAID config is
VD0 = 4x600GB in RAID 10 (OS)
VD1 = 4x600GB in RAID 10 (Data)
VD2 = 2x2TB in RAID 1 (Files)
VD3 = 2x2TB in RAID 1 (Backups)

OS is Windows Server 2008 R2 and Hyper-V. We have a number of virtual servers with VHDs placed across the Virtual Disks.

Yesterday we observed one of the HDDs in the VD1 was blinking as errored. OK - not a big problem, we shutdown gracefully, replaced the phyiscal disk with a cold standby, set as a Hot Standy, attach it to the array and rebuild.

On bringing the server up (which presented some other issues for a later post), there was severe corruption and data loss across the VD1 disk. Most of the files looked like they were there OK, but there were just far too many obvious corrupt file issues with Exchange/SQL/ file shares etc. that we decided to go through a full recovery process from backups. This was painful and time consuming but concluded fine.

Perviously we used good old RAID1 using cheapy disks and RAID through the standard Windows Server interface. Sure we lost some disks but we never lost any data. I thought we'd bolstered our resiliance using the advanced PERC/RAID setup with expensive Dell 15k disks, but it seems we have fallen at our first failure. I feel like we purchased an expensive airbag that worked at all times until the car crashed!

So we lost a day and there will no doubt be some more clean-up. I'm happy our backup strategy was fine as we lost virtually nothing bar a few middle-of-the-night spam emails in the end.

So here's the question: We simply can't afford a repeat of this with a single disk failure. Did we do something obviously wrong? We want to learn the lessons from this and would welcome any advice.
ASKER CERTIFIED SOLUTION
Join our community to see this answer!
Unlock 3 Answers and 4 Comments.
Start Free Trial
Learn from the best

Network and collaborate with thousands of CTOs, CISOs, and IT Pros rooting for you and your success.

Andrew Hancock - VMware vExpert
See if this solution works for you by signing up for a 7 day free trial.
Unlock 3 Answers and 4 Comments.
Try for 7 days

”The time we save is the biggest benefit of E-E to our team. What could take multiple guys 2 hours or more each to find is accessed in around 15 minutes on Experts Exchange.

-Mike Kapnisakis, Warner Bros