Avatar of Contigo1
Contigo1
 asked on

Data loss and corruption following single disk failure in RAID 10 array on Dell R510 / PERC H700

We have just spent a day recovering from a serious data loss following a single disk failure in a RAID10 array.

We have a DELL R510 12-bay server (approx 15 months old) with 8xDell SAS 15k 600GB disks and 4xDELL Near-line SAS 7.2k 2TB disks. The Raid Controller is PERC 700 with slightly out-of-date firmware.

The RAID config is
VD0 = 4x600GB in RAID 10 (OS)
VD1 = 4x600GB in RAID 10 (Data)
VD2 = 2x2TB in RAID 1 (Files)
VD3 = 2x2TB in RAID 1 (Backups)

OS is Windows Server 2008 R2 and Hyper-V. We have a number of virtual servers with VHDs placed across the Virtual Disks.

Yesterday we observed one of the HDDs in the VD1 was blinking as errored. OK - not a big problem, we shutdown gracefully, replaced the phyiscal disk with a cold standby, set as a Hot Standy, attach it to the array and rebuild.

On bringing the server up (which presented some other issues for a later post), there was severe corruption and data loss across the VD1 disk. Most of the files looked like they were there OK, but there were just far too many obvious corrupt file issues with Exchange/SQL/ file shares etc. that we decided to go through a full recovery process from backups. This was painful and time consuming but concluded fine.

Perviously we used good old RAID1 using cheapy disks and RAID through the standard Windows Server interface. Sure we lost some disks but we never lost any data. I thought we'd bolstered our resiliance using the advanced PERC/RAID setup with expensive Dell 15k disks, but it seems we have fallen at our first failure. I feel like we purchased an expensive airbag that worked at all times until the car crashed!

So we lost a day and there will no doubt be some more clean-up. I'm happy our backup strategy was fine as we lost virtually nothing bar a few middle-of-the-night spam emails in the end.

So here's the question: We simply can't afford a repeat of this with a single disk failure. Did we do something obviously wrong? We want to learn the lessons from this and would welcome any advice.
Microsoft Legacy OSServer HardwareDell

Avatar of undefined
Last Comment
David

8/22/2022 - Mon
ASKER CERTIFIED SOLUTION
andyalder

Log in or sign up to see answer
Become an EE member today7-DAY FREE TRIAL
Members can start a 7-Day Free trial then enjoy unlimited access to the platform
Sign up - Free for 7 days
or
Learn why we charge membership fees
We get it - no one likes a content blocker. Take one extra minute and find out why we block content.
Not exactly the question you had in mind?
Sign up for an EE membership and get your own personalized solution. With an EE membership, you can ask unlimited troubleshooting, research, or opinion questions.
ask a question
Contigo1

ASKER
>>we shutdown gracefully, replaced the phyiscal disk

>That is bad practice, you should always replace failed disks hot if possible. Shouldn't cause data corruption though.

OK - thanks for this advice.

>Have you had a previous power out that lasted longer than the cache battery stayed charged for? say about 72 hours. That can lead to a write hole where everything seems to be in order but parity hasn't been updated since writes aren't always atomic. The write hole is most likely to occur with RAID 5 but can happen with other RAID levels as well and only shows up after a disk failure even though the hole was created weeks before that.


No the servers have remained powered up constantly barring the odd reboot.
SOLUTION
arnold

Log in or sign up to see answer
Become an EE member today7-DAY FREE TRIAL
Members can start a 7-Day Free trial then enjoy unlimited access to the platform
Sign up - Free for 7 days
or
Learn why we charge membership fees
We get it - no one likes a content blocker. Take one extra minute and find out why we block content.
Not exactly the question you had in mind?
Sign up for an EE membership and get your own personalized solution. With an EE membership, you can ask unlimited troubleshooting, research, or opinion questions.
ask a question
SOLUTION
Log in to continue reading
Log In
Sign up - Free for 7 days
Get an unlimited membership to EE for less than $4 a week.
Unlimited question asking, solutions, articles and more.
I started with Experts Exchange in 2004 and it's been a mainstay of my professional computing life since. It helped me launch a career as a programmer / Oracle data analyst
William Peck