We help IT Professionals succeed at work.

Data loss and corruption following single disk failure in RAID 10 array on Dell R510 / PERC H700

2,110 Views
Last Modified: 2016-11-23
We have just spent a day recovering from a serious data loss following a single disk failure in a RAID10 array.

We have a DELL R510 12-bay server (approx 15 months old) with 8xDell SAS 15k 600GB disks and 4xDELL Near-line SAS 7.2k 2TB disks. The Raid Controller is PERC 700 with slightly out-of-date firmware.

The RAID config is
VD0 = 4x600GB in RAID 10 (OS)
VD1 = 4x600GB in RAID 10 (Data)
VD2 = 2x2TB in RAID 1 (Files)
VD3 = 2x2TB in RAID 1 (Backups)

OS is Windows Server 2008 R2 and Hyper-V. We have a number of virtual servers with VHDs placed across the Virtual Disks.

Yesterday we observed one of the HDDs in the VD1 was blinking as errored. OK - not a big problem, we shutdown gracefully, replaced the phyiscal disk with a cold standby, set as a Hot Standy, attach it to the array and rebuild.

On bringing the server up (which presented some other issues for a later post), there was severe corruption and data loss across the VD1 disk. Most of the files looked like they were there OK, but there were just far too many obvious corrupt file issues with Exchange/SQL/ file shares etc. that we decided to go through a full recovery process from backups. This was painful and time consuming but concluded fine.

Perviously we used good old RAID1 using cheapy disks and RAID through the standard Windows Server interface. Sure we lost some disks but we never lost any data. I thought we'd bolstered our resiliance using the advanced PERC/RAID setup with expensive Dell 15k disks, but it seems we have fallen at our first failure. I feel like we purchased an expensive airbag that worked at all times until the car crashed!

So we lost a day and there will no doubt be some more clean-up. I'm happy our backup strategy was fine as we lost virtually nothing bar a few middle-of-the-night spam emails in the end.

So here's the question: We simply can't afford a repeat of this with a single disk failure. Did we do something obviously wrong? We want to learn the lessons from this and would welcome any advice.
Comment
Watch Question

retired saggar maker
CERTIFIED EXPERT
Distinguished Expert 2019
Commented:
This one is on us!
(Get your first solution completely free - no credit card required)
UNLOCK SOLUTION

Author

Commented:
>>we shutdown gracefully, replaced the phyiscal disk

>That is bad practice, you should always replace failed disks hot if possible. Shouldn't cause data corruption though.

OK - thanks for this advice.

>Have you had a previous power out that lasted longer than the cache battery stayed charged for? say about 72 hours. That can lead to a write hole where everything seems to be in order but parity hasn't been updated since writes aren't always atomic. The write hole is most likely to occur with RAID 5 but can happen with other RAID levels as well and only shows up after a disk failure even though the hole was created weeks before that.


No the servers have remained powered up constantly barring the odd reboot.
CERTIFIED EXPERT
Distinguished Expert 2019
Commented:
This one is on us!
(Get your first solution completely free - no credit card required)
UNLOCK SOLUTION
DavidPresident
CERTIFIED EXPERT
Top Expert 2010
Commented:
This one is on us!
(Get your first solution completely free - no credit card required)
UNLOCK SOLUTION

Gain unlimited access to on-demand training courses with an Experts Exchange subscription.

Get Access
Why Experts Exchange?

Experts Exchange always has the answer, or at the least points me in the correct direction! It is like having another employee that is extremely experienced.

Jim Murphy
Programmer at Smart IT Solutions

When asked, what has been your best career decision?

Deciding to stick with EE.

Mohamed Asif
Technical Department Head

Being involved with EE helped me to grow personally and professionally.

Carl Webster
CTP, Sr Infrastructure Consultant
Empower Your Career
Did You Know?

We've partnered with two important charities to provide clean water and computer science education to those who need it most. READ MORE

Ask ANY Question

Connect with Certified Experts to gain insight and support on specific technology challenges including:

  • Troubleshooting
  • Research
  • Professional Opinions
Unlock the solution to this question.
Join our community and discover your potential

Experts Exchange is the only place where you can interact directly with leading experts in the technology field. Become a member today and access the collective knowledge of thousands of technology experts.

*This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

OR

Please enter a first name

Please enter a last name

8+ characters (letters, numbers, and a symbol)

By clicking, you agree to the Terms of Use and Privacy Policy.