Solved

Replaced failing drive in RAID 5 Array now almost all folders are empty

Posted on 2012-12-27
12
442 Views
Last Modified: 2016-12-08
My supervisor replaced a failing drive we had in our RAID 5 array on our PE2900 xmas eve,.  I come in today and go to browse the network shares we have for that array and all folders are empty but the free/available space is the same as it was prior to replacing the bad drive, just now no data, but the drive we replaced now has an orange LED.  I cannot restart the server for a few hours as people are using one of the VMs that's hosted in Hyper-V.

Any idea how to fix this?  I've never seen this before.  It's a PERC 6/i controller.
0
Comment
Question by:Mike
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 3
  • 3
  • 3
  • +2
12 Comments
 
LVL 34

Expert Comment

by:Paul MacDonald
ID: 38724403
Should the array have rebuilt itself?
0
 
LVL 33

Expert Comment

by:PowerEdgeTech
ID: 38724406
How EXACTLY did your supervisor replace the failing drive?  What steps did he take?  This can be very important - and could have wiped out the data if done incorrectly.

What are your disk and RAID configurations?
0
 
LVL 47

Expert Comment

by:dlethe
ID: 38724428
You experienced a multiple failure scenario.  Nothing you can do but hire a pro or restore from backup.  Even having the system powered on insures additional data destruction.

Going rate for RAID recovery is $1000 - $1500 PER DISK DRIVE in a rAID config, BTW, so use that to work out whether you restore or go to a professional recovery firm.

P.S. Since a rebuild was involved, be prepared for the worse. Your data may be gone forever or hopelessly corrupted to the point where too much was overwritten to enable you to get any file > your stripe size.

Every moment that system is powered up to windows makes things worse too, so if you plan on attempting recovery, shut it down now and go for it.  It won't get any better on it's own, so you should minimize damage by turning it off and keeping it off.
0
Ransomware: The New Cyber Threat & How to Stop It

This infographic explains ransomware, type of malware that blocks access to your files or your systems and holds them hostage until a ransom is paid. It also examines the different types of ransomware and explains what you can do to thwart this sinister online threat.  

 
LVL 9

Author Comment

by:Mike
ID: 38724500
@poweredgetech - Powered down the server, replaced the drive, turned server back on
We had 6 disks in a RAID 5 (one of them was a spare) at some point while I was away from the organization (roughly 13 months) the spare drive failed, and then a week before I came back he said he was experiencing issues with the drive we replaced.

@Dlethe - I hope this is not the case.
0
 
LVL 34

Expert Comment

by:Paul MacDonald
ID: 38724555
So you lost three drives in your RAID5?  The spare should have kicked in when the first one died.  The spare died at some point (leaving you with no spare).  Then the most recent one died?
0
 
LVL 9

Author Comment

by:Mike
ID: 38724572
no, we lost 1 so we no longer had a spare, then another one started having issues and he swapped out that one.

We only lost 2 drives and replaced 1 of them.
0
 
LVL 47

Expert Comment

by:dlethe
ID: 38724573
Shadowless127 .. this IS the case.  You improperly replaced a drive, so it is actually possible that your RAID never did get rebuilt in the first place and you were degraded all that time, and metadata was never properly written.  So when you put in a replacement drive, it finally kicked off a rebuild and interleaved data from the drive you put in 13 months ago and spread it around.  This corrupted every block you had.

Damage is massive, but actually it could be recovered with lost data limited to the blocks that have changed since the rebuild.

But turn it off, really, nothing you can do and there is no software you can buy (retail) that will fix this up for you.
0
 
LVL 55

Expert Comment

by:andyalder
ID: 38724623
Hard to know what happened when it was powered on with a replacement drive, it could have had metadata on it or it might have been blank. I'd have a look at it with getdataback and see if it's just the MFT gone corrupt although any attempt at repair could make things worse.
0
 
LVL 33

Accepted Solution

by:
PowerEdgeTech earned 500 total points
ID: 38724782
Make a sign and hang it in the server room:  "If it is hot-swap, swap it hot."  

Never power down the system to replace a drive unless it is your only choice ... it is just asking for trouble - and is the reason that drives are "hot-swappable".

Powering up with a new disk can introduce RAID metadata and disk data that does not match the existing array ... the controller has to try to figure out which disks have the right data.  Worst case, the controller decides on its own and is wrong.  Best case, the controller flags it as foreign until the user clears the foreign config on the disk and assigns it as a hot-spare, however if the users mistakenly imports the config, it is the same as the controller picking the wrong configuration - the controller tries to reconcile the data between the two "disks" and usually ends up with bits and pieces of each in an unreadable, corrupt pile of scrambled eggs.

At this point, try a reboot - maybe the volume went offline or was dismounted, but plan to restore from backup (or call data recovery).
0
 
LVL 47

Expert Comment

by:dlethe
ID: 38724853
Nahh, realistically based on additional information you are looking at situation where when you built the RAID and the XOR data was virgin, you had a mirror copy of the first chunk of data in a chunk that was in parity area for the first stripe.   Then that disk stripe went offline so it was protected from change. (Or disk was marked bad, or you had a puncture, or one of other issues.  

But now, after moving things around and rebuilding, that parity chunk got moved to non-parity area.    . sorry getting too deep to explain w/o a graph, and it really is moot.   Suffice to say what was once stale XOR parity block now is logical block zero.   If it happened to one block, it happened to more than one block.  You can figure this out with a binary editor and take the NTFS apart, but that is too big of a pain to walk somebody through and you would need a non-RAID controller and have to determine the proper offsets to even do this.

That information wouldn't be of any value except to a person doing recovery, or just for the satisfaction of determining root cause.   If you want to learn root cause one could programmatically extract the controller's event log (which is volatile, by the way, so don't power off if you *really* want it).   If you were running the mrsas program from Dell at the time this all happened, you would have this information.  If you weren't, then Dell won't do it for you, and only people who could get it would be an LSI engineer or somebody like my company that has the NDA programming info.   (No not volunteering / trolling for consulting, It wouldn't do you any good anyway because even then it wouldn't provide enough information for a recovery)
0
 
LVL 9

Author Comment

by:Mike
ID: 38724879
@PoweredgeTech - My boss for some reason thinks global hot spare is the same as hot swap and since we didnt have a GHS powered it down.  I run this exact same machine at home, and have always hot swapped with no problem.

We were able to get back to normal after a reboot and forcing the new drive online.  We will be putting in a GHS tomorrow for safety.
0
 
LVL 33

Expert Comment

by:PowerEdgeTech
ID: 38724947
I almost asked that very question ... what was your/his interpretation of "spare".  Don't you hate having to educate your boss?  I would.
0

Featured Post

Easy, flexible multimedia distribution & control

Coming soon!  Ideal for large-scale A/V applications, ATEN's VM3200 Modular Matrix Switch is an all-in-one solution that simplifies video wall integration. Easily customize display layouts to see what you want, how you want it in 4k.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

The article will include the best Data Recovery Tools along with their Features, Capabilities, and their Download Links. Hope you’ll enjoy it and will choose the one as required by you.
Finding original email is quite difficult due to their duplicates. From this article, you will come to know why multiple duplicates of same emails appear and how to delete duplicate emails from Outlook securely and instantly while vital emails remai…
This video Micro Tutorial explains how to clone a hard drive using a commercial software product for Windows systems called Casper from Future Systems Solutions (FSS). Cloning makes an exact, complete copy of one hard disk drive (HDD) onto another d…
This tutorial will walk an individual through the process of installing the necessary services and then configuring a Windows Server 2012 system as an iSCSI target. To install the necessary roles, go to Server Manager, and select Add Roles and Featu…

734 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question