Adaptec 5805 Expanding Raid 5 Array

Hi guys,

I was expanding my RAID 5 array on this controller to include two new blank drives (for a total of 6 drives). The expansion was at 15% when I went to bed. Now, it looks as though I had applied a windows update at some point last night that required a reboot so in 4 hours or so Windows 7 rebooted on me.

When I got up this morning I saw that the box was sitting at the windows login screen, not a good sign.
The drive is inaccessible and Adaptec Storage Manager states there are no logical drives. Is there any way I can recover this lost array? And advice at all would be appreciated as there was 3900 GB of data on this logical drive.

Thanks in advance!
LVL 6
JerksonAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

SemperWiFiCommented:
Please note: Data recovery is never guaranteed and even ‘non-destructive’ methods can at times still be destructive.
   
Before I get into this let me say one thing. One of the biggest reasons I continue to be such a fan of Adaptec controllers is due to the extremely impressive level of knowledgeable support they are always VERY happy to provide. They stand quite firmly behind their products and you just simply can't beat that with a stick. So, never hesitate to give them a ring.

This is where I omit the age old advice which EVERY IT professional preaches till they are blue in the face. ALWAYS PERFORM BACKUPS PRIOR TO MAJOR CHANGES! I will leave this part out since I think you know this already, if not before, you do now.

For future reference, it is better to manage major changes to arrays in the controller BIOS. This is because while using the in-OS management software (in this case Adaptec Storage Manager) you are subject to interruption caused by the Operating System which can be potentially harmful. But you know this now already I see. None the less, think of going into the BIOS for these things like being in a private room where there is just you, your controller, and your drives. No interruptions like screaming kids, BSODs, or update triggered reboots. This being said, quicker functions are certainly doable in the software based controller managers.

So here is where we are from my understanding of what you said happened. Firstly, it is my understanding, since you can apparently still access your OS; the array in question is not our primary array. This is very good. Go you! Also, since you went to bed I think I would be assuming correctly that we currently have no idea how far along the array adjustment was when the system rebooted. This being the case I will suggest now calling Adaptec tech support and getting with one of their engineers. Reason why is because they will have you run a 'Support Log' which they can read and with any luck it might give some hints as to how far along the process was when the system rebooted.

If you rather not call them, then you will want to enter the controller BIOS and attempt a rebuild of the array. You will do so by deleting the current array (this does not write to the drives or destroy data) and rebuilding using the same parameters as the original but this time you will use the skip/nit option. This will allow you to defer the ‘array build’ process till later but will often allow you to access the data.

Next you will more than likely need to begin the recovery process. I have had very good luck with RAID Reconstructor (http://www.runtime.org/raid.htm) Which you will want to run - there is very good documentation on their site but if you need additional assistance in this area, I'm happy to explain further.

Please understand that without having your support log run and reviewed you are guessing a lot as to what your situation really is in regard to where the process was.
0
DavidPresidentCommented:
Yes, there is always a way, even if it means paying a few thousand dollars to ontrack.com ...

If your data is worth thousands of dollars then don't touch anything and call ontrack.com and let them direct you.

IF not ..

1) Leave system on if it is on, leave it off it is off.  Don't change state is the important thing.
2) Get serial number of the adaptec controller
3) Goto adaptec.com, and log an instance (you will need the serial number).   Tell them it is down, get a ticket number then call after a bit and use the ticket number and tell them you are down. You will usually get somebody to talk to pretty quickly.

Adaptec can walk people through the process, as it depends on what is in the event log, your current/new config, firmware revisions, etc.

A botched expansion/extension is not the type of thing you can fix with chkdsk, or any of the retail data recovery packages.  They will make incorrect assumptions about the logical volume and can make things worse.    (CAN make things worse, not always will make it worse).    There are programs that professional data recovery firms use that are privy to the algorithm and location of the adaptec metadata that can complete the process and diagnose where in the process it is.
 
If Adaptec won't help you then you need to get a scratch drive (or RAID-protected LUN) large enough to hold the entire reconstructed volume, and a non-RAID SCSI adapter, and run runtime.org's RAID reconstructor (over $100, can't remember) and see if it can put humpty dumpty together.  But then you may still have some partitioning/filesystem damage, so this is also going to require a lot of time and effort.  
0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
JerksonAuthor Commented:
SemperWiFi,

I have already left the computer in the stats in which it is in (on) and left all drives as is. I didnt want to attempt anything until I had at least heard from an expert on here. The super-critical data WAS backed up off-site. However, that the bulk of the data (media files) still in limbo.

I have placed a support ticket with adaptec and we will see if I hear back from them. I have never heard of raid reconstructor, I will have to give that a look after I follow Adaptec's instructions.

i will post more updates and questions in here. Thanks to both of you so far.
0
The Ultimate Tool Kit for Technolgy Solution Provi

Broken down into practical pointers and step-by-step instructions, the IT Service Excellence Tool Kit delivers expert advice for technology solution providers. Get your free copy for valuable how-to assets including sample agreements, checklists, flowcharts, and more!

DavidPresidentCommented:
It is OK to call them after there is an assigned ticket.  Tell them you are down, based on what is a firmware "feature" (if you want to be PC), I would call this a bug.  The agent should let you through immediately.  
0
JerksonAuthor Commented:
Long story short,  an interrupted expansion is one of two things they cannot recover from. The other being a corrupted stripe.

After a firmware upgrade we re-created the 6 drive raid 5, skipping init, and allowed checkdisk to run. It corrected the file table and then began recovering "orphaned" files. I saw my actual filenames being processed. The only reason I allowed checkdisk to run was we had already tried everything they could think of and windows did not recognize any data on the drive.

We will see if I have access to my files in a few hrs when checkdisk completes
0
DavidPresidentCommented:
Well, they are not being truthful, recovering from an interrupted expansion is not much more than running code that sees where it left off, then manually expanding.  What they are saying is that they don't have a utility they will give you.  (Which is their right and understandable).  Their manual states you need to do backup before expanding anyway

Actually chkdsk is wrong course of action. You would have been much better off running something else, but it is moot point.  chkdsk is making things much worse. It has no idea that you have corrupted RAID layout, so is grabbing data from drives that have parity stripes and now corrupting perfectly good files that would have been recoverable  via some creative use of raw block copies with dd, and some scratch drives, and XOR parity testing.  
0
JerksonAuthor Commented:
Well, all I could do is follow the "expert's" course of action. He was on the phone with me for over an hour trying various things.

Raid reconstructor wouldnt run since my entire user-profile was on the raid volume, I couldnt select the drives from the drop-downs. Can you elaborate on " raw block copies with dd, and some scratch drives, and XOR parity  testing. " Maybe this is something that would have made sense to mention off the bat.
0
DavidPresidentCommented:
Here is the overview, not the full cookbook.  that is just too valuable to make it public :)
 -> Figure out topology.  Where does the parity drive exist for each data stripe.  A human can eyeball this, but for software you need to either know what it is ahead of time ,or write a known pattern, expand it, then kill expansion to determine the algorithm.  Professionals have already done this so they know how it is laid out.

 -> Run program that checks XOR parity on stripe basis to locate expanded stripes, trying both original and expanded Lun sizes (i.e, # of disks)
      You will see a large block of sequential "correct stripes", with n drives, and another set with n+1 drives  Mark them.
     
 -> Look for stripes that might have been done while system was alive, mark them as unknown, use statistical analysis (looking for patterns in broken strings and such) to figure out if it was already expanded because the booted host needed to write something at this particular stripe while system was online.

Then use the topology equation which you derive to build a function .. for logical block(x), what physical disk & offset should it go;  do same to get equation for laying down parity.

With info above, you end up with list of stripes that need to be expanded, and you can figure out where expansion stopped, then you could use runtime.org to start at certain block ranges and give you reconstructed spans of data that you manually recombine.

But again, moot point, the chkdsk killed this.  I also didn't say this technique would be easy. With experience it is rather straightforward, and C programs exist to recover.  They just aren't available to you due the intellectual property and cost to develop them.
0
JerksonAuthor Commented:
Fair enough. Thanks for the explanation tho
0
SemperWiFiCommented:
What he is referring to for RAID Reconstructor to run you must have a few things in place.

1. Setup a recovery machine with RAID Reconstructor installed on the local OS.
2. Connect all RAID drives on non RAID controller on recovery machine (plug them into reg SATA on recovery machine)
3. RAID Reconstructor will combine all and image out to another source for recovery. This means you will need disk space equal to the amount of the damaged array to shoot said image to.
4. A data recovery tool will now be needed as RAID Reconstructor is not a data recovery tool, it is a RAID recovery tool. So you will need to you something like GetDataBack to use on the recovered data to retrieve files.

What dlethe is saying is true, the more you mess with the state of the data the harder it is to get back. Thus chkdsk is this case is equal to kicking sand around prior to attempting repair of your sand castle. Chkdsk is now trying to restore data without all the facts. You may still get data back but you just decreased your chance dramatically.
 
0
SemperWiFiCommented:
Sorry, didn't mean to hit send on that last post yet - I see the Adaptec support engineer told you to do exactly what I originally posted but I'm still glad you contacted them. I do wish you had not run chkdsk though and had gone straight to RAID Reconstructor like we both said.
0
DavidPresidentCommented:
Probably too late, but if you want a prayer of getting something, image it all to a scratch drive (RAID protected, of course).  Then blow away the entire RAID config, and start over from scratch building and formatting new LUN(s).

You will have to do this anyway. Just kill the chkdisk and start cloning.  There is no other way unless you want to pay somebody a lot more money then what you would have had to pay a few hours ago.  (sorry, sounds like rubbing salt in the wound, bit it is true)

Also never never never do anything on a Patch Tuesday.  If you must, then disconnect network cable overnight.  I've been burned by MSFT before as well.  
0
JerksonAuthor Commented:
Ok. So the checkdisk finished and I HAVE ACCESS TO ALL MY FILES!

Well, not quite all. I found about 2 dozen pictures which were corrupt and I obviously havent gone through everything.

The strange thing is, the adaptec storage manager shows all 6 1.5 TB drives as being members of the logical drive. However, the logical drive has no more useable space than it did when I had 4 x 1.5tb drives (before this whole expansion fiasco)

My question now is, how do I go about getting these 2 drives to become useasble in the array? Backup all data off the array and recreate from scratch?  Might be an option.
0
DavidPresidentCommented:
Looks like you got darned lucky.  I was working from premise that MSFT reboot killed it something like halfway through. I should not have assumed that. I will endeavor to do better next time.

Looks like the adaptec probably finished the operation or nearly finished it, and was waiting for operator input for final flush to disk when it rebooted.

The RAID is still suspect even though you have the data.  You need to do a full backup, blow all the raid away, rebuild, then do a restore.  YOu can not trust the logical device(s) you have no idea what state they are in.
0
JerksonAuthor Commented:
I definitely agree with you. So far less than 10% of the data seems to be corrupt which is terrific considering I had already come to terms with the fact that the data was indeed gone.
I will gather a series of JBOD's and dump the data to them, clear all RAID disks, recreate array, initialize the array, and restore from JBODS.

I will also take this opportunity to saturate the controller with an 8th drive so that I never need to expand the array again. Adding another drive would mean adding another controller anyways as this is an 8channel controller.

Apparently the firmware I had on the controller greatly reduced the rebuild time compared to the last firmware I had, so that may have hastened the whole process compared to the last time I performed an expand to encompass a 4th drive.

I will split points between the two of you. Unfortunately you (SemperWIFI) didnt mention (earlier) that raid reconstructor had to be in a machine without the raid controller otherwise I could have tried that too. However, before I assign points I will leave this open in case I have questions during my backup/restore process.

Thanks again guys!!

-Jason


0
DavidPresidentCommented:
If the controller supports it, you should do RAID6.  Last thing you want is 3 days worth of rebuilding then another drive failure.  I take it you are using consumer-grade, not enterprise class disks so risk of drive failure is 2X, risk of data corruption is approx 100X more.  RAID6 eliminates those extra risks.  
0
JerksonAuthor Commented:
True, these are consumer grade drives. Raid 6 is just like raid 5 with a second distributed parity stripe. Correct?

So I lose the space of two drives but can sustain two simultaneous drive failures?
0
DavidPresidentCommented:
correct, but more importantly, RAID6 also provides extra XOR parity.   When you have RAID5, AND a drive failure, it is likely you are going to get data corruption.  If any surviving disk has an unreadable block, or if there was a parity error before the failure,  then wrong data will get written on a rebuild.

With RAID6, you have the extra parity, so your data will stay clean.
0
DavidPresidentCommented:
you should read this article I wrote, lotsa good stuff, and they awarded it with the "coveted" editor's choice.  Good tips

http://www.experts-exchange.com/articles/Storage/Misc/Disk-drive-reliability-overview.html
0
SemperWiFiCommented:
@ dlethe - I come back to the thread and you have already written everything I would have said. Funny funny!

@ All - I can't believe the data wasn't completely hosed by chkdsk!!! Jerkson I hope you appreciate just how lucky you are. WOW! I think you might consider running out and purchasing a lotto ticket fast!

SO very happy you got your data back! Cheers! Glad we could help.
0
DavidPresidentCommented:
@ wemper - great minds ...
@ Jerkson - don't wait for a lottery, catch the next flight to vegas :)
0
JerksonAuthor Commented:
LOL. Overall i have found approx 15% corruption. I'm fine with that :)

I have backed it all up to scratch disks and when I get back I will re-create the array and dump it all back.

Oh, and I bought a lottery ticket. 15mil here I come!

Hey dlethe, I read your article and you even referred to me in it. I was that member that thanked you profusely for forcing me to read the release notes prior to my last controller firmware upgrade!! That was back in Sept http://www.experts-exchange.com/Storage/Storage_Technology/Q_24762729.html

When I recreate the array what stripe size should I use? Also,  The largest my controller supports is 1024KB. The array will be 8x1.5 TB RAID 6


Thanks guys!
0
DavidPresidentCommented:
Small world :)


If you are doing a lot of database, ie sql server, then 64KB .... because that is the size of the I/Os that the app generates.  Make sure NTFS is also set for 64KB, so it is all 1:1
However, noticed a headache.  you have 6+2 config, so no matter what, you will never be efficient.   with a 64KB stripe size, then when the O/S writes 16KB, then the RAID has to write 6 data disks.  Optimal would be power of 2 for usable disks,  4+2.  But in  grand scheme of things, unless your system just runs benchmarks a 6+2 won't be that bad.
0
JerksonAuthor Commented:
This is used as a media streaming fileserver on the lan. Most files are >1gb

SO always make it 1:1? Isnt a larger stripe better for files of this size?
0
DavidPresidentCommented:
1:1 is optimal, as there are no extra unnecessary writes. this holds true regardless of the stripe size.    Since  you are doing streaming media server, then correct answer is also a function of cache size in the disks and max # of simultaneous threads.  (Did some modeling with SGI years ago, so I tend to look at this deeply)  But since you don't have control over drivers, and the application source code then best you can get is a generalization.  I would go with 64K if "a lot" of simultaneous streams are the norm, increase size if there are "fewer".   To nail the answer, you will just have to set up real-world testing.  64KB will provide most consistent even flow with more concurrent steams.
0
SemperWiFiCommented:
15% isn't bad at all...good for you!

Ahh... RAID tweaking now :-) Not enough to add at this point for the typing to be useful. Sad I missed it, I suppose I could say I missed out on it because I didn't get back here soon enough or just say it's all dlethe's fault. I'm going with it is ALL dlethe's fault LOL

@ Jerkson - Once again, very glad you got your data back! I know how it can be, we've all lost data at some point. Very very good for you!!!!
0
JerksonAuthor Commented:
I guess this question is resolved. If I have problems during my raid rebuild / recreate I will post a new question. Thanks very much!
0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Storage

From novice to tech pro — start learning today.