RAID1 Drive Fail. Replaced Drive. Something's still wrong.

Windows 2003 server with 6 disks setup in RAID1. In the middle of the work-day yesterday it suddenly went down. I looked at the monitor... flashing cursor on black screen.

I rebooted. While booting up the computer displays some info about the RAID volumes and this time it showed a failed drive error or something like that. The failed drive was a member of the boot volume. I decided to replace both drives in the failed array because its always been my understanding that having two identical hard drives is best (and they're small SATA drives so cheap). So i popped in both drives, went to the RAID manager and setup a new volume in RAID1 using the two blank disks. Then, using a separate computer, cloned the original, working drive to one of the blanks that I had just set up in RAID1. After cloning, I popped the new disk back in the server (with it's mate still attached as well). Computer boots, passes the RAID status check but then just gets "Error Loading Operating System". I wasted 2 hours trying to get to recovery console so I could run FIXMBR, then I finally gave up on that and decided to try booting up with all but one boot drive unplugged.

Worked... got into Windows. Upon logging in I get a few errors about corrupt log files (random log files too, like one that is resident to the OS and one that is from a phone monitoring program we have installed). It suggested I run chkdsk and I was nervous so I obeyed. Ran chkdsk /r, rebooted, made sure disk check started and then left for the night.

I came back in this morning and the server was once again stuck at a black screen with blinking cursor. Rebooted and watched for RAID errors. Boot volume was failing (not just one disk but the volume itself) went into the config menu, was prompted to fix error... fixed error... Rebooted. Got into Windows.

I installed a windows based "Intel Matrix Storage Console" so I could get some info. See attached...

Also, no idea why there's suddenly a missing drive????? That should be totally unrelated. Yes, I already made sure it's connected.

Coworkers are here so I can't really work on it until tomorrow but in the meantime I'd LOVE some help. I truly apologize for length and sloppiness of this post. I'm not sure exactly what details are most important, so I gave them all.

Thanks a million!!
Who is Participating?
jpfultonAuthor Commented:
I got it. I pretty much did exactly what I already discussed in this thread. Once everything appeared right and without error I tested all of the drives by pulling data cables one at a time while the server was running. Everything is A-Ok! According to everyone here it looks like I got lucky.
you screwed up.  You don't clone RAID drives because there is metadata, which is going to include information about all the members of array, boot order, serial number and so on.   So the metadata has the serial # of the disk you got rid of.

No elegant way to repair this other than to use a scratch drive to strip off metadata, boot from a non-RAID clone, backup, build the RAID, restore onto the raid.  
dlethe is right.

When the 2nd disk of your RAID 1 mirror failed, you should have replaced only the failed drive to see if rebuild would have a) completed successfully, and b) solved your initial issue.

Sorry for your misfortune.
Improve Your Query Performance Tuning

In this FREE six-day email course, you'll learn from Janis Griffin, Database Performance Evangelist. She'll teach 12 steps that you can use to optimize your queries as much as possible and see measurable results in your work. Get started today!

jpfultonAuthor Commented:
Is there no way for me to go back to the original 1st disk of the mirror and then rebuild on to a 2nd hard drive? It seems like the metadata on that 1st drive would still all be good, no?
Hi jpfulton, you effectively did some mistakes following a wrong drive failure procedure.

By the way, since you got the operating system still alive you should have two paths to reach your goal. It should be very useful for all of us to correctly know the topology of your system. That's due to better provide help on your specific case.
I can see on picture you got three arrays of discs, each one composed with two phisycal disks.

Array_000 : The ROOT volume show a failed hard drive (failed but present), both drives are manufactured SEAGATE. I suppose this is your faulty disk.
Array_001 : The storageRAID volume show a missing disk, should be a western digital like the twin drive on Port 1
Array_002 : actually working, I suppose it provides extra storage as described

The drive you highlighted is failing but present.

Can you please let me know the correct topology of your system, confirming and detailing my message ? Consider that  your second SEAGATE drive (port 2) is not actually a member of your ROOT volume, since I read you never reconstruct the array.

Regarding your metadata for that volume, it's corrupted due to different physical drives.

Try to proceed booting with only the working disk inserted for Array_000, check integrity of your array in your intel storage console, (secondary disk should be "not present"), then hot-add your secondary disk, try reconstruct the array, this will also update metadata for that volume.

jpfultonAuthor Commented:
Array_0000 - Windows knows this as drive C. this is where windows installed. The disk marked Port 0 is the one I cloned to. The one marked Port 2 is essentially empty... That's what I want to rebuild to. Both of the disks that are now in the ROOT volume are brand new. The old disks are sitting on my desk.

Array_0001 - Windows knows this as drive R. this is a storage drive only. I have no idea why it says missing hard drive. I just grabbed the serial for it and I'm checking into it as we speak. I had no problems with this at all yesterday. It's somewhat possible I accidentally unplugged but I already checked for that once.

Array_0002 - Windows knows this as drive S. This is extra storage. I do nightly backups of drives C and R... the backups are stored here until they are copied to external storage.

Consider that  your second SEAGATE drive (port 2) is not actually a member of your ROOT volume, since I read you never reconstruct the array.

Maybe I'm misunderstanding this, but I actually DID reconstruct the array with the brand new drives... however then I cloned onto one of them using an image of the old drive... perhaps that completely negates the reconstructed array and maybe that's what you mean?
You reconstructed a range of physical blocks, and just got REALLY lucky that the intel controller is pretty stupid. Other controllers would have barfed all over what you did and never would have booted in the first place.

So if you have 50  blocks of metadata, and disk is 1000 blocks total (just to use easy numbers to illustrate) ... then your boot block is at physical block #50, and the partitioning thinks the disk is 950 blocks total, and highest block # is 949 (everything starts at zero)

You need to shift it over 50 blocks to the left, and you will end up, in this case with 50 blocks at end of the drive.  

See , it is non trivial unless you can purchase something like runtime.orgs raid reconstructor, or use LINUX + dd and do a raw bit copy with offset.

jpfultonAuthor Commented:
dlethe, thanks for the reply. I read through and I'm going to understand but first let me mention this... not sure if this is relevant or if I've made progress... see attachment...

I right clicked the drive in Array_0000 Port 2 and selected "Mark As Normal"... it gave me an option to rebuild. I clicked Okay. So it's rebuilding right now... 56% complete, 36 minutes left.

As far as Port 3 in Array_0001, I located the drive, unplugged it, pulled it out, plugged it in to new sata power connector and re-plugged.

So, am I good now probably?
You very well could be.  The firmware is overwriting that disk and has written correct metadata on it.  (presumably).  
Now there is no way of knowing that this is correct, and it is likely that it is ... but if this was my computer, I would take the opportunity to do a full backup while it is still online.   Then once you have done the backup and it has completed the rebuild, shut it down
 - make sure both disks are set up in boot path properly in BIOS. The current disk you are booted is primary, the one it rebuilt is secondary.
 - fix if not correct, then boot.  Then run the intel program and make sure it doesn't see any problems.

If it comes up and sees no problems, then it would be prudent to yank the primary disk while it is hot and make sure that you lose no data and the system stays online.  If you took a full backup and know how to do a restore, and know for a fact that you do know how to do a restore, then you can safely test.  If you have never tested a restore then I would not risk it, and accept the fact that there is a small possibility that my RAID can't handle a drive failure and plan a weekend testing sometime in the future.
jpfultonAuthor Commented:
Sorry... no points awarded because I didn't get any info here that I needed short of "you screwed up"
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.