Link to home
Start Free TrialLog in
Avatar of jpfulton
jpfulton

asked on

RAID1 Drive Fail. Replaced Drive. Something's still wrong.

Windows 2003 server with 6 disks setup in RAID1. In the middle of the work-day yesterday it suddenly went down. I looked at the monitor... flashing cursor on black screen.

I rebooted. While booting up the computer displays some info about the RAID volumes and this time it showed a failed drive error or something like that. The failed drive was a member of the boot volume. I decided to replace both drives in the failed array because its always been my understanding that having two identical hard drives is best (and they're small SATA drives so cheap). So i popped in both drives, went to the RAID manager and setup a new volume in RAID1 using the two blank disks. Then, using a separate computer, cloned the original, working drive to one of the blanks that I had just set up in RAID1. After cloning, I popped the new disk back in the server (with it's mate still attached as well). Computer boots, passes the RAID status check but then just gets "Error Loading Operating System". I wasted 2 hours trying to get to recovery console so I could run FIXMBR, then I finally gave up on that and decided to try booting up with all but one boot drive unplugged.

Worked... got into Windows. Upon logging in I get a few errors about corrupt log files (random log files too, like one that is resident to the OS and one that is from a phone monitoring program we have installed). It suggested I run chkdsk and I was nervous so I obeyed. Ran chkdsk /r, rebooted, made sure disk check started and then left for the night.

I came back in this morning and the server was once again stuck at a black screen with blinking cursor. Rebooted and watched for RAID errors. Boot volume was failing (not just one disk but the volume itself) went into the config menu, was prompted to fix error... fixed error... Rebooted. Got into Windows.

I installed a windows based "Intel Matrix Storage Console" so I could get some info. See attached...

Also, no idea why there's suddenly a missing drive????? That should be totally unrelated. Yes, I already made sure it's connected.

Coworkers are here so I can't really work on it until tomorrow but in the meantime I'd LOVE some help. I truly apologize for length and sloppiness of this post. I'm not sure exactly what details are most important, so I gave them all.

Thanks a million!!
imsc.JPG
Avatar of David
David
Flag of United States of America image

you screwed up.  You don't clone RAID drives because there is metadata, which is going to include information about all the members of array, boot order, serial number and so on.   So the metadata has the serial # of the disk you got rid of.

No elegant way to repair this other than to use a scratch drive to strip off metadata, boot from a non-RAID clone, backup, build the RAID, restore onto the raid.  
dlethe is right.

When the 2nd disk of your RAID 1 mirror failed, you should have replaced only the failed drive to see if rebuild would have a) completed successfully, and b) solved your initial issue.

Sorry for your misfortune.
Avatar of jpfulton
jpfulton

ASKER

Is there no way for me to go back to the original 1st disk of the mirror and then rebuild on to a 2nd hard drive? It seems like the metadata on that 1st drive would still all be good, no?
Hi jpfulton, you effectively did some mistakes following a wrong drive failure procedure.

By the way, since you got the operating system still alive you should have two paths to reach your goal. It should be very useful for all of us to correctly know the topology of your system. That's due to better provide help on your specific case.
I can see on picture you got three arrays of discs, each one composed with two phisycal disks.

Array_000 : The ROOT volume show a failed hard drive (failed but present), both drives are manufactured SEAGATE. I suppose this is your faulty disk.
Array_001 : The storageRAID volume show a missing disk, should be a western digital like the twin drive on Port 1
Array_002 : actually working, I suppose it provides extra storage as described

The drive you highlighted is failing but present.

Can you please let me know the correct topology of your system, confirming and detailing my message ? Consider that  your second SEAGATE drive (port 2) is not actually a member of your ROOT volume, since I read you never reconstruct the array.

Regarding your metadata for that volume, it's corrupted due to different physical drives.

Try to proceed booting with only the working disk inserted for Array_000, check integrity of your array in your intel storage console, (secondary disk should be "not present"), then hot-add your secondary disk, try reconstruct the array, this will also update metadata for that volume.

Topology:
Array_0000 - Windows knows this as drive C. this is where windows installed. The disk marked Port 0 is the one I cloned to. The one marked Port 2 is essentially empty... That's what I want to rebuild to. Both of the disks that are now in the ROOT volume are brand new. The old disks are sitting on my desk.

Array_0001 - Windows knows this as drive R. this is a storage drive only. I have no idea why it says missing hard drive. I just grabbed the serial for it and I'm checking into it as we speak. I had no problems with this at all yesterday. It's somewhat possible I accidentally unplugged but I already checked for that once.

Array_0002 - Windows knows this as drive S. This is extra storage. I do nightly backups of drives C and R... the backups are stored here until they are copied to external storage.

Consider that  your second SEAGATE drive (port 2) is not actually a member of your ROOT volume, since I read you never reconstruct the array.

Maybe I'm misunderstanding this, but I actually DID reconstruct the array with the brand new drives... however then I cloned onto one of them using an image of the old drive... perhaps that completely negates the reconstructed array and maybe that's what you mean?
You reconstructed a range of physical blocks, and just got REALLY lucky that the intel controller is pretty stupid. Other controllers would have barfed all over what you did and never would have booted in the first place.

So if you have 50  blocks of metadata, and disk is 1000 blocks total (just to use easy numbers to illustrate) ... then your boot block is at physical block #50, and the partitioning thinks the disk is 950 blocks total, and highest block # is 949 (everything starts at zero)

You need to shift it over 50 blocks to the left, and you will end up, in this case with 50 blocks at end of the drive.  

See , it is non trivial unless you can purchase something like runtime.orgs raid reconstructor, or use LINUX + dd and do a raw bit copy with offset.




dlethe, thanks for the reply. I read through and I'm going to understand but first let me mention this... not sure if this is relevant or if I've made progress... see attachment...

I right clicked the drive in Array_0000 Port 2 and selected "Mark As Normal"... it gave me an option to rebuild. I clicked Okay. So it's rebuilding right now... 56% complete, 36 minutes left.

As far as Port 3 in Array_0001, I located the drive, unplugged it, pulled it out, plugged it in to new sata power connector and re-plugged.

So, am I good now probably?
imsc-1.JPG
You very well could be.  The firmware is overwriting that disk and has written correct metadata on it.  (presumably).  
Now there is no way of knowing that this is correct, and it is likely that it is ... but if this was my computer, I would take the opportunity to do a full backup while it is still online.   Then once you have done the backup and it has completed the rebuild, shut it down
 - make sure both disks are set up in boot path properly in BIOS. The current disk you are booted is primary, the one it rebuilt is secondary.
 - fix if not correct, then boot.  Then run the intel program and make sure it doesn't see any problems.

If it comes up and sees no problems, then it would be prudent to yank the primary disk while it is hot and make sure that you lose no data and the system stays online.  If you took a full backup and know how to do a restore, and know for a fact that you do know how to do a restore, then you can safely test.  If you have never tested a restore then I would not risk it, and accept the fact that there is a small possibility that my RAID can't handle a drive failure and plan a weekend testing sometime in the future.
ASKER CERTIFIED SOLUTION
Avatar of jpfulton
jpfulton

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Sorry... no points awarded because I didn't get any info here that I needed short of "you screwed up"