Link to home
Start Free TrialLog in
Avatar of PurpleWine
PurpleWineFlag for United States of America

asked on

Array Disk 0:0 Rebuild Failed

So I have a 2800 with 3 73g drives in it. One died, pulled it out and replaced it with an identicle drive. It tries the rebuild and fails, tries and fails, tries and fails, every few seconds. Is there something I need to do on the drive? I was under the impression I just needed to pull the old one, and plug in the new one....  Any ideas?  The server is running fine, but I'd love to know I was back up to one extra drive... Thanks!!
Avatar of PurpleWine
PurpleWine
Flag of United States of America image

ASKER

Screen shot
array.jpg
ASKER CERTIFIED SOLUTION
Avatar of David
David
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
P.S., since you DID boot the O/S, then only course of action is manual repair of the known bad blocks (which your windows-based array manager won't tell you), so it is either a data recovery firm or the $90.00 smartmon-ux, or some other program that will selectively repair the block(s) that are bad so the recovery can complete.

Do something ... fast ... the next drive failure will result in 100% data loss.  If this was my computer I would turn it off until ready to proceed. Every second your system is booted could be the last one for the surviving disks.

The same surviving disks that were likely the same engineering batch, had the exact same I/O load, same temperature exposure, same duty cycle, same age, etc..  In other words, the other 2 disks are more likely to die because whatever environmental issues that caused the first drive failure affect the other drives in exactly the same way.
So this server is running server 03, Exchange 03, and is a file server. All I've done is pulled the drive, and replaced it (nothing in BIOS as I did this all with the server running)  Email is working fine, we've come across no bad files, and the backups all run fine.

 While I'm out of my league here with RAID, so I could be wrong, but I don't see why we are at the point of sending out to a data recovery company....


And what is a JBOD controller?  (I'll check and post what controller is in the server)
It is a critical point, because if you cannot get the drive to rebuild, it means the array is damaged and the only way to fix it is to wipe it out a start over.  There is however one thing you can try.  What you will want to do before doing anything is make sure you have a backup while the data is accessible.  Try to rebuild the drive from the BIOS utility for the controller (CTRL-M>Objects>Physical Drives>Enter/Rebuild) - sometimes it works without the additional load of the OS.  Switch the RAID controller to SCSI mode in the BIOS - you will see a scary message, but continue and hit CTRL-A to enter the SCSI controller.  In Disk Utilities, run a Verify on all 3 drives.  Once complete on all of them, then attempt the rebuild again.

Please make sure you have a backup of your important stuff while you can - they are right - you could lose this at any time.  If it goes down permanently with no backup, your options are Data Recovery/Downtime or lose the data on the server and start over.

JBOD = Just a Bunch Of Disk, but I'm not sure that is necessary.
Clarificationi:  If the drive doesn't rebuild while in the CTRL-M utility outside of Windows, THEN you can attempt the Verify in SCSI mode.
Respectfully, PowerEdgeTech is profoundly wrong.  yes most people do not know how to fix this.  I do.

I've been working for RAID manufacturers as a developer since the '90s and recover arrays all the time with these symptoms.  All you need to do is take advice I posted earlier in the thread. You just need to run the smartmon-ux -verify command which will give you a list of troublesome physical block numbers, and what is wrong with them.   Then you use the software to remap just the offending blocks.  The array will then rebuild.

Piece 'o cake.
(Above is somewhat simplified, you have caveats like you need to make sure that the GLIST isn't maxed out so you have spare blocks, and factor in that the blocks are unreadable, rather than you have something like munged drive firmware, or massive media failure ... but the symptoms do not indicate this.
I obviously do not have the experience working with this that dlethe has, but the verify performs a very similar function and has worked very well for me in the past.  That said, I gladly defer to those with greater experience.  However, since the data is accessible, I think obtaining a backup is more prudent than preparing to send drives to data recovery at this point :)
I have a complete backup. They are still running just fine. (thankfully:)

Working on getting smartmon-ux, but they don't make it easy to purchase on their website :)
Sorry if I seemed harsh PowerEdgeTech, we're all here to contribute and even I learn things on these forums :)

Anyway the problem with the verify in the BIOS is that it does not give you the sense ASC ASCQ bytes, nor does it decode them, and you need to know specifics of the error before trying to repair it.  But I admit, I am one of those people who shoot for 100% recovery and even the "failed" disk drive may be good enough to use to recover the lost stripe using XOR and by examining the neighboring blocks in the stripe and doing other tricks that I do not wish to disclose due to trade secrets.

In any event, purpose is to apologize if I wrote anything that detracts from your valuable contributions to experts-exchange. It was not my intent if I came off that way.
Not only can I not find a way to buy the software on their site, their phone has been disconnected..... Not a good start. Any other software recommendations, or a known place to buy this one?
Don't know what is up with the phone, but got an email that the writer of the software will be giving me a call to ask a few questions and get me setup ( that's cool, I could use a little hand holding :)  I'll let you know how things progress...

Thanks for the help so far guys!
Ok, so David (the software writer) at SANtools is helping me out with this. I'm putting an adaptec scsi card in a workstation, plugging in the drives, and running the verify. He gave me his number so I can get help when I'm ready to run this. Not sure when I'll do it (Hopefully tonight, I need to acquire a scsi card and adapters)  But I will certainly keep you posted. Thanks again.
No harm done.  I was simply adding a solution I have encountered over the years.  I have added your suggestion of my list of things to become familiar with to broaden my horizons.  Thanks and good luck!
Avatar of Shespawn
Shespawn

There are a few other options I have run into to repair this problem.
First, it is quite possible you replaced a bad drive with another bad drive. It does happen.
There is a possibility that your controller needs an update.
Finally, my favorite.... since you have a backup, try re-seating the drives.  Make sure you number them so you know which drive came from where pop-em out and put them right back where they belong.  You can of course also try to re-seat your controller card when you do this.
And of course, CHKDSK is your friend :)
Sorry, a number of problems in the way of this. But I now have a system built with the SANtools installed and registered, and a scsi card that sees the HD individually. I think I am set to give this a try again this weekend....
SANTools was great at taking care of the bad blocks and getting the disks to rebuild. Thanks!