[Webinar] Learn how to a build a cloud-first strategyRegister Now

x
?
Solved

Array Disk 0:0 Rebuild Failed

Posted on 2010-01-07
18
Medium Priority
?
512 Views
Last Modified: 2012-05-08
So I have a 2800 with 3 73g drives in it. One died, pulled it out and replaced it with an identicle drive. It tries the rebuild and fails, tries and fails, tries and fails, every few seconds. Is there something I need to do on the drive? I was under the impression I just needed to pull the old one, and plug in the new one....  Any ideas?  The server is running fine, but I'd love to know I was back up to one extra drive... Thanks!!
0
Comment
Question by:PurpleWine
  • 8
  • 5
  • 4
  • +1
18 Comments
 
LVL 3

Author Comment

by:PurpleWine
ID: 26201006
Screen shot
array.jpg
0
 
LVL 47

Accepted Solution

by:
David earned 2000 total points
ID: 26201093
What is the error message?

I'll take an educated guess in that you have an unrecoverable read error, and it gives up.  Your raid controller can't handle this situation.  Almost none do. That is why moving forward you need to run regular consistency checks to fix such things.

If above is the case then this first means that you have 1KB data loss for each unrecoverable read.

The only solutions I am aware of ..
 - data recovery firm, but it will cost you mimimum of $2000 because 3 disks

Using something like santools' smartmon-ux to scan and remap just the bad blocks (and document them), on the 2 surviving disks. This will clean it up, so you can get a full rebuild. You'll need to plug the disks into a JBOD scsi controller to get it to work.

(Note if problem is that the REPLACEMENT drive is bad, then obviously just use a different one, but this assumes your RAID engine is smart enough to start rebuild on a different disk. If you ever tried to run the O/S, and at any time did this outside of the BIOS, then don't do this because it willl make things worse_
0
 
LVL 47

Expert Comment

by:David
ID: 26201158
P.S., since you DID boot the O/S, then only course of action is manual repair of the known bad blocks (which your windows-based array manager won't tell you), so it is either a data recovery firm or the $90.00 smartmon-ux, or some other program that will selectively repair the block(s) that are bad so the recovery can complete.

Do something ... fast ... the next drive failure will result in 100% data loss.  If this was my computer I would turn it off until ready to proceed. Every second your system is booted could be the last one for the surviving disks.

The same surviving disks that were likely the same engineering batch, had the exact same I/O load, same temperature exposure, same duty cycle, same age, etc..  In other words, the other 2 disks are more likely to die because whatever environmental issues that caused the first drive failure affect the other drives in exactly the same way.
0
Efficient way to get backups off site to Azure

This user guide provides instructions on how to deploy and configure both a StoneFly Scale Out NAS Enterprise Cloud Drive virtual machine and Veeam Cloud Connect in the Microsoft Azure Cloud.

 
LVL 3

Author Comment

by:PurpleWine
ID: 26201600
So this server is running server 03, Exchange 03, and is a file server. All I've done is pulled the drive, and replaced it (nothing in BIOS as I did this all with the server running)  Email is working fine, we've come across no bad files, and the backups all run fine.

 While I'm out of my league here with RAID, so I could be wrong, but I don't see why we are at the point of sending out to a data recovery company....


And what is a JBOD controller?  (I'll check and post what controller is in the server)
0
 
LVL 33

Expert Comment

by:PowerEdgeTech
ID: 26201799
It is a critical point, because if you cannot get the drive to rebuild, it means the array is damaged and the only way to fix it is to wipe it out a start over.  There is however one thing you can try.  What you will want to do before doing anything is make sure you have a backup while the data is accessible.  Try to rebuild the drive from the BIOS utility for the controller (CTRL-M>Objects>Physical Drives>Enter/Rebuild) - sometimes it works without the additional load of the OS.  Switch the RAID controller to SCSI mode in the BIOS - you will see a scary message, but continue and hit CTRL-A to enter the SCSI controller.  In Disk Utilities, run a Verify on all 3 drives.  Once complete on all of them, then attempt the rebuild again.

Please make sure you have a backup of your important stuff while you can - they are right - you could lose this at any time.  If it goes down permanently with no backup, your options are Data Recovery/Downtime or lose the data on the server and start over.

JBOD = Just a Bunch Of Disk, but I'm not sure that is necessary.
0
 
LVL 33

Expert Comment

by:PowerEdgeTech
ID: 26201816
Clarificationi:  If the drive doesn't rebuild while in the CTRL-M utility outside of Windows, THEN you can attempt the Verify in SCSI mode.
0
 
LVL 47

Expert Comment

by:David
ID: 26201875
Respectfully, PowerEdgeTech is profoundly wrong.  yes most people do not know how to fix this.  I do.

I've been working for RAID manufacturers as a developer since the '90s and recover arrays all the time with these symptoms.  All you need to do is take advice I posted earlier in the thread. You just need to run the smartmon-ux -verify command which will give you a list of troublesome physical block numbers, and what is wrong with them.   Then you use the software to remap just the offending blocks.  The array will then rebuild.

Piece 'o cake.
0
 
LVL 47

Expert Comment

by:David
ID: 26201897
(Above is somewhat simplified, you have caveats like you need to make sure that the GLIST isn't maxed out so you have spare blocks, and factor in that the blocks are unreadable, rather than you have something like munged drive firmware, or massive media failure ... but the symptoms do not indicate this.
0
 
LVL 33

Expert Comment

by:PowerEdgeTech
ID: 26202066
I obviously do not have the experience working with this that dlethe has, but the verify performs a very similar function and has worked very well for me in the past.  That said, I gladly defer to those with greater experience.  However, since the data is accessible, I think obtaining a backup is more prudent than preparing to send drives to data recovery at this point :)
0
 
LVL 3

Author Comment

by:PurpleWine
ID: 26202097
I have a complete backup. They are still running just fine. (thankfully:)

Working on getting smartmon-ux, but they don't make it easy to purchase on their website :)
0
 
LVL 47

Expert Comment

by:David
ID: 26202172
Sorry if I seemed harsh PowerEdgeTech, we're all here to contribute and even I learn things on these forums :)

Anyway the problem with the verify in the BIOS is that it does not give you the sense ASC ASCQ bytes, nor does it decode them, and you need to know specifics of the error before trying to repair it.  But I admit, I am one of those people who shoot for 100% recovery and even the "failed" disk drive may be good enough to use to recover the lost stripe using XOR and by examining the neighboring blocks in the stripe and doing other tricks that I do not wish to disclose due to trade secrets.

In any event, purpose is to apologize if I wrote anything that detracts from your valuable contributions to experts-exchange. It was not my intent if I came off that way.
0
 
LVL 3

Author Comment

by:PurpleWine
ID: 26202296
Not only can I not find a way to buy the software on their site, their phone has been disconnected..... Not a good start. Any other software recommendations, or a known place to buy this one?
0
 
LVL 3

Author Comment

by:PurpleWine
ID: 26202449
Don't know what is up with the phone, but got an email that the writer of the software will be giving me a call to ask a few questions and get me setup ( that's cool, I could use a little hand holding :)  I'll let you know how things progress...

Thanks for the help so far guys!
0
 
LVL 3

Author Comment

by:PurpleWine
ID: 26203664
Ok, so David (the software writer) at SANtools is helping me out with this. I'm putting an adaptec scsi card in a workstation, plugging in the drives, and running the verify. He gave me his number so I can get help when I'm ready to run this. Not sure when I'll do it (Hopefully tonight, I need to acquire a scsi card and adapters)  But I will certainly keep you posted. Thanks again.
0
 
LVL 33

Expert Comment

by:PowerEdgeTech
ID: 26206586
No harm done.  I was simply adding a solution I have encountered over the years.  I have added your suggestion of my list of things to become familiar with to broaden my horizons.  Thanks and good luck!
0
 

Expert Comment

by:Shespawn
ID: 26289538
There are a few other options I have run into to repair this problem.
First, it is quite possible you replaced a bad drive with another bad drive. It does happen.
There is a possibility that your controller needs an update.
Finally, my favorite.... since you have a backup, try re-seating the drives.  Make sure you number them so you know which drive came from where pop-em out and put them right back where they belong.  You can of course also try to re-seat your controller card when you do this.
And of course, CHKDSK is your friend :)
0
 
LVL 3

Author Comment

by:PurpleWine
ID: 26326527
Sorry, a number of problems in the way of this. But I now have a system built with the SANtools installed and registered, and a scsi card that sees the HD individually. I think I am set to give this a try again this weekend....
0
 
LVL 3

Author Closing Comment

by:PurpleWine
ID: 31674041
SANTools was great at taking care of the bad blocks and getting the disks to rebuild. Thanks!
0

Featured Post

What does it mean to be "Always On"?

Is your cloud always on? With an Always On cloud you won't have to worry about downtime for maintenance or software application code updates, ensuring that your bottom line isn't affected.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Are you an Exchange administrator employed with an organization? And, have you encountered a corrupt Exchange database due to which you are not able to open its EDB file. This article will explain all the steps to repair corrupt Exchange database.
Stellar Exchange Toolkit: this 5 in 1 toolkit comes loaded with mega-software tool. Here’s an introduction to tools’ usage and advantages:
This tutorial will walk an individual through the steps necessary to configure their installation of BackupExec 2012 to use network shared disk space. Verify that the path to the shared storage is valid and that data can be written to that location:…
In this video, Percona Director of Solution Engineering Jon Tobin discusses the function and features of Percona Server for MongoDB. How Percona can help Percona can help you determine if Percona Server for MongoDB is the right solution for …
Suggested Courses

865 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question