• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 505
  • Last Modified:

RAID5 drive failure, rebuilt OK, now OS will not load

I have a HP server that had a drive failure.  I installed a new drive and rebuilt the RAID 5.  It was successfully optimized.  There are 2 arrays.  Array 0 and 3.  Array 0 is about 5 GB.  Array 3 is about 470GB.  I have loaded each to be the bootable array without success.  

After getting by the Adaptec screen it says array 1 missing or degraded.  I have tried booting to array 0 and 3 without success.  It goes into a constant restart or says that i need to select boot device and try again.  

Dont know why the OS wont start?  Any help is appreciated.
0
NVHG
Asked:
NVHG
  • 7
  • 6
  • 5
  • +2
1 Solution
 
Andrew Hancock (VMware vExpert / EE MVE^2)VMware and Virtualization ConsultantCommented:
The OS is corrupted or has  disappeared and that is why it will not restart.

I would suggest booting via a BartPE CDROM or GPARTED LiveCDROM, to check if there are any partitions on the disk which are bootable.
0
 
noxchoGlobal Support CoordinatorCommented:
Rebuit Array - did you erase it?
Check if you have anything on these arrays now.
0
 
DavidPresidentCommented:
First, you have a low-end RAID controller that probably doesn't even have an on-board battery backup, so data corruption is pretty common after a drive failure.   The long-term solution is to get a better controller with a BBU.  Also if you don't have enterprise class disks, you also put your data at risk, especially with RAID5.  Do both of these and risk of data loss with drive failure drops by several orders of magnitude.

To assess the situation, you need to boot the system via the CD and get a binary editor and look at the raw logical drive to see if it presents a boot partition in block 0.   Take a binary image of the raw partition to a scratch drive (or over network), and use some recovery software on the imaged copy to assess the situation.

Without knowing what the controller is presenting in the contents of the logical drive, nobody can tell you how to fix it with certainty.
0
Improve Your Query Performance Tuning

In this FREE six-day email course, you'll learn from Janis Griffin, Database Performance Evangelist. She'll teach 12 steps that you can use to optimize your queries as much as possible and see measurable results in your work. Get started today!

 
sifueditionCommented:
Raid 5 works off of logical AND. If the bits are the same, the result is 1. If they are different, the result is 0. This allows a unique result that is transferrable.
1 x 1 = 1
1 x 0 = 0
0 x 1 = 0
0 x 0 = 1

Take a moment and see how this works. On any line, if you remove a number, you can do the exact same math to the remaining numbers and the result is the missing number.

That's great and allows the raid controller to replace missing data. Consider those four lines are the first four stripes of your raid array. Each column represets a separate harddrive. The three harddrives combine the line 1 from each to assemble the data that was saved. If the very first number becomes corrupted (the first 1 would be from the first stripe on disk 1) and the raid controller checks the consistency of the data, it can use the data from the other two harddrives to replace the missing data on disk 1. If that number is corrupted, but no check is run until all of disk 3 fails....you now do not have enough data to rebuild. That stripe of data is missing two pieces.

C = Corrupt   F = failed disk

c x 1 = f
1 x 0 = f
0 x 1 = f
0 x 0 = f

This sounds like it is what happened with your system. As dlethe stated, a low end raid controller is not as good at managing or checking your data and is subject to losing data that was pending a write process when it loses power. Additionally, low end harddrives tend to have bit or sector failures more frequently than the raid controller scans for errors, especially as they get older.

This does not necessarily mean all is lost. Forensic data recovery like DriveSavers and their competitors can recover most if not all of the data frequently if it is critical enough to pay the price.

If cost is an issue or the data is not that important, then self-recovery will require a lot of work as dlethe has also stated.

This is totally supposition from your description, but for what it's worth, if you have two raid arrays but they are enumerated 0 and 3, it seems that your second raid array is being detected as a raid array but not recognized as the original array 1. That would imply that the corrupted data is in the meta-data on the disk and therefore difficult to repair on your own. The actual OS+data is really in limbo because of how the meta-data is stored. The meta-data is usually on the intial blocks / sectors. That could mean that your actual data is still untouched further in the disk it just can't be found because the "map" to that data, the meta-data is corrupt. However, corruption on those blocks is unusual and frequently is a sign of disks beyond repair.
0
 
DavidPresidentCommented:
Actually, parity is generated via XOR, not AND.   AND would be:
1 x 1 = 1
1 x 0 = 0
0 x 1 = 0
0 x 0 = 0

0
 
NVHGAuthor Commented:
Okay so i booted to BartPE and dont really know how it works.  There are not a lot of options and I could find nothing about bootable partions.  

How do you use BartPE?
0
 
Andrew Hancock (VMware vExpert / EE MVE^2)VMware and Virtualization ConsultantCommented:
when you boot the cdrom, can you access your data?

e.g. C: drive?
0
 
NVHGAuthor Commented:
Well no, but i tested on my laptop too and that does work and it is missing the C: drive as well.  I opened the A43 utility and it does not have anything besides:

RAMDisk (B:)
BartPE (X:)

My laptop works.  What is the deal?
0
 
Andrew Hancock (VMware vExpert / EE MVE^2)VMware and Virtualization ConsultantCommented:
so you see no C drive on laptop or server, if so download Gparted Live CDROM

it may not have correct storage drivers loaded.
0
 
noxchoGlobal Support CoordinatorCommented:
Boot the machine from this CD: http://sourceforge.net/projects/partedmagic/
Then see if you can notice there any NTFS formatted partition.
0
 
NVHGAuthor Commented:
Okay.  I downloaded and ran PartedMagic.  When i run the disk health it says:
Cannot retrive smart data.  
Device open failed, or device did not return an identify device structure (Clicked show output)
vendor: adaptec
product: device 3
revision: v1.0
user capacity: 458 GB
logical block size: 512  bytes
scsiModePagedOffset: respones lenght too short, resp_len=4 offset=4 bd_len=0
terminate command early due to bad response to iec mode page a manadatroy smart cmd failed: exiting to continue add one or more '.t permissive' options.

When I click on the partition editior i can see my 2 partitions.  
/dev/sda1 426.76 GB
unallocated 6.49 GB

When i go to the file manager i can access my data!  So probably the RAID controller?  
0
 
Andrew Hancock (VMware vExpert / EE MVE^2)VMware and Virtualization ConsultantCommented:
Something is clearly incorrect with your Boot Partition, which requires fixing. Possibly the Master Boot Record needs fixing.

I would backup your data now.
0
 
Andrew Hancock (VMware vExpert / EE MVE^2)VMware and Virtualization ConsultantCommented:
What is the OS?

Re-run the Windows CDROM, to Repair.
0
 
NVHGAuthor Commented:
Server 2003.

You reccomend that i save the data then do a repair on the sever with the windows server 2003 disc?
0
 
Andrew Hancock (VMware vExpert / EE MVE^2)VMware and Virtualization ConsultantCommented:
Yes. that is correct.
0
 
DavidPresidentCommented:
The reason you got the, Cannot retrive smart data, mode page errors, etc .. is because the software you are using is talking to the RAID controller.  The Adaptec  RAID controller, as I mentioned before is low-end, and won't support any of these diagnostics.  

Your data is corrupt, and this software is not going to be able to discern whether or not you have corrupted metadata, an improperly rebuilt RAID, a blown MBR, or a perfectly healthy RAID with file system corruption.  

If you attempt to run the repair, and it turns out that problem is metadata or an improperly rebuilt RAID, then you will lose all hope of ever getting it back.  If budget limits you to a DIY solution, then you must take a IMAGE copy of the entire device presented by the RAID to a scratch drive, and then run some software like runtime.org's NTFS reconstructor.  If it shows massive damage then most likely the RAID metadata is off.  

If that is the case you need to attach the disks to a non-RAID controller, image all the physical disks, analyze the raw data, and manually reconstruct a virtual image.  Then you work with that.   But just running partition magic, BartPE, or Gparted on the logical device as presented by the RAID controller will NOT give you enough of the picture for you to make the correct decision

... Unless you are willing to risk 100% of your data that the RAID is perfect, and the problem is file system corruption.  
0
 
NVHGAuthor Commented:
Your right it is a cheap controller.  Since we cannot diagonse anything from Partition Magic.  I guess i will need to take a IMAGE copy of the entire device presented by the RAID to a scratch drive.

So how do I get started doing this?  This really doesnt make sense to me.  How can I create and IMAGE copy of the device presented by the RAID when it wont start any kind of OS?
0
 
Andrew Hancock (VMware vExpert / EE MVE^2)VMware and Virtualization ConsultantCommented:
Symantec Ghost, Acronis Backup and Recovery, Drive Snapshot from a Bootable CDROM.
0
 
DavidPresidentCommented:
No, you need to get a NON-RAID controller.   Image each physical drive into either a data file onto a larger disk or clone the disk (so you keep the original).  A good product that will do it all is runtime.org's RAID reconstructor, and their NTFS recovery software.  A few hundred bucks, but it will get to the bottom of things, and as consumer raid recovery software goes, it is one of the better ones.

There are some freebies out there like clonezilla which can image, but the runtime software will let you build a virtual image in RAM, then attempt a file system reconstruction, also in RAM, so it is convenient.    The runtime stuff is also free to try, but you pay to reconstruct, so you can at least use it to see what is going on.   (Of course my first recommendation, call in somebody who knows what they are doing to do this for a fee ... but you've come so far, I think you can take it to the next level by reading)
 
0
 
DavidPresidentCommented:
Hanccocka's advice is good, if and only if the RAID is perfect, and damage is purely file system based.  You have no way to determine that at this point.  So I am looking at it from the bottom up, assuming nothing.

The problem is that if there is massive damage, like a partially reconstructed RAID underneath the covers, then the file system reconstruction will fail miserably.

 Nothing wrong with trying the ghost and filesystem-only work, but remember that every I/O you run on the busted RAID could be the last, and since you have no way of knowing if it is properly rebuilt, then you have only a window of safety.  So it just comes down to whether or not you put a $250 deductible in your new car, or buy no collision insurance at all, and hope for the best.

As I do recovery at times professionally, I have to be anal and overcautious, so that is just me.  If this was a premium controller, then I would trust the rebuild to have been done properly, but since it is that Adaptec POS, I take nothing for granted :)

0
 
NVHGAuthor Commented:
Thanks for your help guys.
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

Join & Write a Comment

Featured Post

Introducing Cloud Class® training courses

Tech changes fast. You can learn faster. That’s why we’re bringing professional training courses to Experts Exchange. With a subscription, you can access all the Cloud Class® courses to expand your education, prep for certifications, and get top-notch instructions.

  • 7
  • 6
  • 5
  • +2
Tackle projects and never again get stuck behind a technical roadblock.
Join Now