?
Solved

RAID5 drive failure, rebuilt OK, now OS will not load

Posted on 2011-10-04
21
Medium Priority
?
502 Views
Last Modified: 2012-05-12
I have a HP server that had a drive failure.  I installed a new drive and rebuilt the RAID 5.  It was successfully optimized.  There are 2 arrays.  Array 0 and 3.  Array 0 is about 5 GB.  Array 3 is about 470GB.  I have loaded each to be the bootable array without success.  

After getting by the Adaptec screen it says array 1 missing or degraded.  I have tried booting to array 0 and 3 without success.  It goes into a constant restart or says that i need to select boot device and try again.  

Dont know why the OS wont start?  Any help is appreciated.
0
Comment
Question by:NVHG
  • 7
  • 6
  • 5
  • +2
21 Comments
 
LVL 124
ID: 36916216
The OS is corrupted or has  disappeared and that is why it will not restart.

I would suggest booting via a BartPE CDROM or GPARTED LiveCDROM, to check if there are any partitions on the disk which are bootable.
0
 
LVL 47

Expert Comment

by:noxcho
ID: 36916271
Rebuit Array - did you erase it?
Check if you have anything on these arrays now.
0
 
LVL 47

Expert Comment

by:David
ID: 36916423
First, you have a low-end RAID controller that probably doesn't even have an on-board battery backup, so data corruption is pretty common after a drive failure.   The long-term solution is to get a better controller with a BBU.  Also if you don't have enterprise class disks, you also put your data at risk, especially with RAID5.  Do both of these and risk of data loss with drive failure drops by several orders of magnitude.

To assess the situation, you need to boot the system via the CD and get a binary editor and look at the raw logical drive to see if it presents a boot partition in block 0.   Take a binary image of the raw partition to a scratch drive (or over network), and use some recovery software on the imaged copy to assess the situation.

Without knowing what the controller is presenting in the contents of the logical drive, nobody can tell you how to fix it with certainty.
0
Industry Leaders: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 
LVL 6

Expert Comment

by:sifuedition
ID: 36944323
Raid 5 works off of logical AND. If the bits are the same, the result is 1. If they are different, the result is 0. This allows a unique result that is transferrable.
1 x 1 = 1
1 x 0 = 0
0 x 1 = 0
0 x 0 = 1

Take a moment and see how this works. On any line, if you remove a number, you can do the exact same math to the remaining numbers and the result is the missing number.

That's great and allows the raid controller to replace missing data. Consider those four lines are the first four stripes of your raid array. Each column represets a separate harddrive. The three harddrives combine the line 1 from each to assemble the data that was saved. If the very first number becomes corrupted (the first 1 would be from the first stripe on disk 1) and the raid controller checks the consistency of the data, it can use the data from the other two harddrives to replace the missing data on disk 1. If that number is corrupted, but no check is run until all of disk 3 fails....you now do not have enough data to rebuild. That stripe of data is missing two pieces.

C = Corrupt   F = failed disk

c x 1 = f
1 x 0 = f
0 x 1 = f
0 x 0 = f

This sounds like it is what happened with your system. As dlethe stated, a low end raid controller is not as good at managing or checking your data and is subject to losing data that was pending a write process when it loses power. Additionally, low end harddrives tend to have bit or sector failures more frequently than the raid controller scans for errors, especially as they get older.

This does not necessarily mean all is lost. Forensic data recovery like DriveSavers and their competitors can recover most if not all of the data frequently if it is critical enough to pay the price.

If cost is an issue or the data is not that important, then self-recovery will require a lot of work as dlethe has also stated.

This is totally supposition from your description, but for what it's worth, if you have two raid arrays but they are enumerated 0 and 3, it seems that your second raid array is being detected as a raid array but not recognized as the original array 1. That would imply that the corrupted data is in the meta-data on the disk and therefore difficult to repair on your own. The actual OS+data is really in limbo because of how the meta-data is stored. The meta-data is usually on the intial blocks / sectors. That could mean that your actual data is still untouched further in the disk it just can't be found because the "map" to that data, the meta-data is corrupt. However, corruption on those blocks is unusual and frequently is a sign of disks beyond repair.
0
 
LVL 47

Expert Comment

by:David
ID: 36944449
Actually, parity is generated via XOR, not AND.   AND would be:
1 x 1 = 1
1 x 0 = 0
0 x 1 = 0
0 x 0 = 0

0
 

Author Comment

by:NVHG
ID: 36983091
Okay so i booted to BartPE and dont really know how it works.  There are not a lot of options and I could find nothing about bootable partions.  

How do you use BartPE?
0
 
LVL 124
ID: 36983106
when you boot the cdrom, can you access your data?

e.g. C: drive?
0
 

Author Comment

by:NVHG
ID: 36983219
Well no, but i tested on my laptop too and that does work and it is missing the C: drive as well.  I opened the A43 utility and it does not have anything besides:

RAMDisk (B:)
BartPE (X:)

My laptop works.  What is the deal?
0
 
LVL 124
ID: 36983231
so you see no C drive on laptop or server, if so download Gparted Live CDROM

it may not have correct storage drivers loaded.
0
 
LVL 47

Expert Comment

by:noxcho
ID: 36984813
Boot the machine from this CD: http://sourceforge.net/projects/partedmagic/
Then see if you can notice there any NTFS formatted partition.
0
 

Author Comment

by:NVHG
ID: 36987707
Okay.  I downloaded and ran PartedMagic.  When i run the disk health it says:
Cannot retrive smart data.  
Device open failed, or device did not return an identify device structure (Clicked show output)
vendor: adaptec
product: device 3
revision: v1.0
user capacity: 458 GB
logical block size: 512  bytes
scsiModePagedOffset: respones lenght too short, resp_len=4 offset=4 bd_len=0
terminate command early due to bad response to iec mode page a manadatroy smart cmd failed: exiting to continue add one or more '.t permissive' options.

When I click on the partition editior i can see my 2 partitions.  
/dev/sda1 426.76 GB
unallocated 6.49 GB

When i go to the file manager i can access my data!  So probably the RAID controller?  
0
 
LVL 124
ID: 36987747
Something is clearly incorrect with your Boot Partition, which requires fixing. Possibly the Master Boot Record needs fixing.

I would backup your data now.
0
 
LVL 124
ID: 36987755
What is the OS?

Re-run the Windows CDROM, to Repair.
0
 

Author Comment

by:NVHG
ID: 36987765
Server 2003.

You reccomend that i save the data then do a repair on the sever with the windows server 2003 disc?
0
 
LVL 124
ID: 36987791
Yes. that is correct.
0
 
LVL 47

Expert Comment

by:David
ID: 36987880
The reason you got the, Cannot retrive smart data, mode page errors, etc .. is because the software you are using is talking to the RAID controller.  The Adaptec  RAID controller, as I mentioned before is low-end, and won't support any of these diagnostics.  

Your data is corrupt, and this software is not going to be able to discern whether or not you have corrupted metadata, an improperly rebuilt RAID, a blown MBR, or a perfectly healthy RAID with file system corruption.  

If you attempt to run the repair, and it turns out that problem is metadata or an improperly rebuilt RAID, then you will lose all hope of ever getting it back.  If budget limits you to a DIY solution, then you must take a IMAGE copy of the entire device presented by the RAID to a scratch drive, and then run some software like runtime.org's NTFS reconstructor.  If it shows massive damage then most likely the RAID metadata is off.  

If that is the case you need to attach the disks to a non-RAID controller, image all the physical disks, analyze the raw data, and manually reconstruct a virtual image.  Then you work with that.   But just running partition magic, BartPE, or Gparted on the logical device as presented by the RAID controller will NOT give you enough of the picture for you to make the correct decision

... Unless you are willing to risk 100% of your data that the RAID is perfect, and the problem is file system corruption.  
0
 

Author Comment

by:NVHG
ID: 36988043
Your right it is a cheap controller.  Since we cannot diagonse anything from Partition Magic.  I guess i will need to take a IMAGE copy of the entire device presented by the RAID to a scratch drive.

So how do I get started doing this?  This really doesnt make sense to me.  How can I create and IMAGE copy of the device presented by the RAID when it wont start any kind of OS?
0
 
LVL 124
ID: 36988199
Symantec Ghost, Acronis Backup and Recovery, Drive Snapshot from a Bootable CDROM.
0
 
LVL 47

Expert Comment

by:David
ID: 36988254
No, you need to get a NON-RAID controller.   Image each physical drive into either a data file onto a larger disk or clone the disk (so you keep the original).  A good product that will do it all is runtime.org's RAID reconstructor, and their NTFS recovery software.  A few hundred bucks, but it will get to the bottom of things, and as consumer raid recovery software goes, it is one of the better ones.

There are some freebies out there like clonezilla which can image, but the runtime software will let you build a virtual image in RAM, then attempt a file system reconstruction, also in RAM, so it is convenient.    The runtime stuff is also free to try, but you pay to reconstruct, so you can at least use it to see what is going on.   (Of course my first recommendation, call in somebody who knows what they are doing to do this for a fee ... but you've come so far, I think you can take it to the next level by reading)
 
0
 
LVL 47

Accepted Solution

by:
David earned 1500 total points
ID: 36988327
Hanccocka's advice is good, if and only if the RAID is perfect, and damage is purely file system based.  You have no way to determine that at this point.  So I am looking at it from the bottom up, assuming nothing.

The problem is that if there is massive damage, like a partially reconstructed RAID underneath the covers, then the file system reconstruction will fail miserably.

 Nothing wrong with trying the ghost and filesystem-only work, but remember that every I/O you run on the busted RAID could be the last, and since you have no way of knowing if it is properly rebuilt, then you have only a window of safety.  So it just comes down to whether or not you put a $250 deductible in your new car, or buy no collision insurance at all, and hope for the best.

As I do recovery at times professionally, I have to be anal and overcautious, so that is just me.  If this was a premium controller, then I would trust the rebuild to have been done properly, but since it is that Adaptec POS, I take nothing for granted :)

0
 

Author Closing Comment

by:NVHG
ID: 36989678
Thanks for your help guys.
0

Featured Post

VIDEO: THE CONCERTO CLOUD FOR HEALTHCARE

Modern healthcare requires a modern cloud. View this brief video to understand how the Concerto Cloud for Healthcare can help your organization.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

The password reset disk is often mentioned as the best solution to deal with the lost Windows password problem. In Windows 2008, 7, Vista and XP, a password reset disk can be easily created. But besides Windows 7/Vista/XP, Windows Server 2008 and ot…
Citrix XenApp, Internet Explorer 11 set to Enterprise Mode and using central hosted sites.xml file.
Windows 8 comes with a dramatically different user interface known as Metro. Notably missing from the new interface is a Start button and Start Menu. Many users do not like it, much preferring the interface of earlier versions — Windows 7, Windows X…
With the advent of Windows 10, Microsoft is pushing a Get Windows 10 icon into the notification area (system tray) of qualifying computers. There are many reasons for wanting to remove this icon. This two-part Experts Exchange video Micro Tutorial s…

850 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question