asked on

PERC 3/Di RAID 5 Rebuild on PE2500

Only 4 disks inserted on PowerEdge 2500. Disks 0,1,2,3
All four configured as a single RAID 5 with 2 containers, one 4GB container (set to boot), and one 46GB container (for data).

Disk 3 indicates an amber light condition. A user takes that drive and pulls it out of the backplane and then reseats it. The intent here was to just have it rebuild. Upon reseating of drive 3, drive 0 indicates an amber light and the server locks. In the next attempt to reboot, it is reported that no boot devices are found.

So, booting up into the perc configuration utility it shows that drive 0 and drive 3 are missing in the container properties. It looks like this for both containers:

0:00:0 --Missing Drive--
0:01:0 <drivename> <drivespace>
0:02:0 <drivename> <drivespace>
49:62:0 --Missing Drive--

Watching the entire post process of the machine shows the two containers are both found, but in a state of "unknown". And it always ends with "no boot device available, press F1 to retry, press F2 to enter setup"

Enter Dell tech support

After some explanations and some poking and prodding, they send 2 replacement drives and a backplane and wish us well with our backups (not bashing here, just sayin).
A little more prodding gets the idea to use the <CTRL-R> function inside the Perc config util.

This 'rebuild' option is performed with drive 3 removed.

Upon rebooting the containers now report as "critical" and not unknown (this is good, yes?)
And, the config util looks like this now for each container:

0:00:0 <drivename> <drivespace>
0:01:0 <drivename> <drivespace>
0:02:0 <drivename> <drivespace>
49:62:0 --Missing Drive--

To me, this is looking very promising... But, it still goes "no boot device available, press F1 to retry, press F2 to enter setup"
And everytime I enter the config util, I am forced to "accept" the new configuration EVEN THOUGH NO CHANGES WERE MADE SINCE I LAST ACCEPTED. This is the part that bugs me.

Back with Dell support, I am told that the critical container should be bootable. or at least that a rebuild should take place if I insert a NEW drive into drive 3. This makes sense to me, but neither option seems to work.
The part that keeps bugging me is that each and every boot, no matter how many times I "accept" the new configuration, the config never "sticks".

Any ideas or suggestions as to why the seemingly reported container cannot be found to boot?
(I double checked the system bios and the scsi bios for correct boot orders and they are correct)

Thanks

egrylls

I'll ask the stupid questions like you've got both the lastest firmwares for the perc and the system board? Also you might check to ensure that the drives you are replacing with are (hopefully) the same make and model. At the very least that you are mixing different RPMs as I have had Dell send me 10k's when the replacements should have been 15's and that didnt work!!

My other comment is I dont know why it would be insisting you accept the new config unless potentially dead battery or something? I would replace the perc, install all the original drives and tell it to read the config from disk. That MIGHT help you get your RAID back. But if you really lost 2 drives or something on there is written differently, you might have to rebuild it from scratch but I would trash that PERC card for a spare.

kkohl

ASKER

Well, this is at least something I haven't heard yet.
As far as the stupid questions, the rpms are the same, but I am hesitant to upgrade the perc or system board firmwares, as I have seen numerous warnings to upgrade the drivers first and I can't do that yet :-)

Dead battery is interesting and on the note of replacing the perc we have a thought of taking down another identical PE2500 and removing the drives from it... inserting the three "reporting as present" drives from the critical raid and attempt a boot from that angle.

SOLUTION

egrylls

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

egrylls

I would add I have done this PERC trick before and in fact walked our most junior admin through this just this weekend over the phone on a 2650 and it worked like a charm. His issue was power, but he had to read the config from disk and was up within 20 minutes after populating the new chassis

ASKER CERTIFIED SOLUTION

jamietoner

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

kkohl

ASKER

Did some further testing with Dell tech and decided trying to boot in the other chassis would not be beneficial.

With one of the new disks in (and all others removed), we initialized and created a new container. On each boot the container information was retained. Therefore, as Jamie points out, the array is corrupt itself and not the perc or any internal setting.

So, I am left with what I believe to be three good drives and one bad drive of a corrupted array.

The risks of the CTRL-R failing on this dual container setup was understood (no idea why it was like that...) but c'es la vie

Going to be using a raid reconstructor for some non backed up data. All else is good.

Thank you much for the responses

egrylls

Yeah - once you initialized that put you up the creek sans paddle. I must have missed that in the original post

kkohl

ASKER

-- the initialization took place on new drives, not on any of the original ones --

As best as I can figure, it looks like this is the gist of what happened...

Four drive RAID5 Array split into 2 containers (4GB and 44GB)
Drive 3 fails and is removed.
Drive 3 is reinserted to attempt to rebuild. This reinsertion caused Drive 0 to report as failed.
The RAID is broken at this point and the computer locks up.
Upon a reboot attempt, no boot container is found.
Per tech support, Drive 3 is removed and <CTRL-R> option is used on Drive 0.
The forced rebuild fails. Most likely because of the split containers on one array.
Drive 0 reports as present but the RAID is corrupt and non-recoverable.

It is my belief that there was good chance that drive 0 was really a false failure and had a shot at recovery by forcing it back online... the split containers prevented this. Thanks for the responses.