Link to home
Start Free TrialLog in
Avatar of PhillyGee
PhillyGeeFlag for United States of America

asked on

PE2900 hangs in POST at DRAC

I have a customer that is having boot problem during POST.  It will advance through POST until it attempts to make a connection through the DRAC (see attached file).  Can someone tell me what steps I could have her take to troubleshoot the issue?
kali---boot-issues.bmp
ASKER CERTIFIED SOLUTION
Avatar of PowerEdgeTech
PowerEdgeTech
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
I'm not saying it's impossible, but I've never seen it hang at this spot where it was anything but a damaged OS (there are messages that are normally seen when the DRAC is malfunctioning).  I think it's much more likely to be the OS than the server.
I appreciate the feedback.  Given your name, I think you might know a little more about PE servers than myself.  :-)

For my own knowledge, shouldn't the server throw back some type of error saying that it can't find an OS or something similar?  Is this "hang" somewhat of a normal thing?
"shouldn't the server throw back some type of error saying that it can't find an OS or something similar"

Well, the server doesn't have control at this point.  It has told the "hard drive" to boot, and the drive is trying - spinning its wheels trying to find/read/load boot files, unable to get far enough to have even splashed an image on the screen.  The OS is hung doing its thing, and as far as the server is concerned, it is up and running on it - it could be chugging along authenticating users, server web pages, etc., as far as server management (BMC) knows.

If the boot device didn't exist (failed RAID array, controller is disabled, etc.), then the server would have said "no boot device found" after attempting all boot devices unsuccessfully.

Again, I can't promise that is what is going on in this particular situation, but I've seen it happen MANY times and believe it is the most likely scenario at this point.
PhillyGee,

Sorry to hijack your question, I hope that you get your "hang" issue fixed.  Please let us know the solution.

PowerEdgeTech,

Thank you for the quick PE server lesson.

-Chris
Hopefully it helps pg too ;)
Avatar of PhillyGee

ASKER

Thanks for all the feedback, guys.  She's now sending me more information a little at a time.
- a failed 400GB SAS hard drive in (what looks like) a four disk RAID5 array
- a failed cache battery.
- a failed 400GB SAS hard drive in (what looks like) a four disk RAID5 array

OS corruption could have occurred from the failed disk (errors on remaining disks, etc.).

- a failed cache battery.

OS corruption could have occurred from the write cache data being lost before being committed to disk when important files were in a critical state.

Barring data corruption, RAID 5 should still be operational with a single failed disk.
If there IS data corruption, is all lost so that even RAID rebuild can't fix it?
Depends on the extent of the damage ... it may "only" be OS corruption, in which case, it may be repairable (Recovery Console - chkdsk /r, fixmbr, fixboot; and other utilities such as SFC).  If it is array corruption, there is less that can be done ... a rebuild might be successful and may work normally, or the rebuild may fail.  With only a single disk remaining, if there are any errors in the data on the remaining disk then it will probably just fail the rebuild trying to read it.
Thank you all for the help.  This issue is still outstanding.  The customer is running RH Linux (something I know nothing about). After trying a number of things the customer sought help from RH who booted to rescue mode and re-installed grub to the MBR. After rebooting, the server was still not getting to stage 1 of grub.  They didn't think it was a OS problem.
Now I am told that the four drives are not in a RAID 5 but two mirrors.
PowerEdgeTech, you're going to love this one. A brilliant colleague of mine came up with the solution.
Problem was a controller setting was changed in the BIOS plus an alert setting was set to Disable.
There are two PERCs in this box - a PERC6i (boot controller) and a PERC5e (to a PV MD1000 disk array).
After a 400GB hard drive and cache battery failed on the PERC6i controller somehow the PERC5e BIOS setting set itself to "Enable" causing a conflict as the boot PERC BIOS was also (properly) set to "Enable". On top of that, the “Enable BIOS Stop On Error” was disabled on both array controllers, so no error was ever reported.
The BOIS settings were corrected, both the hard drive and cache battery have been replaced and everything is now working smoothly.

I don't know what it is with PowerEdge BIOS. I've seen it where the BIOS would spontaneously switch the embedded RAID controller setting in "Integrated Devices" from RAID Enabled to SCSI Enabled but this one is a new one for me.

Thank you both for your input.
I didn't help much, but thank you for the assist points.  I'm just glad you were able to get it figured out.

Hat's off to your colleague for figuring it out.
This is a new one for me too, however, I do not often use external storage devices that would connect via an (E)xternal PERC.  But as for:

I don't know what it is with PowerEdge BIOS. I've seen it where the BIOS would spontaneously switch the embedded RAID controller setting in "Integrated Devices" from RAID Enabled to SCSI Enabled

I've never seen this happen "spontaneously" ... there is usually something to precipitate it - CMOS battery failure, power event, BIOS update, hardware failure, etc.  The BIOS default for Embedded RAID is OFF or SCSI Enabled (also keep in mind, this is only for older systems, like the 26x0/28x0), so anytime the BIOS configuration stored in NVRAM is cleared or corrupted, the setting will return to its default.

In any case, I'm glad you figured it out.