Link to home
Start Free TrialLog in
Avatar of vgdexex
vgdexexFlag for Sweden

asked on

POST errors on reboot

Hello,

Upon examining the systemlogs of a HP Proliant ML350 G6 (with Windows SBS 2008), a number of Event ID: 1001 (HP System) was logged, containing information about a failed powersupply and a non-redundant Array. (event details in attached file)

On further examination, we found that these events occurred each time the server rebooted, but when checking on the reported devices, none showed as faulty.

In the case of power supply, the ILO reported that everything was OK and power was fully redundant and when physically checking the server, both power supplies were firmly attached and showed green lights.

In the case of the disk array, checking with HP array configuration utility only yielded that everything was OK. Checking the lights on the disks themselves showed everything as green.

Any thoughts as to why the event log and ILO & ACU claim differently?
Help would be appreciated.
LOG.txt
Avatar of Member_2_231077
Member_2_231077

Need to see the Insight Diagnostic log, you can get to that through the systems management homepage. If there are a lot of power redundancy lost followed by power redundancy restored errors then look at the UPS if you have one and make sure it's a pure-sine wave output.

Array Diagnostic Utility log would also be helpful, but post as attachment rather than in body of thread as it's huge.
Avatar of vgdexex

ASKER

The Insight Diagnosis error log was completely empty.
The Integrated Management log has been attached.

Also ran the HP Insight Diagnosis hardware diagnose, which gave the result that everything is OK, attached the log as well.

Also attached the Array Diagnostic Utility log.
ArrayDiagnosticUtilityLog.zip
InsightDiagnosticIntegratedManag.html
InsightDiagnosticLog.html
Disk 1I:1:1 has had a bus fault causing a rebuild, might be advisable to update all of them to HPD2, HPD0 which you have can cause them not to park the head when powered off. (that's the 1TB ones).

Can't see why it's bothering to tell you about the power supply unplugged message on boot though, that was a year ago.

Ah, it may be the RAID controller firmware:  "Fixed an issue where SATA disks could occasionally be marked as failed or missing during boot when no cache module is installed on the controller." But that may apply to SAS too...

http://h20000.www2.hp.com/bizsupport/TechSupport/SoftwareDescription.jsp?lang=en&cc=us&prodTypeId=329290&prodSeriesId=3883890&swItem=MTX-f751d8f0bc5042e7b439e68553&prodNameId=3883931&swEnvOID=54&swLang=8&taskId=135&mode=4&idx=2
Avatar of vgdexex

ASKER

I will prepare for upgrading the firmware on the RAID-controller since I cannot reboot the server at the moment.

Thanks for the help so far.
Avatar of David Johnson, CD
20121013064610 Can't see why it's bothering to tell you about the power supply unplugged message on boot though, that was a year ago.   Where did you see the reference to 2011 ??
Towards the top of the IML, but it was probably a case of just plugging one cable in then being a bit slow with the second. I would go into the IML and mark everything as repaired, then shutdown, powercycle and see if the messages have gone from the system log. If that doesn't sort it then you can clear down the IML but that shouldn't be necessary.
Avatar of vgdexex

ASKER

I did mark all the messages in the IML as repaired, the next reboot is scheduled to thursday so I'll get back with the results then.
Avatar of vgdexex

ASKER

Firmware for the array-controller has been successfully updated to version 5.70. Did not install any other firmware updates.

However, the messages about not fully redundant power supply remains, as does the notification about the arrays, but this time there was a new one:

"POST Error: 1716-Slot X Drive Array - Unregenerable Media Errors Detected on Drives during previous Rebuild or Auto-Reliability Monitoring (ARM) scan. Problem will be fixed automatically when the sector(s) are overwritten."

attached all the event messages.

As before, HP System Management reports that everything is OK with the arrays and power supply.

If I interprent the new POST error correctly, it will solve itself given time?
LOG2.txt
Is that after a full power-cycle?
It is still complaining about one power supply. That supply may be on it's way out and cannot support a full load.  The post is an item in which the current draw is quite high.  I also think you have a marginal drive
The machine isn't complaining about it though or it would be in the IML, just Windows event log which is why I asked whether it was a power-cycle as opposed to a soft reboot. Can't tell how the machine feels without a new IML uploaded.
Avatar of vgdexex

ASKER

The server was completely shut down before powering up again.

Attached the remaining logs, altho when checking the IML, no mention was made of the power supplies as far as I could see.
ArrayDiagnosticUtilityLog2.zip
InsightDiagnosticLog2.html
InsightDiagnosticIntegratedLog2.html
The problem is this error listed below, it means that there was a read error on the remaining disk during a rebuild so parity/mirror couldn't be rebuilt. In order to complete the rebuild the controller has skipped over the bad block. Until that block is over-written there's not much you can do. It's quite likely that a chkdsk has been previously run which would have flagged that block as being bad to the OS, unfortunately that means the OS will never attempt to write to that sector so the error will never clear. Only way around it is to backup, delete the array and create a new one then restore. Bit of a pain I'm afraid, RAID isn't perfect, it doesn't protect against one disk with a bad block and another one that has failed.

Post Error - Message: 1716-Slot X Drive Array - Unregenerable Media Errors Detected on Drives during previous Rebuild or Auto-Reliability Monitoring (ARM) scan. Problem will be fixed automatically when the sector(s) are overwritten. - Error: 223
Avatar of vgdexex

ASKER

Would it be possible to take the disk with the bad block and completely wipe it and then reinserting it into the array to be rebuilt or is that not an option?

Backing up and then restoring the array does sound like quite a bit of pain to perform, but leaving the error in place will probably cause even bigger headaches later on.
ASKER CERTIFIED SOLUTION
Avatar of Member_2_231077
Member_2_231077

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of vgdexex

ASKER

It is quite possible that Windows have marked the sectors as bad.
Found the following events in the system-log.


Event ID: 24685 Source: Cissesrv
Array controller P410i [Embedded] has reported an uncorrectable read error during surface analysis operations for logical drive 1. A media error was encountered that is not correctable due to media errors on other physical drive(s) belonging to this logical volume. The uncorrectable media defects are between logical block address 11305728 and logical block address 11305983. The host will be unable to read some blocks between this address range until the blocks are overwritten. Capacity expansion operations must be avoided while the blocks are unreadable.

It repeats for a number of different address-ranges. I doubt writing all the ranges would yield any additional information. (this event has been logged 4777 times the past 24 hours).

Running a defrag on the partitions that reside on the array.

Backing up the information on the Array and then rebuilding it does sound like it would clear the errors, but it will take quite some time to perform.
The newer Smart Array controllers have a licensed feature that allows you to move a logical disk from one array to another which would sort it out, but no use for your controller I'm afraid.
Avatar of vgdexex

ASKER

The server is currently performing defragmentation on the partitions on the affected array.
If that doesn't help, the only option would be to rebuild the array.
Avatar of vgdexex

ASKER

Unfortunately, a complete defragmentation did not solve the issue.

Scheduling a rebuild of the array won't be possible for a while, due to its a lengthy operation to perform so it would take a while before any result can be given.
How long are questions usually left open without any further replies made to them?
You an leave it open as long as you like until a cleanup volunteer posts a reminder or you end up with too many open questions so you can't post new ones.
Avatar of vgdexex

ASKER

Alright,

I'll be back when I have more information.
Avatar of vgdexex

ASKER

Was finally able to rebuild the array as suggested, unfortunately it did not solve the issue.

These steps were peformed:
- Boot the server with an Ubuntu-desktop cd.
- Perform a complete DD of the affected disk to an external drive (turned it into a disk-image).
- Copy the disk-image back onto the array.
- Reboot.

While it booted without problem afterwards, the errors still returned in the log.

What other options are there to pursue?
You didn't delete the array/logical disk and re-create it?
Just to confirm, which errors returned in the log? Both the PSU and array messages?
Avatar of vgdexex

ASKER

It would seem as if I failed to follow the very easy instructions.
No, we didn't delete the array after copying the data to another drive.

And yes, both errors, the PSU and array messages, returned.
Perhaps you'll have to clear down the IML log to get rid of the PSU message, presumably you've verified both PSUs work by unplugging power from one and then the other (which will put errors in the IML again of course).
Avatar of vgdexex

ASKER

A little update.

After installing aditional battery-backed memory to the Array-controller, the affected array suddenly began rebuilding itself. Once done the Array-Configuration manager now predicts that both drives for the affected array are going to fail sometime soon.

By the sound of this additional information, it would be a good idea to get replacement disks as soon as possible.
Avatar of vgdexex

ASKER

Unfortunately, we couldn't get this issue solved despite the help we got here.

The server has been scheduled for replacement.
Avatar of vgdexex

ASKER

This would most likely have solved the issue, but is unable to verify.