asked on

HPe ProLiant ML350 Gen10 | vmWare ESXi 6.0.0 Update 3 (Build 7967664)

Hello,
today we've got messages from iLO:
1. HPE iLO 5 AlertMail-043: (CRITICAL) Slot 0 Smart Array - Drive is failed: Port 1I Box 3 Bay 2
2. HPE iLO 5 AlertMail-044: (CAUTION) Slot 0 Smart Array - Logical drive status changed to recovering

We've managed to get a new hdd since the server still runs on carepack.

I wanted to check the controller/array state via CLI and did:
./ssacli ctrl all show detail
and got:
Adapter: Microsemi Corp (Error: Not responding)
Driver Name: smartpqi
Driver Version: 1.0.2.1028
PCI Address (Domain:Bus:Device.Function): 0000:3B:00.0

In vSphere Client for all drives inside array:
Unspecified 0 Dr_Stat_2I3_B007: In Critical Array - Assert

For Chassis:
System Chassis 1 SysHealth_Stat - Transition to non-critical from OK

Right now I don't know what to do...I'm a little scared to change the HDD twomorrow.
Before that I will do a fullbackup with veeam instance and try to restart the host to get the controller running again. Hopefully the host comes up. If not...Happy Halloween...

I didn't find any about those situation, so I'm asking you - hoping for an answer.

Thanks in advance.

Lukas

Andrew Hancock (VMware vExpert PRO / EE Fellow/British Beekeeper)

What RAID level is defined ?

Lukas Kaderavek

ASKER

I can't say exactly - I think it is RAID 5-0.

Andrew Hancock (VMware vExpert PRO / EE Fellow/British Beekeeper)

if the RAID set is fault tolerant, you can replace the disk.

Member_2_231077

I can read through an ADU report for you if you want although it may not work with the adapter not responding to ssacli,

hpssacli ctrl all diag file=/tmp/ADUreport.zip ris=off xml=off zip=on (HPE's instruction says XML and RIS on but they just make it harder to read).

Mr Tortu(r)e

Hi,

my comment :

In vSphere Client for all drives inside array:
Unspecified 0 Dr_Stat_2I3_B007: In Critical Array - Assert

I suppose it is normal since all the RAID array is at risk because if RAID 5 you can loose 1 disk (already) but not 2
So "critical" should mean your datas are now unprotected

Right now I don't know what to do...I'm a little scared to change the HDD twomorrow.

I don't understant why
you have a failed drive, and you have a brand new drive waiting to be inserted, don't you ?
you should have an error led on the failed drive
remove it and put the new one

Also it is mandatory to have good backup, and even more before this type of operation, there is not a big risk in a HDD replacement, it happen all the time

You can have info on the RAID type and logical drive status when you access iLO / System information / Storage section

Mr Tortu(r)e

only thing is if the array is being rebuild, you should not replace the drive at that time
sometimes it is possible a drive fails, then array rebuild and it stay alive for a while again
or the drive fails again and then you have an error led and can replace it
What the physical HDD leds say in front of the server ?

Lukas Kaderavek

ASKER

Hello,

first of all thank you for your input.

@AndrewHancock: Yes it has fault tollerance and yes disk replacement is available, the last time was in 08/2019 and there was no problem to get the status of controller, array and disks out of ssacli.

@AndyAlder: I already tried that and here's the output:
[root@S-ESX-1:/opt/smartstorageadmin/ssacli/bin] ./ssacli ctrl all diag file=/tmp/ADUreport.zip ris=off xml=off zip=on

Error: An ADU report cannot be generated in VMware ESXi using the diag command
due to limitations of the operating system. You must use the SSADUESXI
utility executed from a remote machine in order to obtain the report.

Obtain and install the Smart Storage Diagnostics Utility (SSADU) CLI.

Please see the VMware Utilities User Guide for more information.

@MrTorture: Yes, it is no problem to replace a disk and as I mentioned we did this already on this machine, the array is built to have two disks to get faulty. And I understand that in rebuild there should not get another faulty or any replaced.

@AndyAlder: I tried the ssacli for Windows - which is not running - since all machines are virtualized and I can only perform this while I'm onsite.
So I tried the vmWare vSphere CLI - and:
Error: The controller identified by "slot=0" was not detected.

We already spoke with HP and their representative technical service - twomorrow afternoon some technician will come and change the HDD after backup and hopefully clean reboot - with working ssacli/controller again.
Otherwise the controller needs to be changed as well, with or without storage restore - which indeed is just the worst case, but needs to managed under this conditions.

Come back with updates...

Member_2_231077

Might be worth their time to bring a spare controller as well although it sounds like a software issue.

Lukas Kaderavek

ASKER

They will bring a spare controller as well, sorry forgot to mention.
And yes for me also it seems like a software issue, but still a bad situation and I thought that we had such issue on a HPe Gen10 with ESXi 6.0 so far - and someone could take the wind out.

Lukas Kaderavek

ASKER

Hello,
Unrecoverable Read Errors on the array.
And it is RAID6 (ADG) out of 7 disks.

Andrew Hancock (VMware vExpert PRO / EE Fellow/British Beekeeper)

Rebuild and Restore. (unless the Support person has an special skills)

Member_2_231077

Controller should fix UREs in background when idle; but when is a VMware server idle?

Best thing to do after restore is to get a weekly ADU report and search it for "read errors hard", then at least you'll know you have a problem since disk manufacturers may not treat UREs as predictive failure and nor does the controller although diags will fail it.

Lukas Kaderavek

ASKER

The HP Tech changed the controller as well, right now rebuild is running.
The guy told us, two possible things could happen - the error is gone after rebuild and when sectors are overwritten, or another disk is failing shortly after rebuild - which will be changed as well.

Otherwise a fullbackup and restore is necessary...

Could you please help me to get an adureport out of esxi 6 and ssacli - all commands I tried are not working.

Lukas Kaderavek

ASKER

I've managed to do a report over esxcli on windows:

ssaduesxi.exe --server=IPADRESS--thumbprint=XX:E4:XX:14:XX:40:XX:4A:XX:CE:XX:52:XX:75:XX:57:XX:A8:XX:D9 -–user=USER --password=PASSWORD –-file=adu-report_29012020.zip –-log

Attached the report
adu-report_29012020.zip

Member_2_231077

That's just the serial log which is very hard to read through, I think you have to omit --log to get the normal report.

Lukas Kaderavek

ASKER

Here you go…rebuild is finished!

Lukas Kaderavek

ASKER

forgot to upload...
adu-report_29012020_1.zip

Member_2_231077

I don't like this one, but it is a first glance.

Serial Number 6830A0J9F9ZF
Firmware Revision HPD3
Product Revision HP EH000900JWHPP
Reference Time 0x000bddfc
Sectors Read 0x000000313eb1be61
Read Errors Hard 0x00000030
Read Errors Retry Recovered 0x00000000
Read Errors ECC Corrected 0x0000000000000001
Sectors Written 0x00000022e157aa9e
Write Errors Hard 0x00000000

Lukas Kaderavek

ASKER

OK, I will look into - which drive it is and try to get an exchange.

Since it is RAiD6 and two disks can fall out, could the errors after replace and rebuild been gone?

Member_2_231077

"
Logical Drive 1 Warning This logical drive has Unrecoverable Media Errors Detected on Drives during previous Rebuild or Background Surface Analysis (ARM) scan.
Errors will be fixed automatically when the sector(s) are overwritten.
Backup and Restore are recommended.
"

Without building a new array and restoring a backup onto it it's pretty hard to get rid of that, they'll be fixed if overwritten but the VMs may have added any bad blocks to their bad block tables and after that they won't write to those sectors so you would have to shuffle whole VMDKs from one part of the disk to another by cloning the VMs and deleting originals until the sectors get overwritten, At least you can do that with VMware as it doesn't run fsck AFAIK, can't do it under Windows as chkdsk puts them in the bad block table without forcing an overwrite of the sector.

Lukas Kaderavek

ASKER

OK, then I will move them VMs to an iSCSI LUN or another suitable ESXi host and rebuild the array.
Thanks for your help.
Lukas

This question needs an answer!

Become an EE member today

7 DAY FREE TRIAL

Members can start a 7-Day Free trial then enjoy unlimited access to the platform.

View membership options

Learn why we charge membership fees

We get it - no one likes a content blocker. Take one extra minute and find out why we block content.