Link to home
Start Free TrialLog in
Avatar of Lukas Kaderavek
Lukas KaderavekFlag for Austria

asked on

HPe ProLiant ML350 Gen10 | vmWare ESXi 6.0.0 Update 3 (Build 7967664)

Hello,
today we've got messages from iLO:
1. HPE iLO 5 AlertMail-043: (CRITICAL)  Slot 0 Smart Array - Drive is failed: Port 1I Box 3 Bay 2
2. HPE iLO 5 AlertMail-044: (CAUTION) Slot 0 Smart Array - Logical drive status changed to recovering

We've managed to get a new hdd since the server still runs on carepack.

I wanted to check the controller/array state via CLI and did:
./ssacli ctrl all show detail
and got:
Adapter: Microsemi Corp (Error: Not responding)
   Driver Name: smartpqi
   Driver Version: 1.0.2.1028
   PCI Address (Domain:Bus:Device.Function): 0000:3B:00.0

In vSphere Client for all drives inside array:
Unspecified 0 Dr_Stat_2I3_B007: In Critical Array - Assert

For Chassis:
System Chassis 1 SysHealth_Stat - Transition to non-critical from OK

Right now I don't know what to do...I'm a little scared to change the HDD twomorrow.
Before that I will do a fullbackup with veeam instance and try to restart the host to get the controller running again. Hopefully the host comes up. If not...Happy Halloween...

I didn't find any about those situation, so I'm asking you - hoping for an answer.

Thanks in advance.

Lukas
Avatar of Andrew Hancock (VMware vExpert PRO / EE Fellow/British Beekeeper)
Andrew Hancock (VMware vExpert PRO / EE Fellow/British Beekeeper)
Flag of United Kingdom of Great Britain and Northern Ireland image

What RAID level is defined ?
Avatar of Lukas Kaderavek

ASKER

I can't say exactly - I think it is RAID 5-0.
if the RAID set is fault tolerant, you can replace the disk.
Avatar of Member_2_231077
Member_2_231077

I can read through an ADU report for you if you want although it may not work with the adapter not responding to ssacli,

hpssacli ctrl all diag file=/tmp/ADUreport.zip ris=off xml=off zip=on (HPE's instruction says XML and RIS on but they just make it harder to read).
Hi,

my comment :

In vSphere Client for all drives inside array:
Unspecified 0 Dr_Stat_2I3_B007: In Critical Array - Assert

I suppose it is normal since all the RAID array is at risk because if RAID 5 you can loose 1 disk (already) but not 2
So "critical" should mean your datas are now unprotected


Right now I don't know what to do...I'm a little scared to change the HDD twomorrow.
I don't understant why
you have a failed drive, and you have a brand new drive waiting to be inserted, don't you ?
you should have an error led on the failed drive
remove it and put the new one

Also it is mandatory to have good backup, and even more before this type of operation, there is not a big risk in a HDD replacement, it happen all the time

You can have info on the RAID type and logical drive status when you access iLO / System information / Storage section
only thing is if the array is being rebuild, you should not replace the drive at that time
sometimes it is possible a drive fails, then array rebuild and it stay alive for a while again
or the drive fails again and then you have an error led and can replace it
What the physical HDD leds say in front of the server ?
Hello,

first of all thank you for your input.

@AndrewHancock: Yes it has fault tollerance and yes disk replacement is available, the last time was in 08/2019 and there was no problem to get the status of controller, array and disks out of ssacli.

@AndyAlder: I already tried that and here's the output:
[root@S-ESX-1:/opt/smartstorageadmin/ssacli/bin] ./ssacli ctrl all diag file=/tmp/ADUreport.zip ris=off xml=off zip=on

Error: An ADU report cannot be generated in VMware ESXi using the diag command
       due to limitations of the operating system. You must use the SSADUESXI
       utility executed from a remote machine in order to obtain the report.

       Obtain and install the Smart Storage Diagnostics Utility (SSADU) CLI.

       Please see the VMware Utilities User Guide for more information.

@MrTorture: Yes, it is no problem to replace a disk and as I mentioned we did this already on this machine, the array is built to have two disks to get faulty. And I understand that in rebuild there should not get another faulty or any replaced.

@AndyAlder: I tried the ssacli for Windows - which is not running - since all machines are virtualized and I can only perform this while I'm onsite.
So I tried the vmWare vSphere CLI - and:
Error: The controller identified by "slot=0" was not detected.

We already spoke with HP and their representative technical service - twomorrow afternoon some technician will come and change the HDD after backup and hopefully clean reboot - with working ssacli/controller again.
Otherwise the controller needs to be changed as well, with or without storage restore - which indeed is just the worst case, but needs to managed under this conditions.

Come back with updates...
Might be worth their time to bring a spare controller as well although it sounds like a software issue.
They will bring a spare controller as well, sorry forgot to mention.
And yes for me also it seems like a software issue, but still a bad situation and I thought that we had such issue on a HPe Gen10 with ESXi 6.0 so far - and someone could take the wind out.
Hello,
Unrecoverable Read Errors on the array.
And it is RAID6 (ADG) out of 7 disks.
Rebuild and Restore. (unless the Support person has an special skills)
Controller should fix UREs in background when idle; but when is a VMware server idle?

Best thing to do after restore is to get a weekly ADU report and search it for "read errors hard", then at least you'll know you have a problem since disk manufacturers may not treat UREs as predictive failure and nor does the controller although diags will fail it.
The HP Tech changed the controller as well, right now rebuild is running.
The guy told us, two possible things could happen - the error is gone after rebuild and when sectors are overwritten, or another disk is failing shortly after rebuild - which will be changed as well.

Otherwise a fullbackup and restore is necessary...

Could you please help me to get an adureport out of esxi 6 and ssacli - all commands I tried are not working.
I've managed to do a report over esxcli on windows:

ssaduesxi.exe --server=IPADRESS--thumbprint=XX:E4:XX:14:XX:40:XX:4A:XX:CE:XX:52:XX:75:XX:57:XX:A8:XX:D9 -–user=USER --password=PASSWORD –-file=adu-report_29012020.zip –-log

Attached the report
adu-report_29012020.zip
That's just the serial log which is very hard to read through, I think you have to omit --log to get the normal report.
Here you go…rebuild is finished!
forgot to upload...
adu-report_29012020_1.zip
I  don't like this one, but it is a first glance.

Serial Number                        6830A0J9F9ZF
   Firmware Revision                    HPD3
   Product Revision                     HP      EH000900JWHPP  
   Reference Time                       0x000bddfc
   Sectors Read                         0x000000313eb1be61
   Read Errors Hard                     0x00000030
   Read Errors Retry Recovered          0x00000000
   Read Errors ECC Corrected            0x0000000000000001
   Sectors Written                      0x00000022e157aa9e
   Write Errors Hard                    0x00000000
OK, I will look into - which drive it is and try to get an exchange.

Since it is RAiD6 and two disks can fall out, could the errors after replace and rebuild been gone?
"
Logical Drive 1 Warning  This logical drive has Unrecoverable Media Errors Detected on Drives during previous Rebuild or Background Surface Analysis (ARM) scan.
Errors will be fixed automatically when the sector(s) are overwritten.
Backup and Restore are recommended.
"

Without building a new array and restoring a backup onto it it's pretty hard to get rid of that, they'll be fixed if overwritten but the VMs may have added any bad blocks to their bad block tables and after that they won't write to those sectors so you would have to shuffle whole VMDKs from one part of the disk to another by cloning the VMs and deleting originals until the sectors get overwritten, At least you can do that with VMware as it doesn't run fsck AFAIK, can't do it under Windows as chkdsk puts them in the bad block table without forcing an overwrite of the sector.
OK, then I will move them VMs to an iSCSI LUN or another suitable ESXi host and rebuild the array.
Thanks for your help.
Lukas
This question needs an answer!
Become an EE member today
7 DAY FREE TRIAL
Members can start a 7-Day Free trial then enjoy unlimited access to the platform.
View membership options
or
Learn why we charge membership fees
We get it - no one likes a content blocker. Take one extra minute and find out why we block content.