asked on
can't boot ML370 G5 server after one RAID 5 drives failed, and second predicted fail.
HP Proliant ML370 G5 server, RAID 5 degrade volume, one drive dead, second predicted fail. Need to recover the server that currently can't boot. Array Configuration Utility shows "background parity code 786 Background parity initialization queued or in progress on Logical drive 1.I tried and the server currently can't boot into it's SBS 2003 OS.
If the 2nd. Disk is predicted to fail, it could already be too late, as it may be more than a prediction already. Besides that, raid 5 isn't very reliable, & when rebuilding it is very likely that other disks will fail in the process.
So, first you should move to another raid type that is more resilient.
You should also immediately get another os, sbs 2003 is antique & out of support for years. Using such an out of date OS is totally irresponsible and careless, you open up your Gates to all possible malware, spyware, viruses etc,as there are no security patches available for that OS.
You'll certainly want to look at an OS update - as Rindi said, 2003 is downright dangerous.
Maybe take a look at RAID10 for your next array setup.
You're probably going to need to recover from backups. If you don't have those, you may end up having to try and salvage data from the existing RAID 5 using 3rd party tools.
Presumably you are using SmartStart CD to get into the ACU, there is an ADU on there too (or a diagnostics tab on the ACU) where you can generate an ADU report to USB stick. IF you upload it we can look through it and shed more light on the problem.
ASKER
This is a new customer who approached us to try to recover the OS, and boot, They only have a dump of their accounting database, and no other backups.
Sure they need a new server, but they would like to try to recover this one before getting a new one.
When I try to boot, it's telling me there's no system disk, ADU diagnostics is saying that the volume 1 is up (RAID 5), and that there's a boot record present.
I was trying to repair the OS from the Windows SBS 2003 cd 1, but it asks for a RAID controller drivers (F6), and can't read them off the USB floppy drive I used. When I tried to install a regular floppy drive on this server, I realized that the floppy drive cable has second pin blocked on one side of the cable, and third pin on the other side, so I couldn't connect the data cable to the floppy drive. I guess it is an HP proprietary cable?
ASKER
Sure, I was trying to install the floppy drive since my USB floppy drive didn't do the job, some comments are saying that is because it is not an HP USB floppy drive.
"I was trying to repair the OS from the Windows SBS 2003 "
Keep in mind that the longer you run the disk with the predicted failure, the more likely it will fail further, perhaps completely. I'd look at booting from a USB stick (preferred) or a CD that has the RAID controller drivers available. I think the drivers have to be on a floppy for SBS 2003, but you could use something newer that doesn't have the requirement (and may even include the drivers). Once booted, check the array to see how readable it is. If it is, I'd back up critical data, then try to do a full backup to an external drive using whatever full backup program you prefer. If successful (which is questionable but possible), then you could look at restoring the backup to a new set of drives.
Failed, predicted failed is a situation that the predictive failure drive was kicked out long ago with the recent failed drive is the last straw.
As others mentioned backups are ......
ASKER
I have already replaced the failed SAS 73 GB drive, which might have failed a long time ago. I also added the hot spare drive for the volume 1. When I boot to Acronis it can't see a volume to backup, but sees RAID 5 volume to add as additional disk.
In the raid controller, raid 5 volume how many disks and how many are ok?
You have to look through the log to see which disk was kicked out last, if the drive you replaced was last, is it really dead?
OnHPE megaraid/smart array controller, it is aa tricky thing /situation.
The answer to arnold's question "how many disks and how many are ok?" is important here.
"When I boot to Acronis it can't see a volume to backup, but sees RAID 5 volume to add as additional disk." What size volume does it see? How does it see the disk as being configured?
ASKER
As I said earlier, one drive failed a long time ago. The second drive is a "predictive failure". The RAID controller disabled logical drive to prevent the damage. The third drive seemed ok. Once I replaced the drive that failed a long time ago, I went ahead and enabled the logical drive, which is since then showing up. I tried to boot the server after that and it didn't see a system (boot) disk.
After that I tried to repair the OS using the OS Windows SBS 2003 setup CD, and I stopped when I couldn't load the RAID controller driver (P400).
ASKER
The original RAID 5 was with 3 SAS 73GB disks, 10k. I didn't see the rebuilding part at all. I might to have to check the logs to see if it did rebuild, but I still have the message on the RAID controller showing the information status message:
Code 786, Background parity initialization is currently queued or in progress on Logical Drive 1 (136.7 GB. RAID 5). If background parity initialization is queued, it will start when I/O is performed on the drive.
It would be very informative if you could boot from a USB stick with a version of Windows that already has the drivers for the RAID controller. Server 2019 likely does and Windows 10 may also. Once booted, you can use diskpart to confirm what is seen in the way of disks (lis dis) and wich volumes are seen (list vol). Note the drive letter for the largest volume. Exit diskpart and do a dir /s on that drive letter. That will tell you a lot about the state of the array.
"it will start when I/O is performed on the drive": that implies to me that you need to access the volume to initiate the process. I don't know if that is absolutely correct.
Somewhere in the controller you can see the status of the rebuild. If it is stuck at 0% for a while, that would indicate to me that it won't start until you access the drive. If it is between 0% and 100%, it is rebuilding.
Keep in mind that the rebuild will be based on the data of the old drives, including the one with predictive failure. If it has any unreadable sectors, you could have more of a problem.
When the volume is up, the system would have booted.
Perhaps the dead drive was the boot standalone?
ASKER
There were no standalone drives. This was originally a RAID 5 with three SAS 72 GB drives. I did run a HDD regenerator on the volume 1 and it found no errors. F8 smart array shows all three drives as Ok. When I boot with the smart start cd, it shows second drive as "predicted failure", but shows the RAID 5 volume there. I would like to start the rebuild, but I didn't see how.
It's not clear if the controller thinks that all three drives are part of the array and that all three drives are active parts of the array (as opposed to one being available for rebuild).
ASKER
I also added another hard drive as a hot spare for the RAID 5 volume, not showing on the top screen since I added it after I took that photo.
The next step is to try to access it from a different boot device (CD or USB stick). That will tell you a lot about the state of the files.
ASKER
Logical drive more info still points to Parity Initialization Status as "Queued"?
The failed drive was something else. And could explains why the no boot device is the error.
ASKER
Well, I have replaced the failed drive, and I didn't replace the one that's "predicted fail". The reason why rebuild didn't start yet could be the status of that "predicted fail" drive. I tried to do the backup by booting into Acronis image backup, but it didn't see a volume to backup.
Check the controller boy to see what it points to for the boot volume
Have you tried using linux ubuntu, linux mint to boot the server in hopes of loading the array?
ASKER
Is there a way to preload controller driver when booting with Acronis?
I haven't used Acronis for years, but I assume it does have such an option, how would you restore an image otherwise. But if the RAID Array is down, making an image backup would probably not be possible, as the software would require access to the array & the disk partitions on it.
What might work is to rather use a disk cloning software, where you clone 1 disk to the other 1:1, & then repeat that with the other disk. But to do that, you would have to connect the disks to a non-RAID SAS controller to clone the disks. You'd also have to make really sure that you don't select the wrong disk to clone by mistake.
After that you might just be able to import the newly cloned disks into the RAID controller.
As far as I know this is a real hardware RAID controller, so you would boot from the array, & not a single disk.
ASKER
I have two rear USB ports, one is used for keyboard, the other one seems to not work. I could try the one where keyboard is by plugging the keyboard to the front USB. I still believe that I need an HP USB floppy drive for that to work.
ASKER
I could take the drive that's predicted fail to another server and try to do Acronis disk image of it and restore it to another (better) disk? Then I would put it back to this HP ML370 G5 server. Would that work?
No. That wouldn't work as I explained above. For Disk imaging to work (that is actually the wrong description, it should rather be "Partition Imaging"), you need access to the partitions, which you wouldn't have on a disk that was part of a RAID 5 Array.
Imaging one drive from a RAID set could be disastrous, cloned metadata onto a disk with a different serial number may baffle the controller.
You could install a couple of new disks in spare bays, set up RAID 10 and install Windows onto them using SmartStart and get to the data that way, but depending on firmware and system ROM version it can be difficult to change the boot order. Not all firmware versions have this option:
ASKER
So, I could let's say add another two SAS drives, and Install a different OS, and then try to access the original RAID 5 volume? The server only has 4GB RAM, so I'll have to either increase / add more RAM or use the OS that could be installed with 4 GB of RAM.
once that OS boots, it will provide you access to the RAID 5 when it is online, which By the image it says that it is.
You had four Disks in the System with a Three Disk RAID 5 of 137.GB
Not sure which other OS the ML370 G5 supports in terms of hardware ..... and Drivers.
The issue/risk since you add a hot spare is to get the rebuild going from the Predictive Failure to the hot spare. The other concern is the one that died last was not part of the RAID 5 volume, or you would have had a RAID 5 volume of 200+ GB with one failed drive.
2008/2012 or linux based to boot the system and load the Smart Array drivers, rescan and you should see the array, but since you are presenting the replaced drive as a hot spare......
Within the raid constroller/boot options, there should be a boot search order where it might help identify whether the single dead drive was the sole OS drive.
Or the issue is the dead drive was the prior hot spare. 3 drive RAID 5 with a hot spare.
when the predictive failure was detected, the system transitioned to the hot spare that became active.
when this drive failed, it should reflect the RAID in a degraded state, but it is not.
Whether the controller on boot, reassembled the RAID based on the predictive failure drive plus the other two will mean it wiped/lost information. The predictive failure drive data outdated, but ...
Have seen it, a drive was kicked out of the array.
on reboot the controller brought it back into the ARRAy, and data on the array was consolidated based on the new rejoined drive (data sync) no innervation was required. ...
Old drives, the more you take time, the more likely bad things could happen.
It wiped the other array even when I created a volume via the controller and during the OS deployment, told it to use the existing array (created earlier).
Pulling the RAID 5 array drives out to install the OS volume, and then import the RAID 5 config by reading the metadata from the DISk would be an option, if the VOLUME was Solid, the current Volume's State is unknown.
another though, if there is another similar server.
note which two drives are not having issues 1,2
poweroff this server. pull the two drives. and see whether the ARRAY can be imported into the other system in degraded form
this way you can see what if any data is present to either rebuilld the array in a system that can boot... You have to import the config within the controller bios on boot from the DISKS or they will be seen as Foreign Disks....
RISK total data loss.
if you have a SAS HBA to which you can attach multiple drives
using https://www.runtime.org/raid.htm might be an approach to see whether it can access the raid. imaging the disks individually (note to make sure their positions so that you can return them into the HP server (while power is off on the server)
Raw media copy
RAID has two sets of header data, the RAID config metadata and then the data ......
SmartStart install is going to be simplest as it puts the drivers on for you but that requires an OS the same age as the machine so 2003 or maybe 2008.
Thank you for the clarification on the drivers.
One could go down the USB path with F6 drivers on it using an OS that supports loading drivers from other than a floppy drive. This presumes that HP has drivers for such a later version of Windows Server.
the other issue, not sure whether the ML370 can boot from a USB.
ASKER
the USB floppy drive is used to provide RAID controller (P400) drivers for the OS. This is a standard procedure.
What I have problem with now, is the RAID controller (P400), I used to boot to smartstart, but now it is hanging there and progress bar never goes anywhere. I might have to replace the RAID controller, which might have been giving me issues from the beginning.
ASKER
this server only has 4 GB of RAM, so windows 2008 couldn't be installed, so I'll have to go with 2003.
I can't recall that Windoze 2008 couldn't get installed on a system with 4GB or less RAM. If I remember correctly, there was even a 32bit version of 2008 (2008r2 was 64bit only I believe). 32bit OS's should easily install & run on a system with 4GB RAM, but I think also the 64bit versions should be installable.
If you can install that OS, you must be having another issue, not the amount of RAM.
But some servers have a setting in the OS that allows you to limit the max RAM for the installation. I think that was mainly used to get OS2 installed.
How did you determine: "windows 2008 couldn't be installed"?
ASKER
I haven't seen that it has transitioned. It is still saying the same thing (code 786).
ASKER
I could install Windows 2008 32-bit, or Windows 2003 32-bit. I do have Windows 2003 Enterprise 32-bit disk with me, so I could install that one on a pair of 73 GB SAS drive that i have installed and created a mirrored volume.
You should install an os o. a new volume, within the controller it should indicate the volume as a boot device.
I would boot using the win2k3 inthis case and use F6 to present the uncompressed drivers.
Floppy, CD media (with CD media, you may have to swap them in and out until the drivers are loaded and then rescant should detect the volunes.
ASKER
I tried installing win 2003 using smart start cd, but after the reboot it didn't boot to the OS. I booted off the win 10 USB, and went into diskpart. List disk shows RAID disk, and a new mirror that I created for the new Win 2003 OS. When I list partitions under win 2003 (mirror), it only shows one 4096 MB primary partition. RAID 5 (136 GB) disk doesn't show any partitions. What's your take on this? In BIOS I have "C" as a boot option.
ASKER
Now it says that the Parity Init Status is "In Progress". Before I had as "queued".
Unlike the regular boot search, the RAID controller has a single volume it has sversus scanning through existing volumes till it finds a boorable/active one.
ASKER
Ok, finally the Parity Initialization has completed. What does that really mean?
ASKER
I currently have all three of the RAID 5 volume drives blinking together every 2 seconds
ASKER
also, how do I run the hpacucli on this server to get more info about the RAID 5 volume?
Also if you click the diagnostics tab in the ACU you can get an ADU report on USB stick which has all the info you could ever want.
ASKER
I don't have an option to select the boot volume. Do you know where can I find the firmware update for the P400?
Linux/Supplemental is at https://support.hpe.com/connect/s/softwaredetails?language=en_US&softwareId=MTX_7e4ae58f0d9c42f6a3d47b587c - this can be used as a supplement to the firmware DVD
DVD (amazingly still available) https://support.hpe.com/connect/s/softwaredetails?language=en_US&softwareId=MTX_4372f442c3424063809e3d9198 but requires HPE account.
Alt link probably good http://www.mediafire.com/file/9m4rhs9mny15da4/firmware-10.10-0.zip
There is an alternative method to swap Array A with Array B but don't use until other options used up. Pull the 3 disks, boot from the mirrored pair and they should become Array A so be the boot volume. Shut down and add the RAID 5 and the controller will make that Array B since A is already in use.
ASKER
Performance must be pretty poor, it's running RAID 5 without cache battery but it's not as if that's something that happened recently, probably sold that way.
ASKER
I do have an option to change the boot volume now after firmware update.
ASKER
I did boot to Windows 2008, but the RAID 5 volume is showing in disk manager and asking me to initialize it, which I don't want to do so I don't erase it.
The other possibility is to use test-disk
Potentially the partition table got corrupted......
That being said, a competent data recovery company should be well versed in it or similar tools. If they couldn't recover anything, it's unlikely that a regular tech would. I'd give it a try, though, just for my peace of mind.
ASKER
I know, but it's out of my hands now.
so one is dead, another is dying - not good for a RAID 5
you are looking at getting replacement disks for both, rebuilding the array and restoring from a good backup (you do have that, right?)