sglee
asked on
ESXi turn itself off randomly
Hi,
I have HP Proliand DL380 G5 Server with 5 hard drives in RAID 5 running VMWare ESXi V5.1.
It has been running fine all these years (about 10 years in production).
But last Thursday, it shut itself off during the day. When I visited onsite, I saw power light in sold amber color. I turned it on and it loaded OS just fine and all virtual machines were back online.
I discovered that it shut itself off again on Friday night (approximately 18 hours after its first shutdown).
When I went onsite this afternoon, the power light was in solid amber color again. I turned it back on and all VMs started ok.
If this was Windows server, I can check System Log in Event Viewer to determine if there were some type of hardware failure.
But I am not familiar with ESXi Event log system - how to access it.
What do you suggest I start? It has shut itself down twice in two days and I feel that it may go down again tomorrow unless I address the problem.
It has redundant power supply and I confirmed that both PS units in solid greens from the back of the rack server.
RAID is in working order because all 5 hard drive light were blinking in green.
Can you help?
I have HP Proliand DL380 G5 Server with 5 hard drives in RAID 5 running VMWare ESXi V5.1.
It has been running fine all these years (about 10 years in production).
But last Thursday, it shut itself off during the day. When I visited onsite, I saw power light in sold amber color. I turned it on and it loaded OS just fine and all virtual machines were back online.
I discovered that it shut itself off again on Friday night (approximately 18 hours after its first shutdown).
When I went onsite this afternoon, the power light was in solid amber color again. I turned it back on and all VMs started ok.
If this was Windows server, I can check System Log in Event Viewer to determine if there were some type of hardware failure.
But I am not familiar with ESXi Event log system - how to access it.
What do you suggest I start? It has shut itself down twice in two days and I feel that it may go down again tomorrow unless I address the problem.
It has redundant power supply and I confirmed that both PS units in solid greens from the back of the rack server.
RAID is in working order because all 5 hard drive light were blinking in green.
Can you help?
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
server looks pretty dusty. when you get a chance remove the servers and blow them out. ILO will really help telling you what the cause of the shutdown is.
ASKER
I will check to see if there is Ethernet cable connected to ILO port.
ASKER
I don't think ILO port is connected to the network switch because I don't see an IP address assigned to it.
I will go onsite and connect ILO to the switch.
So everyone seems to be in agreement that the shutdown is caused by some kind of hardware failure on HP Server and ESXi OS can't never shut the hardware down?
I will go onsite and connect ILO to the switch.
So everyone seems to be in agreement that the shutdown is caused by some kind of hardware failure on HP Server and ESXi OS can't never shut the hardware down?
if someone was to hack it they could shut it down!!!
I don't see any error LEDs lit on the insight display, these are driven by the iLO and it can light them even if the server is in a shutdown state like this. Can't see the back but maybe the UPS shut it down?
BTW, don't leave G5s at standby since the PSUs sometimes melt in standby.
BTW, don't leave G5s at standby since the PSUs sometimes melt in standby.
ASKER
@andyalder
On both shutdown occasions, the insight display did not show any problems with memory and power supply units.
I have seen amber lights on both memory and power supply units and had to re-seat memories and replace bad power supplies in the past.
This time, since I don't see any light on either RAM and power supply units, I have to assume that the problem is on some other components.
On both shutdown occasions, the insight display did not show any problems with memory and power supply units.
I have seen amber lights on both memory and power supply units and had to re-seat memories and replace bad power supplies in the past.
This time, since I don't see any light on either RAM and power supply units, I have to assume that the problem is on some other components.
What are your plans for this server ?
1. Old hardware.
2. ESXi out of maintenance ?
Time to migrate to new hardware and ESXi ?
1. Old hardware.
2. ESXi out of maintenance ?
Time to migrate to new hardware and ESXi ?
ASKER
@Andrew
Time to migrate to new hardware and ESXi ? --> Absolutely I agree with you 100%. It is up the owner and I can only bring it to his attention.
Time to migrate to new hardware and ESXi ? --> Absolutely I agree with you 100%. It is up the owner and I can only bring it to his attention.
He may end up with nothing and all VMs lost....
If the problem was on the motherboard the internal health LED would be red but it appears green. it's the one next to the UID led that you can press.
ASKER
I got iLO login screen. How do I find out what the default username and password are?
Unless you've changed them, they were usually on a brown package ticket on the server.
otherwise you'll have to reset them, which can be down from the ESXi bash prompt, or at the POST screen at server BOOT.
otherwise you'll have to reset them, which can be down from the ESXi bash prompt, or at the POST screen at server BOOT.
ASKER
I am not the one who purchased and installed this server. So I am not aware of brown package. If I could, I like to figure it out without rebooting the server.
Did you mean to say “ESXi Batch Prompt”?
Did you mean to say “ESXi Batch Prompt”?
ESXi bash prompt (e.g. the console shell is a bash/sh) or remotely via SSH.
No I meant BASH
Bash is the GNU Project's shell. Bash is the Bourne Again SHell. Bash is an sh-compatible shell that incorporates useful features from the Korn shell (ksh) and C shell (csh). It is intended to conform to the IEEE POSIX P1003.2/ISO 9945.2 Shell and Tools standard. It offers functional improvements over sh for both programming and interactive use. In addition, most sh scripts can be run by Bash without modification.
Source
https://www.gnu.org/software/bash/
here is how to reset the ilo password on ESXi
https://cloudpathshala.com/2018/08/12/how-to-reset-or-configure-ilo-password-via-esxi-shell/
No I meant BASH
Bash is the GNU Project's shell. Bash is the Bourne Again SHell. Bash is an sh-compatible shell that incorporates useful features from the Korn shell (ksh) and C shell (csh). It is intended to conform to the IEEE POSIX P1003.2/ISO 9945.2 Shell and Tools standard. It offers functional improvements over sh for both programming and interactive use. In addition, most sh scripts can be run by Bash without modification.
Source
https://www.gnu.org/software/bash/
here is how to reset the ilo password on ESXi
https://cloudpathshala.com/2018/08/12/how-to-reset-or-configure-ilo-password-via-esxi-shell/
ASKER
You are not running the OEM version of ESXi or you've not install the OEM VIB and tools. (and adding that would need a reboot anyway)
ASKER
I will have to reboot it then.
What do I need to do when the server is rebooting?
Is there a sticker around the server that might have password information?
What do I need to do when the server is rebooting?
Is there a sticker around the server that might have password information?
when the server boots, you will see it states Press F8 to access iLo.
Press F8 and change the password.
The username and password are usually on a brown package tag attached to the rear of the server with keys!
Press F8 and change the password.
The username and password are usually on a brown package tag attached to the rear of the server with keys!
ASKER
I will reboot the server after work hours today and press F8 to change the password.
I will report back.
I will report back.
As you are going into iLo just a thought you can also read the event logs! As well as changing the password!
ASKER
G5 and earlier had the silly tie-on labels.
WRONG SERVER!!!! it's the server above......with 5 disks!!!!
ASKER
@David
Andrew is correct. The VMWare box is the server at the top with 5 HDs.
Andrew is correct. The VMWare box is the server at the top with 5 HDs.
ASKER
Andrew,
I checked the back of the server and I did not see any tag like that.
I will have to reboot and press [F8].
I checked the back of the server and I did not see any tag like that.
I will have to reboot and press [F8].
The UPS mention above, is a good call, we had a dodgy UPS (no errors, no warnings) every time there was a small electric brownout, it would go onto UPS and UPS would FAIL, causing server to reboot!!! because although the server had two PSUs, one was also duff, which gave no error!
It was also a DL380 G5, which is now retired, and shortly be removing from the rack...
It took 5 months to resolve.....
It was also a DL380 G5, which is now retired, and shortly be removing from the rack...
It took 5 months to resolve.....
ASKER
There is a APC UPS that is below two HP servers (one is running ESXi and the other is running Windows Server). Both HP servers are connected to the same UPS. The Windows server has not shut itself down so far.
Especially if both PSUs are run from it rather than one directly to the mains. We also don't know if it is set to power on automatically until sglee reboots it.
iLo 2 Log but the latest IML Entry states...
ASR Detected by System ROM.
(Automatic Server Recovery!)
ASR Detected by System ROM.
(Automatic Server Recovery!)
ASKER
It turned itself off on Thursday morning last week which is 10/10/19. I turned it on in about an hour after it was turned off.
I think it turned off itself again, I am guessing, sometime Friday evening which would be 10/11/19.
I came in Saturday around noon and turned on the server.
This afternoon, 10/15/19, I turned it off manually and turned it back on.
I think it turned off itself again, I am guessing, sometime Friday evening which would be 10/11/19.
I came in Saturday around noon and turned on the server.
This afternoon, 10/15/19, I turned it off manually and turned it back on.
Who has root access to the server ?
any vCenter Server ?
The event logs on the Windows VMs, do they state unexpected shutdown, e.g. ESXi host crashed, which would usually show a PSOD, not run off.
Any issues with UPS
any vCenter Server ?
The event logs on the Windows VMs, do they state unexpected shutdown, e.g. ESXi host crashed, which would usually show a PSOD, not run off.
Any issues with UPS
ASKER
I am the only one who has root access to the server.
No vCenter.
I will check event viewer on two windows VM servers.
IF UPS created this shutdown, then there are two physical servers that are hooked up to this UPS and would have shutdown both servers.
No vCenter.
I will check event viewer on two windows VM servers.
IF UPS created this shutdown, then there are two physical servers that are hooked up to this UPS and would have shutdown both servers.
is apc powerchute configured to shutdown the server when the battery gets low?
ASKER
David,
None of two servers are connected to APC UPS via USB cable.
None of two servers are connected to APC UPS via USB cable.
ASKER
I am going to hook up Windows Server box to APC UPS using USB cable and check out battery charge status. I will report back.
if you have dual power supplies plug one PS into the UPS and the other into the wall outlet.
ASKER
So far, the ESXi host has not turned itself off.
I have not tested APC US yet.
I was hoping iLO log would help me figure out what went wrong, but unfortunately it did not.
I have not tested APC US yet.
I was hoping iLO log would help me figure out what went wrong, but unfortunately it did not.
ASKER
This morning users reported that they lost connection from the server.
I went to the site and ESXi host was (1) making very loud constant noise (like airplane taking off) (2) nothing on the screen (3) Power was on.
I turned off the power and turned it back on like I have done in the past two occasions and everything came back normal. I checked the power supplies and all lights were green. No dust that I could see around ventilation holes.
I just checked iLO log and I did not see anything.
I was hoping that I would see some type of "fan" related messages recorded before it became "frozen".
Since I have not used iLO before, I am not familiar with what is available in this program to help me identify what went wrong.
I went to the site and ESXi host was (1) making very loud constant noise (like airplane taking off) (2) nothing on the screen (3) Power was on.
I turned off the power and turned it back on like I have done in the past two occasions and everything came back normal. I checked the power supplies and all lights were green. No dust that I could see around ventilation holes.
I just checked iLO log and I did not see anything.
I was hoping that I would see some type of "fan" related messages recorded before it became "frozen".
Since I have not used iLO before, I am not familiar with what is available in this program to help me identify what went wrong.
ASKER
When the fans go bonkers the iLO is to blame and that is on the mobo which costs about £50, but as others have said they need to fork out for a new machine. If you buy a new Smart Array based ProLiant with ESXi on a SD card you can move the old disks into it as new rev controllers still read old rev disks since backwards compatibility is guaranteed. Would take less than 30 minutes to migrate but admittedly it would cost about $1000.
ASKER
Based on IML log, do you think I should replace both power supplies?
No. The only events this year are Automatic System Recovery. This is it rebooting itself when it discovers it has hung.
ASKER
"This is it rebooting itself when it discovers it has hung." --> so far, the server turned itself off TWICE and two days ago, the server was found "frozen / hung" with constant LOUD fan noise with screen black screen(no ESXi OS console menu options). I had to power off the server and turn it back on.
Are you saying that it is time to replace motherboard?
Other experts, is this also your opinion?
Are you saying that it is time to replace motherboard?
Other experts, is this also your opinion?
Yes, I say it is the mobo. There are two variants, one only supports 50xx and 51xx CPUs but 436526-001 supports all of them and is only £20 on fleabay. Takes about 30 minutes to swap.
ASKER
I will either replace the MB or get an used HP ProLiant DL380 G5 Server, install ESXi and restore virtual machines from the backup software.
Thanks you all for your help.
Thanks you all for your help.
Even if you buy a new server there is no need to restore the VMs unless they are corrupt, just transfer the disks, the controller will pick up the config from them.
ASKER
"Even if you buy a new server there is no need to restore the VMs unless they are corrupt, just transfer the disks, the controller will pick up the config from them." --> I see a plenty of HP ProLiant DL380 G5 Servers on Ebay to choose from. Do I have to make sure the one that I am buying has Smart Array P400 Controller or every HP ProLiant DL380 G5 Server comes with P400 controller?
It's a PCIe card, but you already have one. It's extremely unlikely that it is causing the fault as even if it locked the PCIe bus up the fans would not have gone on full blast. Some have the cheaper E200 controller that only does RAID 0/1/10
ASKER
so what you are saying is that don't bother checking to see a P400 controller comes with new HP ProLiant DL380 G5 Server?
Just move the controller from current server to the new server, right?
Just move the controller from current server to the new server, right?
yes.
ASKER
What parts do I have to pay attention when buying another HP ProLiant DL380 G5 Server?
Obviously I nee to have same or more RAM.
How about CPU?
Any other component that I have to make sure?
Obviously I nee to have same or more RAM.
How about CPU?
Any other component that I have to make sure?
If needed you move the RAM to the new box. CPUs too although that depends on the mobo rev as mentioned earlier. Do you want me to go to site and do it for you? I am in UK South although I will be drunk for most of the coming weekend.
ASKER
I am going to just buy a whole server with RAM and CPU. And them move hard drives and the P400 controller from existing server to the new server. I am not fond of changing MOBO, it is too risky. Used servers are cheap on EBay.
In effect you are changing the mobo via your method as well as mine. The only difference is I would pull it from the metal chassis whereas you will swap it as a mobo+chassis combination. The real catch is the two variants - HP called both G5 but Dell sensibly added a generation in their numbering when Intel made a new CPU/chipset flavour.
ASKER
I purchased the same model from EBay as a backup. Should the production server goes bad, I am going to transfer RAID controller along with hard drives.
Meanwhile I am going to set up ESXi on the backup server and restore virtual machines from the backup system to see if that works too.
That way I will have two options should production server goes down permanently.
Meanwhile I am going to set up ESXi on the backup server and restore virtual machines from the backup system to see if that works too.
That way I will have two options should production server goes down permanently.
ASKER
I have another question.
If I can take out a P400 RAID controller along with hard drives from current server (HP Proliant DL380 G5) and install them on another server and get it working, then is it possible to do that same in a newer model than HP Proliand DL380 G5 Server?
The reason for asking is that for some reason, in HP Proliand DL380 G5 Server, I can't install Windows 7 or 10 as virtual machine. I could only install XP virtual machine.
If I can take out a P400 RAID controller along with hard drives from current server (HP Proliant DL380 G5) and install them on another server and get it working, then is it possible to do that same in a newer model than HP Proliand DL380 G5 Server?
The reason for asking is that for some reason, in HP Proliand DL380 G5 Server, I can't install Windows 7 or 10 as virtual machine. I could only install XP virtual machine.
Please post a new question.
just connect a LAN cable to ILO port and connect another port to switch configure the ILO subnet to your machine .. access it it's great GUI tool. here you can check the complete diagnostic report of the machine hardware and boot IOS issues etc.
al lthe best