Link to home
Start Free TrialLog in
Avatar of sglee
sglee

asked on

ESXi turn itself off randomly

User generated imageUser generated imageUser generated imageHi,

 I have HP Proliand DL380 G5 Server with 5 hard drives in RAID 5 running VMWare ESXi V5.1.
 It has been running fine all these years (about 10 years in production).
 But last Thursday, it shut itself off during the day. When I visited onsite, I saw power light in sold amber color. I turned it on and it loaded OS just fine and all virtual machines were back online.
 I discovered that it shut itself off again on Friday night (approximately 18 hours after its first shutdown).
 When I went onsite this afternoon, the power light was in solid amber color again. I turned it back on and all VMs started ok.
 If this was Windows server, I can check System Log in Event Viewer to determine if there were some type of hardware failure.
 But I am not familiar with ESXi Event log system - how to access it.
 What do you suggest I start? It has shut itself down twice in two days and I feel that it may go down again tomorrow unless I address the problem.

 It has redundant power supply and I confirmed that both PS units in solid greens from the back of the rack server.
 RAID is in working order because all 5 hard drive light were blinking in green.

 Can you help?
ASKER CERTIFIED SOLUTION
Avatar of Paul Solovyovsky
Paul Solovyovsky
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
check the hardware logs of the machine ... connect ILO is the best method ad Mr. Paul Said.

just connect a LAN cable to ILO port and connect another port to switch configure the ILO subnet to your machine .. access it it's great GUI tool. here you can check the complete diagnostic report of the machine hardware and boot IOS  issues etc.

al lthe best
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
server looks pretty dusty. when you get a chance remove the servers and blow them out. ILO will really help telling you what the cause of the shutdown is.
Avatar of sglee
sglee

ASKER

I will check to see if there is Ethernet cable connected to ILO port.
Avatar of sglee

ASKER

I don't think ILO port is connected to the network switch because I don't see an IP address assigned to it.
I will go onsite and connect ILO to the switch.

So everyone seems to be in agreement that the shutdown is caused by some kind of hardware failure on HP Server and ESXi OS can't never shut the hardware down?
it is rare for servers to shutdown by themselves.  In event of power outage I have my servers start staged.. storage box starts immediately, downstream servers are paused to allow the storage box to be ready to receive connections
User generated image
if someone was to hack it they could shut it down!!!
I don't see any error LEDs lit on the insight display, these are driven by the iLO and it can light them even if the server is in a shutdown state like this. Can't see the back but maybe the UPS shut it down?

BTW, don't leave G5s at standby since the PSUs sometimes melt in standby.
Avatar of sglee

ASKER

@andyalder
On both shutdown occasions, the insight display did not show any problems with memory and power supply units.
I have seen amber lights on both memory and power supply units and had to re-seat memories and replace bad power supplies in the past.
This time, since I don't see any light on either RAM and power supply units, I have to assume that the problem is on some other components.
What are your plans for this server ?

1. Old hardware.
2. ESXi out of maintenance ?

Time to migrate to new hardware and ESXi ?
Avatar of sglee

ASKER

@Andrew
Time to migrate to new hardware and ESXi ? --> Absolutely I agree with you 100%. It is up the owner and I can only bring it to his attention.
He may end up with nothing and all VMs lost....
If the problem was on the motherboard the internal health LED would be red but it appears green. it's the one next to the UID led that you can press.
Avatar of sglee

ASKER

I got iLO login screen. How do I find out what the default username and password are?
Unless you've changed them, they were usually on a brown package ticket on the server.

otherwise you'll have to reset them, which can be down from the ESXi bash prompt, or at the POST screen at server BOOT.
Avatar of sglee

ASKER

I am not the one who purchased and installed this server. So I am not aware of brown package. If I could, I like to figure it out without rebooting the server.
Did you mean to say “ESXi Batch Prompt”?
ESXi bash prompt  (e.g. the console shell is a bash/sh) or remotely via SSH.

No I meant BASH

Bash is the GNU Project's shell. Bash is the Bourne Again SHell. Bash is an sh-compatible shell that incorporates useful features from the Korn shell (ksh) and C shell (csh). It is intended to conform to the IEEE POSIX P1003.2/ISO 9945.2 Shell and Tools standard. It offers functional improvements over sh for both programming and interactive use. In addition, most sh scripts can be run by Bash without modification.

Source
https://www.gnu.org/software/bash/

here is how to reset the ilo password on ESXi

https://cloudpathshala.com/2018/08/12/how-to-reset-or-configure-ilo-password-via-esxi-shell/
Avatar of sglee

ASKER

User generated imageI tried to follow instructions on "how to reset the ilo password on ESXi ".
a. Login into the ESXi shell via putty.  -- Done.
b. Go to the directory /opt/hp/tools/  -- /opt/hp does not exist. I searched for ilo.xml using WinSCP, but no such file exists.
You are not running the OEM version of ESXi or you've not install the OEM VIB and tools. (and adding that would need a reboot anyway)
Avatar of sglee

ASKER

I will have to reboot it then.
What do I need to do when the server is rebooting?
Is there a sticker around the server that might have password information?
when the server boots, you will see it states Press F8 to access iLo.

Press F8 and change the password.

The username and password are usually on a brown package tag attached to the rear of the server with keys!
Avatar of sglee

ASKER

I will reboot the server after work hours today and press F8 to change the password.
I will report back.
on the g6 it is on a pullout on the left sideUser generated image
As you are going into iLo just a thought you can also read the event logs! As well as changing the password!
Avatar of sglee

ASKER

DAVID,

I don’t see a pullout on my DL380 G5.
63D0A796-2FF6-496D-8F6E-4FA6869B055.jpeg
G5 and earlier had the silly tie-on labels.
I see some password info here
User generated image
here is the tab
User generated image
WRONG SERVER!!!! it's the server above......with 5 disks!!!!
Avatar of sglee

ASKER

@David
Andrew is correct. The VMWare box is the server at the top with 5 HDs.
As Andy also posted the G5s came with a brown parcel type package label!

Here's one I've got on my desk, this is what you are looking for, this one is more cream than brown...

User generated image
Sometimes they are hanging on the back of the server, unless been removed!
Avatar of sglee

ASKER

Andrew,

  I checked the back of the server and I did not see any tag like that.
  I will have to reboot and press [F8].
The UPS mention above, is a good call, we had a dodgy UPS (no errors, no warnings) every time there was a small electric brownout, it would go onto UPS and UPS would FAIL, causing server to reboot!!! because although the server had two PSUs, one was also duff, which gave no error!

It was also a DL380 G5, which is now retired, and shortly be removing from the rack...

It took 5 months to resolve.....
Avatar of sglee

ASKER

There is a APC UPS that is below two HP servers (one is running ESXi and the other is running Windows Server). Both HP servers  are connected to the same UPS. The Windows server has not shut itself down so far.
Especially if both PSUs are run from it rather than one directly to the mains. We also don't know if it is set to power on automatically until sglee reboots it.
Avatar of sglee

ASKER

User generated imageI successfully logged in to ILO. Where do I go from here?
iLo 2 Log but the latest IML Entry states...

ASR Detected by System ROM.

(Automatic Server Recovery!)
Avatar of sglee

ASKER

User generated imageIt turned itself off on Thursday morning last week which is 10/10/19. I turned it on in about an hour after it was turned off.
I think it turned off itself again, I am guessing, sometime Friday evening which would be 10/11/19.
I came in Saturday around noon and turned on the server.
This afternoon, 10/15/19, I turned it off manually and turned it back on.
Who has root access to the server ?

any vCenter Server ?

The event logs on the Windows VMs, do they state unexpected shutdown, e.g. ESXi host crashed, which would usually show a PSOD, not run off.

Any issues with UPS
Avatar of sglee

ASKER

I am the only one who has root access to the server.
No vCenter.
I will check event viewer on two windows VM servers.
IF UPS created this shutdown, then there are two physical servers that are hooked up to this UPS and would have shutdown both servers.
I would have thought you would have seen some power restored items in the logs I know on my g6 if I restart the machine i.e. updates etc I will see the power restored items (also ilo2)
under power management do you have the automatic power on set to off ?
User generated image
is apc powerchute configured to shutdown the server when the battery gets low?
Avatar of sglee

ASKER

David,

 None of two servers are connected to APC UPS via USB cable.
Avatar of sglee

ASKER

I am going to hook up Windows Server box to APC UPS using USB cable and check out battery charge status.  I will report back.
if you have dual power supplies plug one PS into the UPS and the other into the wall outlet.
Avatar of sglee

ASKER

So far, the ESXi host has not turned itself off.
I have not tested APC US yet.
I was hoping iLO log would help me figure out what went wrong, but unfortunately it did not.
Avatar of sglee

ASKER

This morning users reported that they lost connection from the server.
I went to the site and ESXi host was (1) making very loud constant noise (like airplane taking off) (2) nothing on the screen (3) Power was on.
I turned off the power and turned it back on like I have done in the past two occasions and everything came back normal. I checked the power supplies and all lights were green. No dust that I could see around ventilation holes.
I just checked iLO log and I did not see anything.
I was hoping that I would see some type of "fan" related messages recorded before it became "frozen".
Since I have not used iLO before, I am not familiar with what is available in this program to help me identify what went wrong.
Avatar of sglee

ASKER

User generated image
I am going thru screen by screen and category by category in iLO and spotted this log.
Do I need to replace one or both PS units?
Fyi, I have NOT plugged or unplugged power supply units during the past three self shutdowns.
I have replaced one of two power supply units in the last 12 months or so.
When the fans go bonkers the iLO is to blame and that is on the mobo which costs about £50, but as others have said they need to fork out for a new machine. If you buy a new Smart Array based ProLiant with ESXi on a SD card you can move the old disks into it as new rev controllers still read old rev disks since backwards compatibility is guaranteed. Would take less than 30 minutes to migrate but admittedly it would cost about $1000.
Avatar of sglee

ASKER

Based on IML log, do you think I should replace both power supplies?
No. The only events this year are Automatic System Recovery. This is it rebooting itself when it discovers it has hung.
Avatar of sglee

ASKER

"This is it rebooting itself when it discovers it has hung." --> so far, the server turned itself off TWICE and two days ago, the server was found "frozen / hung" with constant LOUD fan noise with screen black screen(no ESXi OS console menu options). I had to power off the server and turn it back on.

Are you saying that it is time to replace motherboard?

Other experts, is this also your opinion?
Yes, I say it is the mobo. There are two variants, one only supports 50xx and 51xx CPUs but 436526-001 supports all of them and is only £20 on fleabay. Takes about 30 minutes to swap.
Avatar of sglee

ASKER

I will either replace the MB or get an used HP ProLiant DL380 G5 Server, install ESXi and restore virtual machines from the backup software.

Thanks you all for your help.
Even if you buy a new server there is no need to restore the VMs unless they are corrupt, just transfer the disks, the controller will pick up the config from them.
Avatar of sglee

ASKER

"Even if you buy a new server there is no need to restore the VMs unless they are corrupt, just transfer the disks, the controller will pick up the config from them." --> I see a plenty of HP ProLiant DL380 G5 Servers on Ebay to choose from. Do I have to make sure the one that I am buying has Smart Array P400 Controller or every HP ProLiant DL380 G5 Server comes with P400 controller?
It's a PCIe card, but you already have one. It's extremely unlikely that it is causing the fault as even if it locked the PCIe bus up the fans would not have gone on full blast. Some have the cheaper E200 controller that only does RAID 0/1/10
Avatar of sglee

ASKER

so what you are saying is that don't bother checking to see a P400 controller comes with new HP ProLiant DL380 G5 Server?
Just move the controller from current server to the new server, right?
Avatar of sglee

ASKER

What parts do I have to pay attention when buying another HP ProLiant DL380 G5 Server?
Obviously I nee to have same or more RAM.
How about CPU?
Any other component that I have to make sure?
If needed you move the RAM to the new box. CPUs too although that depends on the mobo rev as mentioned earlier. Do you want me to go to site and do it for you? I am in UK South although I will be drunk for most of the coming weekend.
Avatar of sglee

ASKER

I am going to just buy a whole server with RAM and CPU. And them move hard drives and the P400 controller from existing server to the new server. I am not fond of changing MOBO, it is too risky. Used servers are cheap on EBay.
In effect you are changing the mobo via your method as well as mine. The only difference is I would pull it from the metal chassis whereas you will swap it as a mobo+chassis combination. The real catch is the two variants - HP called both G5 but Dell sensibly added a generation in their numbering when Intel made a new CPU/chipset flavour.
Avatar of sglee

ASKER

I purchased the same model from EBay as a backup. Should the production server goes bad, I am going to transfer RAID controller along with hard drives.
Meanwhile I am going to set up ESXi on the backup server and restore virtual machines from the backup system to see if that works too.
That way I will have two options should production server goes down permanently.
Avatar of sglee

ASKER

I have another question.
If I can take out a P400 RAID controller along with hard drives from current server (HP Proliant DL380 G5) and install them on another server and get it working, then is it possible to do that same in a newer model than HP Proliand DL380 G5 Server?
The reason for asking is that for some reason, in HP Proliand DL380 G5 Server, I can't install Windows 7 or 10 as virtual machine. I could only install XP virtual machine.