Link to home
Start Free TrialLog in
Avatar of Member_2_6538061
Member_2_6538061Flag for United States of America

asked on

Virtual Machine Stops Responding - VMWARE ESXI

I have a couple VMs running on a vmware esxi host, running esxi-6.5.0-20170104001-standard.  One of those vms routinely stops responding, both in vmware console and from SSH.  I have to log into the console and have no option other then shutdown and then a start.  Server comes back up and runs for a while and stops again.  This might be a couple of days, and it might be shorter.  
This VMware host also runs another vm that is essentially the same, a web server.  This vm and server are pretty rock solid and run really well.  No downtime.  

Host details:  
model:  PowerEdge T710
CPU:  8 CPUs x Intel(R) Xeon(R) CPU E5520 @ 2.27GHz
24 gigs ram
datastore 1
datastore 2

The faulty vm (1) is configured with
4 cpus
8 gigs ram
Ubuntu 20.04
Originally installed on datastore 2 and moved to datastore 1 to see if the datastore was the problem.

The rock solid vm (2) is configured with:
4 cpus
8 gigs ram
This server is on datastore 1 and is solid.  

Another rock solid vm (3)
2 cpus
8 gigs ram
datastore 2

I had the vm (1) on another host running esxi 6.5, but had the same problems.  I reinstalled a new ubuntu server 20.04 and moved the database and web files, essentially a clean install, to the current host.  I am still experiencing the problem.  Only on this one vm.  I ran the command tail -f watchng the vmware-vmsvc-root-log file and when the server stopped responding, the below is what was on the screen.  I then hit shutdown in the vmware console and started again.  

User generated image

I would appreciate it if anybody can give some insight as to how best to troubleshoot this and figure out where the problem is, ubuntu server, vm configure, or vm host.  

Thanks in advance.

Avatar of Andrew Hancock (VMware vExpert PRO / EE Fellow/British Beekeeper)
Andrew Hancock (VMware vExpert PRO / EE Fellow/British Beekeeper)
Flag of United Kingdom of Great Britain and Northern Ireland image

Can I just confirm, do you mean Shutdown Guest OS ?
Avatar of Member_2_6538061

ASKER

No.  I log into the vmware console, the VM shows it is running, but the guest os is showing the image from above and not respond to anything.  The only choice I have is to restart the VM.  I am unable to interact with the guest OS, in this case ubuntu 20.04 because it is frozen and will stay like that until I power off the VM and power it back on.   
Okay, that's what I wanted to confirm so it's a Power Off and Power On, and not a Shutdown (Guest OS)
Avatar of noci
noci

Were the log records/log files in the guest researched after restart of the machine? Anything there?
Kernel log, or whereever critial or emergencies are logged?
It may help to setup an external log server and send syslog output live to that server.
I will post those tomorrow. Though I did look through them, but didn't see anything particularly useful.  I will also setup up the syslog server
I reviewed the logs and found the following in the syslog:
Dec 11 00:00:12 servername multipathd[695]: sda: add missing path
Dec 11 00:00:12 servername multipathd[695]: sda: failed to get udev uid: Invalid argument
Dec 11 00:00:12 servername multipathd[695]: sda: failed to get sysfs uid: Invalid argument
Dec 11 00:00:12 servername multipathd[695]: sda: failed to get sgio uid: No such file or directory

Googled it and found some discussion about adding a parameter to the *.vmx file
disk.EnableUUID = "TRUE"

Open in new window

or Edit Settings -> Options tab -> General -> Configuration Parameters in ESX UI.

"The problem is that VMWare by default doesn't provide the information needed by udev to generate /dev/disk/by-id entries. Apart from ESX, VMWare Workstation (my case) is also affected. After rebooting VM with this parameter set, the disk are visible in /dev/disk/by-id and multipathd doesn't complain anymore."


I have entered this into the vmware parameters of the vm and syslog stopped complaining.  I am giving it a few hours to see if the server stays up.  

I will repost here with updated status.  


Just to add this must be a local configuration issue (somewhere), as we have many Ubuntu 20,.04 virtual machines, here and across many clients with no issues, and we don't have disk.EnableUUID = "TRUE" in the configuration file.

When you create the VM, did you select Ubuntu 64 bit ?

It would also be interesting to check your build of ESXi ?
I am running ESXi version 6.5.  

Guest OS:  Ubuntu Linux (64-bit)
Compatibility ESXi 6.5 and later (VM version 13)
VMware Tools Yes
CPUs 4
Memory 8 GB

While adding disk.EnableUUID = "TRUE" removed the errors in the Syslog, it didn't help the overall issue.  

I had started a tail -F /var/log/syslog and when it crashed, I found the following:
kernel BUG at drivers/net/vmxnet3/vmxnet3_drv.c:1441!  

Googled it and found kb article at:vmware:  https://kb.vmware.com/s/article/2151480
This issue was fixed in 6.5 U1 so upgrade or the workaround is, if you don't want to upgrade, is below
  • Add the vmxnet3.rev.30 = FALSE parameter in the vmx file of virtual machine:
    1. Power off the virtual machine 
    2. Edit the vmx file and add the below parameter:
      vmxnet3.rev.30 = FALSE 
    3. Power on the virtual machine

I put added the parameter and have started the VM again.  I will let you know the follow-up.  The very weird thing is that the other VM (identical configuration and purpose) isn't at all affected.  I have a monitor set up that will tell me if it crashes.  I will update the post when I know more.  Currently, uptime is 25 minutes and counting...



ASKER CERTIFIED SOLUTION
Avatar of Andrew Hancock (VMware vExpert PRO / EE Fellow/British Beekeeper)
Andrew Hancock (VMware vExpert PRO / EE Fellow/British Beekeeper)
Flag of United Kingdom of Great Britain and Northern Ireland image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
If might also be depending on some usage patterns. Explaining one working and another non-working...

Probably won't know either way.  Both are configured the same, only the stable one is much busier.  Whatever the case, it probably isn't worth the effort to dig that far down.  The parameter change seems to have worked.  I have an uptime of 6 hours now.  
I am working on the upgrading of ESXI.  Need to ensure backups are complete and that I can recover if something should go wrong during the upgrade.  Thanks. for the help and checking back in.  Appreciate it. 
No problems, it's always worth keeping on top of ESXi updates, as VMware sneak in things like this, as well as important CVE security updates!

Especially around new OS releases, you cannot expect new OS releases to work or have been tested on older versions or ESXi before the OS were released