VM crashing intermittently

Hi,
So I just started this job and inherited this troublesome HP Proliant DL380e Gen 8 with VMware ESXi 5.1 server hosting two vms, Server1 and Server2.  Server1 is configured all jacked up with 4 500GB Sata JBODs on an embedded controller with NO RAID where 1 drive has the ESXi  and the other 3 are spanned into one drive to make a 1.5TB logical drive for datastore1(Server1).  This server randomly shuts off by itself and stalls at 95% as it powers off as well as when it powers back up.  Server 2 is on a separate controller in slot 2 that has 4 500GB Sata drives using RAID-5 to create another 1.5TB logical volume.  Once server 1 powers off, trying to control the ESXi host with Vsphere becomes completely unresponsive and after about an hour, Server2 powers off as well.  The box itself stays powered on but somebody has to go into the server room and do a hard shutdown and then turn it back on.  We can then power the vms back up with Vsphere, Server1 hanging out at 95% for several minutes until everything comes back up.  This happens once a day for the last 3 months.  I don't care for VMware so I'm no pro and am begging for help.  The C: drive has 44GB free and the data partition E: only has 12.9GB free.  They have the server set to manage the swap file automatically on C:.  Right now it's using about 1200MB for that.  Where can I look to figure out wtf is going on?  The other techs have even been on the phone with Vmware but those guys don't seem to be much help.  Any ideas?

Thank you!!
JasonAsked:
Who is Participating?

[Product update] Infrastructure Analysis Tool is now available with Business Accounts.Learn More

x
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Andrew Hancock (VMware vExpert / EE MVE^2)VMware and Virtualization ConsultantCommented:
I think I would start this server from scratch

0. Backup all VMs.

1. Get the RAID configured correctly.

2. Re-install the HP OEM version of ESXi on a USB flash drive or SD card. Use the latest of 5.1 and make sure it's patched.

3. I would ensure all the firmware of the server is up to date, using the HP SmartStart DVD Firmware disk, and ensure all firmware and network interfaces are done, storage controller.

4. Restore VMs, and look at VM configuration
0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
JasonAuthor Commented:
Thanks Andrew.  That was going to be the end result but is there a log somewhere in ESXi that will tell me exactly why the server is powering off?
0
Andrew Hancock (VMware vExpert / EE MVE^2)VMware and Virtualization ConsultantCommented:
there are two places for logs:-

/var/log/vmkernel.log - there maybe some evidence in here.

also if you look in the VM folder on the datastore, there is a vmware.log, this is for the VM.

you may find something in there, but sometimes, nothing is reported.

Also have you checked the hardware for errors, e.g. memory ? by running a memory test.

Also what the the OS of the VMs ?
0
Acronis Data Cloud 7.8 Enhances Cyber Protection

A closer look at five essential enhancements that benefit end-users and help MSPs take their cloud data protection business further.

JasonAuthor Commented:
The OS is Server 2012 Std.  Supposedly the guy before me ran diags.  Note, server2 stays up fine until server1 tanks the whole thing.
Thanks, I'll check out the logs.
0
JasonAuthor Commented:
So when I opened up the server, I discovered they never enabled the second processor on the board.  I switched the jumper and moved the RAM around but my question is now, will the VMs use that second processor automatically when it's available or do I have to configure something?  I'd like each vm to use a separate processor, right?
0
JasonAuthor Commented:
CPU/MMU virtualization is set to automatic.  Also, the Swapfile location is set to default.  What would provide the best performance for that?
0
Andrew Hancock (VMware vExpert / EE MVE^2)VMware and Virtualization ConsultantCommented:
So when I opened up the server, I discovered they never enabled the second processor on the board.  I switched the jumper and moved the RAM around but my question is now, will the VMs use that second processor automatically when it's available or do I have to configure something?  I'd like each vm to use a separate processor, right?

This would cause hardware issues, ESXi will use the memory and processor automatically. Also remember that VMware vSphere is licensed per socket, if you do not have the correct licenses, the license will be violated.

You may want to check what you are licensed for, if this is a paid for license.

Also make sure the memory is balanced between processors, and correct for the sockets.

CPU/MMU virtualization is set to automatic.  Also, the Swapfile location is set to default.  What would provide the best performance for that?

These are default values and normal.
0
JasonAuthor Commented:
I checked the vSphere and it says VMware vSphere 5 Essentials Licensed for 2 physical CPUs (unlimited cores per CPU).  So I don't need to change anything to utilize both of them?  It doesn't seem like any of this should have worked using one CPU if,now with two cpus, Server 1 was pegged at 95%.  It has calmed down a bit since posting and server 1 CPU utilization is steady around 45%, bouncing up around 80%.  Wondering if maybe CPU maxing out is crashing the VM but doesn't make sense that it causes the ESXi host to crap out and not respond as well.
0
Andrew Hancock (VMware vExpert / EE MVE^2)VMware and Virtualization ConsultantCommented:
I checked the vSphere and it says VMware vSphere 5 Essentials Licensed for 2 physical CPUs (unlimited cores per CPU).  So I don't need to change anything to utilize both of them?

No, you are fine.

It doesn't seem like any of this should have worked using one CPU if,now with two cpus, Server 1 was pegged at 95%.  It has calmed down a bit since posting and server 1 CPU utilization is steady around 45%, bouncing up around 80%.  Wondering if maybe CPU maxing out is crashing the VM but doesn't make sense that it causes the ESXi host to crap out and not respond as well.
If a server has not been configured correctly, and a CPU exists in the socket, and it's been disabled, this is not a normal installation for a Hypervisor, so anything could happen.

I think now you have changed the configuration, you need to test, test and test again, and observe, if the issues are repeated.

There are situations where a ESXi Hypervisor, CPU and Memory can max out, and the host does come unresponsive.

I do not think you have given any information as to

1. HOST Specification, e.g. memory and cpus.

2. VM specification.

e.g. if you have allocated silly CPU and Memory figures to VMs, this can have the effect you describe.
0
JasonAuthor Commented:
Well we upgraded the firware on drives 2-4 that have the main datastore on them and turned on that second CPU and were good for a day but it Server1 crapped out twice yesterday, made the Host stop responding so we couldn't connect remotely, and an hour or so later, Server 2 bombs out and we have to do a power cycle.
Server1:  
RAM:  10240MB
CPUs4
Video Card: Video Card
VMCI device:  Restricted
SCSI Controller 0: LSI Logic Parallel
Hard Disk 1 : Virtual Disk
CD/DVD drive 1:  Client Device
Network Adapter 1:  Data Network

Server 2:  
RAM: 8192MB
CPUs: 4
Video Card: Video Card
VMCI device:  Restricted
SCSI Controller 0: LSI Logic Parallel
Hard Disk 1 : Virtual Disk
CD/DVD drive 1:  Client Device
Network Adapter 2:  Data Network
0
JasonAuthor Commented:
Sorry,
Server1:  
SCSI Controller 0:  LSI Logic SAS
0
Andrew Hancock (VMware vExpert / EE MVE^2)VMware and Virtualization ConsultantCommented:
and the host CPUs and Memory..

So total memory assign to VMs is 18GB

and 8 vCPUs (which could be alot!)

What are the servers function ?

and could you explain, what happnes in more detail when the server "boms out" - that's not really a technical term!

see here

vSMP (virtual SMP) can affect virtual machine performance, when adding too many vCPUs to virtual machines that cannot use the vCPUs effectly, e.g. Servers than can use vSMP correctly :- SQL Server, Exchange Server.

This is true, many VMware Administrators, think adding lots of processors, will increase performance - wrong! (and because they can, they just go silly!). Sometimes there is confusion between cores and processors. But what we are adding is additional processors in the virtual machine.

So 4 vCPU, to the VM is a 4 Way SMP (Quad Processor Server), if you have Enterprise Plus license you can add 8, (and only if you have the correct OS License will the OS recognise them all).

If applications, can take advantage e.g. Exchange, SQL, adding additional processors, can/may increase performance.

So usual rule of thumb is try 1 vCPU, then try 2 vCPU, knock back to 1 vCPU if performance is affected. and only use vSMP if the VM can take advantage.

Example, VM with 4 vCPUs allocated!

My simple laymans explaination of the "scheduler!"

As you have assigned 4 vCPUs, to this VM, the VMware scheulder, has to wait until 4 cores are free and available, to do this, it has to pause the first cores, until the 4th is available, during this timeframe, the paused cores are not available for processes, this is my simplistic view, but bottom line is adding more vCPUs to a VM, may not give you the performance benefits you think, unless the VM, it's applications are optimised for additional vCPUs.

See here
http://www.vmware.com/resources/techresources/10131

see here
http://www.gabesvirtualworld.com/how-too-many-vcpus-can-negatively-affect-your-performance/

http://www.zdnet.com/virtual-cpus-the-overprovisioning-penalty-of-vcpu-to-pcpu-ratios-4010025185/

also there is a document here about the CPU scheduler

www.vmware.com/files/pdf/perf-vsphere-cpu_scheduler.pdf

https://blogs.vmware.com/vsphere/2013/10/does-corespersocket-affect-performance.html
0
JasonAuthor Commented:
The VMware.log might as well be in Chinese.  
02T22:33:52.052Z| vmx| I120: OvhdMem: memsize 10240 MB VMK fixed 74 pages var 0 pages cbrcOverhead 0 pages total 5199 pages
2015-04-02T22:33:52.052Z| vmx| I120: VMMEM: Maximum Reservation: 165MB (MainMem=10240MB SVGA=4MB) VMK=20MB
2015-04-02T22:33:52.087Z| vmx| I120: VMXVmdb_SetToolsVersionStatus: status value set to 'ok', 'current', install possible
2015-04-02T22:33:52.090Z| vmx| I120: Destroying virtual dev for scsi0:0 vscsi=8192
2015-04-02T22:33:52.090Z| vmx| I120: VMMon_VSCSIStopVports: No such target on adapter
2015-04-02T22:33:52.092Z| vmx| I120: MKS PowerOff
2015-04-02T22:33:52.102Z| vmx| I120: scsi0:0: numIOs = 0 numMergedIOs = 0 numSplitIOs = 0 ( 0.0%)
2015-04-02T22:33:52.102Z| vmx| I120: Closing disk scsi0:0
2015-04-02T22:33:52.129Z| vmx| I120: DISKLIB-VMFS  : "/vmfs/volumes/50e5f795-df7cb1ce-fc16-ac162db20a54/LSServer1/LSServer1-000001-delta.vmdk" : closed.
2015-04-02T22:33:52.146Z| vmx| I120: DISKLIB-VMFS  : "/vmfs/volumes/50e5f795-df7cb1ce-fc16-ac162db20a54/LSServer1/LSServer1-flat.vmdk" : closed.
2015-04-02T22:33:52.278Z| vmx| I120: WORKER: asyncOps=2 maxActiveOps=1 maxPending=0 maxCompleted=1
2015-04-02T22:33:53.392Z| vmx| I120: Vix: [5763 mainDispatch.c:3867]: VMAutomation_ReportPowerOpFinished: statevar=1, newAppState=1873, success=1 additionalError=0
2015-04-02T22:33:53.392Z| vmx| I120: Vix: [5763 mainDispatch.c:3886]: VMAutomation: Ignoring ReportPowerOpFinished because the VMX is shutting down.

2015-04-02T22:33:53.533Z| vmx| I120: Vix: [5763 mainDispatch.c:3867]: VMAutomation_ReportPowerOpFinished: statevar=0, newAppState=1870, success=1 additionalError=0
2015-04-02T22:33:53.533Z| vmx| I120: Vix: [5763 mainDispatch.c:3886]: VMAutomation: Ignoring ReportPowerOpFinished because the VMX is shutting down.
2015-04-02T22:33:53.534Z| vmx| I120: Transitioned vmx/execState/val to poweredOff
2015-04-02T22:33:53.534Z| vmx| I120: VMX idle exit
2015-04-02T22:33:53.534Z| vmx| I120: VMIOP: Exit
2015-04-02T22:33:53.808Z| vmx| I120: Vix: [5763 mainDispatch.c:861]: VMAutomation_LateShutdown()
2015-04-02T22:33:53.808Z| vmx| I120: Vix: [5763 mainDispatch.c:811]: VMAutomationCloseListenerSocket. Closing listener socket.
2015-04-02T22:33:53.811Z| vmx| I120: Flushing VMX VMDB connections
2015-04-02T22:33:53.811Z| vmx| I120: VmdbDbRemoveCnx: Removing Cnx from Db for '/db/connection/#1/'
2015-04-02T22:33:53.811Z| vmx| I120: VmdbCnxDisconnect: Disconnect: closed pipe for pub cnx '/db/connection/#1/' (0)
2015-04-02T22:33:53.817Z| vmx| I120: VMX exit (0).
2015-04-02T22:33:53.817Z| vmx| I120: AIOMGR-S : stat o=2 r=6 w=0 i=0 br=98304 bw=0
2015-04-02T22:33:53.817Z| vmx| I120: OBJLIB-LIB : ObjLib cleanup done.
2015-04-02T22:33:53.817Z| vmx| W110: VMX has left the building: 0.


Is that saying April 2?  Because that is where the log stops.
0
JasonAuthor Commented:
"What are the servers function?"

Server1 is a domain controller and hosts an SQL database for a program called the users call Seradex but the computer calls it ActiveERP.  It is has a db called Seradex.  

When it "bombs out" we can't see anything.  It just goes offline and we can't get to it remotely.  That is why I was looking for logs.
0
JasonAuthor Commented:
"Is that saying April 2?"
Duh, yeah, after I expanded the datastore browser, it says the VMware.log was last modified on 4.2.2015 on server1 (MainDataStore).
0
Andrew Hancock (VMware vExpert / EE MVE^2)VMware and Virtualization ConsultantCommented:
It's not good practice to have anything share a Domain Controller.

So what does Server2 do ?

What is the HOST CPU and MEMORY ?

HOST specifications.

you should have other vmware*.logs ?

do pings stop ?

does RDP stop ?

vSphere Client connected to ESXi server?

Console Output ?
0
Andrew Hancock (VMware vExpert / EE MVE^2)VMware and Virtualization ConsultantCommented:
I've also noticed your VMs is running on a snapshot, which is not good....also..see my EE Article

HOW TO: VMware Snapshots :- Be Patient

do you have free space on the datastore ?

if the snapshot fills the datastore, the VM will stop. also performance of the VM on a snapshot will be poor.
0
JasonAuthor Commented:
"It's not good practice to have anything share a Domain Controller."

That's what I've been saying.  Server1 and Server2 are DCs.  Server2 is the DHCP and DNS server as well and is a license server for Solidworks.  What is the point of having your domain controller and its backup on the same vm host, you ask?  Yeah, me too.

"What is the HOST CPU and MEMORY ?"
8 CPUs x 1.795 GHz
Intel Xeon E5
Memory:  24541.20 MB
MainDataStore:  1.36TB  Free:  81.88GB
DataStore2:  1.36TB  Free  1.21TB
0
JasonAuthor Commented:
Wait, I just noticed something after typing that.
Server2:  
Provisioned Space:  158.79GB
Used Space:  158.79GB
Host CPU MHz:  2064
Mem:  8083

Server1:
Provisioned Space 2.93TB
Used:  1.28TB
Host Cpu:  2850
Mem:  10311
0
JasonAuthor Commented:
Yet in resource allocation it says:  
Server2   Hard Disk 1  DataStore2
Server1   Hard Disk 1  MainDataStore
0
JasonAuthor Commented:
0
JasonAuthor Commented:
0
JasonAuthor Commented:
Pings don't stop and Server2 stays up for about an hour but RDP stops as does vSphere and console output.
0
JasonAuthor Commented:
Snapshot.png
Snapshot?
0
Andrew Hancock (VMware vExpert / EE MVE^2)VMware and Virtualization ConsultantCommented:
Snapshots don't often appear in the snapshot manager.

check the disks.
0
JasonAuthor Commented:
Alright, well I'm waiting for somebody to buy the new drives, I guess.  In the meantime, server1 hasn't crashed since I messed around with it.  Not sure why.  I'll close this issue since a rebuild is in order.  Oh one more thing, what did you see that made you think it was running on a snapshot?  I've been all over the place and haven't looked in the vm folder but was curious as to what you saw.
0
JasonAuthor Commented:
You are right though.  Server1.vmdk hasn't been modified since 4/1/15 and server1-0000001.vmdk exists.  What steps would you recommend to properly merge the two and get a good backup before I wipe it and start over?
0
Andrew Hancock (VMware vExpert / EE MVE^2)VMware and Virtualization ConsultantCommented:
First check, you have enough disk space for the Snapshot Merge.

1. Create a new Snapshot (Take Snapshot)
2. Wait 60 seconds
3. Select DELETE ALL and Be Very Patient.

How large is the snapshot ? you can check via datastore browser ?
0
JasonAuthor Commented:
Sorry for the delay.  Server1.vmdk is 1.3TB and the Server1-000001.vmdk is 56.5GB.
0
Andrew Hancock (VMware vExpert / EE MVE^2)VMware and Virtualization ConsultantCommented:
Okay, go ahead with my last post.
0
JasonAuthor Commented:
This thing is also showing me an alarm for Power Supply 3 and lost redundancy.  This server doesn't even have 2 power supplies let alone 3.  How do I disable this alarm?
0
JasonAuthor Commented:
In the datastore details, it shows MainDataStore has a capacity of 1.36TB and 78.53GB free.  Will it be ok?
0
JasonAuthor Commented:
Ok to do the snapshot, that is?
0
JasonAuthor Commented:
Thanks Andrew.  HP is sending us a new motherboard and we bought 8 new 1TB hard drives.  Will be utilizing the solution outlined above this Friday night.  Fingers crossed.
0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
VMware

From novice to tech pro — start learning today.