Windows 2016 Data Center VM Crashes

Thomas Grassi
Thomas Grassi used Ask the Experts™
on
Windows 2016 Data Center.
VMWare VCenter ESXI 6.5
Exchange 2016 CU 10 Enterprise.

I have a DAG setup with 2 Nodes

NODE 2 crashes every so often

I get these event errors

Log Name:      System
Source:        MSExchange Server
Date:          3/23/2019 2:22:29 PM
Event ID:      9009
Task Category: General
Level:         Warning
Keywords:      Classic
User:          N/A
Computer:      TGCS021-N2.our.network.tgcsnet.com
Description:
Microsoft Exchange Server 'TGCS021-N2' initiated bug check for server 'TGCS021-N2'. (Source: ActiveMonitoring, Identity: ServiceHealthActiveManagerForceReboot, Context: <LocalThrottlingResult IsPassed="true" MinimumMinutes="720" TotalInOneHour="0" MaxAllowedInOneHour="-1" TotalInOneDay="0" MaxAllowedInOneDay="1" IsThrottlingInProgress="true" IsRecoveryInProgress="false" ChecksFailed="" TimeToRetryAfter="0001-01-01T00:00:00.0000000" />
<GroupThrottlingResult IsPassed="true" TotalRequestsSent="2" TotalRequestsSucceeded="2" MinimumMinutes="600" TotalInOneDay="0" MaxAllowedInOneDay="4" ThrottlingInProgressServers="TGCS021-N2" RecoveryInProgressServers="" ChecksFailed="" TimeToRetryAfter="0001-01-01T00:00:00.0000000" Comment="">
  <ServerStats>
    <TGCS021-N2 TotalSearched="0" MostRecentEntryStartTimeUtc="0001-01-01T00:00:00" MostRecentEntryEndTimeUtc="0001-01-01T00:00:00" TotalActionsInADay="0" IsThrottlingInProgress="true" IsRecoveryInProgress="false" HostProcessStartTimeUtc="2019-03-21T19:20:14.9290778Z" SystemBootTimeUtc="2019-03-21T19:09:32.495102Z" />
    <TGCS021-N1 TotalSearched="0" MostRecentEntryStartTimeUtc="0001-01-01T00:00:00" MostRecentEntryEndTimeUtc="0001-01-01T00:00:00" TotalActionsInADay="0" IsThrottlingInProgress="false" IsRecoveryInProgress="false" HostProcessStartTimeUtc="2019-03-21T03:59:12.4006239Z" SystemBootTimeUtc="2019-03-21T03:57:21.48871Z" />
  </ServerStats>
</GroupThrottlingResult>, Reason: Responder initiated)
Event Xml:
<Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
  <System>
    <Provider Name="MSExchange Server" />
    <EventID Qualifiers="32768">9009</EventID>
    <Level>3</Level>
    <Task>1</Task>
    <Keywords>0x80000000000000</Keywords>
    <TimeCreated SystemTime="2019-03-23T18:22:29.505686600Z" />
    <EventRecordID>148673</EventRecordID>
    <Channel>System</Channel>
    <Computer>TGCS021-N2.our.network.tgcsnet.com</Computer>
    <Security />
  </System>
  <EventData>
    <Data>TGCS021-N2_TGCS021-N2_2019-03-23T18:05:57.1753728+00:00</Data>
    <Data>2019-03-23T18:05:57.1753728+00:00</Data>
    <Data>2019-03-23T18:05:57.1753728+00:00</Data>
    <Data>ActiveMonitoring</Data>
    <Data>ServiceHealthActiveManagerForceReboot</Data>
    <Data>&lt;LocalThrottlingResult IsPassed="true" MinimumMinutes="720" TotalInOneHour="0" MaxAllowedInOneHour="-1" TotalInOneDay="0" MaxAllowedInOneDay="1" IsThrottlingInProgress="true" IsRecoveryInProgress="false" ChecksFailed="" TimeToRetryAfter="0001-01-01T00:00:00.0000000" /&gt;
&lt;GroupThrottlingResult IsPassed="true" TotalRequestsSent="2" TotalRequestsSucceeded="2" MinimumMinutes="600" TotalInOneDay="0" MaxAllowedInOneDay="4" ThrottlingInProgressServers="TGCS021-N2" RecoveryInProgressServers="" ChecksFailed="" TimeToRetryAfter="0001-01-01T00:00:00.0000000" Comment=""&gt;
  &lt;ServerStats&gt;
    &lt;TGCS021-N2 TotalSearched="0" MostRecentEntryStartTimeUtc="0001-01-01T00:00:00" MostRecentEntryEndTimeUtc="0001-01-01T00:00:00" TotalActionsInADay="0" IsThrottlingInProgress="true" IsRecoveryInProgress="false" HostProcessStartTimeUtc="2019-03-21T19:20:14.9290778Z" SystemBootTimeUtc="2019-03-21T19:09:32.495102Z" /&gt;
    &lt;TGCS021-N1 TotalSearched="0" MostRecentEntryStartTimeUtc="0001-01-01T00:00:00" MostRecentEntryEndTimeUtc="0001-01-01T00:00:00" TotalActionsInADay="0" IsThrottlingInProgress="false" IsRecoveryInProgress="false" HostProcessStartTimeUtc="2019-03-21T03:59:12.4006239Z" SystemBootTimeUtc="2019-03-21T03:57:21.48871Z" /&gt;
  &lt;/ServerStats&gt;
&lt;/GroupThrottlingResult&gt;</Data>
    <Data>Responder initiated</Data>
    <Data>svchost</Data>
    <Data>TGCS021-N2</Data>
    <Data>TGCS021-N2</Data>
    <Data>MSExchangeHMWorker</Data>
  </EventData>
</Event>


Log Name:      System
Source:        Microsoft-Windows-Kernel-PnP
Date:          3/23/2019 2:15:56 PM
Event ID:      219
Task Category: (212)
Level:         Warning
Keywords:      
User:          SYSTEM
Computer:      TGCS021-N2.our.network.tgcsnet.com
Description:
The driver \Driver\WudfRd failed to load for the device SWD\WPDBUSENUM\{b70da062-66d8-11e8-a815-806e6f6e6963}#0000000008100000.
Event Xml:
<Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
  <System>
    <Provider Name="Microsoft-Windows-Kernel-PnP" Guid="{9C205A39-1250-487D-ABD7-E831C6290539}" />
    <EventID>219</EventID>
    <Version>0</Version>
    <Level>3</Level>
    <Task>212</Task>
    <Opcode>0</Opcode>
    <Keywords>0x8000000000000000</Keywords>
    <TimeCreated SystemTime="2019-03-23T18:15:56.677157700Z" />
    <EventRecordID>148515</EventRecordID>
    <Correlation />
    <Execution ProcessID="4" ThreadID="432" />
    <Channel>System</Channel>
    <Computer>TGCS021-N2.our.network.tgcsnet.com</Computer>
    <Security UserID="S-1-5-18" />
  </System>
  <EventData>
    <Data Name="DriverNameLength">70</Data>
    <Data Name="DriverName">SWD\WPDBUSENUM\{b70da062-66d8-11e8-a815-806e6f6e6963}#0000000008100000</Data>
    <Data Name="Status">3221226341</Data>
    <Data Name="FailureNameLength">14</Data>
    <Data Name="FailureName">\Driver\WudfRd</Data>
    <Data Name="Version">0</Data>
  </EventData>
</Event>




Log Name:      System
Source:        Microsoft-Windows-WER-SystemErrorReporting
Date:          3/23/2019 2:18:36 PM
Event ID:      1001
Task Category: None
Level:         Error
Keywords:      Classic
User:          N/A
Computer:      TGCS021-N2.our.network.tgcsnet.com
Description:
The computer has rebooted from a bugcheck.  The bugcheck was: 0x000000ef (0xffffe6828a114080, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000). A dump was saved in: C:\Windows\MEMORY.DMP. Report Id: d607026c-ff48-4d16-88b6-f5ec84d002ca.
Event Xml:
<Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
  <System>
    <Provider Name="Microsoft-Windows-WER-SystemErrorReporting" Guid="{ABCE23E7-DE45-4366-8631-84FA6C525952}" EventSourceName="BugCheck" />
    <EventID Qualifiers="16384">1001</EventID>
    <Version>0</Version>
    <Level>2</Level>
    <Task>0</Task>
    <Opcode>0</Opcode>
    <Keywords>0x80000000000000</Keywords>
    <TimeCreated SystemTime="2019-03-23T18:18:36.298446000Z" />
    <EventRecordID>148514</EventRecordID>
    <Correlation />
    <Execution ProcessID="0" ThreadID="0" />
    <Channel>System</Channel>
    <Computer>TGCS021-N2.our.network.tgcsnet.com</Computer>
    <Security />
  </System>
  <EventData>
    <Data Name="param1">0x000000ef (0xffffe6828a114080, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000)</Data>
    <Data Name="param2">C:\Windows\MEMORY.DMP</Data>
    <Data Name="param3">d607026c-ff48-4d16-88b6-f5ec84d002ca</Data>
  </EventData>
</Event>


Any one have an Idea

Thanks Tom
Comment
Watch Question

Do more with

Expert Office
EXPERT OFFICE® is a registered trademark of EXPERTS EXCHANGE®

Commented:
The driver \Driver\WudfRd failed to load for the device SWD\WPDBUSENUM\{b70da062-66d8-11e8-a815-806e6f6e6963}#0000000008100000.

Do you have any plug and play/USB devices connected to the VM?
How many hard disks are connected in the VM?
All the hard disks are from SAN storage datastore?
AlexSenior Infrastructure Analyst

Commented:
A couple of things,

You might want to Redact some of your errors, server name, domain etc etc

So, are any other servers on this host having issues? Have you checked that the server isn't overloaded. 0x000000ef is a critical process failing, which could be bought on by the CSTP times being too high thus crashing the kernel and then causing the server to reboot.

How have you configured your VM and what is the spec on your hardware. If node 2 is crashing every so often, it could be happening during an index and that causes it to fall over (seen it before)

Check your Host as well, run ESXTOP and check your times for your second node.

The other thing to do is check the hardware, drop it into maintenance mode and then run a comprehensive memory check.

Regards
Alex
Thomas GrassiSystems Administrator

Author

Commented:
Afthab T

Do you have any plug and play/USB devices connected to the VM?      NO
How many hard disks are connected in the VM?       3    Drive C Drive F and Drive G
All the hard disks are from SAN storage datastore?  All My VM's are on NFS Storage


Alex

Thanks for the heads up.

So, are any other servers on this host having issues?     No this is the only server having an issue with this at this time

How have you configured your VM and what is the spec on your hardware  the VM has 32 GB of Ram static . 8 CPU's  3 disk drives. Thin
All my VM are done the same way  just have different size in mem cpus and disk size.

To Run ESXTOP I will schedule some time
To drop in Maint mode I will also have to schedule time.

Thamks

Any other ideas or suggestions welcome.
Ensure you’re charging the right price for your IT

Do you wonder if your IT business is truly profitable or if you should raise your prices? Learn how to calculate your overhead burden using our free interactive tool and use it to determine the right price for your IT services. Start calculating Now!

AlexSenior Infrastructure Analyst

Commented:
Hey ya,

ESXTOP doesn't need downtime, you can just enable SSH and jump on.

How are you configuring your CPU's? As in how many cores/sockets are you using and what is the spec on your host?

The memory check could be the important one.

Thanks
Alex
Thomas GrassiSystems Administrator

Author

Commented:
Alex,

I activated SSH on the ESXI Host

SSH'd to the Host ran ESXTOP

What to look for??


How are you configuring your CPU's? As in how many cores/sockets are you using and what is the spec on your host?

cpu socket
Thomas GrassiSystems Administrator

Author

Commented:
It just crashed again this is crazy
Thomas GrassiSystems Administrator

Author

Commented:
It is now crashing daily it is a windows 2016 issue not a VMware issue.

Need help
Senior Infrastructure Analyst
Commented:
Right i'm back,

Your configuration on your sockets is WRONG. You need to put that down to a single socket and then 8 cores. Then see if that helps.
David FavorFractional CTO
Distinguished Expert 2018

Commented:
0x000000EF is very bad.

This means a critical process has died.

The most common reason for this is memory starvation, so RAM exhausts + then swap space fills up.

You can use directions here https://docs.microsoft.com/en-us/windows-hardware/drivers/debugger/bug-check-0xef--critical-process-died for info about how to analyze your C:\Windows\MEMORY.DMP file + likely this won't help.

Analyzing your dump file will tell you about the process that died, which might relate to a primary problem (like a driver aborting).

More than likely you'll find some odd error that makes no sense, because edge conditions like all RAM + Swap exhausted are very difficult to catch + report in any useful way.

If your memory.dmp file makes no sense, look through your system logs just prior to the crash + you might find some clues as to the primary cause.

This is always a tough one to debug.
David FavorFractional CTO
Distinguished Expert 2018

Commented:
Note: I'm imagining you've installed all updates on this machine. If not, then install all updates now + see if problem clears.
AlexSenior Infrastructure Analyst

Commented:
To be honest,

With John's explanation that could correlate with the core setup, you're throwing your vNUMA and pNUMA all over the place, it can't process the memory assigned to it which is then causing it to crash.

Possible.

Cheers

Alex
Top Expert 2005

Commented:
You should be able to select 2 vCPUs, with 4 cores each, or 4 vCPUs with 2 cores each.  You also want to make sure the VMWare Tools are installed on the guest.
AlexSenior Infrastructure Analyst

Commented:
No you can't do that,  because he's on ESXi 6.5, you prioritise cores over sockets.
Thomas GrassiSystems Administrator

Author

Commented:
Guys I will make the cpu changes tonight

Will post results
Top Expert 2005

Commented:
After checking into this a bit more vNUMA doesn't come into play until the socket count is 9 or more.  He is below that.

Having said that, the preferred method, based on smarter people at VMware, is to exhaust core count before incrementing socket count (as eluded to above).  

So, what this means to you is:

1)  Confirm the real number of cores on the Physical CPU so you don't assign more than you have.
2)  Favour vCore over vCPU.   So, 1 vCPU (socket), 8 cores.

Based on all the Proliants we have, we try to optimize memory lanes by evening out the socket/core counts for VMs that need performance, so we balance the sockets which in turn use different memory channels.  For most, this wouldn't be really noticeable, so point 2 is fine and what VMware prefers.
Thomas GrassiSystems Administrator

Author

Commented:
Guys on the server I see this for the CPU's

See attached image of task manager.

MY CPU setup
Thomas GrassiSystems Administrator

Author

Commented:
Update

Server crashed today so I powered it off and changed the CPU  now I have 2 sockets 8 virtual processors

Will let it run and see how it goes.
Thomas GrassiSystems Administrator

Author

Commented:
Thank you guys

After changing the sockets to 4 and I changed the network adapters from E1000E to VMXNET 3 adapter.

Server has been up for several days.

Do more with

Expert Office
Submit tech questions to Ask the Experts™ at any time to receive solutions, advice, and new ideas from leading industry professionals.

Start 7-Day Free Trial