asked on

Windows 2016 Data Center VM Crashes

Windows 2016 Data Center.
VMWare VCenter ESXI 6.5
Exchange 2016 CU 10 Enterprise.

I have a DAG setup with 2 Nodes

NODE 2 crashes every so often

I get these event errors

Log Name: System
Source: MSExchange Server
Date: 3/23/2019 2:22:29 PM
Event ID: 9009
Task Category: General
Level: Warning
Keywords: Classic
User: N/A
Computer: TGCS021-N2.our.network.tgcsnet.com
Description:
Microsoft Exchange Server 'TGCS021-N2' initiated bug check for server 'TGCS021-N2'. (Source: ActiveMonitoring, Identity: ServiceHealthActiveManagerForceReboot, Context: <LocalThrottlingResult IsPassed="true" MinimumMinutes="720" TotalInOneHour="0" MaxAllowedInOneHour="-1" TotalInOneDay="0" MaxAllowedInOneDay="1" IsThrottlingInProgress="true" IsRecoveryInProgress="false" ChecksFailed="" TimeToRetryAfter="0001-01-01T00:00:00.0000000" />
<GroupThrottlingResult IsPassed="true" TotalRequestsSent="2" TotalRequestsSucceeded="2" MinimumMinutes="600" TotalInOneDay="0" MaxAllowedInOneDay="4" ThrottlingInProgressServers="TGCS021-N2" RecoveryInProgressServers="" ChecksFailed="" TimeToRetryAfter="0001-01-01T00:00:00.0000000" Comment="">
<ServerStats>
<TGCS021-N2 TotalSearched="0" MostRecentEntryStartTimeUtc="0001-01-01T00:00:00" MostRecentEntryEndTimeUtc="0001-01-01T00:00:00" TotalActionsInADay="0" IsThrottlingInProgress="true" IsRecoveryInProgress="false" HostProcessStartTimeUtc="2019-03-21T19:20:14.9290778Z" SystemBootTimeUtc="2019-03-21T19:09:32.495102Z" />
<TGCS021-N1 TotalSearched="0" MostRecentEntryStartTimeUtc="0001-01-01T00:00:00" MostRecentEntryEndTimeUtc="0001-01-01T00:00:00" TotalActionsInADay="0" IsThrottlingInProgress="false" IsRecoveryInProgress="false" HostProcessStartTimeUtc="2019-03-21T03:59:12.4006239Z" SystemBootTimeUtc="2019-03-21T03:57:21.48871Z" />
</ServerStats>
</GroupThrottlingResult>, Reason: Responder initiated)
Event Xml:
<Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
<System>
<Provider Name="MSExchange Server" />
<EventID Qualifiers="32768">9009</EventID>
<Level>3</Level>
<Task>1</Task>
<Keywords>0x80000000000000</Keywords>
<TimeCreated SystemTime="2019-03-23T18:22:29.505686600Z" />
<EventRecordID>148673</EventRecordID>
<Channel>System</Channel>
<Computer>TGCS021-N2.our.network.tgcsnet.com</Computer>
<Security />
</System>
<EventData>
<Data>TGCS021-N2_TGCS021-N2_2019-03-23T18:05:57.1753728+00:00</Data>
<Data>2019-03-23T18:05:57.1753728+00:00</Data>
<Data>2019-03-23T18:05:57.1753728+00:00</Data>
<Data>ActiveMonitoring</Data>
<Data>ServiceHealthActiveManagerForceReboot</Data>
<Data><LocalThrottlingResult IsPassed="true" MinimumMinutes="720" TotalInOneHour="0" MaxAllowedInOneHour="-1" TotalInOneDay="0" MaxAllowedInOneDay="1" IsThrottlingInProgress="true" IsRecoveryInProgress="false" ChecksFailed="" TimeToRetryAfter="0001-01-01T00:00:00.0000000" />
<GroupThrottlingResult IsPassed="true" TotalRequestsSent="2" TotalRequestsSucceeded="2" MinimumMinutes="600" TotalInOneDay="0" MaxAllowedInOneDay="4" ThrottlingInProgressServers="TGCS021-N2" RecoveryInProgressServers="" ChecksFailed="" TimeToRetryAfter="0001-01-01T00:00:00.0000000" Comment="">
<ServerStats>
<TGCS021-N2 TotalSearched="0" MostRecentEntryStartTimeUtc="0001-01-01T00:00:00" MostRecentEntryEndTimeUtc="0001-01-01T00:00:00" TotalActionsInADay="0" IsThrottlingInProgress="true" IsRecoveryInProgress="false" HostProcessStartTimeUtc="2019-03-21T19:20:14.9290778Z" SystemBootTimeUtc="2019-03-21T19:09:32.495102Z" />
<TGCS021-N1 TotalSearched="0" MostRecentEntryStartTimeUtc="0001-01-01T00:00:00" MostRecentEntryEndTimeUtc="0001-01-01T00:00:00" TotalActionsInADay="0" IsThrottlingInProgress="false" IsRecoveryInProgress="false" HostProcessStartTimeUtc="2019-03-21T03:59:12.4006239Z" SystemBootTimeUtc="2019-03-21T03:57:21.48871Z" />
</ServerStats>
</GroupThrottlingResult></Data>
<Data>Responder initiated</Data>
<Data>svchost</Data>
<Data>TGCS021-N2</Data>
<Data>TGCS021-N2</Data>
<Data>MSExchangeHMWorker</Data>
</EventData>
</Event>

Log Name: System
Source: Microsoft-Windows-Kernel-PnP
Date: 3/23/2019 2:15:56 PM
Event ID: 219
Task Category: (212)
Level: Warning
Keywords:
User: SYSTEM
Computer: TGCS021-N2.our.network.tgcsnet.com
Description:
The driver \Driver\WudfRd failed to load for the device SWD\WPDBUSENUM\{b70da062-66d8-11e8-a815-806e6f6e6963}#0000000008100000.
Event Xml:
<Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
<System>
<Provider Name="Microsoft-Windows-Kernel-PnP" Guid="{9C205A39-1250-487D-ABD7-E831C6290539}" />
<EventID>219</EventID>
<Version>0</Version>
<Level>3</Level>
<Task>212</Task>
<Opcode>0</Opcode>
<Keywords>0x8000000000000000</Keywords>
<TimeCreated SystemTime="2019-03-23T18:15:56.677157700Z" />
<EventRecordID>148515</EventRecordID>
<Correlation />
<Execution ProcessID="4" ThreadID="432" />
<Channel>System</Channel>
<Computer>TGCS021-N2.our.network.tgcsnet.com</Computer>
<Security UserID="S-1-5-18" />
</System>
<EventData>
<Data Name="DriverNameLength">70</Data>
<Data Name="DriverName">SWD\WPDBUSENUM\{b70da062-66d8-11e8-a815-806e6f6e6963}#0000000008100000</Data>
<Data Name="Status">3221226341</Data>
<Data Name="FailureNameLength">14</Data>
<Data Name="FailureName">\Driver\WudfRd</Data>
<Data Name="Version">0</Data>
</EventData>
</Event>

Log Name: System
Source: Microsoft-Windows-WER-SystemErrorReporting
Date: 3/23/2019 2:18:36 PM
Event ID: 1001
Task Category: None
Level: Error
Keywords: Classic
User: N/A
Computer: TGCS021-N2.our.network.tgcsnet.com
Description:
The computer has rebooted from a bugcheck. The bugcheck was: 0x000000ef (0xffffe6828a114080, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000). A dump was saved in: C:\Windows\MEMORY.DMP. Report Id: d607026c-ff48-4d16-88b6-f5ec84d002ca.
Event Xml:
<Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
<System>
<Provider Name="Microsoft-Windows-WER-SystemErrorReporting" Guid="{ABCE23E7-DE45-4366-8631-84FA6C525952}" EventSourceName="BugCheck" />
<EventID Qualifiers="16384">1001</EventID>
<Version>0</Version>
<Level>2</Level>
<Task>0</Task>
<Opcode>0</Opcode>
<Keywords>0x80000000000000</Keywords>
<TimeCreated SystemTime="2019-03-23T18:18:36.298446000Z" />
<EventRecordID>148514</EventRecordID>
<Correlation />
<Execution ProcessID="0" ThreadID="0" />
<Channel>System</Channel>
<Computer>TGCS021-N2.our.network.tgcsnet.com</Computer>
<Security />
</System>
<EventData>
<Data Name="param1">0x000000ef (0xffffe6828a114080, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000)</Data>
<Data Name="param2">C:\Windows\MEMORY.DMP</Data>
<Data Name="param3">d607026c-ff48-4d16-88b6-f5ec84d002ca</Data>
</EventData>
</Event>

Any one have an Idea

Thanks Tom

Afthab T

The driver \Driver\WudfRd failed to load for the device SWD\WPDBUSENUM\{b70da062-66d8-11e8-a815-806e6f6e6963}#0000000008100000.

Do you have any plug and play/USB devices connected to the VM?
How many hard disks are connected in the VM?
All the hard disks are from SAN storage datastore?

Alex

A couple of things,

You might want to Redact some of your errors, server name, domain etc etc

So, are any other servers on this host having issues? Have you checked that the server isn't overloaded. 0x000000ef is a critical process failing, which could be bought on by the CSTP times being too high thus crashing the kernel and then causing the server to reboot.

How have you configured your VM and what is the spec on your hardware. If node 2 is crashing every so often, it could be happening during an index and that causes it to fall over (seen it before)

Check your Host as well, run ESXTOP and check your times for your second node.

The other thing to do is check the hardware, drop it into maintenance mode and then run a comprehensive memory check.

Regards
Alex

Member_2_6492660_1

ASKER

Afthab T

Do you have any plug and play/USB devices connected to the VM? NO
How many hard disks are connected in the VM? 3 Drive C Drive F and Drive G
All the hard disks are from SAN storage datastore? All My VM's are on NFS Storage

Alex

Thanks for the heads up.

So, are any other servers on this host having issues? No this is the only server having an issue with this at this time

How have you configured your VM and what is the spec on your hardware the VM has 32 GB of Ram static . 8 CPU's 3 disk drives. Thin
All my VM are done the same way just have different size in mem cpus and disk size.

To Run ESXTOP I will schedule some time
To drop in Maint mode I will also have to schedule time.

Thamks

Any other ideas or suggestions welcome.

Alex

Hey ya,

ESXTOP doesn't need downtime, you can just enable SSH and jump on.

How are you configuring your CPU's? As in how many cores/sockets are you using and what is the spec on your host?

The memory check could be the important one.

Thanks
Alex

Member_2_6492660_1

ASKER

Alex,

I activated SSH on the ESXI Host

SSH'd to the Host ran ESXTOP

What to look for??

How are you configuring your CPU's? As in how many cores/sockets are you using and what is the spec on your host?

Member_2_6492660_1

ASKER

It just crashed again this is crazy

Member_2_6492660_1

ASKER

It is now crashing daily it is a windows 2016 issue not a VMware issue.

Need help

ASKER CERTIFIED SOLUTION

Alex

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

David Favor

0x000000EF is very bad.

This means a critical process has died.

The most common reason for this is memory starvation, so RAM exhausts + then swap space fills up.

You can use directions here https://docs.microsoft.com/en-us/windows-hardware/drivers/debugger/bug-check-0xef--critical-process-died for info about how to analyze your C:\Windows\MEMORY.DMP file + likely this won't help.

Analyzing your dump file will tell you about the process that died, which might relate to a primary problem (like a driver aborting).

More than likely you'll find some odd error that makes no sense, because edge conditions like all RAM + Swap exhausted are very difficult to catch + report in any useful way.

If your memory.dmp file makes no sense, look through your system logs just prior to the crash + you might find some clues as to the primary cause.

This is always a tough one to debug.

David Favor

Note: I'm imagining you've installed all updates on this machine. If not, then install all updates now + see if problem clears.

Alex

To be honest,

With John's explanation that could correlate with the core setup, you're throwing your vNUMA and pNUMA all over the place, it can't process the memory assigned to it which is then causing it to crash.

Possible.

Cheers

Alex

Netman66

You should be able to select 2 vCPUs, with 4 cores each, or 4 vCPUs with 2 cores each. You also want to make sure the VMWare Tools are installed on the guest.

Alex

No you can't do that, because he's on ESXi 6.5, you prioritise cores over sockets.

Member_2_6492660_1

ASKER

Guys I will make the cpu changes tonight

Will post results

Netman66

After checking into this a bit more vNUMA doesn't come into play until the socket count is 9 or more. He is below that.

Having said that, the preferred method, based on smarter people at VMware, is to exhaust core count before incrementing socket count (as eluded to above).

So, what this means to you is:

1) Confirm the real number of cores on the Physical CPU so you don't assign more than you have.
2) Favour vCore over vCPU. So, 1 vCPU (socket), 8 cores.

Based on all the Proliants we have, we try to optimize memory lanes by evening out the socket/core counts for VMs that need performance, so we balance the sockets which in turn use different memory channels. For most, this wouldn't be really noticeable, so point 2 is fine and what VMware prefers.

Member_2_6492660_1

ASKER

Guys on the server I see this for the CPU's

See attached image of task manager.

Member_2_6492660_1

ASKER

Update

Server crashed today so I powered it off and changed the CPU now I have 2 sockets 8 virtual processors

Will let it run and see how it goes.

Member_2_6492660_1

ASKER

Thank you guys

After changing the sockets to 4 and I changed the network adapters from E1000E to VMXNET 3 adapter.

Server has been up for several days.