Link to home
Start Free TrialLog in
Avatar of lanmastersINC
lanmastersINC

asked on

Server 2016 VM crawling, unable to resolve DNS, can’t ping but is still accessible on the network

For a the past few weeks I have been having an issue with a server and in need of some help to figure out what's going on.

Machine in question:
Windows Server 2016 on Hyper-V
6 cores
12GB RAM
Host has 10G Network card
Machine is also a DC
Primary function is file server

I found out about this problem when the backups were failing. The server was unable to connect to the networked BDR. Ping to the BDR worked fine but resolving to the admin portal via a browser fails. Rebooting the server fixes the issue and it returns to working perfectly fine for several hours until it errors out. Typically this has been about 7-10 hours.

On the last failure Task Manager was saying I have 0% CPU usage and -1% memory usage... Resource monitor was more reasonable and said 10-15% CPU and 31% RAM. Network utilization was below 100Kbps and Disk usage below 100 KB/s. I couldn't ping anything with cmd or powershell. The command simply does not run. The cursor blinks indefinitely (At least for several hours). Yet I could remotely connect to the machine via ScreenConnect and access the shared network folders on the server from another PC on the same network.  The server is also pretty slow compared to what it normally is and other VMs on the same host with less resources are running fine. Also, while I was able to remotely connect to the computer via ScreenConnect I could not with the Hyper-V Manager connection. The window opened and I could interact with the system (click, type) but the actual image that you see didn’t change. (Imagine the monitor is turned off)

Event viewer had a few different errors:

System log:  

Microsoft-Windows-GroupPolicy: The processing of Group Policy failed. Windows could not obtain the name of a domain controller. This could be caused by a name resolution failure. Verify your Domain Name System (DNS) is configured and working correctly.

Tcpip: A request to allocate an ephemeral port number from the global TCP port space has failed due to all such ports being in use.

Some Googling led me to a post about registry edits to increase TcpTimedWaitDelay, MaxUserPort, TcpNumConnections, and TcpMaxDataRestrictions. Increased all those and then the issue changed. I no longer got the same Tcpip error but now get AFD errors like this:

Closing a UDP socket with local port number 15757 in process 1280 is taking longer than expected. The local port number may not be available until the close operation is completed. This happens typically due to misbehaving network drivers. Ensure latest updates are installed for Windows and any third-party networking software including NIC drivers, firewalls, or other security products.  

So I made sure the system is completely patched and reinstalled the hyper-v integration services. But still having the same issues.

Rebooting the computer returns it to normal. I can’t safely reboot the computer once it starts giving issues. It hangs on closing some process (not consistent) during shutdown and I have to either power off or reset the VM. If the server is operating properly, a reboot works like it should with no hangs.

Any and all suggestions would be greatly appreciated! This issue is driving me up a wall.
Avatar of Philip Elder
Philip Elder
Flag of Canada image

Broadcom Gigabit NICs bound to the virtual switch? DISABLE Virtual Machine Queues for _all_ physical ports in the Broadcom driver.

I have an EE article that may have additional guidance: Some Hyper-V Hardware and Software Best Practices.
Avatar of lanmastersINC
lanmastersINC

ASKER

I heard about an issue with Broadcom NICs, fortunately (and unfortunately) I have and Intel NIC in this host. I'll take a deeper look into that link, from a cursory glance, I have done a few things on that list
On the host in PowerShell please post the results:

Get-NetAdapter
Get-NetLbfoTeam
Get-VMSwitch

I suggest copying and pasting into a TXT file and attaching that to your reply please.
PS C:\Users\Administrator> Get-NetAdapter

Name                      InterfaceDescription                    ifIndex Status       MacAddress             LinkSpeed
----                      --------------------                    ------- ------       ----------             ---------
vEthernet (Virtual Swi... Hyper-V Virtual Ethernet Adapter #3          21 Up           00-1E-67-RE-DA-CT        10 Gbps
vEthernet (Virtual Swi... Hyper-V Virtual Ethernet Adapter #2          18 Up           A0-36-9F-RE-DA-CT        10 Gbps
Ethernet 3 Top            Intel(R) I350 Gigabit Network Conn...#2      14 Disconnected 00-1E-67-RE-DA-CT          0 bps
Ethernet 2 Bottom         Intel(R) I350 Gigabit Network Connec...      13 Disconnected 00-1E-67-RE-DA-CT          0 bps
Ethernet 10G Intel        Intel(R) Ethernet Converged Network ...      12 Up           A0-36-9F-RE-DA-CT        10 Gbps


PS C:\Users\Administrator> Get-NetLbfoTeam
PS C:\Users\Administrator> Get-VMSwitch

Name                       SwitchType NetAdapterInterfaceDescription
----                       ---------- ------------------------------
Virtual Switch to 10G Port External   Intel(R) Ethernet Converged Network Adapter X540-T1
Virtual Switch to 1G Port  External   Intel(R) I350 Gigabit Network Connection
PORT2bottom-Testing        External   Intel(R) I350 Gigabit Network Connection #2
PS-ExpertsExg-1.txt
ASKER CERTIFIED SOLUTION
Avatar of Philip Elder
Philip Elder
Flag of Canada image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
I reviewed your article and made some changes. I had some vCPU allotments that are more reasonable now. It's a dual CPU 8 core host (4 cores each) so I assigned 2 vCPUs each to the 4 VMs running on the host.

I modified the NICs. Stopped sharing the virtual NIC on the Intel 10G with host for management and made it exclusively for the vSwitch. Also teamed the 2 integrated 1G NICs for management.

Monitoring for any changes through the rest of the week.
NIC-Changes.txt
After the long weekend the issue still remains. Nothing has changed in terms of errors or frequency, though.
Any A/V client on the host or guest? Remove it.
Yes, we run ESET File protection. I disabled a while back and still saw the issue. But config has changed so I'll do that again and report back.

Thanks for helping!
No luck thus far still. But we have decided to kill the server and replace it. Pretty simple, it being a VM and all.

I appreciate you helping troubleshoot this! Thank you
Decided to replace server with fresh Windows install. Unsure of what actually was going wrong but these suggestions helped. Best practices article is great!