Solved: Hyper-V Virtual Machine lockup issues on PowerEdge T420 and T620 models

Hyper-V Virtual Machine lockup issues on PowerEdge T420 and T620 models

We are experiencing a very difficult to diagnose issue with virtual machines locking up on the latest generation Dell PowerEdge T420 and T620 models.

The problem manifests a little differently each time, but the root cause seems to be one of the virtual machines freezing up, becoming sluggish, or losing network share accessibility, and often taking the host down with it and other virtual servers.

We initially had problems with two T420s ordered back in August 2012, another T420 in November 2012, and now a T620 just arrived March 2013.

I’ll provide as much detail about the environment and attempted fixes as possible… because we’re at a loss of what to do next. The current generation PowerEdge R720 servers do not seem to have the same problems even though they’re setup exactly the same way. Previous generation PowerEdge 2900, T710, and T610 all work fine when the virtual servers are moved to them as a loaner Hyper-V host.

HARDWARE:
Problem Server 1: Dell PowerEdge T420 – Dual Xeon E5-2430 2.2Ghz/15M cache/6 core – 24GB RAM (4 x 8GB RDIMM 1333Mhz) – 4 x 600GB 15K SAS 6Gps 3.5” Hot-Plug HDDs – PERC H310 RAID5 – On board Broadcom 5720 Dual Port
Problem Server 2: Dell PowerEdge T420 – Dual Xeon E5-2430 2.2Ghz/15M cache/6 core – 24GB RAM (4 x 8GB RDIMM 1333Mhz) – 4 x 600GB 15K SAS 6Gps 3.5” Hot-Plug HDDs – PERC H310 RAID5 (same exact config)
Problem Server 3: Dell PowerEdge T420 – Dual Xeon E5-2430 2.2Ghz/15M cache/6 core – 24GB RAM (4 x 8GB RDIMM 1333Mhz) – 4 x 600GB 15K SAS 6Gps 3.5” Hot-Plug HDDs and 2 x 2TB 7200 SATA 3.5” Hot-Plug HDDs – PERC H310 RAID5/RAID1– On board Broadcom 5720 Dual Port
Problem Server 4: Dell PowerEdge T620 – Dual Xeon E5-2620 2.0Ghz/15M cache/6 core – 32GB RAM (2 x 16GB RDIMM 1333Mhz) – 2 x 300GB 15K SAS 6Gps 3.5” Hot-Plug HDDs and 6 x 600GB 15K SAS 6Gps 3.5” Hot-Plug HDDs – PERC H710P RAID1/RAID5 - Broadcom 5719 QP and Intel i350 DP NICs

OPERATING SYSTEM on Hyper-V Host:
- Microsoft Windows Server 2008 R2 w/ GUI Standard and Enterprise
- Microsoft Windows Server 2012 w/ GUI Standard and Datacenter
- 100GB C: partition for OS on SAS RAID5 volumes
- Loaded via Dell OpenManage DVD
- Loaded via Dell LifeCycle Controller boot option
- Loaded straight from Windows media DVD
- Loaded OEM from Dell factory
- All MS updates installed prior to adding Hyper-V role
- Dell BIOS/Firmware/Drivers all updated using latest SUU DVD available and/or support.dell.com drivers
- Latest Dell OpenManage installed to view hardware issues, etc

OPERATING SYSTEM on Virtual Servers:
- Microsoft Windows Server 2008 R2 w/ GUI Standard and Enterprise
- Microsoft Windows Server 2012 w/ GUI Standard and Datacenter
- 100GB C: partition for OS on first virtual IDE controller
- Loaded straight from Windows media ISO
- Additional VHDs and VHDXs both on IDE controllers and SCSI interface
- All MS updates installed
- Hyper-V Integration Services installed and up to date

(NOTE: The options above were all the various attempts at fixing the problem, we’ve done several wipe/reloads and fresh installs of Windows during our troubleshooting saga)

NEVER any antivirus on the host or virtual servers

Virtual servers backed up within their OS via StorageCraft ShadowProtect (and also without any backup for testing)

Virtual servers all loaded fresh, not P2V’d from old servers

Misc virtual servers running typical line-of-business apps, file server, print server, small SQL servers, SBS 2011, Active Directory, Networking (DNS/DHCP/RRAS/WINS,etc) (NOTE: These VMs can be moved to loaner servers as-is and then work fine)

HYPER-V Networking:
- NIC 1 dedicated to host machine
- NIC 2 selected via Hyper-V setup as dedicated NIC and setup as first virtual switch
- Broadcom NICs using Dell Broadcom and native Microsoft drivers
- Intel NICs using Dell Intel drivers

TROUBLESHOOTING ATTEMPTS:
- Numerous wipe/reload of Windows operating system on both host and VMs, both Sever 2008 R2 and Server 2012
- Virtual disks using both VHD and VHDX formats, both static and dynamically sized
- Additional data VHD(X)s added to VMs onto the virtual IDE and SCSI controllers within Hyper-V manager
- VMs using both Dynamic RAM and static RAM settings
- Turning off NUMA spanning in Hyper-V settings
- Turning off Virtual Machine Queue VMQ in Hyper-V settings
- Turning off VMQ on physical NIC advanced settings
- Turning off TCP/IP offloading on physical NIC advanced settings
- Disable Power Saving in NIC settings
- Using Microsoft drivers for Broadcom NICs as well as Broadcom Drivers
- Disabled on-board Broadcom NICs and tried add-in Intel PCIe NIC
- Having host VSS snapshots (MS Shadow Copies) disabled and enabled on default 7am/12pm schedule
- Having VM VSS snapshots disabled and enabled on default 7am/12pm schedule
- Removed an added USB 3.0 PCIe card in the first two servers, never added on last two
- First two servers got better after replacing ordered PERC H310 RAID controller with more powerful PERC H710 RAID card
- Third server replaced H310 with PERC H710P
- Fourth server ordered with PERC H710P controller
- Dell replaced motherboards, CPUs, RAM, backplanes, etc on the first problem server
- Ensured PERC virtual disk (physical) policies set per recommended settings:
- Read Policy: Adaptave Read Ahead (already set)
- Write Policy: Write Back (already set)
- Disk Cache Policy: Disabled (already set)
- Ran numerous Dell DSET and MS Product Support Reports for Dell support to analyze
- First two servers seem to be running OK after replacing PERC 310 with PERC 710
- Third server client is running a couple VMs on it, but file server and SBS server VMs on loaner T710
- Fourth server just installed recently, ran for about 1 week with no problems and DC/file server locked up Friday and Monday

CRASH / FREEZE / LOCK-UP BEHAVIOR:
- The problem occurs very randomly. It can run fine for 1 to 3 weeks and then return suddenly.
- It does seem to happen during the business day only, under load from the client PCs
- It seems to be most commonly the server with the file shares on it, accessible via mapped drive letter(s)
- We were able to replicate the problem with one of the T420s in our shop with 3 laptops connected via gigabit switch
- Having redirected folders (Desktop and My Docs) via GPO seems to really increase the frequency of the problem
- No relevant Windows Event Viewer entries are created, on the host or the VM (nothing to track down)
- Sometimes just the file sharing server loses its SMB shares, not accessible from client PCs or itself using \\servername
- Sometimes the locked up VM responds to PINGs still while it’s shares are not accessible
- Sometimes the VM is completely locked up and doesn’t respond
- Sometimes the VM will perform a safe shutdown, although sluggish, other times we must use Hyper-V Power Off option
- Sometimes it takes the other VMs with it
- The other VMs may slow down, become unresponsive, or be responsive but not able to perform a safe shutdown via Hyper-V
- The host may act fine, but other times get very sluggish also
- The host sometimes will perform a safe shutdown, other times must be forced down via power button or remote DRAC
- The host sometimes recovers itself after 5 to 10 minutes
- The VMs work perfectly fine when moved from the PowerEdge Tx20 generation server to previous Dell PE generation servers
- We’ve installed about 5 rack PowerEdge R720s in the same manner without any problems, it just seems to be the towers
- The Hyper-V on the host setup with misc VMs running on it is our typical setup… we’re not reinventing anything or creating funky setups

GENERAL DIRECTION / RESOLUTION THOUGHTS:
- After the problems on the first two T420s were fixed by swapping cheaper PERC 310 with PERC 710 we thought we were in the clear
- After the problems with the third T420 we made sure to never again select the PERC H310, but the more powerful H710
- Having the better RAID card definitely makes the HyperV host more resilient to completely locking up
- So we ordered a T620 thinking it was some sort of motherboard, backplane, or architecture issue on the T420s
- We did have a R720 that had slow network performance, high PING returns to its VMs but turning off VMQ on the NIC fixed that
- Misc app servers running SQL or other databases don’t seem to be as affected as ones with mapped drives or redirection
- Because the file share problems, our best guess is it’s some obscure NIC setting, driver, network related OS option, registry key, etc.
- It could be a disk throughput problem, but using fast SAS drives and better RAID card not thinking so

Sorry for such a long post, I wanted to provide as much info as possible. We’re desperate because these servers have gone to new clients who we’ve recommended replacement of their old slow servers with new hardware and promised better reliability. Our reputation and trust is severely damaged with most of these clients. The first two T420s have been running well for about two months but still fighting the last T420 and new T620.

Thanks,

Tim Jackson

Zero AI Policy

We believe in human intelligence. Our moderation policy strictly prohibits the use of LLM content in our Q&A threads.

kevinhsieh🇺🇸

Do I understand correctly that you have had problems with both Windows 2008 R2 and Windows 2012 as the hosts? You didn't mention any hotfixes installed - only regular patches. I know that there is a hotfix for Hyper-V 2008 R2 that addresses issues with VMs losing network connectivity when under high load. I am not familiar with the specific 2012 hotfixes but there are several, and I think that some of them address lockup issues.

ASKER

Thank you for your reply. We have not applied any HotFixes unpublished via Windows Update. I briefly searched for the one you're referring to, but could not find one with a description similar to it.

kevinhsieh🇺🇸

Look at this one.

http://support.microsoft.com/kb/974909

EARN REWARDS FOR ASKING, ANSWERING, AND MORE.

Earn free swag for participating on the platform.

ASKER

Perfect... thanks. The latest two servers are both running Server 2012, which the hotfix does not apply.

We did have Performance Monitor running during the last lockup of the T620. We saw a spike in Disk Read Time / Physical disk just before it. In this instance, the server VMs recovered after about 5 mins of being unavailable.

Any other thoughts?
PerfMon.png

kevinhsieh🇺🇸

There are currently 7 hotfixes for Hyper-V 2012. I suggest that you install at least the cumulative update 2770917 and the two related to networking, 2785638 and 2806542.

http://social.technet.microsoft.com/wiki/contents/articles/15576.hyper-v-update-list-for-windows-server-2012.aspx

kevinhsieh🇺🇸

Here is the list of updates for Hyper-V on 2008 R2.

http://social.technet.microsoft.com/wiki/contents/articles/1349.hyper-v-update-list-for-windows-server-2008-r2.aspx

Get a FREE t-shirt when you ask your first question.

We believe in human intelligence. Our moderation policy strictly prohibits the use of LLM content in our Q&A threads.

Netman66🇨🇦

Do you have SR-IOV selected for the virtual switch?

If so, those Dells have a specific BIOS setting that needs to be enabled AND you need the latest NIC drivers to support that feature on the Host and the Guest (since the hardware is passed through).

Only two guest OSes currently support SR-IOV - 2012 and Win 8. Unless you have very specific reasons to use this, it might be more beneficial to uncheck the box (if checked) in Virtual Switch manager and disable it in the BIOS.

In the BIOS under Integrated Devices Screen there is a global setting "SR-IOV Global Enable" - if you're using it then enable it and find supporting drivers (Dell does not have the latest Broadcom drivers on their site - or at least didn't a month ago). My opinion is disable it and don't use the feature until all of your guests support it.

ASKER

Checked Hyper-V / VM Settings / Network Adaptor / Hardware Acceleration

"Single-root I/O virtualization" was unchecked. That must be the default because we did not change it.

BIOS SR-IOV Global Enable also disabled (default)

Netman66🇨🇦

That removes this as a possible problem then.

I have issues with my 2012 Hyper-V install and VMs being very slow (disk wise) and had to go directly to the controller manufacturers site and use their drivers (Intel) to fix that.

I suspect it may be I/O related but it's hard to tell based on what you explain.

Is there any Event Logs on the VMs or Host that may correlate to the times you experience problems? It may help to try to pinpoint an area to focus our attention.

EARN REWARDS FOR ASKING, ANSWERING, AND MORE.

Earn free swag for participating on the platform.

Netman66🇨🇦

I noticed you said you have 24GB of RAM but it adds up to 32GB. Does the server see it all?

I assume you're using 64-bit OSes?

Maybe you could run a memory test - is this OEM memory from Dell?

ASKER

Server 4 (the one that we have PerfMon running on) does have 32GB of memory.

Yes, all 64-bit OS to be able to access memory above 3.5GB.

Yes, memory tests have been ran by Dell support and all Dell factory installed RAM.

ASKER

On the event log question... there's nothing relevant. In fact, on the host during the pause there's no events recorded at all during the lockup, like it's completely frozen.

Get a FREE t-shirt when you ask your first question.

We believe in human intelligence. Our moderation policy strictly prohibits the use of LLM content in our Q&A threads.

Netman66🇨🇦

Bizarre that there is nothing.

Do you have Virtualization and DEP enabled in hardware?

ASKER

Yes... that's why I'm reaching out to the public.

We're very experienced with Hyper-V. We've been using this setup since PowerEdge 2900/2950 generation. The VMs work fine on all older models of Dell PowerEdge servers.

Netman66🇨🇦

I may be able to help.

Reach out to me at my alias here at gmail.

I have to step out right now for a few hours, but I'll get back to you then.

EARN REWARDS FOR ASKING, ANSWERING, AND MORE.

Earn free swag for participating on the platform.

mtw77

Any update on the status of this. While I'm not using Hyper-V, I've been getting a T620 freezing up a few times a month.

This is an interesting thread. I've got a very similar problem on a Dell Poweredge R250 running Server 2012 Hyper-V. I had presumably tracked it down to a problematic PERC H710 mini which has now been replaced but I think the problem still exists. I'm really hoping for an update in this thread as it sounds like you're experiencing the same kinds of issues I am. It seems that ultimately the hardware is to blame but that doesn't actually allow us to fix the problem.

Based on your results with the RAID controller upgrades improving the situation I feel like it's a controller problem.

Are you running SAS or SATA drives?

Have you programmatically back-dated firmware to see if a previous version of the controller firmware alleviates the symptoms? I'm running only the latest firmware but I've considered backdating.

ASKER

All disks are 15K SAS. I've only updated to the latest versions of PERC controller drivers.

I'm glad you chimed in... I couldn't believe we were the only ones having this problem.

On the latest T620 server... we purchased a R720 for the client. Moved the VMs... and all is running fine. However, prior to the switch we did turn OFF Microsoft Shadow Copies on the Host Hyper-V volumes. That seemed to lessen the problem, but I don't think it alleviated it.

We'll soon have the T620 back in the shop for testing, but we really can't replicate the load the actual client put on it.

Get a FREE t-shirt when you ask your first question.

We believe in human intelligence. Our moderation policy strictly prohibits the use of LLM content in our Q&A threads.

Yes, I feel like we're both experiencing the same issue. I'm going to try some tweaks today to try and Reduce the I/O load (disable shadow copies, move to fixed VHDX, change to SCSI controllers on the VMs instead of IDE). I'll also reinstall integration services today. I'd be happy to just get the system to "usable".

My server is also running SAS drives, which of course means drive operation is RAID controller overhead. Regardless of the 10K or 7200 RPM drives I can still create the behavior with a high I/O operation. The performance impact is unacceptable. I've never seen this kind of issue before, and I've deployed several OEM server 2012 hyper-v hosts, some of which are running a couple VMs on 7200RPM RAID 5 arrays. The most glaring difference being SATA drives and an areca RAID card, not a Dell PERC. I typically don't work with Dell server hardware, and now I know why.

ASKER

This is the first time we've ever had a problem with Dell PowerEdge servers. Since both the new rack and tower servers are the same generation (same PERC, same CPU options, RAM, disks, etc) the only major difference we can think of is the backplane.

The current line of PERC cards is experiencing significant issues in many of our environments.

They recently switched to not using a battery backup but instead a persistent memory cache on the raid cards (read: low quality ssd) for a dirty shutdown scenario.

If you haven't already done so definitely push the newest firmware to the PERC and Mainboard. (Data backups recommended prior to flash). Also it is helpful to rule out subsystem and drive issues, you may want to perform a check of each drive for errors or issues.

I would suspect, based on the experiences thus far that the performance of the persistent cache is below the performance of the rest of the PERC which may be resolved by firmware.

I will report back when I have a 100% fix on this issue. I will mention dell replaced our PERC and it did make significant differences in how the issue is handled and at the very least we do have proper error reporting of the issue which is a step in the right direction.

Lockups of this variety (Under high IO and Low CPU) are almost exclusively RAID Card, RAID Subsystem (backplane/multiplexer), or a really faulty individual drive (with a failing controller board typically).

EARN REWARDS FOR ASKING, ANSWERING, AND MORE.

Earn free swag for participating on the platform.

ASKER CERTIFIED SOLUTION

ASKER

membership

Signing up is free and takes 30 seconds. No credit card required.

Create Account

Greg_Clark

I see closure pending on this issue that I read with great interest. We fought this same battle this last week with a client that we recently moved to a 2008 R2 Hyper-V platform on a Dell T620. Same issues you've seen, and we've read multiple other posts around the net with these issues popping up. We installed an old Dell PowerEdge server onsite yesterday morning and moved the VM's, everything is now running fine.

Are you confident after all of your troubleshooting that the T620 platform is stable and the issue is indeed related to the MS Shadow Copies? The other server we put onsite has the same base install of 2008 R2 on it, if MS Shadow Copies is turned on do you think we need to be concerned? Was following this case because it so closely aligns with the issues we've seen all week. I'd love to hear your feedback.

ASKER

No more lockups

ASKER

Dell did eventually accept an RMA for refund on the T420. Still working with the T620.

Get a FREE t-shirt when you ask your first question.

We believe in human intelligence. Our moderation policy strictly prohibits the use of LLM content in our Q&A threads.

It seems like our issues were resolved by a combination of a couple things. First, a PERC replacement (the first one was definitely bad). At this point, the lockups still happened occasionally but lasted 90 seconds instead of 20 minutes. The second fix that resolved the small remaining issue was to convert to Fixed VHDX files on a SCSI adapter (suggested configuration for maximum performance).

My conclusion for the issue is that during a large data, heavy I/O operation the system functioned normally up until the point that the dynamic VHDX needed to expand to allow for more space. At this point, the system needed to map out and allocate the physical locations on the drive for the remaining data to be written. Since the system had SAS drives, this is a process handled entirely by the RAID controller - and, since the PERC was defective, this blew out the entire system, causing everything on those drives to be completely unresponsive for an extended period of time. With a replaced PERC, this allocation process was shortened to a very small amount of time, then completely eliminated altogether with a fixed-size VHDX.

What's most interesting about this is that I suspect the issue would not exist with SATA drives, and in most drive operations the allocation process for new writes would be small, as it would allocate space for single files, not a massive stream of backup data. Ultimately in my case the hardware was to blame. It seems that the PERCS with SAS drives are not as robust as they should be for I/O intensive applications - but a combination of minor reconfiguration and bad hardware replacement means the system is ready for production.

Viking2013🇺🇸

Thank you for this discussion post.

I have a T620 using SATA drives and none of my disks are in a RAID configuration. I have been seeing lots of Disk and ntfs errors in my event viewer. I too had run the DELL diagnostics and have had them report no errors. I am also not using any VM's, but run 2 SQL DB's on this SBS2011 install ( a migration ). RAM load is high and CPU usage is low, and we do not have a lot of data I/O going on. Wednessays crash damaged the Exchange store and it sucessfully repaired, but todays crash has caused more issues with EX2010.
It looks like I will need to contact DELL about some of the hardware and definately firmware and drivers.

Have been battling this same issue on two T620s since last fall. Dell is telling me they cannot find an SR relating to your cause, JacksonTech. If you have one, can you please let me know. As of right now, Dell is telling me they know of no other such problems.

EARN REWARDS FOR ASKING, ANSWERING, AND MORE.

Earn free swag for participating on the platform.

ASKER

We essentially stopped buying PowerEdge T (tower) servers for this generation unless not running Hyper-V. We haven't seen the problems in the R720s (rack).

One definite fix is to Disable Virtual Machine Queues on the physical NIC properties. Also download the latest Dell SUU (Server Update Utility) and apply all firmware/driver/BIOS updates. Currently ISO is on version 14.3. Last was v7.3 but I think Dell moved to a year/month version number arrangement.

We've disabled VMQ and Shadow Copies at the host months ago. That made the problem less severe in that the VMs come back to life after the backups complete. But they are still functionally offline during the backup.

Have been through three complete loops of updating drivers and firmware, then updating again, and again as the months passed.

Event logs at the host and VM level are clean, the VMs just become so slow that they are functionally offline.

Before we turned off Shadow Copies and limited backups to only use the Hyper-V VSS Writer, the VMs would not recover from being so sluggish until they were rebooted, which could take hours.

We've seen the read queue on the RAID5 array as high as 29 seconds during backup. Dell is starting to posture with "Your use may exceed the design intent of these servers." Never mind that we're just trying to make a backup using the embedded backup software in Windows Server 2012.

We referred Dell to this thread as evidence that others were having the same issue, but they claim they cannot find any support case(s) for your problem.

ASKER

We've moved away from trying to backup the VMs from the host and gone to ShadowProtect installed within each VM. Just too many VSS problems.

Also we have in-place upgraded 2012 servers to 2012-R2. Not sure if that actually fixed anyting but made me feel better being on the latest OS.

Get a FREE t-shirt when you ask your first question.

We believe in human intelligence. Our moderation policy strictly prohibits the use of LLM content in our Q&A threads.

Something seems to be seriously wrong with the hardware in the Tx20 chassis. We're using the same backup system on other, older servers without issue. But on those two T620s it causes all sorts of problems.

Dell tried to blame the USB 3.0 controllers we installed to facilitate faster backups to external drives (even though the bottleneck was clearly at the read queue on the RAID5 array), but the problem manifested when we ran a backup to the RAID5 array on the server itself. Then they claimed that test overloaded the array, even though they asked for it.

They are shipping a replacement T620 now, but I'm not optimistic that more of the same will lead to a different result.

ASKER

They did the same thing with us. We also added a USB 3.0 card and they asked us to remove it. We obliged but it didn't change anything.

Of course the obvious failure here is current generation PowerEdge should have USB 3.0 built-in already!

"Of course the obvious failure here is current generation PowerEdge should have USB 3.0 built-in already! "

I've mentioned that several times. I also offered to installed an "approved" USB 3.0 card of their choosing. They told me yesterday that none are tested/approved.

They want to test a backup to a USB 2.0 connected drive or a shared folder on another server, but I suspect that would shift the bottleneck to the write side and cover up the problem.

FYI, we've had a paid PSS case open with Microsoft since last fall as well. They been worse than useless- I can't even get them to return emails or call me.

EARN REWARDS FOR ASKING, ANSWERING, AND MORE.

Earn free swag for participating on the platform.

Viking2013🇺🇸

I forgot to come back and update this posting. I had worked with Microsoft Support and Dell support for probably 80 + hours in total including a couple of all-nighters, but they both pointed at the other as the cause. We were poised to replace the default HDD RAID controller PERC h310 with the much more expensive unit when the customer ended their contract with us.
Needless to say we will not be using Dell T series servers again.
Our next two customer migrations were to HP severs and went just fine.

ASKER

We actually did purchase the PERC 710P, with the larger cache. It helped, a little, but still did not fix the lockups.