Server Hardware
--
Questions
--
Followers
Top Experts
The problem manifests a little differently each time, but the root cause seems to be one of the virtual machines freezing up, becoming sluggish, or losing network share accessibility, and often taking the host down with it and other virtual servers.
We initially had problems with two T420s ordered back in August 2012, another T420 in November 2012, and now a T620 just arrived March 2013.
I’ll provide as much detail about the environment and attempted fixes as possible… because we’re at a loss of what to do next. The current generation PowerEdge R720 servers do not seem to have the same problems even though they’re setup exactly the same way. Previous generation PowerEdge 2900, T710, and T610 all work fine when the virtual servers are moved to them as a loaner Hyper-V host.
HARDWARE:
Problem Server 1: Dell PowerEdge T420 – Dual Xeon E5-2430 2.2Ghz/15M cache/6 core – 24GB RAM (4 x 8GB RDIMM 1333Mhz) – 4 x 600GB 15K SAS 6Gps 3.5” Hot-Plug HDDs – PERC H310 RAID5 – On board Broadcom 5720 Dual Port
Problem Server 2: Dell PowerEdge T420 – Dual Xeon E5-2430 2.2Ghz/15M cache/6 core – 24GB RAM (4 x 8GB RDIMM 1333Mhz) – 4 x 600GB 15K SAS 6Gps 3.5” Hot-Plug HDDs – PERC H310 RAID5 (same exact config)
Problem Server 3: Dell PowerEdge T420 – Dual Xeon E5-2430 2.2Ghz/15M cache/6 core – 24GB RAM (4 x 8GB RDIMM 1333Mhz) – 4 x 600GB 15K SAS 6Gps 3.5” Hot-Plug HDDs and 2 x 2TB 7200 SATA 3.5” Hot-Plug HDDs – PERC H310 RAID5/RAID1– On board Broadcom 5720 Dual Port
Problem Server 4: Dell PowerEdge T620 – Dual Xeon E5-2620 2.0Ghz/15M cache/6 core – 32GB RAM (2 x 16GB RDIMM 1333Mhz) – 2 x 300GB 15K SAS 6Gps 3.5” Hot-Plug HDDs and 6 x 600GB 15K SAS 6Gps 3.5” Hot-Plug HDDs – PERC H710P RAID1/RAID5 - Broadcom 5719 QP and Intel i350 DP NICs
OPERATING SYSTEM on Hyper-V Host:
- Microsoft Windows Server 2008 R2 w/ GUI Standard and Enterprise
- Microsoft Windows Server 2012 w/ GUI Standard and Datacenter
- 100GB C: partition for OS on SAS RAID5 volumes
- Loaded via Dell OpenManage DVD
- Loaded via Dell LifeCycle Controller boot option
- Loaded straight from Windows media DVD
- Loaded OEM from Dell factory
- All MS updates installed prior to adding Hyper-V role
- Dell BIOS/Firmware/Drivers all updated using latest SUU DVD available and/or support.dell.com drivers
- Latest Dell OpenManage installed to view hardware issues, etc
OPERATING SYSTEM on Virtual Servers:
- Microsoft Windows Server 2008 R2 w/ GUI Standard and Enterprise
- Microsoft Windows Server 2012 w/ GUI Standard and Datacenter
- 100GB C: partition for OS on first virtual IDE controller
- Loaded straight from Windows media ISO
- Additional VHDs and VHDXs both on IDE controllers and SCSI interface
- All MS updates installed
- Hyper-V Integration Services installed and up to date
(NOTE: The options above were all the various attempts at fixing the problem, we’ve done several wipe/reloads and fresh installs of Windows during our troubleshooting saga)
NEVER any antivirus on the host or virtual servers
Virtual servers backed up within their OS via StorageCraft ShadowProtect (and also without any backup for testing)
Virtual servers all loaded fresh, not P2V’d from old servers
Misc virtual servers running typical line-of-business apps, file server, print server, small SQL servers, SBS 2011, Active Directory, Networking (DNS/DHCP/RRAS/WINS,etc) (NOTE: These VMs can be moved to loaner servers as-is and then work fine)
HYPER-V Networking:
- NIC 1 dedicated to host machine
- NIC 2 selected via Hyper-V setup as dedicated NIC and setup as first virtual switch
- Broadcom NICs using Dell Broadcom and native Microsoft drivers
- Intel NICs using Dell Intel drivers
TROUBLESHOOTING ATTEMPTS:
- Numerous wipe/reload of Windows operating system on both host and VMs, both Sever 2008 R2 and Server 2012
- Virtual disks using both VHD and VHDX formats, both static and dynamically sized
- Additional data VHD(X)s added to VMs onto the virtual IDE and SCSI controllers within Hyper-V manager
- VMs using both Dynamic RAM and static RAM settings
- Turning off NUMA spanning in Hyper-V settings
- Turning off Virtual Machine Queue VMQ in Hyper-V settings
- Turning off VMQ on physical NIC advanced settings
- Turning off TCP/IP offloading on physical NIC advanced settings
- Disable Power Saving in NIC settings
- Using Microsoft drivers for Broadcom NICs as well as Broadcom Drivers
- Disabled on-board Broadcom NICs and tried add-in Intel PCIe NIC
- Having host VSS snapshots (MS Shadow Copies) disabled and enabled on default 7am/12pm schedule
- Having VM VSS snapshots disabled and enabled on default 7am/12pm schedule
- Removed an added USB 3.0 PCIe card in the first two servers, never added on last two
- First two servers got better after replacing ordered PERC H310 RAID controller with more powerful PERC H710 RAID card
- Third server replaced H310 with PERC H710P
- Fourth server ordered with PERC H710P controller
- Dell replaced motherboards, CPUs, RAM, backplanes, etc on the first problem server
- Ensured PERC virtual disk (physical) policies set per recommended settings:
- Read Policy: Adaptave Read Ahead (already set)
- Write Policy: Write Back (already set)
- Disk Cache Policy: Disabled (already set)
- Ran numerous Dell DSET and MS Product Support Reports for Dell support to analyze
- First two servers seem to be running OK after replacing PERC 310 with PERC 710
- Third server client is running a couple VMs on it, but file server and SBS server VMs on loaner T710
- Fourth server just installed recently, ran for about 1 week with no problems and DC/file server locked up Friday and Monday
CRASH / FREEZE / LOCK-UP BEHAVIOR:
- The problem occurs very randomly. It can run fine for 1 to 3 weeks and then return suddenly.
- It does seem to happen during the business day only, under load from the client PCs
- It seems to be most commonly the server with the file shares on it, accessible via mapped drive letter(s)
- We were able to replicate the problem with one of the T420s in our shop with 3 laptops connected via gigabit switch
- Having redirected folders (Desktop and My Docs) via GPO seems to really increase the frequency of the problem
- No relevant Windows Event Viewer entries are created, on the host or the VM (nothing to track down)
- Sometimes just the file sharing server loses its SMB shares, not accessible from client PCs or itself using \\servername
- Sometimes the locked up VM responds to PINGs still while it’s shares are not accessible
- Sometimes the VM is completely locked up and doesn’t respond
- Sometimes the VM will perform a safe shutdown, although sluggish, other times we must use Hyper-V Power Off option
- Sometimes it takes the other VMs with it
- The other VMs may slow down, become unresponsive, or be responsive but not able to perform a safe shutdown via Hyper-V
- The host may act fine, but other times get very sluggish also
- The host sometimes will perform a safe shutdown, other times must be forced down via power button or remote DRAC
- The host sometimes recovers itself after 5 to 10 minutes
- The VMs work perfectly fine when moved from the PowerEdge Tx20 generation server to previous Dell PE generation servers
- We’ve installed about 5 rack PowerEdge R720s in the same manner without any problems, it just seems to be the towers
- The Hyper-V on the host setup with misc VMs running on it is our typical setup… we’re not reinventing anything or creating funky setups
GENERAL DIRECTION / RESOLUTION THOUGHTS:
- After the problems on the first two T420s were fixed by swapping cheaper PERC 310 with PERC 710 we thought we were in the clear
- After the problems with the third T420 we made sure to never again select the PERC H310, but the more powerful H710
- Having the better RAID card definitely makes the HyperV host more resilient to completely locking up
- So we ordered a T620 thinking it was some sort of motherboard, backplane, or architecture issue on the T420s
- We did have a R720 that had slow network performance, high PING returns to its VMs but turning off VMQ on the NIC fixed that
- Misc app servers running SQL or other databases don’t seem to be as affected as ones with mapped drives or redirection
- Because the file share problems, our best guess is it’s some obscure NIC setting, driver, network related OS option, registry key, etc.
- It could be a disk throughput problem, but using fast SAS drives and better RAID card not thinking so
Sorry for such a long post, I wanted to provide as much info as possible. We’re desperate because these servers have gone to new clients who we’ve recommended replacement of their old slow servers with new hardware and promised better reliability. Our reputation and trust is severely damaged with most of these clients. The first two T420s have been running well for about two months but still fighting the last T420 and new T620.
Thanks,
Tim Jackson
Zero AI Policy
We believe in human intelligence. Our moderation policy strictly prohibits the use of LLM content in our Q&A threads.






EARN REWARDS FOR ASKING, ANSWERING, AND MORE.
Earn free swag for participating on the platform.
We did have Performance Monitor running during the last lockup of the T620. We saw a spike in Disk Read Time / Physical disk just before it. In this instance, the server VMs recovered after about 5 mins of being unavailable.
Any other thoughts?
http://social.technet.microsoft.com/wiki/contents/articles/15576.hyper-v-update-list-for-windows-server-2012.aspx
http://social.technet.microsoft.com/wiki/contents/articles/1349.hyper-v-update-list-for-windows-server-2008-r2.aspx

Get a FREE t-shirt when you ask your first question.
We believe in human intelligence. Our moderation policy strictly prohibits the use of LLM content in our Q&A threads.
If so, those Dells have a specific BIOS setting that needs to be enabled AND you need the latest NIC drivers to support that feature on the Host and the Guest (since the hardware is passed through).
Only two guest OSes currently support SR-IOV - 2012 and Win 8. Unless you have very specific reasons to use this, it might be more beneficial to uncheck the box (if checked) in Virtual Switch manager and disable it in the BIOS.
In the BIOS under Integrated Devices Screen there is a global setting "SR-IOV Global Enable" - if you're using it then enable it and find supporting drivers (Dell does not have the latest Broadcom drivers on their site - or at least didn't a month ago). My opinion is disable it and don't use the feature until all of your guests support it.
I have issues with my 2012 Hyper-V install and VMs being very slow (disk wise) and had to go directly to the controller manufacturers site and use their drivers (Intel) to fix that.
I suspect it may be I/O related but it's hard to tell based on what you explain.
Is there any Event Logs on the VMs or Host that may correlate to the times you experience problems? It may help to try to pinpoint an area to focus our attention.






EARN REWARDS FOR ASKING, ANSWERING, AND MORE.
Earn free swag for participating on the platform.
I assume you're using 64-bit OSes?
Maybe you could run a memory test - is this OEM memory from Dell?
Yes, all 64-bit OS to be able to access memory above 3.5GB.
Yes, memory tests have been ran by Dell support and all Dell factory installed RAM.

Get a FREE t-shirt when you ask your first question.
We believe in human intelligence. Our moderation policy strictly prohibits the use of LLM content in our Q&A threads.
Do you have Virtualization and DEP enabled in hardware?
We're very experienced with Hyper-V. We've been using this setup since PowerEdge 2900/2950 generation. The VMs work fine on all older models of Dell PowerEdge servers.
Reach out to me at my alias here at gmail.
I have to step out right now for a few hours, but I'll get back to you then.






EARN REWARDS FOR ASKING, ANSWERING, AND MORE.
Earn free swag for participating on the platform.
Based on your results with the RAID controller upgrades improving the situation I feel like it's a controller problem.
Are you running SAS or SATA drives?
Have you programmatically back-dated firmware to see if a previous version of the controller firmware alleviates the symptoms? I'm running only the latest firmware but I've considered backdating.
I'm glad you chimed in... I couldn't believe we were the only ones having this problem.
On the latest T620 server... we purchased a R720 for the client. Moved the VMs... and all is running fine. However, prior to the switch we did turn OFF Microsoft Shadow Copies on the Host Hyper-V volumes. That seemed to lessen the problem, but I don't think it alleviated it.
We'll soon have the T620 back in the shop for testing, but we really can't replicate the load the actual client put on it.

Get a FREE t-shirt when you ask your first question.
We believe in human intelligence. Our moderation policy strictly prohibits the use of LLM content in our Q&A threads.
My server is also running SAS drives, which of course means drive operation is RAID controller overhead. Regardless of the 10K or 7200 RPM drives I can still create the behavior with a high I/O operation. The performance impact is unacceptable. I've never seen this kind of issue before, and I've deployed several OEM server 2012 hyper-v hosts, some of which are running a couple VMs on 7200RPM RAID 5 arrays. The most glaring difference being SATA drives and an areca RAID card, not a Dell PERC. I typically don't work with Dell server hardware, and now I know why.
They recently switched to not using a battery backup but instead a persistent memory cache on the raid cards (read: low quality ssd) for a dirty shutdown scenario.
If you haven't already done so definitely push the newest firmware to the PERC and Mainboard. (Data backups recommended prior to flash). Also it is helpful to rule out subsystem and drive issues, you may want to perform a check of each drive for errors or issues.
I would suspect, based on the experiences thus far that the performance of the persistent cache is below the performance of the rest of the PERC which may be resolved by firmware.
I will report back when I have a 100% fix on this issue. I will mention dell replaced our PERC and it did make significant differences in how the issue is handled and at the very least we do have proper error reporting of the issue which is a step in the right direction.
Lockups of this variety (Under high IO and Low CPU) are almost exclusively RAID Card, RAID Subsystem (backplane/multiplexer), or a really faulty individual drive (with a failing controller board typically).






EARN REWARDS FOR ASKING, ANSWERING, AND MORE.
Earn free swag for participating on the platform.
Are you confident after all of your troubleshooting that the T620 platform is stable and the issue is indeed related to the MS Shadow Copies? The other server we put onsite has the same base install of 2008 R2 on it, if MS Shadow Copies is turned on do you think we need to be concerned? Was following this case because it so closely aligns with the issues we've seen all week. I'd love to hear your feedback.

Get a FREE t-shirt when you ask your first question.
We believe in human intelligence. Our moderation policy strictly prohibits the use of LLM content in our Q&A threads.
My conclusion for the issue is that during a large data, heavy I/O operation the system functioned normally up until the point that the dynamic VHDX needed to expand to allow for more space. At this point, the system needed to map out and allocate the physical locations on the drive for the remaining data to be written. Since the system had SAS drives, this is a process handled entirely by the RAID controller - and, since the PERC was defective, this blew out the entire system, causing everything on those drives to be completely unresponsive for an extended period of time. With a replaced PERC, this allocation process was shortened to a very small amount of time, then completely eliminated altogether with a fixed-size VHDX.
What's most interesting about this is that I suspect the issue would not exist with SATA drives, and in most drive operations the allocation process for new writes would be small, as it would allocate space for single files, not a massive stream of backup data. Ultimately in my case the hardware was to blame. It seems that the PERCS with SAS drives are not as robust as they should be for I/O intensive applications - but a combination of minor reconfiguration and bad hardware replacement means the system is ready for production.
I have a T620 using SATA drives and none of my disks are in a RAID configuration. I have been seeing lots of Disk and ntfs errors in my event viewer. I too had run the DELL diagnostics and have had them report no errors. I am also not using any VM's, but run 2 SQL DB's on this SBS2011 install ( a migration ). RAM load is high and CPU usage is low, and we do not have a lot of data I/O going on. Wednessays crash damaged the Exchange store and it sucessfully repaired, but todays crash has caused more issues with EX2010.
It looks like I will need to contact DELL about some of the hardware and definately firmware and drivers.






EARN REWARDS FOR ASKING, ANSWERING, AND MORE.
Earn free swag for participating on the platform.
One definite fix is to Disable Virtual Machine Queues on the physical NIC properties. Also download the latest Dell SUU (Server Update Utility) and apply all firmware/driver/BIOS updates. Currently ISO is on version 14.3. Last was v7.3 but I think Dell moved to a year/month version number arrangement.
Have been through three complete loops of updating drivers and firmware, then updating again, and again as the months passed.
Event logs at the host and VM level are clean, the VMs just become so slow that they are functionally offline.
Before we turned off Shadow Copies and limited backups to only use the Hyper-V VSS Writer, the VMs would not recover from being so sluggish until they were rebooted, which could take hours.
We've seen the read queue on the RAID5 array as high as 29 seconds during backup. Dell is starting to posture with "Your use may exceed the design intent of these servers." Never mind that we're just trying to make a backup using the embedded backup software in Windows Server 2012.
We referred Dell to this thread as evidence that others were having the same issue, but they claim they cannot find any support case(s) for your problem.
Also we have in-place upgraded 2012 servers to 2012-R2. Not sure if that actually fixed anyting but made me feel better being on the latest OS.

Get a FREE t-shirt when you ask your first question.
We believe in human intelligence. Our moderation policy strictly prohibits the use of LLM content in our Q&A threads.
Dell tried to blame the USB 3.0 controllers we installed to facilitate faster backups to external drives (even though the bottleneck was clearly at the read queue on the RAID5 array), but the problem manifested when we ran a backup to the RAID5 array on the server itself. Then they claimed that test overloaded the array, even though they asked for it.
They are shipping a replacement T620 now, but I'm not optimistic that more of the same will lead to a different result.
Of course the obvious failure here is current generation PowerEdge should have USB 3.0 built-in already!
I've mentioned that several times. I also offered to installed an "approved" USB 3.0 card of their choosing. They told me yesterday that none are tested/approved.
They want to test a backup to a USB 2.0 connected drive or a shared folder on another server, but I suspect that would shift the bottleneck to the write side and cover up the problem.
FYI, we've had a paid PSS case open with Microsoft since last fall as well. They been worse than useless- I can't even get them to return emails or call me.






EARN REWARDS FOR ASKING, ANSWERING, AND MORE.
Earn free swag for participating on the platform.
Needless to say we will not be using Dell T series servers again.
Our next two customer migrations were to HP severs and went just fine.

Get a FREE t-shirt when you ask your first question.
We believe in human intelligence. Our moderation policy strictly prohibits the use of LLM content in our Q&A threads.
Server Hardware
--
Questions
--
Followers
Top Experts
Servers are computing devices that are similar to desktop computers in that they have the same basic components, but are significantly different in size, configuration and purpose. Servers are usually accessed over a network, and many run unattended, without a computer monitor, input device, audio hardware or USB interfaces. Many servers do not have a graphical user interface (GUI), and are configured and managed remotely. Servers typically include hardware redundancy such as dual power supplies, RAID disk systems, and ECC memory, along with extensive pre-boot memory testing and verification. Critical components might be hot swappable, and to guard against overheating, servers might have more powerful fans or use water cooling.