asked on

Strange VSS issue: Slow but successful System State Backups

I've got a little over 40 identical 2012R2 servers being backed up with DPM 2019 (previously DPM 2012). They all work fine except for one. It's unknown when the problem started in the last few months, but we do know that it was still using DPM 2012 when the problem started. We have an issue where one server's System State backup takes hours. All of the other systems average 45 minutes to an hour. This one has a fastest time of 3.5 hours, an average of 5 hours, and sometimes taking 7 hours. The System state backup has never been larger than 9-10GB. The backups never fails. What does fail is server's services. If it takes 3 hours, it's fine. If it takes 6 hours we'll see the server start to become unresponsive. Certain monitoring tools will fail to connect by remote powershell. Ping never drops. The cluster log will show that the node fails to respond to control codes multiple times as the process takes longer and longer. It could take minutes to get an RDP session working. Once the backup completes (or is canceled) the system returns to normal immediately.

Unlike all of the other servers, this server shows "Waiting for Responses. These may be delayed if a shadow copy is being prepared." when running vssadmin list writers. Googling hasn't helped, as all the results seem to point to an issue where it hangs there. This server doesn't hang there, it just takes about 15 seconds for them to load and it all pops up. The output (below) is clean. Everything is [1] Stable and No Error. To isolate the issue, we uninstalled the DPM agent, and used Windows Backup to test backups. It behaves the same way, taking about 3.5 hours to perform a system state backup on the server, about 10GB in size. That's writing to the same local disk C: Volume as well, so there's no network at play. Average disk queue during backup is about .2, and we never saw it go above 1.4.

Here's some of the things tried:
1. Change the VSS shadow size from unlimited to 300Mb to unlimited.
2. Purge all incremental and backup histories and WIndowsImageBackup folder and start over to make sure there wasn't an issue in the chain. Both DPM and Windows Backup perform the same way after doing that.
3. Analyze the Component Store with DISM and perform a disk cleanup, clearing 4GB of update files in WinSxS and 25GB of mem dumps from over 5 years.
4. Do the DLL registration thing.
5. Lots of reboots between trying everything.

I'm left scratching my head, and Microsoft is being their usually useless selves, asking for logs and traces and then disappearing for days. Looking for any ideas.

C:\>  vssadmin list writers
vssadmin 1.1 - Volume Shadow Copy Service administrative command-line tool
(C) Copyright 2001-2013 Microsoft Corp.


Waiting for responses.
These may be delayed if a shadow copy is being prepared.


Writer name: 'Task Scheduler Writer'
   Writer Id: {d61d61c8-d73a-4eee-8cdd-f6f9786b7124}
   Writer Instance Id: {1bddd48e-5052-49db-9b07-b96f96727e6b}
   State: [1] Stable
   Last error: No error


Writer name: 'VSS Metadata Store Writer'
   Writer Id: {75dfb225-e2e4-4d39-9ac9-ffaff65ddf06}
   Writer Instance Id: {088e7a7d-09a8-4cc6-a609-ad90e75ddc93}
   State: [1] Stable
   Last error: No error


Writer name: 'Performance Counters Writer'
   Writer Id: {0bada1de-01a9-4625-8278-69e735f39dd2}
   Writer Instance Id: {f0086dda-9efc-47c5-8eb6-a944c3d09381}
   State: [1] Stable
   Last error: No error


Writer name: 'System Writer'
   Writer Id: {e8132975-6f93-4464-a53e-1050253ae220}
   Writer Instance Id: {af9a2e42-8a04-453e-8e8b-3542d12aec79}
   State: [1] Stable
   Last error: No error


Writer name: 'ASR Writer'
   Writer Id: {be000cbe-11fe-4426-9c58-531aa6355fc4}
   Writer Instance Id: {66b06aa6-b4b2-4dd2-bbe1-e693c93539db}
   State: [1] Stable
   Last error: No error


Writer name: 'Shadow Copy Optimization Writer'
   Writer Id: {4dc3bdd4-ab48-4d07-adb0-3bee2926fd7f}
   Writer Instance Id: {b63b1a22-785f-4a0b-836f-51f13bb098f1}
   State: [1] Stable
   Last error: No error


Writer name: 'Cluster Shared Volume VSS Writer'
   Writer Id: {1072ae1c-e5a7-4ea1-9e4a-6f7964656570}
   Writer Instance Id: {b43c93cb-9351-42c3-a6d2-4066d493e170}
   State: [1] Stable
   Last error: No error


Writer name: 'Registry Writer'
   Writer Id: {afbab4a2-367d-4d15-a586-71dbb18f8485}
   Writer Instance Id: {752560e1-8733-403d-bd43-52419b648526}
   State: [1] Stable
   Last error: No error


Writer name: 'COM+ REGDB Writer'
   Writer Id: {542da469-d3e1-473c-9f4f-7847f01fc64f}
   Writer Instance Id: {c53d1220-3c01-4e79-a9ca-37e45d0e3fb4}
   State: [1] Stable
   Last error: No error


Writer name: 'WMI Writer'
   Writer Id: {a6ad56c2-b509-4e6c-bb19-49d8f43532f0}
   Writer Instance Id: {56e71e02-3b6a-4f32-be82-58e8c1c62a82}
   State: [1] Stable
   Last error: No error


Writer name: 'BITS Writer'
   Writer Id: {4969d978-be47-48b0-b100-f328f07ac1e0}
   Writer Instance Id: {9bdef0d5-42c8-42ae-b0cf-81d0c2ace531}
   State: [1] Stable
   Last error: No error


Writer name: 'Microsoft Hyper-V VSS Writer'
   Writer Id: {66841cd4-6ded-4f4b-8f17-fd23f8ddc3de}
   Writer Instance Id: {3c584d58-1d6c-4534-a543-d686518d6165}
   State: [1] Stable
   Last error: No error


Writer name: 'Cluster Database'
   Writer Id: {41e12264-35d8-479b-8e5c-9b23d1dad37e}
   Writer Instance Id: {8355f9a4-783b-4920-aaf5-7e6db4682c11}
   State: [1] Stable
   Last error: No error


C:\> vssadmin list providers
vssadmin 1.1 - Volume Shadow Copy Service administrative command-line tool
(C) Copyright 2001-2013 Microsoft Corp.


Provider name: 'Microsoft CSV Shadow Copy Helper Provider'
   Provider type: Software
   Provider Id: {26d02d81-6aac-4275-8504-b9c6edc5261d}
   Version: 1.0.0.1


Provider name: 'Microsoft CSV Shadow Copy Provider'
   Provider type: Software
   Provider Id: {400a2ff4-5eb1-44b0-8a05-1fcac0bcf9ff}
   Version: 1.0.0.1


Provider name: 'VSS Null Provider'
   Provider type: Software
   Provider Id: {8202aeda-45bd-48c4-b38b-ea1b7017aec3}
   Version: 10.19.58.0


Provider name: 'Microsoft File Share Shadow Copy provider'
   Provider type: Fileshare
   Provider Id: {89300202-3cec-4981-9171-19f59559e0f2}
   Version: 1.0.0.1


Provider name: 'Microsoft Software Shadow Copy provider 1.0'
   Provider type: System
   Provider Id: {b5946137-7b9f-4925-af80-51abd60b20d5}
   Version: 1.0.0.7


Provider name: 'VSS Null Provider'
   Provider type: Fileshare
   Provider Id: {f4a69dd4-f712-40e3-a6b3-faeff03cb2b8}
   Version: 10.19.58.0

Open in new window

Philip Elder

This is a domain controller where the System State backup is being attempted?

Casey Weaver

ASKER

No, these are Hyper-V hosts. About 52 of them in various Hyper-V clusters.

Philip Elder

A System State backup would be something done on domain controllers?

I don't see the reason why it would be done on a Hyper-V host?

Casey Weaver

ASKER

System State backups take a backup of the cluster configuration database, the cluster metadata. Back up system state and bare metal | Microsoft Docs

Without such a backup, in a situation where you lose the CSV to something like SAN failure or crypto, you're rebuilding the Hyper-V environment instead of simply bringing your cluster back online with a copy of the cluster metadata. While it may not seem like much to set up a Hyper-V host, it becomes more critical to recovery speed when you have nearly one thousand VMs in a cluster.

Philip Elder

This is what we do:
Protecting a Backup Repository from Malware and Ransomware
Disaster Preparedness: KVM/IP + USB Flash = Recovery. Here’s a Guide

We've recovered clusters from the ground up without backing up the hosts. The data is still on the SAN or in Storage Spaces and need only be imported and made highly available. This is all done via PowerShell.

If the entire cluster node set is fried there's bigger problems to deal with.

Casey Weaver

ASKER

The environment deployment is automated, all runs on blades with no local storage. Everything is iSCSI or FC boot. If you lose the SAN (as I've personally witnessed with a failed NetApp controller replacement going sideways due to firmware bug), you've lost the cluster. A metadata backup prevents you from needing to do all of the imports if you for some reason lose the quorum disk. Beyond that you get into the mirrored SAN, the DR environment etc. The goal is to recover hosts and environment in under two hours.

Obviously this is something that should be working, and has been working on this subset of systems for over 5 years. If it wasn't a Hyper-V host, this "Waiting for response" error could be happening to another system, as there seems to be plenty of these issues to be found on Google. I'd like to focus on the why it is happening, and what tools Microsoft provides that can break down why it is waiting for a response. So far, I haven't seen a way to break down what is happening during the vssadmin list writers command to know what writer is causing the hang-up.

Philip Elder

Correlate the updates applied versus when the slowdowns happened.
Make sure the listed updates don't have components that could impact the performance of the OS.
If the OS is booting via remote, then verify the network fabric that the OS boot image is running from is not dropping packets.
Monitor In/Out on the switch(es) for dropped frames and/or latency spikes during the VSS snapshots.

Casey Weaver

ASKER

1. I've tried to nail down with the backup team when the issue started occurring, but it didn't line up with any update or config change . It was about half way between updates, and the servers had been running with the patches for just over two weeks when it started happening.
2. They get their monthly rollups about 2 weeks after Patch Tuesday (first week being for testing in DEV). No other system has shown this issue.
3. No issues at the fabric level. Checked with the FC switches and the interconnects. Had the DC move the blade to another chassis just to see if anything changed, and it didn't.
4. Overall throughput from the system is just above Windows baseline during the snapshot phase, peaking to normal backups rates once the data starts getting written. Latency is the same as the other servers accessing the all flash storage, ~1-2ms, even during backups. Writing to a local disk set installed for testing, still sub 5ms.

Even when the system runs slow when backups run long, resource monitor doesn't show any unusual disk queue length or latency time. This feels very much like an interrupt deadlock kind of issue, where the system is waiting to perform an action that it doesn't have a physical (CPU/Memory/Disk/etc) reason to be waiting for.

This question needs an answer!

Become an EE member today

7 DAY FREE TRIAL

Members can start a 7-Day Free trial then enjoy unlimited access to the platform.

View membership options

Learn why we charge membership fees

We get it - no one likes a content blocker. Take one extra minute and find out why we block content.