Link to home
Start Free TrialLog in
Avatar of PMIMIS
PMIMISFlag for United States of America

asked on

Veeam backups of Exchange 2010 on ESXi 5.0 fail with error Unable to release guest. Error: Unfreeze error (over VIX): [Backup job failed..

We are trying to setup Veeam backups of our Exchange 2010 server. We are getting the error message from Veeam that says "Unable to release guest. Error: Unfreeze error (over VIX): [Backup job failed.]
Error: Unfreeze error (over VIX): [Backup job failed.]"

Our Exchange 2010 server is on SP2 Rollup 3. The OS is Windows 2008 R2 SP1. It runs as a guest on VMWare ESXi 5.0. It is the only guest on the host. VMware Tools for Windows version 8.6.5, build-621624 is installed on the server and running.

Our Exchange 2010 environment is currently in the early stages of coexistence with Exchange 2003. Approximately 15 of 250 mailboxes have been moved to databases on the 2010 servers. We have one Exchange 2003 server and two 2010 servers. The 2010 servers are both CAS, HT, MB servers with a CAS array behind Kemp load balancers and a DAG that uses a file server as a witness server. The 2003 server and the other 2010 server are installed directly on physical systems. The 2010 server in question was installed as a VM specifically for the purpose of simplified DR. Down the road we will add a third 2010 server at an overseas site to better serve our Asian offices and provide us with manual site resiliency. It might also be relevant that the vm being backed up sits on a different subnet than the host or the Veeam server. The vm uses the load balancer as its default gateway, but unless there is a virtual service configured, traffic to it goes through the router.

Preliminary troubleshooting indicates possible problems with VSS. Before the backup job is run I confirmed that all writers are in State: [1] Stable, Last error: No error. But after the job fails one VSS Writer is in a failed state and five of them are in a waiting for completion state, specifically:

Writer name: 'Microsoft Exchange Writer'
   Writer Id: {76fe1ac4-15f7-4bcd-987e-8e1acb462fb7}
   Writer Instance Id: {332e5d22-32f8-4b59-840c-f95b50a4b2a8}
   State: [9] Failed
   Last error: Timed out

Writer name: 'Microsoft Exchange Replica Writer'
   Writer Id: {76fe1ac4-15f7-4bcd-987e-8e1acb462fb7}
   Writer Instance Id: {70a1015a-6e10-4df5-a039-03f829eddaa7}
   State: [5] Waiting for completion
   Last error: No error

Writer name: 'WMI Writer'
   Writer Id: {a6ad56c2-b509-4e6c-bb19-49d8f43532f0}
   Writer Instance Id: {b6baa074-43fd-4062-b13e-161e3192bc9c}
   State: [5] Waiting for completion
   Last error: No error

Writer name: 'Cluster Database'
   Writer Id: {41e12264-35d8-479b-8e5c-9b23d1dad37e}
   Writer Instance Id: {de46fd3f-2f6d-4fe8-b812-caf342a29e08}
   State: [5] Waiting for completion
   Last error: No error

Writer name: 'IIS Config Writer'
   Writer Id: {2a40fd15-dfca-4aa8-a654-1f8c654603f6}
   Writer Instance Id: {bb3b9052-2642-4c30-8d11-9293a8b6a7e3}
   State: [5] Waiting for completion
   Last error: No error

Writer name: 'IIS Metabase Writer'
   Writer Id: {59b1f0cf-90ef-465f-9609-6ca8b2938366}
   Writer Instance Id: {aafe0c53-6d13-4ffe-9655-58c28c733eb4}
   State: [5] Waiting for completion
   Last error: No error

The Microsoft Application Log shows relevant entries releated to the failed backup:

(Repeated entries omitted to save space)
Event 2021 The Microsoft Exchange VSS Writer has successfully collected the metadata document in preparation for backup.
Event 9606 Exchange VSS Writer (instance 7d96a1ab-186d-495b-a6ec-c9682bccb5f0) has prepared for backup successfully.
Event 2110 The Microsoft Exchange VSS Writer instance 7d96a1ab-186d-495b-a6ec-c9682bccb5f0 has successfully prepared for a full or a copy backup of database 'Mailbox Database SEA-B'.  The following database will be backed up: Mailbox Database SEA-B.
Event 2023 The Microsoft Exchange Replication service VSS Writer (Instance 7d96a1ab-186d-495b-a6ec-c9682bccb5f0) successfully prepared for backup.
Event 2005 Information Store (4640) Shadow copy instance 1 starting. This will be a Full shadow copy.
Event 9811 Exchange VSS Writer (instance 1) has successfully prepared the database engine for a full or copy backup of database 'Public Folder Database 2010B'.
Event 9608 Exchange VSS Writer (instance 7d96a1ab-186d-495b-a6ec-c9682bccb5f0:1) has prepared for Snapshot successfully.
Event 9539 The Microsoft Exchange Information Store database "3751e3f8-909d-426b-b8ff-8e3c4945e8f9: /o=Pacific Market International/ou=Exchange Administrative Group (FYDIBOHF23SPDLT)/cn=Configuration/cn=Servers/cn=Outlook-SEA.pmi-worldwide.com/cn=Microsoft Private MDB" was stopped.
Event 2027 The Microsoft Exchange VSS Writer instance 7d96a1ab-186d-495b-a6ec-c9682bccb5f0 has successfully frozen the databases.
Event 960 msexchangerepl (2148) This computer is performing a surrogate backup.  The master server is EX-SEA01.
Event 2025 The Microsoft Exchange Replication service VSS Writer (Instance 7d96a1ab-186d-495b-a6ec-c9682bccb5f0) successfully prepared for a snapshot.
Event 2027 The Microsoft Exchange VSS Writer instance 7d96a1ab-186d-495b-a6ec-c9682bccb5f0 has successfully frozen the databases.
Event 2001 Information Store (4640) Shadow copy instance 1 freeze started.
Event 2001 Information Store (4640) Public Folder Database 2010B: Shadow copy instance 1 freeze started.
Event 9610 Exchange VSS Writer (instance 7d96a1ab-186d-495b-a6ec-c9682bccb5f0:1) has frozen the database(s) successfully.
Event 2003 Information Store (4640) Shadow copy instance 1 freeze ended.
Event 2007 Information Store (4640) Shadow copy instance 1 aborted.
Event 9614 Exchange VSS Writer (instance 7d96a1ab-186d-495b-a6ec-c9682bccb5f0:1) has aborted the backup successfully.
Event 2029 The Microsoft Exchange VSS Writer instance 7d96a1ab-186d-495b-a6ec-c9682bccb5f0 has successfully thawed the databases.
Event 2114 The replication instance for database Mailbox Database SEA-B has started copying log files. The first log file copied was generation 14857.
Event 1000 Attempting to start the Information Store "Mailbox Database SEA-B".
Event 2035 The Microsoft Exchange Replication service VSS Writer (Instance 7d96a1ab-186d-495b-a6ec-c9682bccb5f0) has successfully processed the post-snapshot event.
Event 9648 Exchange VSS Writer (instance 7d96a1ab-186d-495b-a6ec-c9682bccb5f0:1) has processed the backup shutdown event successfully.
Event 964 msexchangerepl (2148) The surrogate backup to EX-SEA01 has been stopped with error 0xFFFFFFFF.
Event 2037 The Microsoft Exchange Replication service VSS Writer (Instance 7d96a1ab-186d-495b-a6ec-c9682bccb5f0) backup has been successfully shut down.
Event 2114 The replication instance for database Mailbox Database SEA-A has started copying log files. The first log file copied was generation 12507.
Event 1000 Attempting to start the Information Store "Mailbox Database SEA-A".





The Microsoft System logs show a few Information entries related to the failed backup too.
Event 7045 VeeamVssSupport service installs
Event 7036 Volume Shadow Copy service enters running state.
Event 7036 The VeeamVssSupport service entered the running state.
Event 7036 The Application Experience service entered the running state.
Event 7036 The WinHTTP Web Proxy Auto-Discovery Service service entered the stopped state.
Event 7036 The Microsoft Software Shadow Copy Provider service entered the running state.
Event 7036 The COM+ System Application service entered the stopped state.
Event 1       The system time has changed to ¿2012¿-¿09¿-¿21T17:07:15.491000000Z from ¿2012¿-¿09¿-¿21T17:03:37.474319200Z. (I am pretty sure the time was correct before the backup started. Perhaps the machine was frozen for a while during the backup? I will watch the clock next time I try a backup.)
That is where the job fails. Afterward the VeeamVssSupport service, Volume Shadow Copy service, Microsoft Software Shadow Copy Provider service, and Application Experience service enter stopped state.

Do any of you have any thoughts on this? I am seeing a couple years of people pointing to this issue in various forums, but have yet to find a definitive solution or troubleshooting guide.

Many thanks,
PMIMIS
Avatar of Andrew Hancock (VMware vExpert PRO / EE Fellow/British Beekeeper)
Andrew Hancock (VMware vExpert PRO / EE Fellow/British Beekeeper)
Flag of United Kingdom of Great Britain and Northern Ireland image

check if you can Snapshot the Running VM, with the Quicse Option ticked?

do you get the same error?
Hi,

Which features have been enabled in the vmware tools. If the Enable VMware Tools quiescence has been enabled try to disable it and perform the backup again. To clear the error state of the vss writer you might need to reboot the server.

Regards,
Johan
Avatar of PMIMIS

ASKER

Hi Johan,

Rebooting does seem to be required to clear the VSS error state. That is what I did before the last backup test after trying many other approaches. VMware Tools quiescence was already disabled. This backup is running with application-aware image processing.

Thank you for your reply. Please let me know if you have any other thoughts.

Regards,
PMIMIS
Remove VMware Tools and Re-install VMware Tools.
Avatar of PMIMIS

ASKER

Hi hanccocka,

Kicked off a VMware snapshot on the running VM with Quiesce option ticked at 1:15:36.
This caused the server to become unavailable via RDP and caused ping times to spike up to more than 2000 ms, but eventually completed successfully at 1:24:35.

Began snapshot deletion at 1:26:05. Completed at 1:26:29.

Checking the server event logs for any errors now.

Regards,
PMIMIS
if you have slow storage or a very busy server this can happen.
Hi,

Also checked this kb from vmware:

http://kb.vmware.com/selfservice/microsites/search.do?cmd=displayKC&docType=kc&docTypeID=DT_KB_1_1&externalId=1007696


Is the firewall running on the exchange server if so try to disable it temporarily.

Johan
Avatar of PMIMIS

ASKER

Hi Hancocka,

Did not get the same errors when running VMware snapshot. Got all sorts of new erros. Exchange REALLY does not like being snapshotted live by VMware.

VMware snapshotting process seems to have been very unpopular with the Exchange server. Event log is littered with Errors for the duration.

MSExchangeIS failed to initialize
SACL Watcher fot error opening group policy
Replication network manager error occurred while attempting a cluster operation
Log copier could not communicate with other Exchange server
Store driver could not deliver public folder replication message
Watson report send for store.exe process
MS Exchange Replication service could not communicate with Exchange Information Store service.
MS Exchange Replication service encountered unexpected error in log replay.
Information Store service terminated so Active manager could quickly dismount all databases.

Things finally started calming down around five minutes after the the snapshot was deleted.

Regards,
PMIMIS
Rerun your backup at your Exchange quietest time, with few users and traffic.
Avatar of PMIMIS

ASKER

Okay guys,

I am going to try rebooting the machine to clear the VSS errors and I am then going to temporarily disable the firewall and attempt a Veeam backup.

If that does not work I am going to try uninstalling and reinstalling the VMware Tools before clearing the VSS errors with another reboot and trying again (still with firewall off).
Avatar of PMIMIS

ASKER

I like the idea of rerunning the backup at a quiet time. It would also make sense to make sure there is no online maintenance at that time and to put the server into maintenance mode with the StartDagServerMaintenance.ps1 script. But for now I am going to go forward with the other ideas. This server will only be busiser in the future when we have the other 95% of our mailboxes moved from 2003 to 2010.
I have had nothing but problems with the combination of Veeam and Exchange 2010 DAG, to the point of no longer recommending the solution.
I prefer to go back to something doing the backup inside the OS, with my usual choice being Backup Assist.

Not alone in this either. One solution that has worked in the past is this:
http://desktopfeedbag.com/2012/02/23/fixed-exchange-2010-sp1sp2-dag-fail-over-with-veeam/

Simon.
Avatar of PMIMIS

ASKER

Friday night I turned off all online maintenance and ran the backup in the dead of night, with all databases in a passive state. Still failed in exactly the same place. Boo.
I have been experiencing the same issue for some time and have finally found a fix.  The issue is the snapshot process of vmware.  The fix is to make sure that no backup or replication jobs are running for the server experiencing the issue.  Go to vcenter, right click on the server with the issue and select perform snapshot.  Once that snapshot completes, go to vcenter again, right click on the machine and select snapshot manager.  Once it opens, select the delete all to delete all snapshots.  This may take some time due to the amount of data that has been stored so far and how busy the server is at the moment.  If that does not work you need to call vmware support and let them remote in and repair your snapshot chain.
Avatar of PMIMIS

ASKER

I spoke with Veeam support. Basically you can't take snapshot based backups unless your disks are fast enough to quiesce the machine in about 20 seconds. That's a shame because one of the benefits of Exchange 2010 is that it works great with cheap, slow SATA disks.

Veeam's recommendation is to move the server to fast storage. We are going to do that for just one server in the three server DAG and then just backup that one server. The other servers can be rebuilt with the Exchange rebuild option if necessary.
Avatar of PMIMIS

ASKER

I've requested that this question be closed as follows:

Accepted answer: 0 points for PMIMIS's comment #a38502812

for the following reason:

I am giving this solution a B because it will work. It is not getting an A because it does not solve the problem in a very cost effective manner.
I'm objecting because in http:#a38423457, I state "if you have slow storage or a very busy server this can happen".

This is confirmed later by your conversation with Veeam Software - "I spoke with Veeam support. Basically you can't take snapshot based backups unless your disks are fast enough to quiesce the machine in about 20 seconds."

This may not be the solution you sort, but it's the truth, and I believe, this should be included and recorded in the Knowledgebase for future users to search, and realize that SATA disks are not enterprise class, and does not warrant a deletion or lack of points.
Avatar of PMIMIS

ASKER

Microsoft goes out of their way to say SATA disks are more than fast enough to run Exchange 2010 in an Enterprise environment and I can vouch that they work great from the standpoint of Exchange performance, so I don't think the Knowledgebase is well served by just saying SATA disks aren't for Enterprise and Veeam often doesn't work with slow or busy disks. It's also perhaps premature to say that moving the server to faster storage is going to fix the issue since it hasn't happened yet.

I have no objection to giving hanccoka all the points, but I would prefer to tag these last couple comments as the solution since they give a fuller story that is more likely to help others.
ASKER CERTIFIED SOLUTION
Avatar of Andrew Hancock (VMware vExpert PRO / EE Fellow/British Beekeeper)
Andrew Hancock (VMware vExpert PRO / EE Fellow/British Beekeeper)
Flag of United Kingdom of Great Britain and Northern Ireland image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of PMIMIS

ASKER

So just to be clear, if you have two identical pieces of hardware with the same cores, memory, and SATA disk arrays and you build one of them as a physical Exchange server and on the other install ESXi 5.0 and give all of the resources to a single Exchange server you would expect the virtual machine to be much, much slower?
Correct with any Hypervisor, hence why for performance we use RAW disks, and present LUNs direct to the VM to get the best performance without the virtual disk layer.

Also Hypervisors because of the CPU scheduler "suck" performance away from the CPU, this is observed with SQL databases, and realtime high performance clusters.

Hypervisors are a compromise,mand work well for 90% of workloads but not all.

To end, Snapshots usage, exploited by ALL third party backup programs do not work well for real time high throughput applications, eg Exchange, SQL, Oracle, Domain Controllers, since their introduction, 15 years ago. Unfortunately, there was a time in IT we had a window to backup out of hours, but with 24/7 working 365 days, that is no longer applicable. Your issue is not the first time Ive seen similar issues, wrapped up because of the VMware Snapshot Method.

Because there was no conclusion for us and our clients, or resolution, we moved to SAN Snapshot solutions from 2004.