NTBackup on Windows SBS 2003r2 Server Hangs intermittently

I've recently installed a Windows Server 2003r2 SBS Edition on an HP ML150 Server with an E200 add-on RAID card.  I patched the server until everything was up to date (as of February 7th, 2009.)

The following weekend I updated to the latest tape and raid drivers without issue.

Two weeks ago (on Tuesday, Feb 9th) I received a call in the early morning that the users could not log in... or get email... or access the internet (DNS.)

I RDP'd into the Terminal Server and could ping the server, but it was not responding to RDP requests.  I went on site and saw that the console for the server was stuck on the gray background of the login screen (the server had been logged in and locked.)

I restarted the server and it came up slowly but after letting it sit for 2 hrs it was still "Applying Computer Settings" -- I restarted the server into Safe Mode and uninstalled the updated RAID and Tape Drive... thinking that maybe a bad driver had caused the system to crash hard.

The thing I noted was that the server froze at 11:22pm (backups start at 10pm and take 3hrs.)

So I thought maybe the system kicked off the backup and ran into an error.

Anyway, I restarted into Safe Mode and disabled all services and was able to then restart and log into the system.  I started enabling services one by one until everything that could start would.  DHCP and IIS were still having issues started.  I spent about 3hrs playing around with trying to get the services started... reregistering the ocx file (with regsvr32, I can't remember the specific DLL, but it's documented in the DHCP recovery documention from Microsoft.)

Anyway -- with seemingly no intervention on my part (I know it sounds absurd) at about 3pm after spending about 7hrs on the issue the services finally started and the system started performing like normal... I hadn't restarted it since 12:30p -- so I'm not sure why it started working suddenly.

Anyway -- I've restarted the server a few times since then and have been fixing issues related to updating the system to WSS 3.0 since the first hang up.

Last night for the first time since the last system hang... the server froze again.  This time at 7:30pm with the backup starting at 7pm.  The backup logs are empty (0 bytes) for both the 9th and last night (the 17th.)

I logged into the terminal server again and pinged the server, but could not RDP.  The server console was once again frozen on the grey screen.

I powered off the server and restarted it -- it came up fine with no logs after 7:30p until 8:17am when I restarted it.  No errors right before it stopped.

Upon restarting the server the following information was the first new entry stored in the System Event Log:

"The previous system shutdown at 9:06:00 PM on 2/18/2009 was unexpected."

The logs stopped at 7:30pm, but apparently the system didn't register it had hung until 9:06.

The backups when started at 7pm typically finish backup at 9:30pm and verification between 10:40p and 11:00pm...

Tonight's backup went off without a hitch, so the behavior is not consistant.

At 7:30pm tonight the following Informational Alerts were entered into the Event Log:

"The Removable Storage service was successfully sent a start control."
"The Removable Storage service entered the running state."
"The Volume Shadow Copy service was successfully sent a start control."
"The Volume Shadow Copy service entered the running state."
"The Microsoft Software Shadow Copy Provider service was successfully sent a start control."
"The Microsoft Software Shadow Copy Provider service entered the running state."

And at 9:30pm:
"The Volume Shadow Copy service entered the stopped state."
"The Microsoft Software Shadow Copy Provider service entered the stopped state."

Then at 11:08pm tonight:
"RSM was stopped."
"The Removable Storage service entered the stopped state."

And the Backup Logs are complete and SBS says it's successful.

I've seen some VSS issues that people are mentioning applying a hotfix for, but most of them are a few years old and relate to Servers running 2003r2 SP1, not SP2.... with people mentioning the problems have been fixed in SP2.

I'm still willing to try applying redundant hotfixes if that solves the problem.

Any thoughts on this... I hope I was through enough and yet still to the point.  :)

-- UPDATE --

I've now run:
regsvr32 msxml.dll
regsvr32  msxml3.dll
regsvr32 msxml4.dll

per:


I've run:
vssadmin list writers

per:
http://www.petri.co.il/forums/showthread.php?t=25841

^^ that seems to be my issue identically

This also seems to partially be my issue (minus the actual error log / message:)
http://www.eggheadcafe.com/software/aspnet/33545710/ntbackup-failing.aspx

Tonight I'm trying:
http://support.microsoft.com/kb/940349

To see if that fixes the issue... of course we won't know for another week or two...

C:\Documents and Settings\Administrator>vssadmin list writers
vssadmin 1.1 - Volume Shadow Copy Service administrative command-line tool
(C) Copyright 2001 Microsoft Corp.
 
Writer name: 'System Writer'
   Writer Id: {e8132975-6f93-4464-a53e-1050253ae220}
   Writer Instance Id: {8b70819a-81f3-4bcd-8fa8-b90385b29523}
   State: [5] Waiting for completion
   Last error: No error
 
Writer name: 'MSDEWriter'
   Writer Id: {f8544ac1-0611-4fa5-b04b-f7ee00b03277}
   Writer Instance Id: {eb9cd8d4-55f7-49f9-9d8e-2896e49cfd84}
   State: [1] Stable
   Last error: No error
 
Writer name: 'SqlServerWriter'
   Writer Id: {a65faa63-5ea8-4ebc-9dbd-a0c4db26912a}
   Writer Instance Id: {de6d3ee3-d6a4-4e0f-97e8-3209ed3703e4}
   State: [5] Waiting for completion
   Last error: No error
 
Writer name: 'Event Log Writer'
   Writer Id: {eee8c692-67ed-4250-8d86-390603070d00}
   Writer Instance Id: {9eee7ecb-0eb5-46f2-80fe-56f59e934001}
   State: [1] Stable
   Last error: No error
 
Writer name: 'WINS Jet Writer'
   Writer Id: {f08c1483-8407-4a26-8c26-6c267a629741}
   Writer Instance Id: {ee60961c-3792-4f2e-8115-e38d16b08330}
   State: [5] Waiting for completion
   Last error: No error
 
Writer name: 'IIS Metabase Writer'
   Writer Id: {59b1f0cf-90ef-465f-9609-6ca8b2938366}
   Writer Instance Id: {017748bf-5f6f-4a0f-a5ee-20b2c4e176fd}
   State: [5] Waiting for completion
   Last error: No error
 
Writer name: 'COM+ REGDB Writer'
   Writer Id: {542da469-d3e1-473c-9f4f-7847f01fc64f}
   Writer Instance Id: {7c716e84-e6ff-47d7-8101-0ee56488a3ab}
   State: [1] Stable
   Last error: No error
 
Writer name: 'Dhcp Jet Writer'
   Writer Id: {be9ac81e-3619-421f-920f-4c6fea9e93ad}
   Writer Instance Id: {1c65cd90-37aa-40ed-9263-64f88e50ec60}
   State: [5] Waiting for completion
   Last error: No error
 
Writer name: 'Registry Writer'
   Writer Id: {afbab4a2-367d-4d15-a586-71dbb18f8485}
   Writer Instance Id: {6d254596-2668-4a92-969b-bf3cd62917be}
   State: [1] Stable
   Last error: No error
 
Writer name: 'NTDS'
   Writer Id: {b2014c9e-8711-4c5c-a5a9-3cf384484757}
   Writer Instance Id: {1e2419bd-27c5-4a6e-82ec-41893940338d}
   State: [1] Stable
   Last error: No error
 
Writer name: 'SPSearch VSS Writer'
   Writer Id: {57af97e4-4a76-4ace-a756-d11e8f0294c7}
   Writer Instance Id: {dbae9288-e9e5-4aaf-8712-62ae719be159}
   State: [5] Waiting for completion
   Last error: No error
 
Writer name: 'FRS Writer'
   Writer Id: {d76f5a28-3092-4589-ba48-2958fb88ce29}
   Writer Instance Id: {002866ec-eb21-42c9-aef5-07badb04843e}
   State: [5] Waiting for completion
   Last error: No error
 
Writer name: 'BITS Writer'
   Writer Id: {4969d978-be47-48b0-b100-f328f07ac1e0}
   Writer Instance Id: {540c2bbe-3042-4c90-848d-86f35f20a78a}
   State: [5] Waiting for completion
   Last error: No error
 
Writer name: 'WMI Writer'
   Writer Id: {a6ad56c2-b509-4e6c-bb19-49d8f43532f0}
   Writer Instance Id: {7c9f2413-6390-4900-88ed-bb26868e01d3}
   State: [5] Waiting for completion
   Last error: No error

Open in new window

ryansinnAsked:
Who is Participating?
 
ryansinnConnect With a Mentor Author Commented:
Thank you HP for a horrible server - The problem was finally fixed Mid-March... no lockups since replacing the following:

We replaced each of these components separately and retested the server:

Physically Damaged SCSI Cable (damaged clip)
-- didn't fix lockups

Bad System Motherboard
-- Reseating all RAM determined that Slot 3 was flakey / bad.
-- found potential bad memory DIMM(s)

Replaced 2 RAM DIMM
-- failed to book server when mounted in any slot

Bad E200 RAID Controller  (these cards run *HOT* intentionally) due to parts falling over (i think due to heat + poor soldering)
-- New RAID Controller came with out of date firmware (1.66)
--- after being replaced upgraded firmware to 1.80

...

The replacement of the systemboard, memory and raid controlller then upgrading the firmware on the RAID controller fixed the issue.

The server has run for 3 months without crash since replacing all of this hardware... the OS has not been reinstalled since January, so multiple hardware failures created some really inconsistent problems.

Backups have been fine since March as well.
0
 
lnkevinCommented:
Most of the time, when NTbackup kicks on, the system may have another activity that overlap the time and takes up the resources. I would suggest to move the backup schedule to a few hours after 12:00 and keep monitoring it to see if the issue is still there. Also, let me know what objects that you select to backup. Common problem is people choose to backup C: with some system file actively running and NT failed to back it up. If you can, snapshot the selection with all tasks expanded and post it here.

K
0
 
SysExpertCommented:
Since this is SBS, check what other tasks are running schedules, and also turn on the alerting option.

While you are at it run the SBS BPA ( best practices analyzer )


I hope this helps !
0
Free Tool: ZipGrep

ZipGrep is a utility that can list and search zip (.war, .ear, .jar, etc) archives for text patterns, without the need to extract the archive's contents.

One of a set of tools we're offering as a way to say thank you for being a part of the community.

 
ryansinnAuthor Commented:
Install the recovery console and attempt to remove the virus from there:
http://support.microsoft.com/kb/216417
0
 
ryansinnAuthor Commented:
sorry -- wrong question :)
0
 
ryansinnAuthor Commented:
Best Practices only has two issues, which I'm ok with:

The Network Driver is more than a Year Old

The Update for Daylight Savings Time (DST) is not installed... it is, I've tried to rerun it and it says it's already installed.

The Scheduled Tasks look fine as well.

Which "Alerting" option are you talking about?
scheduledtasks.png
0
 
lnkevinCommented:
Schedule task does not look fine. You have something set to run on every hours. This one may randomly start up as the same time with your backup creating the issue in your memory. What is that task (95%)? You should check your task manager when thing start freezing to see what process is taking the CPU and memory.

K
0
 
ryansinnAuthor Commented:
looks fine now.  I think that 95% was the SBS Monitoring Service.  I just looked at Scheduled Tasks now... 95% is gone.
scheduledtasks.png
0
 
ryansinnAuthor Commented:
not sure why it grabbed the wrong screenshot... but here's the updated Scheduled Taks... no 95%
0
 
ryansinnAuthor Commented:
attachment
scheduledtasks.png
0
 
lnkevinCommented:
You get my statement properly. You need to loook in your schedule task and reorganize it. You have a lot of overlap tasks set in schedule task such as: volume shadow copy, performance data collection.... these tasks can start at the same time with the backup causing the memory insuffient issue. Add more memory to your system or organize your tasks to avoid other activities during NTbackup is running will free up memory for the backup task.

K
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.