Link to home
Start Free TrialLog in
Avatar of mjkisic
mjkisicFlag for United States of America

asked on

SBS 2003 randomly freezes during backup but not all the time.

This is driving me nuts, I manage a lot of sbs boxes, this one I just inherited.  Dell Poweredge SC1430 with SBS 2003 R2 SP2.  The server has randomly frozen several times where video goes out, mouse, keyboard and network all unresponsive the only option is hard reboot.  There are no errors in event logs, event logs are very clean, and this doesn't occur all the time.  Using sbsbackup to an external usb hard drive, here are the only consistencies:

2/5/11 ntbackup starts 5:30 server hangs at 6:11
2/14/11 ntbackup starts at 6:30 server hangs at 6:57
2/23/11 ntbackup starts at 7:30 server hangs at 7:46

all ntbackups in between the above listed are successful and usually complete about 11:00 pm.
Here is what I have done to troubleshoot:
Changed out usb backup drive after the first time, but happened again.
Changed start time of the backup 1hour later each time
Disabled antivirus (it is set to avoid, exchange and backup drives anyway)
checked server hardware (memory, hard drive)
Checked scheduled tasks for conflicts no other tasks other than system tasks listed.
There are no other backup programs running, server has updates, plenty of space on both internal and external hard drives.  There were scheduled nt backup routines running unsuccessfully when I took over the server but I deleted and removed the scripts and set up sbsbackup to run instead.  Seems odd that this occurs almost every 9 days maybe just a coincidence?  I am out of ideas, any help is appreciated!

Thanks
Avatar of David
David
Flag of United States of America image

One of the root causes can be the result of using desktop-class instead of enterprise class HDDs.  You can get this when the disk goes into a deep recovery state due to a bad block.   Windows will not necessarily report a problem in event log, and if the HDD recovers the data the disk will pass all diagnostics.

NTbackup "causes" this to happen because it is most likely the only time during the week that you try to access the data.

A solution in this case would be to use either hardware or O/S-based RAID1 with enterprise-class disks.  
Have you tried enabling verbose logging in NTBackup to see if it is always hanging on the same files? Maybe a client has some files locked?

When you schedule the SBS backup, it creates an NTBackup job that you can modify.
The random freezing indicates dirt, bad ram, items needing reseating, bad capacitors, or a flakey power supply, pretty much in that order
I, quite literally, take them outside and use a leaf blower which cleans things out much better than Blow-Off.  Blow in through the two exit fan openings (P/S and CPU).  Don't try to get the fans going supersonic and you'll be OK.
I keep seeing 2-5 year old Samsung memory go bad which bugs the heck out of me and use the free ISO from http://www.memtest86.com to test.
If the system has one, lift the green shroud over the CPU heatsink and inspect for bad capacitors: http://www.google.com/imgres?imgurl=http://www.thenakedpc.com/dan/Bulging_Capacitors/close-up.jpg&imgrefurl=http://www.thenakedpc.com/dan/Bulging_Capacitors/index.html&usg=__I8uVpjpR_5DoPAsyyxnX-gEshcM=&h=526&w=786&sz=49&hl=en&start=0&zoom=1&tbnid=IXu-R6JwUeRtqM:&tbnh=146&tbnw=213&ei=9mdqTZmWGIausAO-vNGoBA&prev=/images%3Fq%3Dbad%2Bcapacitors%26hl%3Den%26safe%3Doff%26biw%3D1242%26bih%3D766%26gbv%3D2%26tbs%3Disch:1&itbs=1&iact=hc&vpx=363&vpy=101&dur=4609&hovh=184&hovw=275&tx=136&ty=110&oei=9mdqTZmWGIausAO-vNGoBA&page=1&ndsp=21&ved=1t:429,r:1,s:0  If you see them, either get a new systemboard or have someone who knows how to desolder replace them.
Next, reseat everything, especially ther cpu chip and put some new heatsink compound on it while your at it.
And last, if that still hasn't fixed it, replace the power supply.
Avatar of mjkisic

ASKER

The system ran a full backup last night with out issue as well as the previous evening.  Markdmac, I will enable to verbose logging to see if there is an issue with a locked or corrupt file.  They do use an access based database heavily during the day that there have been intermiten problems requiring a database repair, however these issues have been on days where there has been no sever freezes during the times mentioned.

My first thoughts were hardware such as HD and RAM.  I have done chkdsk and Memtest without any errors reported.  I still find it odd that the three times this has happened has been almost exactly a 9 day span between freezing, hardware issues would be more random than that.  

I will report back after collecting some ntbackup log details.

Thanks!
Run the sbs best practices analyzer and see if anything stands out.

You are saying this only occurs during a backup,right.

Could be a memory leak

To see if it is mem leak,reboot server daily and see if the issue goes away.
If so ,you've got an errant program or device driver causing the issue.

Make sure all device drivers and BIOS is flashed and up to date.
I see the Broadcom driver for that system has a memory leak depending upon version.

Have you run a chkdsk and or defrag?
Avatar of mjkisic

ASKER

pgm554,  I have run best practices, and everything looks good.  As stated above this has happened on a little more than three occasions, including 2/5, 2/14 and 2/23.  Sbsbackup is set to run everyday, it is just on those days the system has frozen.  Have run chkdsk, and will do the defrag just incase and will check the drivers version.
Just to be mathematical and see if it is a pattern.
Reboot the server halfway between the 9 day cycle and if it hangs 9 days from that reboot,you've got a memory leak.
Also you should get server logs emailed to you daily on the SBS and check for any process that looks like it may be growing.

Like this:

Top 5 Processes by Memory Usage

Process Name - ID       Memory Usage       
sqlservr - 2052       204 MB                      
sqlservr - 1928       88 MB                      
services - 428       88 MB                      
sqlservr - 1980       42 MB                      
w3wp - 4556       39 MB                      



Top 5 Processes by CPU Usage

Process Name - ID       CPU Time       
sqlservr - 2052       0.4 %                      
svchost - 876       0.3 %                      
wmiprvse - 3712       0.2 %                      
jqs - 1828       0.2 %                      
Avatar of mjkisic

ASKER

If the 9 day pattern holds true, the next lock up will be March 4th, unless it is just a coincidence, but today would be about half way through so I will reboot the server to day.  I do get the reports emailed to me, I will review them for a process that is growing and will report back.
You know, since this is driving you nuts, and you have gotten a lot of wonderful advice on where to look ... you should consider a different strategy.   Something that the professionals use.

For around $100 you can get a diagnostic board.  Plug it into the backplane, run some software and it will usually (nothing is infallible, and the more money you spend the more things it will look at) tell you exactly what it is.  Even if it can't tell you what it is, it can certainly tell you what it ISN'T.   Just eliminating (truly eliminating CPU, memory, interrupts, hardware, etc..) will go a long way to finding the ultimate solution.

My gosh, your backups are failing, so this is pretty important to get resolved.  Google "PC diagnostic board", and choose something.  (Usually you can find a nice selection on ebay and equivalent auction sites).

Software-based hardware diagnostics can and WILL sometimes give you bad info.  A board in a slot with it's own CPU and memory that monitors traffic on the bus can easily detect faults that can be missed.  Even DOS-booted relatively simple memory diagnostic programs will easily miss memory problems.  

Also, if your PC is a server class, then is there a BIOS accessible event log?  

Avatar of mjkisic

ASKER

OK in review sever health reports the top two memory using processes are dbsrv9 and store from 2/19 to 2/27 they fluctuate from 547 MB to 621 Mb, there is no pattern to the fluctuations.  On CPU usage there NTbackup consistently has the highest CPU usage, and there is a pattern here:

2/19      5.8%
2/20      6.3%
2/21      5.9%
2/22      4.9%
2/23      7.2%
2/24 server crashed, restarted
2/25 stopped nt backups for trouble shooting
2/26   resumed ntbackup    3.7%

These numbers may be just a reflection of the server freezing on the 24th?
Avatar of mjkisic

ASKER

Thanks pgm554, we did install SEP 12.0  on the 9th of Feb, we had a crash previous to that though on the 5th.  However I am going to restart dbsrv9 service and see if that reduces memory hogging.
Looking a some of the posts on the Sym site and such,leads me to believe that is the culprit.

Lots of people are complaining about that product.

I've had my issues with Symantec in the past and their products are sometimes ,just barely tolerable.

They have a nasty tendency of acquiring a decent product and ruining the support and QA.
Avatar of mjkisic

ASKER

pgm554 It is sure easy enough to troubleshoot down that path, I will disable the dbsrv9 service for the week and see if that resolves the issue.  Perhaps it is the combination of memory hog of dbsrv9 and the cpu usage of the ntbackup routine that is causing the perfect storm.
I will report back on server status in the next couple of days and cross my fingers that that is the culprit.

Thanks for all of your input!
ASKER CERTIFIED SOLUTION
Avatar of pgm554
pgm554
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of mjkisic

ASKER

Thanks for the logical approach.
Avatar of mjkisic

ASKER

I disabled the dbsrv9 service altogether, after, 9 plus days the server has not locked up during a normal scheduled backup.  I think pgm554 nailed that one, you would think a brand new updated  install of SEP would be a little better than that, I've run the program in one version or another on a lot of servers without that type of issue before.

Thanks!
Not a big Symantec fan
I am trying to install BEX 2010 on an SBS 2008 box and am just having fun,fun,fun.
It's the same old ,same old with Symantec,
Live updates not working or not finding installed products,cryptic errors and interfering with Spack installs.
Man between them and CA ,I don't know who is more mediocre.
Avatar of mjkisic

ASKER

I know what you mean, but I do use the BESR on my bigger machines and networks and have to say when it does work well it has saved many a system.  I have been able to do a full restore of an entire sbs 2003 box supporting exchange and an AD of 35 as well as about 200 GB of data in less than 3 hours of down time.  I have found the BESR to be somewhat more reliable and use it instead of BEX, havent had any issues on sbs 2008.
>I have found the BESR to be somewhat more reliable

That's a PowerQuest product,thank them for the stable code base.

Tried it on SBS 2011 and was unable to do a restore,Symantec says still unsupported.

I like BEXSR when it works and the dissimilar restore has worked pretty well on some upgrades,but when it screws up,it's usually the Symantec tweaks.

Live update breaks quite often in my experience.