Link to home
Start Free TrialLog in
Avatar of intelliwyse
intelliwyse

asked on

Black screen of death SBS 2003

I have a server which was running for years without any issues and suddenly the other day starting freezing with a black screen. Running SBS 2003. It happens at strange times, so far never at the same time, and not in any sort of pattern I can really define, except always in the middle of the night between 9:30 PM and 6 AM. I've disabled backup jobs, Shadow Protect on the C:\ drive, VSS and WSUS, as well as all jobs in the task scheduler. Still it locks up. I saw someone suggested heavily fragmented c:\ drives could cause that, I checked and used Diskeeper to defrag my c: drive, took multiple passes all day with it until all the red fragmentation turned blue. I even freed up some space on the c:\ drive so it went from having 15% free to 30%. The black screen persists. The symptoms are like this - the server's running along great, until suddenly the screen on the monitor just turns black. No mouse, no icons. Num lock on the keyboard does not work. CD-ROM drive ejects ok. Server quits responding to PING and loses inbound and outbound network access. No blue screen, no error message. A reboot by holding down the power button on the front brings the server back up fine for a few hours, and then bam it goes down at random again. Checked the event logs, hoping to find a smoking gun like a common event id right before the lockup, but the event logs don't show any errors or warnings around the times of the lockups, and the last event logged is the winhttpautoproxy service logging it's standard information which shows up in my event logs all the time. I looked for memory.dmp but it was empty also. Any advice?
Avatar of comphil
comphil
Flag of United Kingdom of Great Britain and Northern Ireland image

If the event viewer isn't catching anything, it may well be hardware related, possibly something to do with heat.  Have you checked RAM, disk usage etc. at the time to see if it spikes just before a crash?
That sounds like a hardware issue to me.

My guess would be to check memory and processor.

Check things like processor and PSU fans are working OK, as maybe the processor is shutting down if its overheating.

If it;s SBS then there are a faw processes that are ran overnight, such as Exchange cleanup etc and these can cause a big increase in processor utilisation and hence can cause the processor to get hot!

Wayne



Avatar of intelliwyse
intelliwyse

ASKER

Yeah good points, the hardware is the next major thing I'm checking, the server's at a remote site so to check the hardware I have to load a backup image up as a virtual machine so users can run, but the I can pull the hardware and strip it down.  
Check power save and screen saver options. Some power save options shut down the nic card as well as the monitor. Hybernate, Sleep, or is it a coma?
So far I've moved the os to a virtual machine to work on the physical machine. I ran extensive tests on memory, hard disk and cpu but didn't come back with anything conclusive. After swinging the OS from Physical to Virtual I haven't had any lockups either
Any physical signs of failure?  Clogged fans etc?  What about temperature monitoring, can you see how well it's regulating CPU & chassis temperatures?
Yeah I'll do a full internal cleaning of the chassis tomorrow but it's not very dusty at all, so far everything's spinning well and I don't hear any bad bearings when I listen close to the fans... Not getting any heat alarms but I don't know yet what temp the CPU's running at exactly.
Tomorrow Im swinging over the virtual machine back onto the physical.  I also figured out the "black screen of death" I said I was getting is really just because the machine was doing a hard-freeze while the monitor was in power save and since it was blank/off, it didn't wake up after the hardware locked up.
You certain this isn't the other way around? Could the power save features cause hardware problems? It did in Windows Vista. The USB drivers wouldn't allow the computer from coming out of power save. Also, it conflicted with hardware, causing freezes. The resolve was to update the USB bus drivers. For 2003 server, maybe this will help:

http://msdn.microsoft.com/en-us/windows/hardware/gg463430
Ok I've been fighting this ever since I made this post, so about 2 weeks and here's what I've discovered:

We noticed the server would freeze on a pattern, at a random hour, between 6:30 PM and 6:00 AM only, except for 2 out of 20 days when it froze about noon.
Most of the time it froze between 9:30 PM and 2:00 AM tho.

After eliminating the backup system as the problem (it's a Zenith Infotech BDR and I know they cause a VERY simialr issue) I figured it had to be either Exchange, anti-virus, or hardware. I couldn't find anything wrong with the hardware but that doesn't mean anything, but I disabled all the non-microsoft services, anti-virus software and even the entire Exchange system for 1 night a couple nights ago. That night was the first time in weeks the server didn't freeze.

I decided to focus on the anti-virus first.

When I disable deep scans of my anti-virus (sunbelt vipre) then the lockups quit happening now for 2 days. They were consistant daily, and the patter was trending to only during hours when a deep scan would be running.
Deep scans were scheduled to start at 9:00 AM, but I think they are scanning all the SQL and Exchange databases because the scans are taking between 9 and 20 hours to complete! Except they rarely completed because the server would freeze part way through, then when it was rebooted the deep scans would resume and sometimes by the end of the day they would freeze the system again, and I think that's were those 6:30 PM freezes were coming from.

I still have to wait to see if it stays stable for several more days before I issue the blessing that it's fixed but seriously looking like my Anti-Virus was causing the problem.

I've had Sunbelt Vipre installed on this server for over a year and it only started doing this Easter weekend and has been fairly relentless about it since. I have identical servrers (Dell Poweredge 1800 SBS 2003) running Sunbelt Vipre at other sites and I am not experiencing this issue with them yet.

Any thoughts?
let me correct something, I said Deep Scans were scheduled to start at 9 am, I meant they are scheduled at 9 PM! they are scheduled to run at night. I also want to say the backup system actually takes incremental images of the server throughout the daytime every hour, and the backups do NOT run at night so backups never conflicted with the AV except when the backup server was taking backups during the daytime and the AV was also scanning because it had not completed from the night prior due to lockup.
Well this is INTERESTING. I had just made my posts above talking about the AV being the root of the issue, and I was wrapping up my day. I a ticket from a user saying that they were't getting e-mail for the last couple days, and I knew this user was uniquely using the "Pop3 Mail Connector for Exchange" on the server.

When I concluded the AV was the issue, I had determined that by stopped all Exchange services and the AV services one night, and then finding that no lockups happened. I then had one of my techs restart the Exchange services, and the Anti-Virus services, but then I disabled regular deep scans on the AV.
Another night passed (last night) and the server still didn't lock up, so I concluded "AV's the issue"

Today when I went into the server to look at this ticket for this user, the first thing I checked was if my tech had restarted the Pop 3 Mail Connector for Exchange service. It was in a stopped state. I went ahead and started it back up at about 4:00 PM and did a "retrieve now" in the POP3 Mail Connector manager.

15 minutes later the server was frozen up. It's never frozen up at that time or anywhere near that time before. I had not frozen up in 2 days, making it from my Wednesday 8 AM reboot all the way until Friday at 4 PM, minutes after I started back up this one Exchange service.

I've got the service stopped now and I'm leaving it that way until I can see the server lock up with it stopped.
What AV software and version are you using. There has to be a workaround for email to flow.
I've worked around the issue with the POP3 Mail Connector by rerouting some mail, but that seems to be the trigger here.

The Exchange Pop3 Mail Connector is causing the server to hard freeze.

What I do not as of yet understand is this: During the whole duration of troubleshooting this, several weeks, the Pop3 Mail Connector for Exchange was running and triggering every 15 minutes - for 1 single mailbox. All the other users get their mail flow the traditional ways, except for this one mailbox, so it's not like the connector was being overloaded. Furthermore, this mailbox isn't a high volume box, it's just a basic, small time end user.

Now, the freeze ONLY occurred 75% of the time after 9:00 PM and before 6:00 AM, 20% of the time at 6:30 pm exactly, and 5% of the time, at noon.

So, there must be some other associated function that's also contributing to the trigger.

My anti-virus is Sunbelt Vipre, and I have the latest version installed and definitions I would go look it up precisely if it's really needed, but anyone have any thoughts on what I should do from here?

I'm contemplating an Exchange reinstall on SBS 03.

ASKER CERTIFIED SOLUTION
Avatar of comphil
comphil
Flag of United Kingdom of Great Britain and Northern Ireland image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Yes it was the pop3mail connector for sure. Very strange.