Link to home
Start Free TrialLog in
Avatar of ITnavigators
ITnavigators

asked on

Win2K3 / Exch2K3 server rebooting nightly with no event log entries.

Need help troubleshooting a server that reboots without any errors.  The reboots are almost nightly and usually occur within a 2 hour time block (11:00pm to 1:00am).    Lately there have been two reboots each night.  There are no entries made in ANY of the event logs that indicate what is happening.

Details...  The customer is running a Wide Area Network which includes three Identical Dell PE2850 servers running Windows Server 2003 R2 and Exchange 2003.  The three servers are located at different physical locations on the WAN.  Only one server is experiencing this problem.  

The systems are all running Symantec Enterprise Edition 10 (including SMSMSE 5.0).  I have disabled my overnight Information Store scan but that didn't resolve the issue.  

All three servers are protected by TrippLite UPS (SMART2200RMXL2U) with extra batteries.  Power is reliable at the other two locations.  The UPS is logging all the power fluctuations at the problem site.  While there are numerous fluctuations, they are all well within UPS coverage and they don't seem to be related to the timing of the reboots.

We have recently replaced the motherboard on the server as well as the power supplies.  

Any other thoughts?
Avatar of Pber
Pber
Flag of Canada image

I presume you do get event log entries that indicate the previous reboot was unexpected?

Also could also try and elminate the UPS.  Are other servers\devices on the UPS?  Since the 2850 is dual power supply capable, if you have dual supples, plug one supply into normal power (non-protected) and the other in the UPS.  If the UPS is starting to fail, it should stay up.  If the problem perisist, it may be hardware on the server.

Do you have the Dell Server Administrator Installed?  Have you looked in the logs within the Dell Server Administrator?
Avatar of ITnavigators
ITnavigators

ASKER

Yes, we do get the previous reboot unexpected as well as the normal events for the various services starting.  What I meant was there were no 'oops I'm dying' kinds of events.  

I like the idea of splitting the power sources.  Granted, that exposes me to surges on the unprotected source, but we can through a simple surge protector in to cover that.  In addition to the PE2850, there are three X servers, a Cisco router and a couple of switches.  None of them are complaining or having problems.  I would think that if the UPS was actually failing the X servers would all go offline too.  I never lose connection to any of them or the various network devices.

Dell Server Admin is installed.  The only log entries of note were talking about problems with the redundant power.  That was related to issues on the motherboard (since the 2850 has no power transfer board).  It has already been replaced.  There have been no further power log entries since that time.  But the nightly reboots have continued.
The fact that it happens at fairly consistant times definetely indicates some process is causing it.  Could be something internal on Windows (unlikely since no event entries).  Could be some self testing on the UPS.
The lack of Windows event entries leaves me to believe it's hardware.  
The lack of ESM logs from the System in Server Administrator worries me.  That usually can catch a bad supply as well as a loss of input or other issues

The fact that you have several other devices hooked up leans me away from the UPS.

However, I don't find the quality in Dell server.  Cheap isn't always better. (;  Anyhow, the Cisco and IBM stuff might have better tolerance.  Thus they don't see power blips and the Dell might.  The fact that you replaced the MB and power supples and it still happens throws me off.

I would try the one power cord going to non-UPSed power to see if the problem clears.  If so, I would lean towards the UPS is starting to go.
We will try to sneak back up there and split the power today.  That should let us know within a day or so.  
Update:  Problem doesn't seem to be linked to the power.  

I did some experimenting over the weekend and may have a clue.  I attempted to perform an offline defrag of the Exchange Information Store.  It failed at roughly 6% -- after generating an extremely large temporary STM file.  Didn't happen to catch the error message and don't have a window of time to retry the defrag at this time.  A successful defrag will take approximately 1.5  to 2.5 hours.

I suspended the Exchange Maintenance window for a couple of days.  I noticed that the server did NOT reboot tonight.  Hmmmmm...

The IS passes the normal consistency checks.  We will look at either a ESEUTIL /P or possibly moving the mailboxes to another server and completely rebuilding the IS.

Any thoughts???
Update:

We disabled the maintenance window to prevent any online defrags from occuring.  The system still crashed during the time the maintenance window was disabled.

Hooked up an IP camera in front of the monitor that captured images at 1 sec intervals during the window that we normally experience the reboots.  There are a total of 14 frames from normal screensaver to the start of the hard boot.  Several are blank, but there are two frames of interest.  After the last screensaver frame there is a 1 second blank frame followed by two by two frames, that are a black screen with what appears to be a white rectangle in the middle of the screen (about the size of the splash screen).  

Since the camera is coming out of low light mode I can't be certain but it appears there MAY be writing on it.  I am going to reset the camera to zoom in on the box tonight.  Perhaps we will get more information.

Following the two box frames there are several seconds of blank screen followed by the hard boot (about 10 seconds later).  

We have not seen any contributions to this post lately, but will continue to update this post with further information that may help anyone looking to resolve this issue in the future.
Also of note is the lack of a minidump file (even though minidump is configured).  That points towards hardware.  But that makes no sense with the timing.
Captured the image with the white box.  Unfortunately it was not helpful.  Just the monitor going to sleep when whatever happened to the server happened to the server.
Sorry for not getting back to you.  This is a tough one.  

Have you disabled automatic restart on system failure?  That might work, although you mentioned you weren't getting minidumps.  

Have you tried disabling DEP (Data Execution Prevention)?  (Try via the boot.ini - http://www.microsoft.com/technet/security/prodtech/windowsxp/depcnfxp.mspx)
DEP is an interesting thought.  I will look at that.  

Dell has decided to replace the MB and Power Supplies again.
ASKER CERTIFIED SOLUTION
Avatar of ITnavigators
ITnavigators

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
I work for a power company, and we see this at certain hydro electric plants.  We usually use small APC UPS's for remote sites like generating stations, but for noisy power we will use Liebert UPS's.  The Lieberts essentially run from the battery all the time, where are the APC's will run in bypass (with conditioning) as long as there is AC input.
We are using a TrippLite with Line Interactive in this installation.  Guarantees voltage with input voltage from 85 VAC to 143 VAC.  And it does this without running on the battery.  That is great for brownouts before the failure.  We get a lot of that in the land of ice and snow.

Personally I really like Sola and the CVT (Constant Voltage Transformer).  I'm sure that would have locked in the voltage +/- 1%, but the noise appears to have been the issue.  Not sure what that would have done with the noise.  Most likely nothing.  

How do the Lieberts price out compared with APC?  SWAG of course.
Closed, 500 points refunded.
Vee_Mod
Community Support Moderator