Link to home
Start Free TrialLog in
Avatar of hrybko1
hrybko1

asked on

Server Reboots every Hour for no reason - Unexpected Shutdown

My setup: Domain of 10 Microsoft servers in a datacentre behind an ISA 2006 Firewall. Servers are a mixture of web servers running iis6 and 2 database servers running SQL 2005. All servers receive latest SUS updates and are protected by F-PROT antivirus as well as a weekly CLAMWIm scan.

The problem:
I have an intermittent problem whereby one or more of my Windows 2003 servers in the domain start rebooting without any warning or error msg - usually once every hour. The reboots will be almost exactly on an hourly basis. Event log shows only "unexpected shutdown" and no other errors.
This usually carries on for 2 or 3 days before getting "cured" - after which it dissapears ...

The thing is that the problem starts on a particular server and then I go crazy trying to find the problem - run av check (have tried Norton - Trend and CA) - check all OS files are not corruptedwith PREVX CSI - and go thu each running service. Nothing!
I know this could be a hardware prob but the last 3 servers have all been different hardware configs.
 
And the suddenly the problem goes away .... until it hits another server - maybe after a week or sa has passed. (I had the exact problem at a customer site and could not solve it. We eventually bought a new server and reloaded and the problem went away.)

I KNOW that other servers out there in the world must be having the same problem but I have searched far and wide without finding a solution.

Thank you
Howard
Syncrony.com

SOLUTION
Avatar of purplepomegranite
purplepomegranite
Flag of United Kingdom of Great Britain and Northern Ireland image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
I find that for these random reboot problems Windows event viewer are typically useless.

Do you have a memory dump enabled. If yes there should be a recent file in %Systemroot%\Memory.dmp

Open it with http://www.microsoft.com/whdc/devtools/debugging/default.mspx and hopefully it will point you to an offending driver or kernel level issue if that is the case.
Avatar of hrybko1
hrybko1

ASKER

The data centre environment is very stable - largest in our country - full generator backup air cons, security vault etc. So it is not an environment problem.

Memory Dump: How do I enable this please...

Thank you
I wouldn't be so quick as to discount possibilities.  Large datacentres have been known to have issues too - but environment could also include the server PSU.

If the reboots are every hour, in my mind this points even more to environment - unless you have a task scheduled for every hour on the server.  If the timeframe is particularly regular, it is more likely to be an environment problem in my experience.

Memory dump is very unlikely to help if there isn't a crash event.  Windows will only dump memory if it knows it crashes.  If you only have the "Previous shutdown was unexpected" event, Windows only knew about it's shutdown when it booted back up - you will not have a memory dump.
Avatar of hrybko1

ASKER

Memory Dump: I have checked it IS set to do that in System Properties - Startup & Recovery options
All options ticked with this as the dump file  - %SystemRoot%\MEMORY.DMP

BUT the dump file is old. Was created in 2 months ago.

Thanks
Make a note of the times the server rebooted.  Ask the datacenter if they run any monitoring tools on a similar schedule.  Particularly tools that they only apply to certain parts of the datacenter during a specific timeframe.

No crash event usually means no crash.  In your case as the problem goes from one server to another, it is extremely unlikely to be internal hardware or software if none of the affected servers have reported a crash.
Avatar of hrybko1

ASKER

PurplePom: OK you are right there can be no dump. But why did this affect my mail server for 3 days last week (also a 2003 box) which has now stabilized. And since a reboot on Mon AM this has now affected one of my SQL servers?

What do u suggest I do to isolate the problem?

Thanks
are all the servers protected by UPS's? are they all on a single UPS or seperate UPS's?
Are all you servers in the datacenter located in one cabinet, or are they in different parts?

It actually gets quite difficult to isolate problems when Windows doesn't report anything.  It IS possible that it is software causing it, but for software to cause a problem that Windows misses completely on more than one server is unusual to say the least.

Is there any particular piece of software present on all your servers?  Are they all the same hardware?  If so, is the BIOS on each up-to-date?
Avatar of kadadi_v
The server is three year old means or may be server smps giving the problem ( like capacitor is weak)
Can you change the server SMPS ...? try with New power supply and for testing purpose disable the antivirus protection or firewall if installed and then check.


Regards,

VIjay Kadadi
Avatar of hrybko1

ASKER

Thanks for all the suggestions - but if it is hardware how come the same syndrome or hourly reboots has appeared on 3 totally different machines AND how come I had an exact same problem at a customers server room end of last year?

This must be happening to someone else....

Thank you
So what software are you running that is on all these servers?
Avatar of hrybko1

ASKER

Update: I have changed the power supply as suggested by Kadadi.
BUT the problem remains.
The server came up with the new powersupply at 17:30 and since then it has been rebooting every hour on the hour 18:30, 19:30 ....

Thus it cannot be a device in the Data Centre since the "reboot clock"  has been reset to half past the hour where it was on the hour before.

PurplePom: As explained before the servers are running 2003 Std Edition - and this server has 2005 SQL running; it is also a secondary Domain Controller and otherwise it runs no other software.

Please does anyone have any ideas?

Thank you
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Sorry, I wasn't clear in my question.

I meant, are you running any particular software on every one of the servers that has shown this problem e.g. a tool, a particular program, etc.
ASKER CERTIFIED SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of hrybko1

ASKER

ok
SQL Server  
   
It is hard to do a virus scan as it does not complete in one hour, but so far I have scanned it in parts (program files, windows/system32), and still continuing to select more parts, but I assume if there were an active worm it would have been detected as part of the memory scan.  
   
The solution mentioned here (manual reboot) does not work for me.  <!--[if gte mso 9]>   12.00 <![endif]--><!--[if gte mso 9]>   Normal  0            false  false  false    EN-US  X-NONE  X-NONE                                       MicrosoftInternetExplorer4                                     <![endif]--><!--[if gte mso 9]>                                                                                                                                                                                                                                                                                    <![endif]--><!--[if gte mso 10]><![endif]-->Since a few days one of my Windows 2003 servers also rebootsevery hour. I tried to deinstall everything that was not important, stop unnecessaryservices, but it still reboots every hour.  In fact in 60 minutes and about 30 seconds:
     
Boot Time
 
   Difference
       
13:55:31
 
       
12:55:19
 
   
1:00:12
 
     
11:54:50
 
   
1:00:29
 
     
10:54:13
 
   
1:00:37
 
     
9:53:42
 
   
1:00:31
 
     
8:53:09
 
   
1:00:33
 
     
7:52:35
 
   
1:00:34
 
 
The eventlog shows The previous system shutdown at 7:52:35AM on 1/21/2009 was unexpected.
 
I checked the BIOS but did not see anything strange. Thepower management had the setting that after one hour the hard disk could beswitched off, but after setting that to Never, it also rebooted.
 
The only things running on this server (now) are:
<!--[if !supportLists]-->·        <!--[endif]-->IIS (With ASP.NET)
<!--[if !supportLists]-->·        <!--[endif]-->SQL Server
 
It is hard to do a virus scan as it does not complete in onehour, but so far I have scanned it in parts (program files, windows/system32),and still continuing to select more parts, but I assume if there were an activeworm it would have been detected as part of the memory scan.
 
The solution mentioned here (manual reboot) does not workfor me.