Link to home
Start Free TrialLog in
Avatar of Chris Dent
Chris DentFlag for United Kingdom of Great Britain and Northern Ireland

asked on

Windows 2003 File Server Cluster

If anyone has any insight into this it would be lovely, this is a bit of a long shot.

We have a Windows 2003 Cluster acting as a File Server (Active / Passive). The Nodes in this cluster have developed this rather nasty habit of shutting down. This can be either (or both) of the nodes (occasionally at exactly the same time).

I can't really say what it's up to when it shuts down, there's no consistency:

 - No Information messages / Warnings / Errors logged in the Event Viewer related to the shutdown
 - Cluster Log shows no errors (just the Heartbeat failure when a Node shuts down)
 - No excessive system activity
 - No Memory Dump files
 - No BSoD

I even checked the WBEM logs on the off-chance they said something meaningful...

The storage for the Cluster is based on an EMC Clariion SAN using PowerPath to manage the HBA. We have one patch we can apply to Powerpath, just in case.

We ran Dell's hardware diagnostic tools to check the hardware, everything comes back clear there.

Can anyone think of anything / anywhere else to check?

Chris
SOLUTION
Avatar of inbarasan
inbarasan
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
ASKER CERTIFIED SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of Chris Dent

ASKER


It comes up as Unplanned shutdown.

Thermal shutdown is logged by the Dell RAC (as well as notification prior to the event).

We haven't done much digging for viruses, the AV software on the machine is up to date and hasn't reported anything. The remainder of the network is protected and constantly updated.

I'll try and sort out the auditing. It's going to take a while, the logs on there don't last more than a few hours at the moment.

Chris

Fun fun. No shutdown noticed by the Security log.

Chris
Still weird.

The only thing I can think of is someone enabling Crash on Audit Fail.  But that should still leave some notice why it shutdown in the log.
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial

Ho hum... typical really. As soon as we start the report tool it stops breaking.

I'll give it a few more days to reappear as a problem. It's scheduled to be replaced anyway, maybe I'll be lucky and it won't reboot again until after that's happened.

Chris
You definitely need to look at running a memory leak tool on the system to see if you are having a problem with paged pool memory. This will require a reboot to zero out the registers though. Once you have that information, you will want to run PerfWiz on the system to see if you are having other problems. Also, verify you have the latest and greatest version of Antivirus on the system.

Also, verify you are NOT running the /3GB switch, since this will cause you to reduce the amount of kernel memory available to the system. Especially on systems that consume a lot of i/o, i.e. Exchange, SQL, Terminal Services.

Still no breaks. Going to close this one, thankyou all for your input.

Chris
Avatar of kNumberz
kNumberz

Don't know if you have resolve this in 2+ years since. However, I had a similiar problem with the very same symptoms. Enabling the crash dump is outlined here(http://support.microsoft.com/kb/307973). If this is enabled and no dump file is generated than this article may help (http://support.microsoft.com/kb/130536). To Analyse the file take a look at this
(http://www.raymond.cc/blog/archives/2009/01/17/analyzing-windows-crash-dump-or-minidump-with-whocrashed/). You will have to download some prereqs but should be pretty intuitive.

At the end of the day the there was an issue with the storage driver (http://support.microsoft.com/kb/932755). I was using a qlogic HBA at the time and connecting to a clariion through a FC fabric switch. After consulting MS and EMC the driver was updated (http://driverdownloads.qlogic.com/QLogicDriverDownloads_UI/Product_detail.aspx?oemid=224) for that specific reason. There was a qlogic update. Finally it resolved the issue and no longer saw the issue.

Hope this helps even though it is late posting.