Windows 2003 File Server Cluster

If anyone has any insight into this it would be lovely, this is a bit of a long shot.

We have a Windows 2003 Cluster acting as a File Server (Active / Passive). The Nodes in this cluster have developed this rather nasty habit of shutting down. This can be either (or both) of the nodes (occasionally at exactly the same time).

I can't really say what it's up to when it shuts down, there's no consistency:

 - No Information messages / Warnings / Errors logged in the Event Viewer related to the shutdown
 - Cluster Log shows no errors (just the Heartbeat failure when a Node shuts down)
 - No excessive system activity
 - No Memory Dump files
 - No BSoD

I even checked the WBEM logs on the off-chance they said something meaningful...

The storage for the Cluster is based on an EMC Clariion SAN using PowerPath to manage the HBA. We have one patch we can apply to Powerpath, just in case.

We ran Dell's hardware diagnostic tools to check the hardware, everything comes back clear there.

Can anyone think of anything / anywhere else to check?

Chris
LVL 71
Chris DentPowerShell DeveloperAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

inbarasanCommented:
http://www.hostingforum.ca/135681-how-audit-who-has-shutdown-server.html
Enable this audit and try to find how it is getting shutdown.
Hope this helps
0
AnthonyP9618Commented:
Hey Chris,

Wow... That's odd.  Ghost in the machine?

I guess my only question would be is the machine gracefully shutting down, or is it asking a reason why the shutdown was unplanned?  My only suggestions would be either a thermal event or a trojan virus or something.

Definitely something weird going on there.
0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
Chris DentPowerShell DeveloperAuthor Commented:

It comes up as Unplanned shutdown.

Thermal shutdown is logged by the Dell RAC (as well as notification prior to the event).

We haven't done much digging for viruses, the AV software on the machine is up to date and hasn't reported anything. The remainder of the network is protected and constantly updated.

I'll try and sort out the auditing. It's going to take a while, the logs on there don't last more than a few hours at the moment.

Chris
0
Newly released Acronis True Image 2019

In announcing the release of the 15th Anniversary Edition of Acronis True Image 2019, the company revealed that its artificial intelligence-based anti-ransomware technology – stopped more than 200,000 ransomware attacks on 150,000 customers last year.

Chris DentPowerShell DeveloperAuthor Commented:

Fun fun. No shutdown noticed by the Security log.

Chris
0
AnthonyP9618Commented:
Still weird.

The only thing I can think of is someone enabling Crash on Audit Fail.  But that should still leave some notice why it shutdown in the log.
0
pmarquardtCommented:
Microsoft offers MPSReport_Cluster available here: http://www.microsoft.com/downloads/details.aspx?FamilyID=cebf3c7c-7ca5-408f-88b7-f9c79b7306c0&displaylang=en

I would suggest running this tool and then taking a look at the cluster config saved by the tool. Antivirus program, especially Symantec, McAfee can cause random reboots. Have you considered running PerfWiz on the system to see if you are running out of resources on the local machine, causing random reboots? You could be running out of paged pool, or non-paged pool memory for the kernel. Are you running the /3GB switch in the Boot.ini file?

Give me some more info, and I'll do my best to help you sort this out.
0
Chris DentPowerShell DeveloperAuthor Commented:

Ho hum... typical really. As soon as we start the report tool it stops breaking.

I'll give it a few more days to reappear as a problem. It's scheduled to be replaced anyway, maybe I'll be lucky and it won't reboot again until after that's happened.

Chris
0
pmarquardtCommented:
You definitely need to look at running a memory leak tool on the system to see if you are having a problem with paged pool memory. This will require a reboot to zero out the registers though. Once you have that information, you will want to run PerfWiz on the system to see if you are having other problems. Also, verify you have the latest and greatest version of Antivirus on the system.

Also, verify you are NOT running the /3GB switch, since this will cause you to reduce the amount of kernel memory available to the system. Especially on systems that consume a lot of i/o, i.e. Exchange, SQL, Terminal Services.
0
Chris DentPowerShell DeveloperAuthor Commented:

Still no breaks. Going to close this one, thankyou all for your input.

Chris
0
kNumberzCommented:
Don't know if you have resolve this in 2+ years since. However, I had a similiar problem with the very same symptoms. Enabling the crash dump is outlined here(http://support.microsoft.com/kb/307973). If this is enabled and no dump file is generated than this article may help (http://support.microsoft.com/kb/130536). To Analyse the file take a look at this
(http://www.raymond.cc/blog/archives/2009/01/17/analyzing-windows-crash-dump-or-minidump-with-whocrashed/). You will have to download some prereqs but should be pretty intuitive.

At the end of the day the there was an issue with the storage driver (http://support.microsoft.com/kb/932755). I was using a qlogic HBA at the time and connecting to a clariion through a FC fabric switch. After consulting MS and EMC the driver was updated (http://driverdownloads.qlogic.com/QLogicDriverDownloads_UI/Product_detail.aspx?oemid=224) for that specific reason. There was a qlogic update. Finally it resolved the issue and no longer saw the issue.

Hope this helps even though it is late posting.
0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Microsoft Server OS

From novice to tech pro — start learning today.