Solved

Windows 2003 File Server Cluster

Posted on 2007-04-05
10
3,013 Views
Last Modified: 2008-03-06
If anyone has any insight into this it would be lovely, this is a bit of a long shot.

We have a Windows 2003 Cluster acting as a File Server (Active / Passive). The Nodes in this cluster have developed this rather nasty habit of shutting down. This can be either (or both) of the nodes (occasionally at exactly the same time).

I can't really say what it's up to when it shuts down, there's no consistency:

 - No Information messages / Warnings / Errors logged in the Event Viewer related to the shutdown
 - Cluster Log shows no errors (just the Heartbeat failure when a Node shuts down)
 - No excessive system activity
 - No Memory Dump files
 - No BSoD

I even checked the WBEM logs on the off-chance they said something meaningful...

The storage for the Cluster is based on an EMC Clariion SAN using PowerPath to manage the HBA. We have one patch we can apply to Powerpath, just in case.

We ran Dell's hardware diagnostic tools to check the hardware, everything comes back clear there.

Can anyone think of anything / anywhere else to check?

Chris
0
Comment
Question by:Chris Dent
  • 4
  • 2
  • 2
  • +2
10 Comments
 
LVL 14

Assisted Solution

by:inbarasan
inbarasan earned 166 total points
ID: 18856666
http://www.hostingforum.ca/135681-how-audit-who-has-shutdown-server.html
Enable this audit and try to find how it is getting shutdown.
Hope this helps
0
 
LVL 11

Accepted Solution

by:
AnthonyP9618 earned 167 total points
ID: 18856692
Hey Chris,

Wow... That's odd.  Ghost in the machine?

I guess my only question would be is the machine gracefully shutting down, or is it asking a reason why the shutdown was unplanned?  My only suggestions would be either a thermal event or a trojan virus or something.

Definitely something weird going on there.
0
 
LVL 70

Author Comment

by:Chris Dent
ID: 18856856

It comes up as Unplanned shutdown.

Thermal shutdown is logged by the Dell RAC (as well as notification prior to the event).

We haven't done much digging for viruses, the AV software on the machine is up to date and hasn't reported anything. The remainder of the network is protected and constantly updated.

I'll try and sort out the auditing. It's going to take a while, the logs on there don't last more than a few hours at the moment.

Chris
0
 
LVL 70

Author Comment

by:Chris Dent
ID: 18863779

Fun fun. No shutdown noticed by the Security log.

Chris
0
 
LVL 11

Expert Comment

by:AnthonyP9618
ID: 18864705
Still weird.

The only thing I can think of is someone enabling Crash on Audit Fail.  But that should still leave some notice why it shutdown in the log.
0
What Is Threat Intelligence?

Threat intelligence is often discussed, but rarely understood. Starting with a precise definition, along with clear business goals, is essential.

 
LVL 4

Assisted Solution

by:pmarquardt
pmarquardt earned 167 total points
ID: 18885337
Microsoft offers MPSReport_Cluster available here: http://www.microsoft.com/downloads/details.aspx?FamilyID=cebf3c7c-7ca5-408f-88b7-f9c79b7306c0&displaylang=en

I would suggest running this tool and then taking a look at the cluster config saved by the tool. Antivirus program, especially Symantec, McAfee can cause random reboots. Have you considered running PerfWiz on the system to see if you are running out of resources on the local machine, causing random reboots? You could be running out of paged pool, or non-paged pool memory for the kernel. Are you running the /3GB switch in the Boot.ini file?

Give me some more info, and I'll do my best to help you sort this out.
0
 
LVL 70

Author Comment

by:Chris Dent
ID: 18896530

Ho hum... typical really. As soon as we start the report tool it stops breaking.

I'll give it a few more days to reappear as a problem. It's scheduled to be replaced anyway, maybe I'll be lucky and it won't reboot again until after that's happened.

Chris
0
 
LVL 4

Expert Comment

by:pmarquardt
ID: 18901281
You definitely need to look at running a memory leak tool on the system to see if you are having a problem with paged pool memory. This will require a reboot to zero out the registers though. Once you have that information, you will want to run PerfWiz on the system to see if you are having other problems. Also, verify you have the latest and greatest version of Antivirus on the system.

Also, verify you are NOT running the /3GB switch, since this will cause you to reduce the amount of kernel memory available to the system. Especially on systems that consume a lot of i/o, i.e. Exchange, SQL, Terminal Services.
0
 
LVL 70

Author Comment

by:Chris Dent
ID: 18920527

Still no breaks. Going to close this one, thankyou all for your input.

Chris
0
 

Expert Comment

by:kNumberz
ID: 26062639
Don't know if you have resolve this in 2+ years since. However, I had a similiar problem with the very same symptoms. Enabling the crash dump is outlined here(http://support.microsoft.com/kb/307973). If this is enabled and no dump file is generated than this article may help (http://support.microsoft.com/kb/130536). To Analyse the file take a look at this
(http://www.raymond.cc/blog/archives/2009/01/17/analyzing-windows-crash-dump-or-minidump-with-whocrashed/). You will have to download some prereqs but should be pretty intuitive.

At the end of the day the there was an issue with the storage driver (http://support.microsoft.com/kb/932755). I was using a qlogic HBA at the time and connecting to a clariion through a FC fabric switch. After consulting MS and EMC the driver was updated (http://driverdownloads.qlogic.com/QLogicDriverDownloads_UI/Product_detail.aspx?oemid=224) for that specific reason. There was a qlogic update. Finally it resolved the issue and no longer saw the issue.

Hope this helps even though it is late posting.
0

Featured Post

How to improve team productivity

Quip adds documents, spreadsheets, and tasklists to your Slack experience
- Elevate ideas to Quip docs
- Share Quip docs in Slack
- Get notified of changes to your docs
- Available on iOS/Android/Desktop/Web
- Online/Offline

Join & Write a Comment

Suggested Solutions

On July 14th 2015, Windows Server 2003 will become End of Support, leaving hundreds of thousands of servers around the world that still run this 12 year old operating system vulnerable and potentially out of compliance in many organisations around t…
Learn about cloud computing and its benefits for small business owners.
Internet Business Fax to Email Made Easy - With eFax Corporate (http://www.enterprise.efax.com), you'll receive a dedicated online fax number, which is used the same way as a typical analog fax number. You'll receive secure faxes in your email, fr…
This video discusses moving either the default database or any database to a new volume.

708 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

15 Experts available now in Live!

Get 1:1 Help Now