Solved

Windows 2003 File Server Cluster

Posted on 2007-04-05
10
3,016 Views
Last Modified: 2008-03-06
If anyone has any insight into this it would be lovely, this is a bit of a long shot.

We have a Windows 2003 Cluster acting as a File Server (Active / Passive). The Nodes in this cluster have developed this rather nasty habit of shutting down. This can be either (or both) of the nodes (occasionally at exactly the same time).

I can't really say what it's up to when it shuts down, there's no consistency:

 - No Information messages / Warnings / Errors logged in the Event Viewer related to the shutdown
 - Cluster Log shows no errors (just the Heartbeat failure when a Node shuts down)
 - No excessive system activity
 - No Memory Dump files
 - No BSoD

I even checked the WBEM logs on the off-chance they said something meaningful...

The storage for the Cluster is based on an EMC Clariion SAN using PowerPath to manage the HBA. We have one patch we can apply to Powerpath, just in case.

We ran Dell's hardware diagnostic tools to check the hardware, everything comes back clear there.

Can anyone think of anything / anywhere else to check?

Chris
0
Comment
Question by:Chris Dent
  • 4
  • 2
  • 2
  • +2
10 Comments
 
LVL 14

Assisted Solution

by:inbarasan
inbarasan earned 166 total points
ID: 18856666
http://www.hostingforum.ca/135681-how-audit-who-has-shutdown-server.html
Enable this audit and try to find how it is getting shutdown.
Hope this helps
0
 
LVL 11

Accepted Solution

by:
AnthonyP9618 earned 167 total points
ID: 18856692
Hey Chris,

Wow... That's odd.  Ghost in the machine?

I guess my only question would be is the machine gracefully shutting down, or is it asking a reason why the shutdown was unplanned?  My only suggestions would be either a thermal event or a trojan virus or something.

Definitely something weird going on there.
0
 
LVL 70

Author Comment

by:Chris Dent
ID: 18856856

It comes up as Unplanned shutdown.

Thermal shutdown is logged by the Dell RAC (as well as notification prior to the event).

We haven't done much digging for viruses, the AV software on the machine is up to date and hasn't reported anything. The remainder of the network is protected and constantly updated.

I'll try and sort out the auditing. It's going to take a while, the logs on there don't last more than a few hours at the moment.

Chris
0
Optimizing Cloud Backup for Low Bandwidth

With cloud storage prices going down a growing number of SMBs start to use it for backup storage. Unfortunately, business data volume rarely fits the average Internet speed. This article provides an overview of main Internet speed challenges and reveals backup best practices.

 
LVL 70

Author Comment

by:Chris Dent
ID: 18863779

Fun fun. No shutdown noticed by the Security log.

Chris
0
 
LVL 11

Expert Comment

by:AnthonyP9618
ID: 18864705
Still weird.

The only thing I can think of is someone enabling Crash on Audit Fail.  But that should still leave some notice why it shutdown in the log.
0
 
LVL 4

Assisted Solution

by:pmarquardt
pmarquardt earned 167 total points
ID: 18885337
Microsoft offers MPSReport_Cluster available here: http://www.microsoft.com/downloads/details.aspx?FamilyID=cebf3c7c-7ca5-408f-88b7-f9c79b7306c0&displaylang=en

I would suggest running this tool and then taking a look at the cluster config saved by the tool. Antivirus program, especially Symantec, McAfee can cause random reboots. Have you considered running PerfWiz on the system to see if you are running out of resources on the local machine, causing random reboots? You could be running out of paged pool, or non-paged pool memory for the kernel. Are you running the /3GB switch in the Boot.ini file?

Give me some more info, and I'll do my best to help you sort this out.
0
 
LVL 70

Author Comment

by:Chris Dent
ID: 18896530

Ho hum... typical really. As soon as we start the report tool it stops breaking.

I'll give it a few more days to reappear as a problem. It's scheduled to be replaced anyway, maybe I'll be lucky and it won't reboot again until after that's happened.

Chris
0
 
LVL 4

Expert Comment

by:pmarquardt
ID: 18901281
You definitely need to look at running a memory leak tool on the system to see if you are having a problem with paged pool memory. This will require a reboot to zero out the registers though. Once you have that information, you will want to run PerfWiz on the system to see if you are having other problems. Also, verify you have the latest and greatest version of Antivirus on the system.

Also, verify you are NOT running the /3GB switch, since this will cause you to reduce the amount of kernel memory available to the system. Especially on systems that consume a lot of i/o, i.e. Exchange, SQL, Terminal Services.
0
 
LVL 70

Author Comment

by:Chris Dent
ID: 18920527

Still no breaks. Going to close this one, thankyou all for your input.

Chris
0
 

Expert Comment

by:kNumberz
ID: 26062639
Don't know if you have resolve this in 2+ years since. However, I had a similiar problem with the very same symptoms. Enabling the crash dump is outlined here(http://support.microsoft.com/kb/307973). If this is enabled and no dump file is generated than this article may help (http://support.microsoft.com/kb/130536). To Analyse the file take a look at this
(http://www.raymond.cc/blog/archives/2009/01/17/analyzing-windows-crash-dump-or-minidump-with-whocrashed/). You will have to download some prereqs but should be pretty intuitive.

At the end of the day the there was an issue with the storage driver (http://support.microsoft.com/kb/932755). I was using a qlogic HBA at the time and connecting to a clariion through a FC fabric switch. After consulting MS and EMC the driver was updated (http://driverdownloads.qlogic.com/QLogicDriverDownloads_UI/Product_detail.aspx?oemid=224) for that specific reason. There was a qlogic update. Finally it resolved the issue and no longer saw the issue.

Hope this helps even though it is late posting.
0

Featured Post

Free Tool: SSL Checker

Scans your site and returns information about your SSL implementation and certificate. Helpful for debugging and validating your SSL configuration.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
User profile Size Report 3 84
How to resolve user quota error? 13 81
Auto Login Script 3 51
Cannot connect to server with remote dekstop connection 5 30
I've always wanted to allow a user to have a printer no matter where they login. The steps below will show you how to achieve just that. In this Article I'll show how to deploy printers automatically with group policy and then using security fil…
Learn about cloud computing and its benefits for small business owners.
Two types of users will appreciate AOMEI Backupper Pro: 1 - Those with PCIe drives (and haven't found cloning software that works on them). 2 - Those who want a fast clone of their boot drive (no re-boots needed) and it can clone your drive wh…
Nobody understands Phishing better than an anti-spam company. That’s why we are providing Phishing Awareness Training to our customers. According to a report by Verizon, only 3% of targeted users report malicious emails to management. With compan…

828 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question