Solved

Windows 2003 File Server Cluster

Posted on 2007-04-05
10
3,015 Views
Last Modified: 2008-03-06
If anyone has any insight into this it would be lovely, this is a bit of a long shot.

We have a Windows 2003 Cluster acting as a File Server (Active / Passive). The Nodes in this cluster have developed this rather nasty habit of shutting down. This can be either (or both) of the nodes (occasionally at exactly the same time).

I can't really say what it's up to when it shuts down, there's no consistency:

 - No Information messages / Warnings / Errors logged in the Event Viewer related to the shutdown
 - Cluster Log shows no errors (just the Heartbeat failure when a Node shuts down)
 - No excessive system activity
 - No Memory Dump files
 - No BSoD

I even checked the WBEM logs on the off-chance they said something meaningful...

The storage for the Cluster is based on an EMC Clariion SAN using PowerPath to manage the HBA. We have one patch we can apply to Powerpath, just in case.

We ran Dell's hardware diagnostic tools to check the hardware, everything comes back clear there.

Can anyone think of anything / anywhere else to check?

Chris
0
Comment
Question by:Chris Dent
  • 4
  • 2
  • 2
  • +2
10 Comments
 
LVL 14

Assisted Solution

by:inbarasan
inbarasan earned 166 total points
ID: 18856666
http://www.hostingforum.ca/135681-how-audit-who-has-shutdown-server.html
Enable this audit and try to find how it is getting shutdown.
Hope this helps
0
 
LVL 11

Accepted Solution

by:
AnthonyP9618 earned 167 total points
ID: 18856692
Hey Chris,

Wow... That's odd.  Ghost in the machine?

I guess my only question would be is the machine gracefully shutting down, or is it asking a reason why the shutdown was unplanned?  My only suggestions would be either a thermal event or a trojan virus or something.

Definitely something weird going on there.
0
 
LVL 70

Author Comment

by:Chris Dent
ID: 18856856

It comes up as Unplanned shutdown.

Thermal shutdown is logged by the Dell RAC (as well as notification prior to the event).

We haven't done much digging for viruses, the AV software on the machine is up to date and hasn't reported anything. The remainder of the network is protected and constantly updated.

I'll try and sort out the auditing. It's going to take a while, the logs on there don't last more than a few hours at the moment.

Chris
0
NAS Cloud Backup Strategies

This article explains backup scenarios when using network storage. We review the so-called “3-2-1 strategy” and summarize the methods you can use to send NAS data to the cloud

 
LVL 70

Author Comment

by:Chris Dent
ID: 18863779

Fun fun. No shutdown noticed by the Security log.

Chris
0
 
LVL 11

Expert Comment

by:AnthonyP9618
ID: 18864705
Still weird.

The only thing I can think of is someone enabling Crash on Audit Fail.  But that should still leave some notice why it shutdown in the log.
0
 
LVL 4

Assisted Solution

by:pmarquardt
pmarquardt earned 167 total points
ID: 18885337
Microsoft offers MPSReport_Cluster available here: http://www.microsoft.com/downloads/details.aspx?FamilyID=cebf3c7c-7ca5-408f-88b7-f9c79b7306c0&displaylang=en

I would suggest running this tool and then taking a look at the cluster config saved by the tool. Antivirus program, especially Symantec, McAfee can cause random reboots. Have you considered running PerfWiz on the system to see if you are running out of resources on the local machine, causing random reboots? You could be running out of paged pool, or non-paged pool memory for the kernel. Are you running the /3GB switch in the Boot.ini file?

Give me some more info, and I'll do my best to help you sort this out.
0
 
LVL 70

Author Comment

by:Chris Dent
ID: 18896530

Ho hum... typical really. As soon as we start the report tool it stops breaking.

I'll give it a few more days to reappear as a problem. It's scheduled to be replaced anyway, maybe I'll be lucky and it won't reboot again until after that's happened.

Chris
0
 
LVL 4

Expert Comment

by:pmarquardt
ID: 18901281
You definitely need to look at running a memory leak tool on the system to see if you are having a problem with paged pool memory. This will require a reboot to zero out the registers though. Once you have that information, you will want to run PerfWiz on the system to see if you are having other problems. Also, verify you have the latest and greatest version of Antivirus on the system.

Also, verify you are NOT running the /3GB switch, since this will cause you to reduce the amount of kernel memory available to the system. Especially on systems that consume a lot of i/o, i.e. Exchange, SQL, Terminal Services.
0
 
LVL 70

Author Comment

by:Chris Dent
ID: 18920527

Still no breaks. Going to close this one, thankyou all for your input.

Chris
0
 

Expert Comment

by:kNumberz
ID: 26062639
Don't know if you have resolve this in 2+ years since. However, I had a similiar problem with the very same symptoms. Enabling the crash dump is outlined here(http://support.microsoft.com/kb/307973). If this is enabled and no dump file is generated than this article may help (http://support.microsoft.com/kb/130536). To Analyse the file take a look at this
(http://www.raymond.cc/blog/archives/2009/01/17/analyzing-windows-crash-dump-or-minidump-with-whocrashed/). You will have to download some prereqs but should be pretty intuitive.

At the end of the day the there was an issue with the storage driver (http://support.microsoft.com/kb/932755). I was using a qlogic HBA at the time and connecting to a clariion through a FC fabric switch. After consulting MS and EMC the driver was updated (http://driverdownloads.qlogic.com/QLogicDriverDownloads_UI/Product_detail.aspx?oemid=224) for that specific reason. There was a qlogic update. Finally it resolved the issue and no longer saw the issue.

Hope this helps even though it is late posting.
0

Featured Post

U.S. Department of Agriculture and Acronis Access

With the new era of mobile computing, smartphones and tablets, wireless communications and cloud services, the USDA sought to take advantage of a mobilized workforce and the blurring lines between personal and corporate computing resources.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

On a regular basis I get questions about slow RDP performance, RDP connection problems, strange errors and even BSOD, remote computers freezing or restarting after initiation of a remote session. In a lot of this cases the quick solutions made b…
Welcome to my series of short tips on migrations. Whilst based on Microsoft migrations the same principles can be applied to any type of migration. My first tip is around source server preparation. No migration is an easy migration, there is a…
This Micro Tutorial will give you a basic overview how to record your screen with Microsoft Expression Encoder. This program is still free and open for the public to download. This will be demonstrated using Microsoft Expression Encoder 4.
This video shows how to use Hyena, from SystemTools Software, to bulk import 100 user accounts from an external text file. View in 1080p for best video quality.

785 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question