Solved

Win 2008 R2 cluster quorum disk 'disappeared' temporarily

Posted on 2013-12-17
9
678 Views
Last Modified: 2014-01-26
We had a cluster service 'lost' for several minutes yesterday
(& this is a repeat incident): refer to attached screens.

Q1:
What could be the cause of the issue?

Q2:
Any resolution or workaround for this issue?

Q3:
Currently the heartbeat goes thru Production network.
Will setting up dedicated heartbeat (ie a direct cross-
cable between the 2 member servers of the cluster
resolve this?

Q4:
Or can we tune/tweak the heartbeat interval & the number
of missed heartbeat to make the cluster more resilient
ie to address this issue?  If so, can point me to a link or
provide instructions on how to tune these?

Q5:
Can this be due to differences in the various firmware
(UEFI, versions between the 2 member servers as IBM
found the versions to be different but can't pinpoint that
this issue is due to differences in firmware version of
the member nodes.

I prefer not to attach the Event Viewer logs here as it
contains sensitive info
1.jpg
2.jpg
3.jpg
4.jpg
5.jpg
6.jpg
10.jpg
ClusharedVols.jpg
Clustorage.jpg
0
Comment
Question by:sunhux
  • 4
  • 3
9 Comments
 
LVL 34

Assisted Solution

by:Seth Simmons
Seth Simmons earned 125 total points
Comment Utility
what kind of shared storage are you using?  any issues on that side either with it directly or connectivity to it?
0
 

Author Comment

by:sunhux
Comment Utility
It's just plain SAN LUNs that's presented to both servers.
We have similar setup at a few other sites' clusters but
did not encounter this issue.

This issue is not persistent, ie it happens every couple
of weeks or months
0
 

Author Comment

by:sunhux
Comment Utility
I've done  "cluster /prop" & below is the output.  Compared with all other
clusters & only one line (EnabledSharedVolume) is different from other
clusters (except I have one DR cluster which is also 1 but never had
this issue):

Attached are the cluster.log files from both nodes of the cluster
which I've sanitized: look for 2003/12/16-11:50  to  2003/12/16-12:00
date & timings

Listing properties for 'NNXXXPP1VIR03':

T  Cluster              Name                           Value
-- -------------------- ------------------------------ -----------------------
DR NNXXXPP1VIR03        FixQuorum                      0 (0x0)
DR NNXXXPP1VIR03        IgnorePersistentStateOnStartup 0 (0x0)
SR NNXXXPP1VIR03        SharedVolumesRoot              C:\ClusterStorage
D  NNXXXPP1VIR03        AddEvictDelay                  60 (0x3c)
D  NNXXXPP1VIR03        BackupInProgress               0 (0x0)
D  NNXXXPP1VIR03        ClusSvcHangTimeout             60 (0x3c)
D  NNXXXPP1VIR03        ClusSvcRegroupOpeningTimeout   5 (0x5)
D  NNXXXPP1VIR03        ClusSvcRegroupPruningTimeout   5 (0x5)
D  NNXXXPP1VIR03        ClusSvcRegroupStageTimeout     7 (0x7)
D  NNXXXPP1VIR03        ClusSvcRegroupTickInMilliseconds 300 (0x12c)
D  NNXXXPP1VIR03        ClusterGroupWaitDelay          30 (0x1e)
D  NNXXXPP1VIR03        ClusterLogLevel                3 (0x3)
D  NNXXXPP1VIR03        ClusterLogSize                 100 (0x64)
D  NNXXXPP1VIR03        CrossSubnetDelay               1000 (0x3e8)
D  NNXXXPP1VIR03        CrossSubnetThreshold           5 (0x5)
D  NNXXXPP1VIR03        DefaultNetworkRole             2 (0x2)
S  NNXXXPP1VIR03        Description
D  NNXXXPP1VIR03        EnableSharedVolumes            1 (0x1) <==
D  NNXXXPP1VIR03        HangRecoveryAction             3 (0x3)
D  NNXXXPP1VIR03        LogResourceControls            0 (0x0)
D  NNXXXPP1VIR03        PlumbAllCrossSubnetRoutes      0 (0x0)
D  NNXXXPP1VIR03        QuorumArbitrationTimeMax       20 (0x14)
D  NNXXXPP1VIR03        RequestReplyTimeout            60 (0x3c)
D  NNXXXPP1VIR03        RootMemoryReserved             4294967295 (0xffffffff)
D  NNXXXPP1VIR03        SameSubnetDelay                1000 (0x3e8)
D  NNXXXPP1VIR03        SameSubnetThreshold            5 (0x5)
B  NNXXXPP1VIR03        Security Descriptor            01 00 04 80 ... (280 byte
s)
D  NNXXXPP1VIR03        SecurityLevel                  1 (0x1)
M  NNXXXPP1VIR03        SharedVolumeCompatibleFilters
M  NNXXXPP1VIR03        SharedVolumeIncompatibleFilters
D  NNXXXPP1VIR03        ShutdownTimeoutInMinutes       20 (0x14)
D  NNXXXPP1VIR03        WitnessDatabaseWriteTimeout    300 (0x12c)
D  NNXXXPP1VIR03        WitnessRestartInterval         15 (0xf)
DB1clusani.txt
DB2clusani.txt
0
 
LVL 18

Accepted Solution

by:
Netflo earned 375 total points
Comment Utility
I would strongly recommend getting your segmented heartbeat network up and running. Also when you get downtime, can you validate your cluster as that may highlight some issues with your configuration.

I'm assuming the basics have been covered, such as rebooting switch, switching network cables out.

Also are your NIC configured to go to sleep - this is a default setting under the NIC power management setting, turn that off.
0
Do You Know the 4 Main Threat Actor Types?

Do you know the main threat actor types? Most attackers fall into one of four categories, each with their own favored tactics, techniques, and procedures.

 

Author Comment

by:sunhux
Comment Utility
> are your NIC configured to go to sleep - this is a default setting under the
> NIC power management setting, turn that off.
No, certainly it's not, else I would have got Tivoli alert for ping
timeouts to both DB servers'  teamed NICs' IP addr.

>  validate your cluster as that may highlight some issues
We have run this & only a couple of warnings with the rest
Successful.  I can post it this Mon/Tue after sanitizing the
outputs.  For sure, it's the same result as when we first
install the cluster & if any of the error/warnings are serious,
cluster installation would not have been allowed by Windows
in the 1st place, right?
0
 
LVL 18

Assisted Solution

by:Netflo
Netflo earned 375 total points
Comment Utility
In many occasions I've seen many IT admins tweak certain areas rendering the cluster to become invalid and then wonder why such issues such as Live Migration don't work.

Has all components been rebooted in your cluster?

Are the cluster hosts fully up to date? If yes, you may want to take a look at the following hotfixes: http://support.microsoft.com/kb/2545685
0
 

Author Comment

by:sunhux
Comment Utility
Yes, all components have been rebooted.

Well, I'm not sure if it's really necessary to apply the hotfixes as the
other sites' cluster uses exactly the same Win 2008 R2 with same
patch level.

Only difference is the hardware type & firmware revision:
this site uses a slightly newer version of IBM x3850 hardware
compared to the others sites that are stable.

Is there anything we can analyse from the logs to pin-point
the root cause so that we can come up with a sure-fire
resolution?
0
 
LVL 18

Assisted Solution

by:Netflo
Netflo earned 375 total points
Comment Utility
If you open a support case with Microsoft Support, they will tell you that its recommended to apply the hotfixes - I've gone through the same issue recently.

The only advice I can offer is get your hardware and software up to date. Cover the basics in terms of changing cables etc. Plug in a switch to isolate the heartbeat network.

Lastly disable 'large send offload' on your NICs, does that improve your situation?
0

Featured Post

The problems with reply email signatures

Do you wish that you could place an email signature under a reply? Well, unfortunately, you can't. That great Exchange/Office 365 signature you've created will just appear at the bottom of an email chain. What a pain! Is there really no way to solve this? Well, there might be...

Join & Write a Comment

Suggested Solutions

Title # Comments Views Activity
ADMT Intra Forest migration questions 7 67
Microsoft Lync 2013 4 41
What is this Task? 4 34
BULK LOGGED - log full 9 11
A Bare Metal Image backup allows for the restore of an entire system to a similar or dissimilar hardware. They are highly useful for migrations and disaster recovery. Bare Metal Image backups support Full and Incremental backups. Differential backup…
Join Greg Farro and Ethan Banks from Packet Pushers (http://packetpushers.net/podcast/podcasts/pq-show-93-smart-network-monitoring-paessler-sponsored/) and Greg Ross from Paessler (https://www.paessler.com/prtg) for a discussion about smart network …
This tutorial will give a short introduction and overview of Backup Exec 2012 and how to navigate and perform basic functions. Click on the Backup Exec button in the upper left corner. From here, are global settings for the application such as conne…
With the advent of Windows 10, Microsoft is pushing a Get Windows 10 icon into the notification area (system tray) of qualifying computers. There are many reasons for wanting to remove this icon. This two-part Experts Exchange video Micro Tutorial s…

772 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

10 Experts available now in Live!

Get 1:1 Help Now