sunhux
asked on
Win 2008 R2 cluster quorum disk 'disappeared' temporarily
We had a cluster service 'lost' for several minutes yesterday
(& this is a repeat incident): refer to attached screens.
Q1:
What could be the cause of the issue?
Q2:
Any resolution or workaround for this issue?
Q3:
Currently the heartbeat goes thru Production network.
Will setting up dedicated heartbeat (ie a direct cross-
cable between the 2 member servers of the cluster
resolve this?
Q4:
Or can we tune/tweak the heartbeat interval & the number
of missed heartbeat to make the cluster more resilient
ie to address this issue? If so, can point me to a link or
provide instructions on how to tune these?
Q5:
Can this be due to differences in the various firmware
(UEFI, versions between the 2 member servers as IBM
found the versions to be different but can't pinpoint that
this issue is due to differences in firmware version of
the member nodes.
I prefer not to attach the Event Viewer logs here as it
contains sensitive info
1.jpg
2.jpg
3.jpg
4.jpg
5.jpg
6.jpg
10.jpg
ClusharedVols.jpg
Clustorage.jpg
(& this is a repeat incident): refer to attached screens.
Q1:
What could be the cause of the issue?
Q2:
Any resolution or workaround for this issue?
Q3:
Currently the heartbeat goes thru Production network.
Will setting up dedicated heartbeat (ie a direct cross-
cable between the 2 member servers of the cluster
resolve this?
Q4:
Or can we tune/tweak the heartbeat interval & the number
of missed heartbeat to make the cluster more resilient
ie to address this issue? If so, can point me to a link or
provide instructions on how to tune these?
Q5:
Can this be due to differences in the various firmware
(UEFI, versions between the 2 member servers as IBM
found the versions to be different but can't pinpoint that
this issue is due to differences in firmware version of
the member nodes.
I prefer not to attach the Event Viewer logs here as it
contains sensitive info
1.jpg
2.jpg
3.jpg
4.jpg
5.jpg
6.jpg
10.jpg
ClusharedVols.jpg
Clustorage.jpg
SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
I've done "cluster /prop" & below is the output. Compared with all other
clusters & only one line (EnabledSharedVolume) is different from other
clusters (except I have one DR cluster which is also 1 but never had
this issue):
Attached are the cluster.log files from both nodes of the cluster
which I've sanitized: look for 2003/12/16-11:50 to 2003/12/16-12:00
date & timings
Listing properties for 'NNXXXPP1VIR03':
T Cluster Name Value
-- -------------------- -------------------------- ---- -----------------------
DR NNXXXPP1VIR03 FixQuorum 0 (0x0)
DR NNXXXPP1VIR03 IgnorePersistentStateOnSta rtup 0 (0x0)
SR NNXXXPP1VIR03 SharedVolumesRoot C:\ClusterStorage
D NNXXXPP1VIR03 AddEvictDelay 60 (0x3c)
D NNXXXPP1VIR03 BackupInProgress 0 (0x0)
D NNXXXPP1VIR03 ClusSvcHangTimeout 60 (0x3c)
D NNXXXPP1VIR03 ClusSvcRegroupOpeningTimeo ut 5 (0x5)
D NNXXXPP1VIR03 ClusSvcRegroupPruningTimeo ut 5 (0x5)
D NNXXXPP1VIR03 ClusSvcRegroupStageTimeout 7 (0x7)
D NNXXXPP1VIR03 ClusSvcRegroupTickInMillis econds 300 (0x12c)
D NNXXXPP1VIR03 ClusterGroupWaitDelay 30 (0x1e)
D NNXXXPP1VIR03 ClusterLogLevel 3 (0x3)
D NNXXXPP1VIR03 ClusterLogSize 100 (0x64)
D NNXXXPP1VIR03 CrossSubnetDelay 1000 (0x3e8)
D NNXXXPP1VIR03 CrossSubnetThreshold 5 (0x5)
D NNXXXPP1VIR03 DefaultNetworkRole 2 (0x2)
S NNXXXPP1VIR03 Description
D NNXXXPP1VIR03 EnableSharedVolumes 1 (0x1) <==
D NNXXXPP1VIR03 HangRecoveryAction 3 (0x3)
D NNXXXPP1VIR03 LogResourceControls 0 (0x0)
D NNXXXPP1VIR03 PlumbAllCrossSubnetRoutes 0 (0x0)
D NNXXXPP1VIR03 QuorumArbitrationTimeMax 20 (0x14)
D NNXXXPP1VIR03 RequestReplyTimeout 60 (0x3c)
D NNXXXPP1VIR03 RootMemoryReserved 4294967295 (0xffffffff)
D NNXXXPP1VIR03 SameSubnetDelay 1000 (0x3e8)
D NNXXXPP1VIR03 SameSubnetThreshold 5 (0x5)
B NNXXXPP1VIR03 Security Descriptor 01 00 04 80 ... (280 byte
s)
D NNXXXPP1VIR03 SecurityLevel 1 (0x1)
M NNXXXPP1VIR03 SharedVolumeCompatibleFilt ers
M NNXXXPP1VIR03 SharedVolumeIncompatibleFi lters
D NNXXXPP1VIR03 ShutdownTimeoutInMinutes 20 (0x14)
D NNXXXPP1VIR03 WitnessDatabaseWriteTimeou t 300 (0x12c)
D NNXXXPP1VIR03 WitnessRestartInterval 15 (0xf)
DB1clusani.txt
DB2clusani.txt
clusters & only one line (EnabledSharedVolume) is different from other
clusters (except I have one DR cluster which is also 1 but never had
this issue):
Attached are the cluster.log files from both nodes of the cluster
which I've sanitized: look for 2003/12/16-11:50 to 2003/12/16-12:00
date & timings
Listing properties for 'NNXXXPP1VIR03':
T Cluster Name Value
-- -------------------- --------------------------
DR NNXXXPP1VIR03 FixQuorum 0 (0x0)
DR NNXXXPP1VIR03 IgnorePersistentStateOnSta
SR NNXXXPP1VIR03 SharedVolumesRoot C:\ClusterStorage
D NNXXXPP1VIR03 AddEvictDelay 60 (0x3c)
D NNXXXPP1VIR03 BackupInProgress 0 (0x0)
D NNXXXPP1VIR03 ClusSvcHangTimeout 60 (0x3c)
D NNXXXPP1VIR03 ClusSvcRegroupOpeningTimeo
D NNXXXPP1VIR03 ClusSvcRegroupPruningTimeo
D NNXXXPP1VIR03 ClusSvcRegroupStageTimeout
D NNXXXPP1VIR03 ClusSvcRegroupTickInMillis
D NNXXXPP1VIR03 ClusterGroupWaitDelay 30 (0x1e)
D NNXXXPP1VIR03 ClusterLogLevel 3 (0x3)
D NNXXXPP1VIR03 ClusterLogSize 100 (0x64)
D NNXXXPP1VIR03 CrossSubnetDelay 1000 (0x3e8)
D NNXXXPP1VIR03 CrossSubnetThreshold 5 (0x5)
D NNXXXPP1VIR03 DefaultNetworkRole 2 (0x2)
S NNXXXPP1VIR03 Description
D NNXXXPP1VIR03 EnableSharedVolumes 1 (0x1) <==
D NNXXXPP1VIR03 HangRecoveryAction 3 (0x3)
D NNXXXPP1VIR03 LogResourceControls 0 (0x0)
D NNXXXPP1VIR03 PlumbAllCrossSubnetRoutes 0 (0x0)
D NNXXXPP1VIR03 QuorumArbitrationTimeMax 20 (0x14)
D NNXXXPP1VIR03 RequestReplyTimeout 60 (0x3c)
D NNXXXPP1VIR03 RootMemoryReserved 4294967295 (0xffffffff)
D NNXXXPP1VIR03 SameSubnetDelay 1000 (0x3e8)
D NNXXXPP1VIR03 SameSubnetThreshold 5 (0x5)
B NNXXXPP1VIR03 Security Descriptor 01 00 04 80 ... (280 byte
s)
D NNXXXPP1VIR03 SecurityLevel 1 (0x1)
M NNXXXPP1VIR03 SharedVolumeCompatibleFilt
M NNXXXPP1VIR03 SharedVolumeIncompatibleFi
D NNXXXPP1VIR03 ShutdownTimeoutInMinutes 20 (0x14)
D NNXXXPP1VIR03 WitnessDatabaseWriteTimeou
D NNXXXPP1VIR03 WitnessRestartInterval 15 (0xf)
DB1clusani.txt
DB2clusani.txt
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
> are your NIC configured to go to sleep - this is a default setting under the
> NIC power management setting, turn that off.
No, certainly it's not, else I would have got Tivoli alert for ping
timeouts to both DB servers' teamed NICs' IP addr.
> validate your cluster as that may highlight some issues
We have run this & only a couple of warnings with the rest
Successful. I can post it this Mon/Tue after sanitizing the
outputs. For sure, it's the same result as when we first
install the cluster & if any of the error/warnings are serious,
cluster installation would not have been allowed by Windows
in the 1st place, right?
> NIC power management setting, turn that off.
No, certainly it's not, else I would have got Tivoli alert for ping
timeouts to both DB servers' teamed NICs' IP addr.
> validate your cluster as that may highlight some issues
We have run this & only a couple of warnings with the rest
Successful. I can post it this Mon/Tue after sanitizing the
outputs. For sure, it's the same result as when we first
install the cluster & if any of the error/warnings are serious,
cluster installation would not have been allowed by Windows
in the 1st place, right?
SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
Yes, all components have been rebooted.
Well, I'm not sure if it's really necessary to apply the hotfixes as the
other sites' cluster uses exactly the same Win 2008 R2 with same
patch level.
Only difference is the hardware type & firmware revision:
this site uses a slightly newer version of IBM x3850 hardware
compared to the others sites that are stable.
Is there anything we can analyse from the logs to pin-point
the root cause so that we can come up with a sure-fire
resolution?
Well, I'm not sure if it's really necessary to apply the hotfixes as the
other sites' cluster uses exactly the same Win 2008 R2 with same
patch level.
Only difference is the hardware type & firmware revision:
this site uses a slightly newer version of IBM x3850 hardware
compared to the others sites that are stable.
Is there anything we can analyse from the logs to pin-point
the root cause so that we can come up with a sure-fire
resolution?
SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
We have similar setup at a few other sites' clusters but
did not encounter this issue.
This issue is not persistent, ie it happens every couple
of weeks or months