Solved

How to tune MSCS heartbeat interval & number of times before declaring node is unavailable

Posted on 2013-06-03
8
1,396 Views
Last Modified: 2013-06-21
http://blogs.technet.com/b/aevalshah/archive/2012/05/15/windows-server-2008-r2-failover-clustering-best-practice-guide.aspx

Refer to above url / link.  How do we tune the default Win 2008 x64 R2
Enterprise to say every 5 secs for 2-3 mins?  Give steps by step & 
screen shot by screen shot instructions, please.

Any caveat/shortcoming for tuning?  Say the cluster will take longer
to failover in the event there's a genuine outage/crash of the active
node?  How to get the best of both worlds, ie resiliency to network
slowness (during high backup traffic at night) & yet reduce the
failover duration.

We're running MS SQL 2008 R2 Enterprise server databases on
the pair of MSCS & ran into a few incidents of Quorum disk
(& 2 occasions of SAN disks) disappearing for several minutes.

There's also a case of duplicate IP address & we traced that it's
due to both nodes becoming active at the same time, thus the MS
SQL VIP became duplicate (owned by both nodes at the same time.


Heartbeat
========
The heartbeat for the cluster nodes is set by default to send a heartbeat every second. The heartbeat configuration also by default will allow five missed heartbeats before a cluster node is deemed as unavailable. This can be configured to increase the interval and increase/decrease the threshold. This is of particular interest when using a high latency network (WAN) when implementing a geographically dispersed cluster.
0
Comment
Question by:sunhux
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 5
  • 3
8 Comments
 
LVL 34

Assisted Solution

by:Paul MacDonald
Paul MacDonald earned 495 total points
ID: 39216817
"We're running MS SQL 2008 R2 Enterprise server databases on
the pair of MSCS & ran into a few incidents of Quorum disk
(& 2 occasions of SAN disks) disappearing for several minutes."

Heartbeat aside, this is a problem.

Also, it's not clear is you're running a geographically dispersed cluster or not.  If not, your best bet is to take the heartbeat off the common network and establish a dedicated cluster heartbeat connection - even if it's just a crossover cable from a dedicated NIC on one node to a dedicated NIC on another.  My point being, if heartbeat latency is an issue, it's better to address the latency than it is to pretend it isn't there.

As to the loss of the cluster resources, again, there are ways to allow for temporary outages, but in the long run you'll be better off stopping the outages to begin with.

Changing the thresholds for any resource (including the heartbeat) will mean it takes longer for the cluster to detect (and therefore react to) the loss of a resource.
0
 

Author Comment

by:sunhux
ID: 39218739
it's not a geographically separated pair of cluster nodes,
just about 10m apart from each other in different racks
within the same DC.

Having said that, can someone still give me the step by
step instruction on tuning the heartbeat?

On the Production network, the customer currently also
perform Netbackup over it & I've seen when the Production
network is congested (say comparing a daily backup when
there's not many servers doing backup over Prod LAN) vs
monthend backups, can see that the backup of the same
server (of same amt of data to be backup) can take up to
7 times longer during monthend.

I can't be certain if the LAN congestion issue is triggering
all these, but it's likely
0
 
LVL 34

Assisted Solution

by:Paul MacDonald
Paul MacDonald earned 495 total points
ID: 39218919
But you're talking about two different things.  The loss of connectivity to the cluster resources is not the same as the heartbeat timeout.

Disk resources should be on a dedicated SAN (FC or iSCSI) - certainly not some iSCSI connection that sits on top of the internal network traffic.

Microsoft recommends (as do I) that you put the heartbeat on a dedicated network.  It's easy To do, and a much better solution than trying to "tune" the heartbeat.  That said, and while I cannot emphasize enough how bad an idea it is to muck about with these settings for a LAN, here's the information you're looking for:
http://technet.microsoft.com/en-us/library/af8540dd-25cc-418c-a693-8c3b556b83e4
0
Free NetCrunch network monitor licenses!

Only on Experts-Exchange: Sign-up for a free-trial and we'll send you your permanent license!

Here is what you get: 30 Nodes | Unlimited Sensors | No Time Restrictions | Absolutely FREE!

Act now. This offer ends July 14, 2017.

 

Author Comment

by:sunhux
ID: 39219973
Attached is a warning message that I got when running the
MSCS test.  Is this related to temporary quorum disk loss?

Ok, if the heartbeat is not a related issue to the quorum
disk loss, what are the possible root causes & ways to
address it?

SAN vendor/support said there's no problem with the
SAN otherwise they would have received alerts
MSCSCluerr.jpg
0
 

Author Comment

by:sunhux
ID: 39219979
loss of connectivity to the cluster resources is now our main
issue/concern
0
 

Author Comment

by:sunhux
ID: 39219997
Previously on 2 occasions when there's loss of cluster disk (ie SAN disks)
resources, we get duplicate IP: I'm guessing both nodes became active
at the same time & both fight to grab the cluster disk(s) resources :
correct me if my 'theory' is wrong
0
 

Author Comment

by:sunhux
ID: 39220018
Below is an output of the cluster properties, if it's of any help:

C:\Windows\System32>cluster /cluster:NNXXXHH1VIR03.NHG.LOCAL res
Listing status for all available resources:

Resource             Group                Node            Status
-------------------- -------------------- --------------- ------
Cluster Disk 1       Cluster Group        NNXXXHH1DBS02   Online
Cluster Disk 2       NNXXXHH1VIR05        NNXXXHH1DBS02   Online
Cluster Disk 3       SQL Server (XXXSQLHHHPRD) NNXXXHH1DBS02   Online
Cluster Disk 4       SQL Server (XXXSQLHHHPRD) NNXXXHH1DBS02   Online
Cluster Disk 5       SQL Server (XXXSQLHHHPRD) NNXXXHH1DBS02   Online
Cluster IP Address   Cluster Group        NNXXXHH1DBS02   Online
Cluster Name         Cluster Group        NNXXXHH1DBS02   Online
FileServer-(NNXXXHH1VIR04)(Cluster Disk 4) SQL Server (XXXSQLHHHPRD) NNXXXHH1DBS02   Online
IP Address 10.a.b.129.210 NNXXXHH1VIR05        NNXXXHH1DBS02   Online
MSDTC-NNXXXHH1VIR05  NNXXXHH1VIR05        NNXXXHH1DBS02   Online
NNXXXHH1VIR05        NNXXXHH1VIR05        NNXXXHH1DBS02   Online
SQL IP Address 1 (NNXXXHH1VIR04) SQL Server (XXXSQLHHHPRD) NNXXXHH1DBS02   Online
SQL Network Name (NNXXXHH1VIR04) SQL Server (XXXSQLHHHPRD) NNXXXHH1DBS02   Online
SQL Server (XXXSQLHHHPRD) SQL Server (XXXSQLHHHPRD) NNXXXHH1DBS02   Online
SQL Server Agent (XXXSQLHHHPRD) SQL Server (XXXSQLHHHPRD) NNXXXHH1DBS02   Online
0
 
LVL 34

Accepted Solution

by:
Paul MacDonald earned 495 total points
ID: 39220051
The cluster validation image you posted doesn't contain any errors - it simply notes you only have one network path from DBS01 to DBS02.  If that one path fails, the cluster will be in an undetermined state.  Neither node will know who owns the cluster.

Your theory about both nodes going "active" sounds right to me.  Does your SAN sit on top of your LAN, or does it have a dedicated set of hardware?  That is, can you get to your SAN when your LAN is disconnected?  They should be separate interfaces.
0

Featured Post

Windows Server 2016: All you need to know

Learn about Hyper-V features that increase functionality and usability of Microsoft Windows Server 2016. Also, throughout this eBook, you’ll find some basic PowerShell examples that will help you leverage the scripts in your environments!

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

In this article we will get to know that how can we recover deleted data if it happens accidently. We really can recover deleted rows if we know the time when data is deleted by using the transaction log.
The recent Microsoft changes on update philosophy for Windows pre-10 and their impact on existing WSUS implementations.
This tutorial will show how to configure a single USB drive with a separate folder for each day of the week. This will allow each of the backups to be kept separate preventing the previous day’s backup from being overwritten. The USB drive must be s…
With the advent of Windows 10, Microsoft is pushing a Get Windows 10 icon into the notification area (system tray) of qualifying computers. There are many reasons for wanting to remove this icon. This two-part Experts Exchange video Micro Tutorial s…

717 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question