Solved

How to tune MSCS heartbeat interval & number of times before declaring node is unavailable

Posted on 2013-06-03
8
1,321 Views
Last Modified: 2013-06-21
http://blogs.technet.com/b/aevalshah/archive/2012/05/15/windows-server-2008-r2-failover-clustering-best-practice-guide.aspx

Refer to above url / link.  How do we tune the default Win 2008 x64 R2
Enterprise to say every 5 secs for 2-3 mins?  Give steps by step & 
screen shot by screen shot instructions, please.

Any caveat/shortcoming for tuning?  Say the cluster will take longer
to failover in the event there's a genuine outage/crash of the active
node?  How to get the best of both worlds, ie resiliency to network
slowness (during high backup traffic at night) & yet reduce the
failover duration.

We're running MS SQL 2008 R2 Enterprise server databases on
the pair of MSCS & ran into a few incidents of Quorum disk
(& 2 occasions of SAN disks) disappearing for several minutes.

There's also a case of duplicate IP address & we traced that it's
due to both nodes becoming active at the same time, thus the MS
SQL VIP became duplicate (owned by both nodes at the same time.


Heartbeat
========
The heartbeat for the cluster nodes is set by default to send a heartbeat every second. The heartbeat configuration also by default will allow five missed heartbeats before a cluster node is deemed as unavailable. This can be configured to increase the interval and increase/decrease the threshold. This is of particular interest when using a high latency network (WAN) when implementing a geographically dispersed cluster.
0
Comment
Question by:sunhux
  • 5
  • 3
8 Comments
 
LVL 34

Assisted Solution

by:Paul MacDonald
Paul MacDonald earned 495 total points
ID: 39216817
"We're running MS SQL 2008 R2 Enterprise server databases on
the pair of MSCS & ran into a few incidents of Quorum disk
(& 2 occasions of SAN disks) disappearing for several minutes."

Heartbeat aside, this is a problem.

Also, it's not clear is you're running a geographically dispersed cluster or not.  If not, your best bet is to take the heartbeat off the common network and establish a dedicated cluster heartbeat connection - even if it's just a crossover cable from a dedicated NIC on one node to a dedicated NIC on another.  My point being, if heartbeat latency is an issue, it's better to address the latency than it is to pretend it isn't there.

As to the loss of the cluster resources, again, there are ways to allow for temporary outages, but in the long run you'll be better off stopping the outages to begin with.

Changing the thresholds for any resource (including the heartbeat) will mean it takes longer for the cluster to detect (and therefore react to) the loss of a resource.
0
 

Author Comment

by:sunhux
ID: 39218739
it's not a geographically separated pair of cluster nodes,
just about 10m apart from each other in different racks
within the same DC.

Having said that, can someone still give me the step by
step instruction on tuning the heartbeat?

On the Production network, the customer currently also
perform Netbackup over it & I've seen when the Production
network is congested (say comparing a daily backup when
there's not many servers doing backup over Prod LAN) vs
monthend backups, can see that the backup of the same
server (of same amt of data to be backup) can take up to
7 times longer during monthend.

I can't be certain if the LAN congestion issue is triggering
all these, but it's likely
0
 
LVL 34

Assisted Solution

by:Paul MacDonald
Paul MacDonald earned 495 total points
ID: 39218919
But you're talking about two different things.  The loss of connectivity to the cluster resources is not the same as the heartbeat timeout.

Disk resources should be on a dedicated SAN (FC or iSCSI) - certainly not some iSCSI connection that sits on top of the internal network traffic.

Microsoft recommends (as do I) that you put the heartbeat on a dedicated network.  It's easy To do, and a much better solution than trying to "tune" the heartbeat.  That said, and while I cannot emphasize enough how bad an idea it is to muck about with these settings for a LAN, here's the information you're looking for:
http://technet.microsoft.com/en-us/library/af8540dd-25cc-418c-a693-8c3b556b83e4
0
PRTG Network Monitor: Intuitive Network Monitoring

Network Monitoring is essential to ensure that computer systems and network devices are running. Use PRTG to monitor LANs, servers, websites, applications and devices, bandwidth, virtual environments, remote systems, IoT, and many more. PRTG is easy to set up & use.

 

Author Comment

by:sunhux
ID: 39219973
Attached is a warning message that I got when running the
MSCS test.  Is this related to temporary quorum disk loss?

Ok, if the heartbeat is not a related issue to the quorum
disk loss, what are the possible root causes & ways to
address it?

SAN vendor/support said there's no problem with the
SAN otherwise they would have received alerts
MSCSCluerr.jpg
0
 

Author Comment

by:sunhux
ID: 39219979
loss of connectivity to the cluster resources is now our main
issue/concern
0
 

Author Comment

by:sunhux
ID: 39219997
Previously on 2 occasions when there's loss of cluster disk (ie SAN disks)
resources, we get duplicate IP: I'm guessing both nodes became active
at the same time & both fight to grab the cluster disk(s) resources :
correct me if my 'theory' is wrong
0
 

Author Comment

by:sunhux
ID: 39220018
Below is an output of the cluster properties, if it's of any help:

C:\Windows\System32>cluster /cluster:NNXXXHH1VIR03.NHG.LOCAL res
Listing status for all available resources:

Resource             Group                Node            Status
-------------------- -------------------- --------------- ------
Cluster Disk 1       Cluster Group        NNXXXHH1DBS02   Online
Cluster Disk 2       NNXXXHH1VIR05        NNXXXHH1DBS02   Online
Cluster Disk 3       SQL Server (XXXSQLHHHPRD) NNXXXHH1DBS02   Online
Cluster Disk 4       SQL Server (XXXSQLHHHPRD) NNXXXHH1DBS02   Online
Cluster Disk 5       SQL Server (XXXSQLHHHPRD) NNXXXHH1DBS02   Online
Cluster IP Address   Cluster Group        NNXXXHH1DBS02   Online
Cluster Name         Cluster Group        NNXXXHH1DBS02   Online
FileServer-(NNXXXHH1VIR04)(Cluster Disk 4) SQL Server (XXXSQLHHHPRD) NNXXXHH1DBS02   Online
IP Address 10.a.b.129.210 NNXXXHH1VIR05        NNXXXHH1DBS02   Online
MSDTC-NNXXXHH1VIR05  NNXXXHH1VIR05        NNXXXHH1DBS02   Online
NNXXXHH1VIR05        NNXXXHH1VIR05        NNXXXHH1DBS02   Online
SQL IP Address 1 (NNXXXHH1VIR04) SQL Server (XXXSQLHHHPRD) NNXXXHH1DBS02   Online
SQL Network Name (NNXXXHH1VIR04) SQL Server (XXXSQLHHHPRD) NNXXXHH1DBS02   Online
SQL Server (XXXSQLHHHPRD) SQL Server (XXXSQLHHHPRD) NNXXXHH1DBS02   Online
SQL Server Agent (XXXSQLHHHPRD) SQL Server (XXXSQLHHHPRD) NNXXXHH1DBS02   Online
0
 
LVL 34

Accepted Solution

by:
Paul MacDonald earned 495 total points
ID: 39220051
The cluster validation image you posted doesn't contain any errors - it simply notes you only have one network path from DBS01 to DBS02.  If that one path fails, the cluster will be in an undetermined state.  Neither node will know who owns the cluster.

Your theory about both nodes going "active" sounds right to me.  Does your SAN sit on top of your LAN, or does it have a dedicated set of hardware?  That is, can you get to your SAN when your LAN is disconnected?  They should be separate interfaces.
0

Featured Post

Problems using Powershell and Active Directory?

Managing Active Directory does not always have to be complicated.  If you are spending more time trying instead of doing, then it's time to look at something else. For nearly 20 years, AD admins around the world have used one tool for day-to-day AD management: Hyena. Discover why

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Ever notice how you can't use a new drive in Windows without having Windows assigning a Disk Signature?  Ever have a signature collision problem (especially with Virtual Machines?)  This article is intended to help you understand what's going on and…
This article explains how to reset the password of the sa account on a Microsoft SQL Server.  The steps in this article work in SQL 2005, 2008, 2008 R2, 2012, 2014 and 2016.
In this video, we discuss why the need for additional vertical screen space has become more important in recent years, namely, due to the transition in the marketplace of 4x3 computer screens to 16x9 and 16x10 screens (so-called widescreen format). …
This tutorial will walk an individual through setting the global and backup job media overwrite and protection periods in Backup Exec 2012. Log onto the Backup Exec Central Administration Server. Examine the services. If all or most of them are stop…

813 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

16 Experts available now in Live!

Get 1:1 Help Now