Solved

How to tune MSCS heartbeat interval & number of times before declaring node is unavailable

Posted on 2013-06-03
8
1,286 Views
Last Modified: 2013-06-21
http://blogs.technet.com/b/aevalshah/archive/2012/05/15/windows-server-2008-r2-failover-clustering-best-practice-guide.aspx

Refer to above url / link.  How do we tune the default Win 2008 x64 R2
Enterprise to say every 5 secs for 2-3 mins?  Give steps by step &
screen shot by screen shot instructions, please.

Any caveat/shortcoming for tuning?  Say the cluster will take longer
to failover in the event there's a genuine outage/crash of the active
node?  How to get the best of both worlds, ie resiliency to network
slowness (during high backup traffic at night) & yet reduce the
failover duration.

We're running MS SQL 2008 R2 Enterprise server databases on
the pair of MSCS & ran into a few incidents of Quorum disk
(& 2 occasions of SAN disks) disappearing for several minutes.

There's also a case of duplicate IP address & we traced that it's
due to both nodes becoming active at the same time, thus the MS
SQL VIP became duplicate (owned by both nodes at the same time.


Heartbeat
========
The heartbeat for the cluster nodes is set by default to send a heartbeat every second. The heartbeat configuration also by default will allow five missed heartbeats before a cluster node is deemed as unavailable. This can be configured to increase the interval and increase/decrease the threshold. This is of particular interest when using a high latency network (WAN) when implementing a geographically dispersed cluster.
0
Comment
Question by:sunhux
  • 5
  • 3
8 Comments
 
LVL 33

Assisted Solution

by:paulmacd
paulmacd earned 495 total points
ID: 39216817
"We're running MS SQL 2008 R2 Enterprise server databases on
the pair of MSCS & ran into a few incidents of Quorum disk
(& 2 occasions of SAN disks) disappearing for several minutes."

Heartbeat aside, this is a problem.

Also, it's not clear is you're running a geographically dispersed cluster or not.  If not, your best bet is to take the heartbeat off the common network and establish a dedicated cluster heartbeat connection - even if it's just a crossover cable from a dedicated NIC on one node to a dedicated NIC on another.  My point being, if heartbeat latency is an issue, it's better to address the latency than it is to pretend it isn't there.

As to the loss of the cluster resources, again, there are ways to allow for temporary outages, but in the long run you'll be better off stopping the outages to begin with.

Changing the thresholds for any resource (including the heartbeat) will mean it takes longer for the cluster to detect (and therefore react to) the loss of a resource.
0
 

Author Comment

by:sunhux
ID: 39218739
it's not a geographically separated pair of cluster nodes,
just about 10m apart from each other in different racks
within the same DC.

Having said that, can someone still give me the step by
step instruction on tuning the heartbeat?

On the Production network, the customer currently also
perform Netbackup over it & I've seen when the Production
network is congested (say comparing a daily backup when
there's not many servers doing backup over Prod LAN) vs
monthend backups, can see that the backup of the same
server (of same amt of data to be backup) can take up to
7 times longer during monthend.

I can't be certain if the LAN congestion issue is triggering
all these, but it's likely
0
 
LVL 33

Assisted Solution

by:paulmacd
paulmacd earned 495 total points
ID: 39218919
But you're talking about two different things.  The loss of connectivity to the cluster resources is not the same as the heartbeat timeout.

Disk resources should be on a dedicated SAN (FC or iSCSI) - certainly not some iSCSI connection that sits on top of the internal network traffic.

Microsoft recommends (as do I) that you put the heartbeat on a dedicated network.  It's easy To do, and a much better solution than trying to "tune" the heartbeat.  That said, and while I cannot emphasize enough how bad an idea it is to muck about with these settings for a LAN, here's the information you're looking for:
http://technet.microsoft.com/en-us/library/af8540dd-25cc-418c-a693-8c3b556b83e4
0
 

Author Comment

by:sunhux
ID: 39219973
Attached is a warning message that I got when running the
MSCS test.  Is this related to temporary quorum disk loss?

Ok, if the heartbeat is not a related issue to the quorum
disk loss, what are the possible root causes & ways to
address it?

SAN vendor/support said there's no problem with the
SAN otherwise they would have received alerts
MSCSCluerr.jpg
0
Find Ransomware Secrets With All-Source Analysis

Ransomware has become a major concern for organizations; its prevalence has grown due to past successes achieved by threat actors. While each ransomware variant is different, we’ve seen some common tactics and trends used among the authors of the malware.

 

Author Comment

by:sunhux
ID: 39219979
loss of connectivity to the cluster resources is now our main
issue/concern
0
 

Author Comment

by:sunhux
ID: 39219997
Previously on 2 occasions when there's loss of cluster disk (ie SAN disks)
resources, we get duplicate IP: I'm guessing both nodes became active
at the same time & both fight to grab the cluster disk(s) resources :
correct me if my 'theory' is wrong
0
 

Author Comment

by:sunhux
ID: 39220018
Below is an output of the cluster properties, if it's of any help:

C:\Windows\System32>cluster /cluster:NNXXXHH1VIR03.NHG.LOCAL res
Listing status for all available resources:

Resource             Group                Node            Status
-------------------- -------------------- --------------- ------
Cluster Disk 1       Cluster Group        NNXXXHH1DBS02   Online
Cluster Disk 2       NNXXXHH1VIR05        NNXXXHH1DBS02   Online
Cluster Disk 3       SQL Server (XXXSQLHHHPRD) NNXXXHH1DBS02   Online
Cluster Disk 4       SQL Server (XXXSQLHHHPRD) NNXXXHH1DBS02   Online
Cluster Disk 5       SQL Server (XXXSQLHHHPRD) NNXXXHH1DBS02   Online
Cluster IP Address   Cluster Group        NNXXXHH1DBS02   Online
Cluster Name         Cluster Group        NNXXXHH1DBS02   Online
FileServer-(NNXXXHH1VIR04)(Cluster Disk 4) SQL Server (XXXSQLHHHPRD) NNXXXHH1DBS02   Online
IP Address 10.a.b.129.210 NNXXXHH1VIR05        NNXXXHH1DBS02   Online
MSDTC-NNXXXHH1VIR05  NNXXXHH1VIR05        NNXXXHH1DBS02   Online
NNXXXHH1VIR05        NNXXXHH1VIR05        NNXXXHH1DBS02   Online
SQL IP Address 1 (NNXXXHH1VIR04) SQL Server (XXXSQLHHHPRD) NNXXXHH1DBS02   Online
SQL Network Name (NNXXXHH1VIR04) SQL Server (XXXSQLHHHPRD) NNXXXHH1DBS02   Online
SQL Server (XXXSQLHHHPRD) SQL Server (XXXSQLHHHPRD) NNXXXHH1DBS02   Online
SQL Server Agent (XXXSQLHHHPRD) SQL Server (XXXSQLHHHPRD) NNXXXHH1DBS02   Online
0
 
LVL 33

Accepted Solution

by:
paulmacd earned 495 total points
ID: 39220051
The cluster validation image you posted doesn't contain any errors - it simply notes you only have one network path from DBS01 to DBS02.  If that one path fails, the cluster will be in an undetermined state.  Neither node will know who owns the cluster.

Your theory about both nodes going "active" sounds right to me.  Does your SAN sit on top of your LAN, or does it have a dedicated set of hardware?  That is, can you get to your SAN when your LAN is disconnected?  They should be separate interfaces.
0

Featured Post

Are end users causing IT problems again?

You’ve taken the time to design and update all your end user’s email signatures, only to find out they’re messing up the HTML, changing the font and ruining the imagery. What can you do to prevent this? Find out how you can save your signatures from end users today.

Join & Write a Comment

Possible fixes for Windows 7 and Windows Server 2008 updating problem. Solutions mentioned are from Microsoft themselves. I started a case with them from our Microsoft Silver Partner option to open a case and get direct support from Microsoft. If s…
Restoring deleted objects in Active Directory has been a standard feature in Active Directory for many years, yet some admins may not know what is available.
This tutorial will show how to push an installation of Backup Exec to an additional server in both 2012 and 2014 versions of the software. Click on the Backup Exec button in the upper left corner. From here, select Installation and Licensing, then I…
This tutorial will show how to configure a single USB drive with a separate folder for each day of the week. This will allow each of the backups to be kept separate preventing the previous day’s backup from being overwritten. The USB drive must be s…

747 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

13 Experts available now in Live!

Get 1:1 Help Now