We had a cluster service 'lost' for several minutes yesterday
(& this is a repeat incident): refer to attached screens.
What could be the cause of the issue?
Any resolution or workaround for this issue?
Currently the heartbeat goes thru Production network.
Will setting up dedicated heartbeat (ie a direct cross-
cable between the 2 member servers of the cluster
Or can we tune/tweak the heartbeat interval & the number
of missed heartbeat to make the cluster more resilient
ie to address this issue? If so, can point me to a link or
provide instructions on how to tune these?
Can this be due to differences in the various firmware
(UEFI, versions between the 2 member servers as IBM
found the versions to be different but can't pinpoint that
this issue is due to differences in firmware version of
the member nodes.
I prefer not to attach the Event Viewer logs here as it
contains sensitive info