[Last Call] Learn how to a build a cloud-first strategyRegister Now

x
  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 5496
  • Last Modified:

Cluster connection issues

I had an issue today where all my clusters (4) showed loss of communication on the "heartbeat" and "public" networks at roughly the same time, and for the same duration. This article (http://support.microsoft.com/kb/892422) states that it is likely not a network issue, but something else. Any ideas?
0
ktpoitm
Asked:
ktpoitm
  • 8
  • 3
  • 2
1 Solution
 
Acosta Technology ServicesCommented:
Are you using a switch to manage the cluster?  Are the heartbeat and LAN both running through the same switch?
0
 
ktpoitmAuthor Commented:
Both nodes are on blades, that are in 2 different enclosures. They are bot HP bl20p G3's. I found no hardware similarities, all nodes are spread across 5 different blade enclosures. They only common like they have is to our core switch, but I see no errors in the logs on the switch.
0
 
Acosta Technology ServicesCommented:
How are you performing the heartbeat for the clusters?  Did you dedicate a port on a specific enclosure fabric to be the heartbeat port, or are all of the enclosures only connected to the core switch?
0
Independent Software Vendors: We Want Your Opinion

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 
ktpoitmAuthor Commented:
Each blade has 4 nics, 3 are teamed for the public address (10.0.1.x), 1 for the heartbeat (192.168.75.x). They connect back to the integrated switch on the blade enclosure, but its pretty much just a passthru. Then the enclosure switches are connected directly to the core. Each enclosure switch has two ports setup, one for vlan 10 (server vlan, 10.0.1.x) and one for vlan 9 (management vlan, 10.0.9.x).
0
 
ktpoitmAuthor Commented:
I spoke to the person who helped configure this. He says its using the internal switch to switch communication.
0
 
arnoldCommented:
Check for an event on the enclosers as well as events on the core switch which was pointed out earlier by operationnos.

a loss of communication across all interfaces would be where they are all aggregated and in your case they all aggregate on the core switch.
0
 
ktpoitmAuthor Commented:
Already checked the core switch...no event. There are 32 other blades that also use the same vlan (10.0.1.x) as the public adapters and no comm issues. Only issues were with the clusters and on both the public and heartbeat adapters. Idk...this one has me stumped. This incident was isolated to 4 clusters that are spread on 5 different blade enclosures.
0
 
arnoldCommented:
Two enclosures with two blade servers in each had a hickup.
Do you have collected data on the Network traffic to see whether there was a network saturation?
Are the 32 other blades part of a cluster across enclosures?
one is a hickup among nodes to see who is in charge, and another is to have a node that temporarily stopped receiving requests.
0
 
ktpoitmAuthor Commented:
I have 5 blade enclosures and 4 clusters. All 4 clusters are on seperate blades spread across the enclosures (no cluster has both sides in the same chassis). They all had a simultaneous hiccup. I don't have any network traffic logs to see if there was any saturation yet. Looking into it. The other blades are just stand alone servers in the chassis, but they are all share the same vlan.
0
 
ktpoitmAuthor Commented:
Are you saying they have an "intentional" hiccup to see who is the primary and who is the secondary?
0
 
ktpoitmAuthor Commented:
Below is a snapshot of the system log in case what I am explaining happened isn't clear.

9/12/2011      11:26:35 AM      Foundation Agents      Error      Events       1172      N/A      KTPO16      "Cluster Agent: The cluster service on KTPO15 has failed.
[SNMP TRAP: 15004 in CPQCLUS.MIB]"
9/12/2011      11:24:41 AM      ClusSvc      Warning      Node Mgr       1135      N/A      KTPO16      Cluster node KTPO15 was removed from the active server cluster membership. Cluster service may have been stopped on the node, the node may have failed, or the node may have lost communication with the other active server cluster nodes.
9/12/2011      11:24:35 AM      ClusSvc      Warning      Node Mgr       1123      N/A      KTPO16      The node lost communication with cluster node 'KTPO15' on network 'Public'.
9/12/2011      11:24:35 AM      ClusSvc      Warning      Node Mgr       1123      N/A      KTPO16      The node lost communication with cluster node 'KTPO15' on network 'Heart Beat'.
9/12/2011      11:23:17 AM      Service Control Manager      Information      None      7036      N/A      KTPO15      The Symantec AntiVirus service entered the running state.
9/12/2011      11:23:00 AM      Service Control Manager      Information      None      7035      NT AUTHORITY\SYSTEM      KTPO15      The SAVRT service was successfully sent a start control.
9/12/2011      11:22:49 AM      Service Control Manager      Information      None      7036      N/A      KTPO15      The Windows Installer service entered the running state.
9/12/2011      11:22:49 AM      Service Control Manager      Information      None      7035      NT AUTHORITY\SYSTEM      KTPO15      The Windows Installer service was successfully sent a start control.
9/12/2011      11:22:49 AM      Service Control Manager      Information      None      7036      N/A      KTPO15      The Network Location Awareness (NLA) service entered the running state.
9/12/2011      11:22:49 AM      Service Control Manager      Information      None      7035      NT AUTHORITY\SYSTEM      KTPO15      The Network Location Awareness (NLA) service was successfully sent a start control.
9/12/2011      11:22:49 AM      Service Control Manager      Information      None      7035      NT AUTHORITY\SYSTEM      KTPO15      The Altiris Kernel Driver service was successfully sent a start control.
9/12/2011      11:22:49 AM      Service Control Manager      Information      None      7035      S-1-5-21-675632585-1759720205-4280849243-1346      KTPO15      The Distributed Transaction Coordinator service was successfully sent a stop control.
9/12/2011      11:22:46 AM      ClusSvc      Information      Startup/Shutdown       1062      N/A      KTPO15      Cluster service successfully joined the server cluster MESMCS.
9/12/2011      11:22:33 AM      W32Time      Information      None      35      N/A      KTPO15      The time service is now synchronizing the system time with the time source 10.0.1.6 (ntp.m|0x0|10.0.1.25:123->10.0.1.6:123).
9/12/2011      11:22:26 AM      ClusSvc      Information      Event Logger       1202      N/A      KTPO15      The time delta between node KTPO15 and node KTPO16 is 60696876(in 100 nanosecs).
9/12/2011      11:22:23 AM      ClusSvc      Information      Node Mgr       1122      N/A      KTPO15      The node (re)established communication with cluster node 'KTPO16' on network 'Public'.
9/12/2011      11:22:23 AM      ClusSvc      Information      Node Mgr       1122      N/A      KTPO15      The node (re)established communication with cluster node 'KTPO16' on network 'Heart Beat'.
9/12/2011      11:22:21 AM      Foundation Agents      Information      Service       400      N/A      KTPO15      The Foundation Agents service version 8.50.0.0 has started.
9/12/2011      11:22:20 AM      Storage Agents      Information      Service       400      N/A      KTPO15      The Storage Agents service version 8.50.0.0 has started.
9/12/2011      11:22:20 AM      Server Agents      Information      Service       400      N/A      KTPO15      The Server Agents service version 8.50.0.0 has started.
9/12/2011      11:22:19 AM      NIC Agents      Information      Service       277      N/A      KTPO15      The NIC Management Agent version 8.50.0.0 has started.
9/12/2011      11:22:18 AM      SNMP      Information      None      1001      N/A      KTPO15      The SNMP Service has started successfully.
9/12/2011      11:22:17 AM      HP System Management Homepage      Information      Service       9      N/A      KTPO15      The HP System Management Homepage Win32 service has been started successfully.
9/12/2011      11:22:13 AM      IPSec      Information      None      4294      N/A      KTPO15      The IPSec driver has entered Secure mode. IPSec policies, if they have been configured, are now being applied to this computer.
9/12/2011      11:22:04 AM      cpqriis      Information      None      105      N/A      KTPO15      The service was started.
9/12/2011      11:22:03 AM      AeLookupSvc      Information      None      3      N/A      KTPO15      The Application Experience Lookup service started successfully.
9/12/2011      11:21:48 AM      q57w2k      Information      None      11      N/A      KTPO15      HP NC7781 Gigabit Server: Network controller configured for 1Gb full-duplex link.
9/12/2011      11:21:48 AM      q57w2k      Information      None      11      N/A      KTPO15      HP NC7781 Gigabit Server: Network controller configured for 1Gb full-duplex link.
9/12/2011      11:21:48 AM      q57w2k      Information      None      11      N/A      KTPO15      HP NC7781 Gigabit Server: Network controller configured for 1Gb full-duplex link.
9/12/2011      11:21:47 AM      q57w2k      Information      None      11      N/A      KTPO15      HP NC7781 Gigabit Server: Network controller configured for 1Gb full-duplex link.
9/12/2011      11:21:45 AM      IPSec      Information      None      4295      N/A      KTPO15      The IPSec Driver is starting in Bypass mode. No IPSec security is being applied while this computer starts up. IPSec policies, if they have been assigned, will be applied to this computer after the IPSec services start.
9/12/2011      11:21:45 AM      hpeaadsm      Information      None      102      N/A      KTPO15      A new path (SCSI address Port 3 B0 T1 L2) was added to existing multipath capable disk 600508B400105F300000900000860000.
9/12/2011      11:21:45 AM      mpio      Information      None      2      N/A      KTPO15      Added device to \Device\MPIODisk1. DumpData contains the current number of paths.
9/12/2011      11:21:45 AM      hpeaadsm      Information      None      102      N/A      KTPO15      A new path (SCSI address Port 3 B0 T1 L1) was added to existing multipath capable disk 600508B400105F3000009000007D0000.
9/12/2011      11:21:45 AM      mpio      Information      None      2      N/A      KTPO15      Added device to \Device\MPIODisk0. DumpData contains the current number of paths.
9/12/2011      11:21:45 AM      hpeaadsm      Information      None      102      N/A      KTPO15      A new path (SCSI address Port 3 B0 T0 L2) was added to existing multipath capable disk 600508B400105F300000900000860000.
9/12/2011      11:21:45 AM      mpio      Information      None      2      N/A      KTPO15      Added device to \Device\MPIODisk1. DumpData contains the current number of paths.
9/12/2011      11:21:45 AM      hpeaadsm      Information      None      102      N/A      KTPO15      A new path (SCSI address Port 3 B0 T0 L1) was added to existing multipath capable disk 600508B400105F3000009000007D0000.
9/12/2011      11:21:45 AM      mpio      Information      None      2      N/A      KTPO15      Added device to \Device\MPIODisk0. DumpData contains the current number of paths.
9/12/2011      11:21:44 AM      q57w2k      Information      None      15      N/A      KTPO15      HP NC7781 Gigabit Server: Driver initialized successfully.
9/12/2011      11:21:44 AM      q57w2k      Information      None      15      N/A      KTPO15      HP NC7781 Gigabit Server: Driver initialized successfully.
9/12/2011      11:21:44 AM      q57w2k      Information      None      15      N/A      KTPO15      HP NC7781 Gigabit Server: Driver initialized successfully.
9/12/2011      11:21:44 AM      q57w2k      Information      None      15      N/A      KTPO15      HP NC7781 Gigabit Server: Driver initialized successfully.
9/12/2011      11:21:43 AM      CPQCISSE      Information      None      24685      N/A      KTPO15      "The Event Notification driver Cpqcisse.sys of
Array Controller [Embedded] has started."
9/12/2011      11:21:42 AM      hpeaadsm      Information      None      102      N/A      KTPO15      A new path (SCSI address Port 2 B0 T1 L2) was added to existing multipath capable disk 600508B400105F300000900000860000.
9/12/2011      11:21:42 AM      mpio      Information      None      2      N/A      KTPO15      Added device to \Device\MPIODisk1. DumpData contains the current number of paths.
9/12/2011      11:21:42 AM      hpeaadsm      Information      None      102      N/A      KTPO15      A new path (SCSI address Port 2 B0 T1 L1) was added to existing multipath capable disk 600508B400105F3000009000007D0000.
9/12/2011      11:21:42 AM      mpio      Information      None      2      N/A      KTPO15      Added device to \Device\MPIODisk0. DumpData contains the current number of paths.
9/12/2011      11:21:42 AM      hpeaadsm      Information      None      101      N/A      KTPO15      Discovered a new multipath capable disk with serial number 600508B400105F300000900000860000; first path SCSI address Port 2 B0 T0 L2.
9/12/2011      11:21:42 AM      mpio      Information      None      1      N/A      KTPO15      \Device\MPIODisk1 created.
9/12/2011      11:21:42 AM      hpeaadsm      Information      None      101      N/A      KTPO15      Discovered a new multipath capable disk with serial number 600508B400105F3000009000007D0000; first path SCSI address Port 2 B0 T0 L1.
9/12/2011      11:21:42 AM      mpio      Information      None      1      N/A      KTPO15      \Device\MPIODisk0 created.
9/12/2011      11:21:23 AM      hpeaadsm      Information      None      109      N/A      KTPO15      The DSM (version 2.1.2.130) has been started successfully.
9/12/2011      11:21:51 AM      DCOM      Information      None      10026      N/A      KTPO15      The COM sub system is suppressing duplicate event log entries for a duration of 86400 seconds.  The suppression timeout can be controlled by a REG_DWORD value named SuppressDuplicateDuration under the following registry key: HKLM\Software\Microsoft\Ole\EventLog.
9/12/2011      11:21:51 AM      EventLog      Information      None      6005      N/A      KTPO15      The Event log service was started.
9/12/2011      11:21:51 AM      EventLog      Information      None      6009      N/A      KTPO15      Microsoft (R) Windows (R) 5.02. 3790 Service Pack 1 Multiprocessor Free.
9/12/2011      11:22:34 AM      Foundation Agents      Warning      Events       1171      N/A      KTPO16      "Cluster Agent: The cluster service on KTPO15 has become degraded.
[SNMP TRAP: 15003 in CPQCLUS.MIB]"
9/12/2011      11:22:17 AM      ClusSvc      Information      Node Mgr       1125      N/A      KTPO16      The interface for cluster node 'KTPO15' on network 'Public' is operational (up). The node can communicate with all other available cluster nodes on the network.
9/12/2011      11:22:17 AM      ClusSvc      Information      Node Mgr       1125      N/A      KTPO16      The interface for cluster node 'KTPO15' on network 'Heart Beat' is operational (up). The node can communicate with all other available cluster nodes on the network.
9/12/2011      11:22:16 AM      ClusSvc      Information      Node Mgr       1122      N/A      KTPO16      The node (re)established communication with cluster node 'KTPO15' on network 'Public'.
9/12/2011      11:22:16 AM      ClusSvc      Information      Node Mgr       1122      N/A      KTPO16      The node (re)established communication with cluster node 'KTPO15' on network 'Heart Beat'.
9/12/2011      11:16:34 AM      Foundation Agents      Error      Events       1172      N/A      KTPO16      "Cluster Agent: The cluster service on KTPO15 has failed.
[SNMP TRAP: 15004 in CPQCLUS.MIB]"
9/12/2011      11:15:08 AM      ClusSvc      Warning      Node Mgr       1135      N/A      KTPO16      Cluster node KTPO15 was removed from the active server cluster membership. Cluster service may have been stopped on the node, the node may have failed, or the node may have lost communication with the other active server cluster nodes.
9/12/2011      11:15:04 AM      ClusSvc      Warning      Node Mgr       1123      N/A      KTPO16      The node lost communication with cluster node 'KTPO15' on network 'Public'.
9/12/2011      11:15:04 AM      ClusSvc      Warning      Node Mgr       1123      N/A      KTPO16      The node lost communication with cluster node 'KTPO15' on network 'Heart Beat'.
9/12/2011      11:10:47 AM      Service Control Manager      Information      None      7036      N/A      KTPO15      The Symantec AntiVirus service entered the running state.
9/12/2011      11:10:30 AM      Service Control Manager      Information      None      7035      NT AUTHORITY\SYSTEM      KTPO15      The SAVRT service was successfully sent a start control.
9/12/2011      11:10:18 AM      Service Control Manager      Information      None      7036      N/A      KTPO15      The Windows Installer service entered the running state.
9/12/2011      11:10:18 AM      Service Control Manager      Information      None      7035      NT AUTHORITY\SYSTEM      KTPO15      The Windows Installer service was successfully sent a start control.
9/12/2011      11:10:18 AM      Service Control Manager      Information      None      7036      N/A      KTPO15      The Network Location Awareness (NLA) service entered the running state.
9/12/2011      11:10:18 AM      Service Control Manager      Information      None      7035      NT AUTHORITY\SYSTEM      KTPO15      The Network Location Awareness (NLA) service was successfully sent a start control.
9/12/2011      11:10:18 AM      Service Control Manager      Information      None      7035      NT AUTHORITY\SYSTEM      KTPO15      The Altiris Kernel Driver service was successfully sent a start control.
9/12/2011      11:10:18 AM      Service Control Manager      Information      None      7035      S-1-5-21-675632585-1759720205-4280849243-1346      KTPO15      The Distributed Transaction Coordinator service was successfully sent a stop control.
9/12/2011      11:10:16 AM      ClusSvc      Information      Startup/Shutdown       1062      N/A      KTPO15      Cluster service successfully joined the server cluster MESMCS.
9/12/2011      11:10:04 AM      W32Time      Information      None      35      N/A      KTPO15      The time service is now synchronizing the system time with the time source 10.0.1.6 (ntp.m|0x0|10.0.1.25:123->10.0.1.6:123).
9/12/2011      11:10:02 AM      ClusSvc      Information      Event Logger       1202      N/A      KTPO15      The time delta between node KTPO15 and node KTPO16 is 58631404(in 100 nanosecs).
9/12/2011      11:09:52 AM      ClusSvc      Information      Node Mgr       1122      N/A      KTPO15      The node (re)established communication with cluster node 'KTPO16' on network 'Public'.
9/12/2011      11:09:52 AM      ClusSvc      Information      Node Mgr       1122      N/A      KTPO15      The node (re)established communication with cluster node 'KTPO16' on network 'Heart Beat'.
9/12/2011      11:09:52 AM      Foundation Agents      Information      Service       400      N/A      KTPO15      The Foundation Agents service version 8.50.0.0 has started.
9/12/2011      11:09:50 AM      Storage Agents      Information      Service       400      N/A      KTPO15      The Storage Agents service version 8.50.0.0 has started.
9/12/2011      11:09:50 AM      Server Agents      Information      Service       400      N/A      KTPO15      The Server Agents service version 8.50.0.0 has started.
9/12/2011      11:09:50 AM      NIC Agents      Information      Service       277      N/A      KTPO15      The NIC Management Agent version 8.50.0.0 has started.
9/12/2011      11:09:49 AM      SNMP      Information      None      1001      N/A      KTPO15      The SNMP Service has started successfully.
9/12/2011      11:09:48 AM      HP System Management Homepage      Information      Service       9      N/A      KTPO15      The HP System Management Homepage Win32 service has been started successfully.
9/12/2011      11:09:44 AM      IPSec      Information      None      4294      N/A      KTPO15      The IPSec driver has entered Secure mode. IPSec policies, if they have been configured, are now being applied to this computer.
9/12/2011      11:09:38 AM      cpqriis      Information      None      105      N/A      KTPO15      The service was started.
9/12/2011      11:09:37 AM      AeLookupSvc      Information      None      3      N/A      KTPO15      The Application Experience Lookup service started successfully.
9/12/2011      11:09:21 AM      q57w2k      Information      None      11      N/A      KTPO15      HP NC7781 Gigabit Server: Network controller configured for 1Gb full-duplex link.
9/12/2011      11:09:21 AM      q57w2k      Information      None      11      N/A      KTPO15      HP NC7781 Gigabit Server: Network controller configured for 1Gb full-duplex link.
9/12/2011      11:09:21 AM      q57w2k      Information      None      11      N/A      KTPO15      HP NC7781 Gigabit Server: Network controller configured for 1Gb full-duplex link.
9/12/2011      11:09:21 AM      q57w2k      Information      None      11      N/A      KTPO15      HP NC7781 Gigabit Server: Network controller configured for 1Gb full-duplex link.
9/12/2011      11:09:18 AM      IPSec      Information      None      4295      N/A      KTPO15      The IPSec Driver is starting in Bypass mode. No IPSec security is being applied while this computer starts up. IPSec policies, if they have been assigned, will be applied to this computer after the IPSec services start.
9/12/2011      11:09:18 AM      q57w2k      Information      None      15      N/A      KTPO15      HP NC7781 Gigabit Server: Driver initialized successfully.
9/12/2011      11:09:18 AM      q57w2k      Information      None      15      N/A      KTPO15      HP NC7781 Gigabit Server: Driver initialized successfully.
9/12/2011      11:09:17 AM      q57w2k      Information      None      15      N/A      KTPO15      HP NC7781 Gigabit Server: Driver initialized successfully.
9/12/2011      11:09:17 AM      q57w2k      Information      None      15      N/A      KTPO15      HP NC7781 Gigabit Server: Driver initialized successfully.
9/12/2011      11:09:17 AM      CPQCISSE      Information      None      24685      N/A      KTPO15      "The Event Notification driver Cpqcisse.sys of
Array Controller [Embedded] has started."
9/12/2011      11:09:16 AM      hpeaadsm      Information      None      102      N/A      KTPO15      A new path (SCSI address Port 3 B0 T1 L2) was added to existing multipath capable disk 600508B400105F300000900000860000.
9/12/2011      11:09:16 AM      mpio      Information      None      2      N/A      KTPO15      Added device to \Device\MPIODisk1. DumpData contains the current number of paths.
9/12/2011      11:09:16 AM      hpeaadsm      Information      None      102      N/A      KTPO15      A new path (SCSI address Port 3 B0 T1 L1) was added to existing multipath capable disk 600508B400105F3000009000007D0000.
9/12/2011      11:09:16 AM      mpio      Information      None      2      N/A      KTPO15      Added device to \Device\MPIODisk0. DumpData contains the current number of paths.
9/12/2011      11:09:16 AM      hpeaadsm      Information      None      102      N/A      KTPO15      A new path (SCSI address Port 3 B0 T0 L2) was added to existing multipath capable disk 600508B400105F300000900000860000.
9/12/2011      11:09:16 AM      mpio      Information      None      2      N/A      KTPO15      Added device to \Device\MPIODisk1. DumpData contains the current number of paths.
9/12/2011      11:09:16 AM      hpeaadsm      Information      None      102      N/A      KTPO15      A new path (SCSI address Port 3 B0 T0 L1) was added to existing multipath capable disk 600508B400105F3000009000007D0000.
9/12/2011      11:09:16 AM      mpio      Information      None      2      N/A      KTPO15      Added device to \Device\MPIODisk0. DumpData contains the current number of paths.
9/12/2011      11:09:16 AM      hpeaadsm      Information      None      102      N/A      KTPO15      A new path (SCSI address Port 2 B0 T1 L2) was added to existing multipath capable disk 600508B400105F300000900000860000.
9/12/2011      11:09:16 AM      mpio      Information      None      2      N/A      KTPO15      Added device to \Device\MPIODisk1. DumpData contains the current number of paths.
9/12/2011      11:09:16 AM      hpeaadsm      Information      None      102      N/A      KTPO15      A new path (SCSI address Port 2 B0 T1 L1) was added to existing multipath capable disk 600508B400105F3000009000007D0000.
9/12/2011      11:09:16 AM      mpio      Information      None      2      N/A      KTPO15      Added device to \Device\MPIODisk0. DumpData contains the current number of paths.
9/12/2011      11:09:16 AM      hpeaadsm      Information      None      101      N/A      KTPO15      Discovered a new multipath capable disk with serial number 600508B400105F300000900000860000; first path SCSI address Port 2 B0 T0 L2.
9/12/2011      11:09:16 AM      mpio      Information      None      1      N/A      KTPO15      \Device\MPIODisk1 created.
9/12/2011      11:09:16 AM      hpeaadsm      Information      None      101      N/A      KTPO15      Discovered a new multipath capable disk with serial number 600508B400105F3000009000007D0000; first path SCSI address Port 2 B0 T0 L1.
9/12/2011      11:09:16 AM      mpio      Information      None      1      N/A      KTPO15      \Device\MPIODisk0 created.
9/12/2011      11:09:01 AM      hpeaadsm      Information      None      109      N/A      KTPO15      The DSM (version 2.1.2.130) has been started successfully.
9/12/2011      11:09:25 AM      DCOM      Information      None      10026      N/A      KTPO15      The COM sub system is suppressing duplicate event log entries for a duration of 86400 seconds.  The suppression timeout can be controlled by a REG_DWORD value named SuppressDuplicateDuration under the following registry key: HKLM\Software\Microsoft\Ole\EventLog.
9/12/2011      11:09:24 AM      EventLog      Information      None      6005      N/A      KTPO15      The Event log service was started.
9/12/2011      11:09:24 AM      EventLog      Information      None      6009      N/A      KTPO15      Microsoft (R) Windows (R) 5.02. 3790 Service Pack 1 Multiprocessor Free.
9/12/2011      11:10:10 AM      ClusSvc      Information      Event Logger       1202      N/A      KTPO16      The time delta between node KTPO16 and node KTPO15 is -57084471(in 100 nanosecs).
9/12/2011      11:09:48 AM      ClusSvc      Information      Node Mgr       1125      N/A      KTPO16      The interface for cluster node 'KTPO15' on network 'Public' is operational (up). The node can communicate with all other available cluster nodes on the network.
9/12/2011      11:09:48 AM      ClusSvc      Information      Node Mgr       1125      N/A      KTPO16      The interface for cluster node 'KTPO15' on network 'Heart Beat' is operational (up). The node can communicate with all other available cluster nodes on the network.
9/12/2011      11:09:46 AM      ClusSvc      Information      Node Mgr       1122      N/A      KTPO16      The node (re)established communication with cluster node 'KTPO15' on network 'Public'.
9/12/2011      11:09:46 AM      ClusSvc      Information      Node Mgr       1122      N/A      KTPO16      The node (re)established communication with cluster node 'KTPO15' on network 'Heart Beat'.
9/12/2011      11:08:34 AM      Foundation Agents      Error      Events       1172      N/A      KTPO16      "Cluster Agent: The cluster service on KTPO15 has failed.
[SNMP TRAP: 15004 in CPQCLUS.MIB]"
9/12/2011      11:07:20 AM      ClusSvc      Warning      Node Mgr       1135      N/A      KTPO16      Cluster node KTPO15 was removed from the active server cluster membership. Cluster service may have been stopped on the node, the node may have failed, or the node may have lost communication with the other active server cluster nodes.
9/12/2011      11:07:16 AM      ClusSvc      Warning      Node Mgr       1123      N/A      KTPO16      The node lost communication with cluster node 'KTPO15' on network 'Public'.
9/12/2011      11:07:16 AM      ClusSvc      Warning      Node Mgr       1123      N/A      KTPO16      The node lost communication with cluster node 'KTPO15' on network 'Heart Beat'.
9/12/2011      10:43:48 AM      TermService      Error      None      1041      N/A      KTPO16      Autoreconnect failed to reconnect user to session because authentication failed. (0x0)
9/12/2011      10:11:52 AM      Service Control Manager      Information      None      7036      N/A      KTPO16      The WMI Performance Adapter service entered the stopped state.
9/12/2011      10:11:52 AM      Service Control Manager      Information      None      7036      N/A      KTPO16      The WMI Performance Adapter service entered the running state.
9/12/2011      10:11:52 AM      Service Control Manager      Information      None      7035      S-1-5-21-675632585-1759720205-4280849243-1346      KTPO16      The WMI Performance Adapter service was successfully sent a start control.
9/12/2011      10:11:32 AM      ClusSvc      Information      Failover Mgr       1204      N/A      KTPO15      "The Cluster Service brought the Resource Group ""Group 0"" offline."
9/12/2011      10:11:32 AM      ClusSvc      Information      Failover Mgr       1203      N/A      KTPO15      "The Cluster Service is attempting to offline the Resource Group ""Group 0""."
9/12/2011      10:11:36 AM      Service Control Manager      Information      None      7036      N/A      KTPO16      The CIMPLICITY HMI Service service entered the running state.
9/12/2011      10:11:36 AM      Service Control Manager      Information      None      7035      S-1-5-21-675632585-1759720205-4280849243-1346      KTPO16      The CIMPLICITY HMI Service service was successfully sent a start control.
9/12/2011      10:11:33 AM      ClusSvc      Information      Failover Mgr       1205      N/A      KTPO16      "The Cluster Service failed to bring the Resource Group ""Group 0"" completely online or offline."
9/12/2011      10:11:32 AM      ClusSvc      Information      Failover Mgr       1200      N/A      KTPO16      "The Cluster Service is attempting to bring online the Resource Group ""Group 0""."
9/12/2011      10:11:26 AM      ClusSvc      Information      Failover Mgr       1201      N/A      KTPO16      "The Cluster Service brought the Resource Group ""Cluster Group"" online."
9/12/2011      10:11:17 AM      ClusSvc      Information      Failover Mgr       1204      N/A      KTPO15      "The Cluster Service brought the Resource Group ""Cluster Group"" offline."
9/12/2011      10:11:17 AM      ClusSvc      Information      Failover Mgr       1203      N/A      KTPO15      "The Cluster Service is attempting to offline the Resource Group ""Cluster Group""."
9/12/2011      10:11:04 AM      Service Control Manager      Information      None      7036      N/A      KTPO15      The CIMPLICITY HMI Service service entered the stopped state.
9/12/2011      10:11:17 AM      ClusSvc      Information      Failover Mgr       1200      N/A      KTPO16      "The Cluster Service is attempting to bring online the Resource Group ""Cluster Group""."
9/12/2011      10:10:33 AM      Foundation Agents      Warning      Events       1167      N/A      KTPO16      "Cluster Agent: The cluster resource KUB1_AND1 has become degraded.
[SNMP TRAP: 15005 in CPQCLUS.MIB]"
9/12/2011      10:10:33 AM      Foundation Agents      Warning      Events       1167      N/A      KTPO16      "Cluster Agent: The cluster resource KUB1_QUL1 has become degraded.
[SNMP TRAP: 15005 in CPQCLUS.MIB]"
9/12/2011      10:10:33 AM      Foundation Agents      Warning      Events       1167      N/A      KTPO16      "Cluster Agent: The cluster resource KUB1_MCS1 has become degraded.
[SNMP TRAP: 15005 in CPQCLUS.MIB]"
9/12/2011      10:01:45 AM      ClusSvc      Information      Failover Mgr       1201      N/A      KTPO15      "The Cluster Service brought the Resource Group ""Group 0"" online."
9/12/2011      10:01:39 AM      Service Control Manager      Information      None      7036      N/A      KTPO15      The WMI Performance Adapter service entered the stopped state.
9/12/2011      10:01:39 AM      Service Control Manager      Information      None      7036      N/A      KTPO15      The WMI Performance Adapter service entered the running state.
9/12/2011      10:01:39 AM      Service Control Manager      Information      None      7035      S-1-5-21-675632585-1759720205-4280849243-1346      KTPO15      The WMI Performance Adapter service was successfully sent a start control.
9/12/2011      10:01:30 AM      ClusSvc      Information      Failover Mgr       1201      N/A      KTPO15      "The Cluster Service brought the Resource Group ""Cluster Group"" online."
9/12/2011      10:01:25 AM      ClusSvc      Information      Node Mgr       1125      N/A      KTPO15      The interface for cluster node 'KTPO16' on network 'Public' is operational (up). The node can communicate with all other available cluster nodes on the network.
9/12/2011      10:01:25 AM      ClusSvc      Information      Node Mgr       1125      N/A      KTPO15      The interface for cluster node 'KTPO16' on network 'Heart Beat' is operational (up). The node can communicate with all other available cluster nodes on the network.
9/12/2011      10:01:24 AM      ClusSvc      Information      Node Mgr       1122      N/A      KTPO15      The node (re)established communication with cluster node 'KTPO16' on network 'Public'.
9/12/2011      10:01:24 AM      ClusSvc      Information      Node Mgr       1122      N/A      KTPO15      The node (re)established communication with cluster node 'KTPO16' on network 'Heart Beat'.
9/12/2011      10:01:23 AM      Service Control Manager      Information      None      7036      N/A      KTPO15      The CIMPLICITY HMI Service service entered the running state.
9/12/2011      10:01:23 AM      Service Control Manager      Information      None      7035      S-1-5-21-675632585-1759720205-4280849243-1346      KTPO15      The CIMPLICITY HMI Service service was successfully sent a start control.
9/12/2011      10:01:22 AM      ClusSvc      Information      Failover Mgr       1200      N/A      KTPO15      "The Cluster Service is attempting to bring online the Resource Group ""Group 0""."
9/12/2011      10:01:26 AM      Service Control Manager      Information      None      7036      N/A      KTPO16      The Cluster Service service entered the running state.
9/12/2011      10:01:26 AM      ClusSvc      Information      Startup/Shutdown       1062      N/A      KTPO16      Cluster service successfully joined the server cluster MESMCS.
9/12/2011      10:01:24 AM      ClusSvc      Information      Node Mgr       1122      N/A      KTPO16      The node (re)established communication with cluster node 'KTPO15' on network 'Public'.
9/12/2011      10:01:24 AM      ClusSvc      Information      Node Mgr       1122      N/A      KTPO16      The node (re)established communication with cluster node 'KTPO15' on network 'Heart Beat'.
9/12/2011      10:01:08 AM      Service Control Manager      Information      None      7036      N/A      KTPO16      The CIMPLICITY HMI Service service entered the stopped state.
9/12/2011      10:00:33 AM      Foundation Agents      Error      Events       1172      N/A      KTPO16      "Cluster Agent: The cluster service on KTPO16 has failed.
[SNMP TRAP: 15004 in CPQCLUS.MIB]"
9/12/2011      10:00:24 AM      Ftdisk      Warning      Disk       57      N/A      KTPO16      The system failed to flush data to the transaction log. Corruption may occur.
9/12/2011      10:00:24 AM      Ftdisk      Warning      Disk       57      N/A      KTPO16      The system failed to flush data to the transaction log. Corruption may occur.
9/12/2011      10:00:24 AM      Ftdisk      Warning      Disk       57      N/A      KTPO16      The system failed to flush data to the transaction log. Corruption may occur.
9/12/2011      10:00:24 AM      Ftdisk      Warning      Disk       57      N/A      KTPO16      The system failed to flush data to the transaction log. Corruption may occur.
9/12/2011      10:00:22 AM      Ftdisk      Warning      Disk       57      N/A      KTPO16      The system failed to flush data to the transaction log. Corruption may occur.
9/12/2011      10:00:22 AM      Service Control Manager      Error      None      7031      N/A      KTPO16      The Cluster Service service terminated unexpectedly.  It has done this 1 time(s).  The following corrective action will be taken in 60000 milliseconds: Restart the service.
9/12/2011      10:00:22 AM      Ftdisk      Warning      Disk       57      N/A      KTPO16      The system failed to flush data to the transaction log. Corruption may occur.
9/12/2011      10:00:22 AM      Ftdisk      Warning      Disk       57      N/A      KTPO16      The system failed to flush data to the transaction log. Corruption may occur.
9/12/2011      10:00:22 AM      Ftdisk      Warning      Disk       57      N/A      KTPO16      The system failed to flush data to the transaction log. Corruption may occur.
9/12/2011      10:00:22 AM      Ftdisk      Warning      Disk       57      N/A      KTPO16      The system failed to flush data to the transaction log. Corruption may occur.
9/12/2011      10:00:22 AM      Ftdisk      Warning      Disk       57      N/A      KTPO16      The system failed to flush data to the transaction log. Corruption may occur.
9/12/2011      10:00:21 AM      ClusNet      Error      None      1118      N/A      KTPO16      Cluster service was terminated as requested by Node 1.
9/12/2011      10:00:21 AM      ClusNet      Error      None      1118      N/A      KTPO16      Cluster service was terminated as requested by Node 1.
9/12/2011      9:59:52 AM      ClusSvc      Information      Node Mgr       1122      N/A      KTPO15      The node (re)established communication with cluster node 'KTPO16' on network 'Public'.
9/12/2011      9:59:52 AM      ClusSvc      Warning      Node Mgr       1123      N/A      KTPO15      The node lost communication with cluster node 'KTPO16' on network 'Public'.
9/12/2011      9:59:43 AM      ClusSvc      Information      Node Mgr       1122      N/A      KTPO15      The node (re)established communication with cluster node 'KTPO16' on network 'Public'.
9/12/2011      9:59:42 AM      ClusSvc      Warning      Node Mgr       1123      N/A      KTPO15      The node lost communication with cluster node 'KTPO16' on network 'Public'.
9/12/2011      10:00:09 AM      ClusSvc      Information      Event Logger       1202      N/A      KTPO16      The time delta between node KTPO16 and node KTPO15 is -2883631(in 100 nanosecs).
9/12/2011      9:59:27 AM      ClusSvc      Information      Node Mgr       1122      N/A      KTPO15      The node (re)established communication with cluster node 'KTPO16' on network 'Heart Beat'.
9/12/2011      9:59:26 AM      ClusSvc      Information      Node Mgr       1122      N/A      KTPO15      The node (re)established communication with cluster node 'KTPO16' on network 'Public'.
9/12/2011      9:59:25 AM      ClusSvc      Warning      Node Mgr       1123      N/A      KTPO15      The node lost communication with cluster node 'KTPO16' on network 'Public'.
9/12/2011      9:59:25 AM      ClusSvc      Warning      Node Mgr       1123      N/A      KTPO15      The node lost communication with cluster node 'KTPO16' on network 'Heart Beat'.
9/12/2011      9:59:19 AM      ClusSvc      Information      Node Mgr       1128      N/A      KTPO15      Cluster network 'Heart Beat' is operational (up). All available server cluster nodes attached to the network can communicate using it.
9/12/2011      9:59:19 AM      ClusSvc      Information      Node Mgr       1125      N/A      KTPO15      The interface for cluster node 'KTPO15' on network 'Heart Beat' is operational (up). The node can communicate with all other available cluster nodes on the network.
9/12/2011      9:59:19 AM      ClusSvc      Information      Node Mgr       1125      N/A      KTPO15      The interface for cluster node 'KTPO16' on network 'Heart Beat' is operational (up). The node can communicate with all other available cluster nodes on the network.
9/12/2011      9:59:17 AM      ClusSvc      Information      Node Mgr       1122      N/A      KTPO15      The node (re)established communication with cluster node 'KTPO16' on network 'Public'.
9/12/2011      9:59:17 AM      ClusSvc      Information      Node Mgr       1122      N/A      KTPO15      The node (re)established communication with cluster node 'KTPO16' on network 'Heart Beat'.
9/12/2011      10:00:09 AM      ClusSvc      Information      Event Logger       1202      N/A      KTPO16      The time delta between node KTPO16 and node KTPO15 is 317891460(in 100 nanosecs).
9/12/2011      10:00:06 AM      ClusSvc      Information      Node Mgr       1122      N/A      KTPO16      The node (re)established communication with cluster node 'KTPO15' on network 'Public'.
9/12/2011      10:00:04 AM      ClusSvc      Warning      Node Mgr       1123      N/A      KTPO16      The node lost communication with cluster node 'KTPO15' on network 'Public'.
9/12/2011      10:00:03 AM      ClusSvc      Information      Node Mgr       1122      N/A      KTPO16      The node (re)established communication with cluster node 'KTPO15' on network 'Heart Beat'.
9/12/2011      10:00:02 AM      ClusSvc      Warning      Node Mgr       1123      N/A      KTPO16      The node lost communication with cluster node 'KTPO15' on network 'Heart Beat'.
9/12/2011      9:59:40 AM      ClusSvc      Information      Node Mgr       1122      N/A      KTPO16      The node (re)established communication with cluster node 'KTPO15' on network 'Public'.
9/12/2011      9:59:39 AM      ClusSvc      Information      Node Mgr       1122      N/A      KTPO16      The node (re)established communication with cluster node 'KTPO15' on network 'Heart Beat'.
9/12/2011      9:59:38 AM      ClusSvc      Warning      Node Mgr       1123      N/A      KTPO16      The node lost communication with cluster node 'KTPO15' on network 'Public'.
9/12/2011      9:59:38 AM      ClusSvc      Warning      Node Mgr       1123      N/A      KTPO16      The node lost communication with cluster node 'KTPO15' on network 'Heart Beat'.
9/12/2011      9:59:34 AM      ClusSvc      Information      Node Mgr       1122      N/A      KTPO16      The node (re)established communication with cluster node 'KTPO15' on network 'Public'.
9/12/2011      9:59:32 AM      ClusSvc      Warning      Node Mgr       1123      N/A      KTPO16      The node lost communication with cluster node 'KTPO15' on network 'Public'.
9/12/2011      9:59:28 AM      ClusSvc      Information      Node Mgr       1122      N/A      KTPO16      The node (re)established communication with cluster node 'KTPO15' on network 'Heart Beat'.
9/12/2011      9:59:27 AM      ClusSvc      Warning      Node Mgr       1123      N/A      KTPO16      The node lost communication with cluster node 'KTPO15' on network 'Heart Beat'.
9/12/2011      9:59:16 AM      ClusSvc      Warning      Node Mgr       1123      N/A      KTPO15      The node lost communication with cluster node 'KTPO16' on network 'Public'.
9/12/2011      9:59:16 AM      ClusSvc      Warning      Node Mgr       1123      N/A      KTPO15      The node lost communication with cluster node 'KTPO16' on network 'Heart Beat'.
9/12/2011      9:59:11 AM      ClusSvc      Information      Node Mgr       1122      N/A      KTPO15      The node (re)established communication with cluster node 'KTPO16' on network 'Public'.
9/12/2011      9:59:11 AM      ClusSvc      Information      Node Mgr       1122      N/A      KTPO15      The node (re)established communication with cluster node 'KTPO16' on network 'Heart Beat'.
9/12/2011      9:59:10 AM      ClusSvc      Warning      Node Mgr       1130      N/A      KTPO15      Cluster network 'Public' is down. None of the available nodes can communicate using this network. If the condition persists, check for failures in any network components to which the nodes are connected such as hubs, switches, or bridges. Next, check the cables connecting the nodes to the network. Finally, check for hardware or software errors in the adapters that attach the nodes to the network.
9/12/2011      9:59:10 AM      ClusSvc      Warning      Node Mgr       1126      N/A      KTPO15      The interface for cluster node 'KTPO15' on network 'Public' is unreachable by at least one other cluster node attached to the network. the server cluster was not able to determine the location of the failure. Look for additional entries in the system event log indicating which other nodes have lost communication with node KTPO15. If the condition persists, check the cable connecting the node to the network. Next, check for hardware or software errors in the node's network adapter. Finally, check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.
9/12/2011      9:59:10 AM      ClusSvc      Warning      Node Mgr       1126      N/A      KTPO15      The interface for cluster node 'KTPO16' on network 'Public' is unreachable by at least one other cluster node attached to the network. the server cluster was not able to determine the location of the failure. Look for additional entries in the system event log indicating which other nodes have lost communication with node KTPO16. If the condition persists, check the cable connecting the node to the network. Next, check for hardware or software errors in the node's network adapter. Finally, check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.
9/12/2011      9:59:10 AM      ClusSvc      Warning      Node Mgr       1130      N/A      KTPO15      Cluster network 'Heart Beat' is down. None of the available nodes can communicate using this network. If the condition persists, check for failures in any network components to which the nodes are connected such as hubs, switches, or bridges. Next, check the cables connecting the nodes to the network. Finally, check for hardware or software errors in the adapters that attach the nodes to the network.
9/12/2011      9:59:10 AM      ClusSvc      Warning      Node Mgr       1126      N/A      KTPO15      The interface for cluster node 'KTPO15' on network 'Heart Beat' is unreachable by at least one other cluster node attached to the network. the server cluster was not able to determine the location of the failure. Look for additional entries in the system event log indicating which other nodes have lost communication with node KTPO15. If the condition persists, check the cable connecting the node to the network. Next, check for hardware or software errors in the node's network adapter. Finally, check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.
9/12/2011      9:59:10 AM      ClusSvc      Warning      Node Mgr       1126      N/A      KTPO15      The interface for cluster node 'KTPO16' on network 'Heart Beat' is unreachable by at least one other cluster node attached to the network. the server cluster was not able to determine the location of the failure. Look for additional entries in the system event log indicating which other nodes have lost communication with node KTPO16. If the condition persists, check the cable connecting the node to the network. Next, check for hardware or software errors in the node's network adapter. Finally, check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.
9/12/2011      9:59:08 AM      ClusSvc      Warning      Node Mgr       1123      N/A      KTPO15      The node lost communication with cluster node 'KTPO16' on network 'Public'.
9/12/2011      9:59:08 AM      ClusSvc      Warning      Node Mgr       1123      N/A      KTPO15      The node lost communication with cluster node 'KTPO16' on network 'Heart Beat'.
9/12/2011      9:59:20 AM      ClusSvc      Information      Node Mgr       1122      N/A      KTPO16      The node (re)established communication with cluster node 'KTPO15' on network 'Heart Beat'.
9/12/2011      9:59:20 AM      ClusSvc      Warning      Node Mgr       1123      N/A      KTPO16      The node lost communication with cluster node 'KTPO15' on network 'Heart Beat'.
9/12/2011      9:59:11 AM      ClusSvc      Information      Node Mgr       1122      N/A      KTPO16      The node (re)established communication with cluster node 'KTPO15' on network 'Public'.
9/12/2011      9:59:11 AM      ClusSvc      Information      Node Mgr       1122      N/A      KTPO16      The node (re)established communication with cluster node 'KTPO15' on network 'Heart Beat'.
9/12/2011      9:59:10 AM      ClusSvc      Warning      Node Mgr       1123      N/A      KTPO16      The node lost communication with cluster node 'KTPO15' on network 'Public'.
9/12/2011      9:59:10 AM      ClusSvc      Warning      Node Mgr       1123      N/A      KTPO16      The node lost communication with cluster node 'KTPO15' on network 'Heart Beat'.
9/12/2011      1:00:00 AM      Service Control Manager      Information      None      7036      N/A      KTPO15      The Performance Logs and Alerts service entered the running state.
0
 
arnoldCommented:
The heartbeat is the only service that is self-monitoring. i.e. each node expects to receive an event from the "active node" as soon as the active node seems to become inaccessible, the other nodes depending on the policy setting would either assert that it is now the active node, or they will attempt to poll to establish which is to become the active node.

The errors seem to suggest that once the 'heartbeat' communications were lost the heartbeat connection was reestablished over the 'public' network.

The log is between KTO15 and KTO16.
In reallity this may mean that the heartbeat connection between KTO15 and KTO16 on the 192.168.75.x network.
The error on KTO15 seems to suggest that the system panicked after a disk issue.
9/12/2011      11:26:35 AM      Foundation Agents      Error      Events       1172      N/A      KTPO16      "Cluster Agent: The cluster service on KTPO15 has failed.
0
 
ktpoitmAuthor Commented:
It wasn't the entire issue.
0

Featured Post

[Webinar] Cloud and Mobile-First Strategy

Maybe you’ve fully adopted the cloud since the beginning. Or maybe you started with on-prem resources but are pursuing a “cloud and mobile first” strategy. Getting to that end state has its challenges. Discover how to build out a 100% cloud and mobile IT strategy in this webinar.

  • 8
  • 3
  • 2
Tackle projects and never again get stuck behind a technical roadblock.
Join Now