Solved

Cluster connection issues

Posted on 2011-09-12
13
4,958 Views
Last Modified: 2012-05-12
I had an issue today where all my clusters (4) showed loss of communication on the "heartbeat" and "public" networks at roughly the same time, and for the same duration. This article (http://support.microsoft.com/kb/892422) states that it is likely not a network issue, but something else. Any ideas?
0
Comment
Question by:ktpoitm
  • 8
  • 3
  • 2
13 Comments
 
LVL 8

Expert Comment

by:Acosta Technology Services
ID: 36524420
Are you using a switch to manage the cluster?  Are the heartbeat and LAN both running through the same switch?
0
 
LVL 1

Author Comment

by:ktpoitm
ID: 36524443
Both nodes are on blades, that are in 2 different enclosures. They are bot HP bl20p G3's. I found no hardware similarities, all nodes are spread across 5 different blade enclosures. They only common like they have is to our core switch, but I see no errors in the logs on the switch.
0
 
LVL 8

Expert Comment

by:Acosta Technology Services
ID: 36524473
How are you performing the heartbeat for the clusters?  Did you dedicate a port on a specific enclosure fabric to be the heartbeat port, or are all of the enclosures only connected to the core switch?
0
 
LVL 1

Author Comment

by:ktpoitm
ID: 36524530
Each blade has 4 nics, 3 are teamed for the public address (10.0.1.x), 1 for the heartbeat (192.168.75.x). They connect back to the integrated switch on the blade enclosure, but its pretty much just a passthru. Then the enclosure switches are connected directly to the core. Each enclosure switch has two ports setup, one for vlan 10 (server vlan, 10.0.1.x) and one for vlan 9 (management vlan, 10.0.9.x).
0
 
LVL 1

Author Comment

by:ktpoitm
ID: 36524958
I spoke to the person who helped configure this. He says its using the internal switch to switch communication.
0
 
LVL 76

Expert Comment

by:arnold
ID: 36525175
Check for an event on the enclosers as well as events on the core switch which was pointed out earlier by operationnos.

a loss of communication across all interfaces would be where they are all aggregated and in your case they all aggregate on the core switch.
0
How to improve team productivity

Quip adds documents, spreadsheets, and tasklists to your Slack experience
- Elevate ideas to Quip docs
- Share Quip docs in Slack
- Get notified of changes to your docs
- Available on iOS/Android/Desktop/Web
- Online/Offline

 
LVL 1

Author Comment

by:ktpoitm
ID: 36525214
Already checked the core switch...no event. There are 32 other blades that also use the same vlan (10.0.1.x) as the public adapters and no comm issues. Only issues were with the clusters and on both the public and heartbeat adapters. Idk...this one has me stumped. This incident was isolated to 4 clusters that are spread on 5 different blade enclosures.
0
 
LVL 76

Expert Comment

by:arnold
ID: 36525258
Two enclosures with two blade servers in each had a hickup.
Do you have collected data on the Network traffic to see whether there was a network saturation?
Are the 32 other blades part of a cluster across enclosures?
one is a hickup among nodes to see who is in charge, and another is to have a node that temporarily stopped receiving requests.
0
 
LVL 1

Author Comment

by:ktpoitm
ID: 36525300
I have 5 blade enclosures and 4 clusters. All 4 clusters are on seperate blades spread across the enclosures (no cluster has both sides in the same chassis). They all had a simultaneous hiccup. I don't have any network traffic logs to see if there was any saturation yet. Looking into it. The other blades are just stand alone servers in the chassis, but they are all share the same vlan.
0
 
LVL 1

Author Comment

by:ktpoitm
ID: 36525311
Are you saying they have an "intentional" hiccup to see who is the primary and who is the secondary?
0
 
LVL 1

Author Comment

by:ktpoitm
ID: 36525396
Below is a snapshot of the system log in case what I am explaining happened isn't clear.

9/12/2011      11:26:35 AM      Foundation Agents      Error      Events       1172      N/A      KTPO16      "Cluster Agent: The cluster service on KTPO15 has failed.
[SNMP TRAP: 15004 in CPQCLUS.MIB]"
9/12/2011      11:24:41 AM      ClusSvc      Warning      Node Mgr       1135      N/A      KTPO16      Cluster node KTPO15 was removed from the active server cluster membership. Cluster service may have been stopped on the node, the node may have failed, or the node may have lost communication with the other active server cluster nodes.
9/12/2011      11:24:35 AM      ClusSvc      Warning      Node Mgr       1123      N/A      KTPO16      The node lost communication with cluster node 'KTPO15' on network 'Public'.
9/12/2011      11:24:35 AM      ClusSvc      Warning      Node Mgr       1123      N/A      KTPO16      The node lost communication with cluster node 'KTPO15' on network 'Heart Beat'.
9/12/2011      11:23:17 AM      Service Control Manager      Information      None      7036      N/A      KTPO15      The Symantec AntiVirus service entered the running state.
9/12/2011      11:23:00 AM      Service Control Manager      Information      None      7035      NT AUTHORITY\SYSTEM      KTPO15      The SAVRT service was successfully sent a start control.
9/12/2011      11:22:49 AM      Service Control Manager      Information      None      7036      N/A      KTPO15      The Windows Installer service entered the running state.
9/12/2011      11:22:49 AM      Service Control Manager      Information      None      7035      NT AUTHORITY\SYSTEM      KTPO15      The Windows Installer service was successfully sent a start control.
9/12/2011      11:22:49 AM      Service Control Manager      Information      None      7036      N/A      KTPO15      The Network Location Awareness (NLA) service entered the running state.
9/12/2011      11:22:49 AM      Service Control Manager      Information      None      7035      NT AUTHORITY\SYSTEM      KTPO15      The Network Location Awareness (NLA) service was successfully sent a start control.
9/12/2011      11:22:49 AM      Service Control Manager      Information      None      7035      NT AUTHORITY\SYSTEM      KTPO15      The Altiris Kernel Driver service was successfully sent a start control.
9/12/2011      11:22:49 AM      Service Control Manager      Information      None      7035      S-1-5-21-675632585-1759720205-4280849243-1346      KTPO15      The Distributed Transaction Coordinator service was successfully sent a stop control.
9/12/2011      11:22:46 AM      ClusSvc      Information      Startup/Shutdown       1062      N/A      KTPO15      Cluster service successfully joined the server cluster MESMCS.
9/12/2011      11:22:33 AM      W32Time      Information      None      35      N/A      KTPO15      The time service is now synchronizing the system time with the time source 10.0.1.6 (ntp.m|0x0|10.0.1.25:123->10.0.1.6:123).
9/12/2011      11:22:26 AM      ClusSvc      Information      Event Logger       1202      N/A      KTPO15      The time delta between node KTPO15 and node KTPO16 is 60696876(in 100 nanosecs).
9/12/2011      11:22:23 AM      ClusSvc      Information      Node Mgr       1122      N/A      KTPO15      The node (re)established communication with cluster node 'KTPO16' on network 'Public'.
9/12/2011      11:22:23 AM      ClusSvc      Information      Node Mgr       1122      N/A      KTPO15      The node (re)established communication with cluster node 'KTPO16' on network 'Heart Beat'.
9/12/2011      11:22:21 AM      Foundation Agents      Information      Service       400      N/A      KTPO15      The Foundation Agents service version 8.50.0.0 has started.
9/12/2011      11:22:20 AM      Storage Agents      Information      Service       400      N/A      KTPO15      The Storage Agents service version 8.50.0.0 has started.
9/12/2011      11:22:20 AM      Server Agents      Information      Service       400      N/A      KTPO15      The Server Agents service version 8.50.0.0 has started.
9/12/2011      11:22:19 AM      NIC Agents      Information      Service       277      N/A      KTPO15      The NIC Management Agent version 8.50.0.0 has started.
9/12/2011      11:22:18 AM      SNMP      Information      None      1001      N/A      KTPO15      The SNMP Service has started successfully.
9/12/2011      11:22:17 AM      HP System Management Homepage      Information      Service       9      N/A      KTPO15      The HP System Management Homepage Win32 service has been started successfully.
9/12/2011      11:22:13 AM      IPSec      Information      None      4294      N/A      KTPO15      The IPSec driver has entered Secure mode. IPSec policies, if they have been configured, are now being applied to this computer.
9/12/2011      11:22:04 AM      cpqriis      Information      None      105      N/A      KTPO15      The service was started.
9/12/2011      11:22:03 AM      AeLookupSvc      Information      None      3      N/A      KTPO15      The Application Experience Lookup service started successfully.
9/12/2011      11:21:48 AM      q57w2k      Information      None      11      N/A      KTPO15      HP NC7781 Gigabit Server: Network controller configured for 1Gb full-duplex link.
9/12/2011      11:21:48 AM      q57w2k      Information      None      11      N/A      KTPO15      HP NC7781 Gigabit Server: Network controller configured for 1Gb full-duplex link.
9/12/2011      11:21:48 AM      q57w2k      Information      None      11      N/A      KTPO15      HP NC7781 Gigabit Server: Network controller configured for 1Gb full-duplex link.
9/12/2011      11:21:47 AM      q57w2k      Information      None      11      N/A      KTPO15      HP NC7781 Gigabit Server: Network controller configured for 1Gb full-duplex link.
9/12/2011      11:21:45 AM      IPSec      Information      None      4295      N/A      KTPO15      The IPSec Driver is starting in Bypass mode. No IPSec security is being applied while this computer starts up. IPSec policies, if they have been assigned, will be applied to this computer after the IPSec services start.
9/12/2011      11:21:45 AM      hpeaadsm      Information      None      102      N/A      KTPO15      A new path (SCSI address Port 3 B0 T1 L2) was added to existing multipath capable disk 600508B400105F300000900000860000.
9/12/2011      11:21:45 AM      mpio      Information      None      2      N/A      KTPO15      Added device to \Device\MPIODisk1. DumpData contains the current number of paths.
9/12/2011      11:21:45 AM      hpeaadsm      Information      None      102      N/A      KTPO15      A new path (SCSI address Port 3 B0 T1 L1) was added to existing multipath capable disk 600508B400105F3000009000007D0000.
9/12/2011      11:21:45 AM      mpio      Information      None      2      N/A      KTPO15      Added device to \Device\MPIODisk0. DumpData contains the current number of paths.
9/12/2011      11:21:45 AM      hpeaadsm      Information      None      102      N/A      KTPO15      A new path (SCSI address Port 3 B0 T0 L2) was added to existing multipath capable disk 600508B400105F300000900000860000.
9/12/2011      11:21:45 AM      mpio      Information      None      2      N/A      KTPO15      Added device to \Device\MPIODisk1. DumpData contains the current number of paths.
9/12/2011      11:21:45 AM      hpeaadsm      Information      None      102      N/A      KTPO15      A new path (SCSI address Port 3 B0 T0 L1) was added to existing multipath capable disk 600508B400105F3000009000007D0000.
9/12/2011      11:21:45 AM      mpio      Information      None      2      N/A      KTPO15      Added device to \Device\MPIODisk0. DumpData contains the current number of paths.
9/12/2011      11:21:44 AM      q57w2k      Information      None      15      N/A      KTPO15      HP NC7781 Gigabit Server: Driver initialized successfully.
9/12/2011      11:21:44 AM      q57w2k      Information      None      15      N/A      KTPO15      HP NC7781 Gigabit Server: Driver initialized successfully.
9/12/2011      11:21:44 AM      q57w2k      Information      None      15      N/A      KTPO15      HP NC7781 Gigabit Server: Driver initialized successfully.
9/12/2011      11:21:44 AM      q57w2k      Information      None      15      N/A      KTPO15      HP NC7781 Gigabit Server: Driver initialized successfully.
9/12/2011      11:21:43 AM      CPQCISSE      Information      None      24685      N/A      KTPO15      "The Event Notification driver Cpqcisse.sys of
Array Controller [Embedded] has started."
9/12/2011      11:21:42 AM      hpeaadsm      Information      None      102      N/A      KTPO15      A new path (SCSI address Port 2 B0 T1 L2) was added to existing multipath capable disk 600508B400105F300000900000860000.
9/12/2011      11:21:42 AM      mpio      Information      None      2      N/A      KTPO15      Added device to \Device\MPIODisk1. DumpData contains the current number of paths.
9/12/2011      11:21:42 AM      hpeaadsm      Information      None      102      N/A      KTPO15      A new path (SCSI address Port 2 B0 T1 L1) was added to existing multipath capable disk 600508B400105F3000009000007D0000.
9/12/2011      11:21:42 AM      mpio      Information      None      2      N/A      KTPO15      Added device to \Device\MPIODisk0. DumpData contains the current number of paths.
9/12/2011      11:21:42 AM      hpeaadsm      Information      None      101      N/A      KTPO15      Discovered a new multipath capable disk with serial number 600508B400105F300000900000860000; first path SCSI address Port 2 B0 T0 L2.
9/12/2011      11:21:42 AM      mpio      Information      None      1      N/A      KTPO15      \Device\MPIODisk1 created.
9/12/2011      11:21:42 AM      hpeaadsm      Information      None      101      N/A      KTPO15      Discovered a new multipath capable disk with serial number 600508B400105F3000009000007D0000; first path SCSI address Port 2 B0 T0 L1.
9/12/2011      11:21:42 AM      mpio      Information      None      1      N/A      KTPO15      \Device\MPIODisk0 created.
9/12/2011      11:21:23 AM      hpeaadsm      Information      None      109      N/A      KTPO15      The DSM (version 2.1.2.130) has been started successfully.
9/12/2011      11:21:51 AM      DCOM      Information      None      10026      N/A      KTPO15      The COM sub system is suppressing duplicate event log entries for a duration of 86400 seconds.  The suppression timeout can be controlled by a REG_DWORD value named SuppressDuplicateDuration under the following registry key: HKLM\Software\Microsoft\Ole\EventLog.
9/12/2011      11:21:51 AM      EventLog      Information      None      6005      N/A      KTPO15      The Event log service was started.
9/12/2011      11:21:51 AM      EventLog      Information      None      6009      N/A      KTPO15      Microsoft (R) Windows (R) 5.02. 3790 Service Pack 1 Multiprocessor Free.
9/12/2011      11:22:34 AM      Foundation Agents      Warning      Events       1171      N/A      KTPO16      "Cluster Agent: The cluster service on KTPO15 has become degraded.
[SNMP TRAP: 15003 in CPQCLUS.MIB]"
9/12/2011      11:22:17 AM      ClusSvc      Information      Node Mgr       1125      N/A      KTPO16      The interface for cluster node 'KTPO15' on network 'Public' is operational (up). The node can communicate with all other available cluster nodes on the network.
9/12/2011      11:22:17 AM      ClusSvc      Information      Node Mgr       1125      N/A      KTPO16      The interface for cluster node 'KTPO15' on network 'Heart Beat' is operational (up). The node can communicate with all other available cluster nodes on the network.
9/12/2011      11:22:16 AM      ClusSvc      Information      Node Mgr       1122      N/A      KTPO16      The node (re)established communication with cluster node 'KTPO15' on network 'Public'.
9/12/2011      11:22:16 AM      ClusSvc      Information      Node Mgr       1122      N/A      KTPO16      The node (re)established communication with cluster node 'KTPO15' on network 'Heart Beat'.
9/12/2011      11:16:34 AM      Foundation Agents      Error      Events       1172      N/A      KTPO16      "Cluster Agent: The cluster service on KTPO15 has failed.
[SNMP TRAP: 15004 in CPQCLUS.MIB]"
9/12/2011      11:15:08 AM      ClusSvc      Warning      Node Mgr       1135      N/A      KTPO16      Cluster node KTPO15 was removed from the active server cluster membership. Cluster service may have been stopped on the node, the node may have failed, or the node may have lost communication with the other active server cluster nodes.
9/12/2011      11:15:04 AM      ClusSvc      Warning      Node Mgr       1123      N/A      KTPO16      The node lost communication with cluster node 'KTPO15' on network 'Public'.
9/12/2011      11:15:04 AM      ClusSvc      Warning      Node Mgr       1123      N/A      KTPO16      The node lost communication with cluster node 'KTPO15' on network 'Heart Beat'.
9/12/2011      11:10:47 AM      Service Control Manager      Information      None      7036      N/A      KTPO15      The Symantec AntiVirus service entered the running state.
9/12/2011      11:10:30 AM      Service Control Manager      Information      None      7035      NT AUTHORITY\SYSTEM      KTPO15      The SAVRT service was successfully sent a start control.
9/12/2011      11:10:18 AM      Service Control Manager      Information      None      7036      N/A      KTPO15      The Windows Installer service entered the running state.
9/12/2011      11:10:18 AM      Service Control Manager      Information      None      7035      NT AUTHORITY\SYSTEM      KTPO15      The Windows Installer service was successfully sent a start control.
9/12/2011      11:10:18 AM      Service Control Manager      Information      None      7036      N/A      KTPO15      The Network Location Awareness (NLA) service entered the running state.
9/12/2011      11:10:18 AM      Service Control Manager      Information      None      7035      NT AUTHORITY\SYSTEM      KTPO15      The Network Location Awareness (NLA) service was successfully sent a start control.
9/12/2011      11:10:18 AM      Service Control Manager      Information      None      7035      NT AUTHORITY\SYSTEM      KTPO15      The Altiris Kernel Driver service was successfully sent a start control.
9/12/2011      11:10:18 AM      Service Control Manager      Information      None      7035      S-1-5-21-675632585-1759720205-4280849243-1346      KTPO15      The Distributed Transaction Coordinator service was successfully sent a stop control.
9/12/2011      11:10:16 AM      ClusSvc      Information      Startup/Shutdown       1062      N/A      KTPO15      Cluster service successfully joined the server cluster MESMCS.
9/12/2011      11:10:04 AM      W32Time      Information      None      35      N/A      KTPO15      The time service is now synchronizing the system time with the time source 10.0.1.6 (ntp.m|0x0|10.0.1.25:123->10.0.1.6:123).
9/12/2011      11:10:02 AM      ClusSvc      Information      Event Logger       1202      N/A      KTPO15      The time delta between node KTPO15 and node KTPO16 is 58631404(in 100 nanosecs).
9/12/2011      11:09:52 AM      ClusSvc      Information      Node Mgr       1122      N/A      KTPO15      The node (re)established communication with cluster node 'KTPO16' on network 'Public'.
9/12/2011      11:09:52 AM      ClusSvc      Information      Node Mgr       1122      N/A      KTPO15      The node (re)established communication with cluster node 'KTPO16' on network 'Heart Beat'.
9/12/2011      11:09:52 AM      Foundation Agents      Information      Service       400      N/A      KTPO15      The Foundation Agents service version 8.50.0.0 has started.
9/12/2011      11:09:50 AM      Storage Agents      Information      Service       400      N/A      KTPO15      The Storage Agents service version 8.50.0.0 has started.
9/12/2011      11:09:50 AM      Server Agents      Information      Service       400      N/A      KTPO15      The Server Agents service version 8.50.0.0 has started.
9/12/2011      11:09:50 AM      NIC Agents      Information      Service       277      N/A      KTPO15      The NIC Management Agent version 8.50.0.0 has started.
9/12/2011      11:09:49 AM      SNMP      Information      None      1001      N/A      KTPO15      The SNMP Service has started successfully.
9/12/2011      11:09:48 AM      HP System Management Homepage      Information      Service       9      N/A      KTPO15      The HP System Management Homepage Win32 service has been started successfully.
9/12/2011      11:09:44 AM      IPSec      Information      None      4294      N/A      KTPO15      The IPSec driver has entered Secure mode. IPSec policies, if they have been configured, are now being applied to this computer.
9/12/2011      11:09:38 AM      cpqriis      Information      None      105      N/A      KTPO15      The service was started.
9/12/2011      11:09:37 AM      AeLookupSvc      Information      None      3      N/A      KTPO15      The Application Experience Lookup service started successfully.
9/12/2011      11:09:21 AM      q57w2k      Information      None      11      N/A      KTPO15      HP NC7781 Gigabit Server: Network controller configured for 1Gb full-duplex link.
9/12/2011      11:09:21 AM      q57w2k      Information      None      11      N/A      KTPO15      HP NC7781 Gigabit Server: Network controller configured for 1Gb full-duplex link.
9/12/2011      11:09:21 AM      q57w2k      Information      None      11      N/A      KTPO15      HP NC7781 Gigabit Server: Network controller configured for 1Gb full-duplex link.
9/12/2011      11:09:21 AM      q57w2k      Information      None      11      N/A      KTPO15      HP NC7781 Gigabit Server: Network controller configured for 1Gb full-duplex link.
9/12/2011      11:09:18 AM      IPSec      Information      None      4295      N/A      KTPO15      The IPSec Driver is starting in Bypass mode. No IPSec security is being applied while this computer starts up. IPSec policies, if they have been assigned, will be applied to this computer after the IPSec services start.
9/12/2011      11:09:18 AM      q57w2k      Information      None      15      N/A      KTPO15      HP NC7781 Gigabit Server: Driver initialized successfully.
9/12/2011      11:09:18 AM      q57w2k      Information      None      15      N/A      KTPO15      HP NC7781 Gigabit Server: Driver initialized successfully.
9/12/2011      11:09:17 AM      q57w2k      Information      None      15      N/A      KTPO15      HP NC7781 Gigabit Server: Driver initialized successfully.
9/12/2011      11:09:17 AM      q57w2k      Information      None      15      N/A      KTPO15      HP NC7781 Gigabit Server: Driver initialized successfully.
9/12/2011      11:09:17 AM      CPQCISSE      Information      None      24685      N/A      KTPO15      "The Event Notification driver Cpqcisse.sys of
Array Controller [Embedded] has started."
9/12/2011      11:09:16 AM      hpeaadsm      Information      None      102      N/A      KTPO15      A new path (SCSI address Port 3 B0 T1 L2) was added to existing multipath capable disk 600508B400105F300000900000860000.
9/12/2011      11:09:16 AM      mpio      Information      None      2      N/A      KTPO15      Added device to \Device\MPIODisk1. DumpData contains the current number of paths.
9/12/2011      11:09:16 AM      hpeaadsm      Information      None      102      N/A      KTPO15      A new path (SCSI address Port 3 B0 T1 L1) was added to existing multipath capable disk 600508B400105F3000009000007D0000.
9/12/2011      11:09:16 AM      mpio      Information      None      2      N/A      KTPO15      Added device to \Device\MPIODisk0. DumpData contains the current number of paths.
9/12/2011      11:09:16 AM      hpeaadsm      Information      None      102      N/A      KTPO15      A new path (SCSI address Port 3 B0 T0 L2) was added to existing multipath capable disk 600508B400105F300000900000860000.
9/12/2011      11:09:16 AM      mpio      Information      None      2      N/A      KTPO15      Added device to \Device\MPIODisk1. DumpData contains the current number of paths.
9/12/2011      11:09:16 AM      hpeaadsm      Information      None      102      N/A      KTPO15      A new path (SCSI address Port 3 B0 T0 L1) was added to existing multipath capable disk 600508B400105F3000009000007D0000.
9/12/2011      11:09:16 AM      mpio      Information      None      2      N/A      KTPO15      Added device to \Device\MPIODisk0. DumpData contains the current number of paths.
9/12/2011      11:09:16 AM      hpeaadsm      Information      None      102      N/A      KTPO15      A new path (SCSI address Port 2 B0 T1 L2) was added to existing multipath capable disk 600508B400105F300000900000860000.
9/12/2011      11:09:16 AM      mpio      Information      None      2      N/A      KTPO15      Added device to \Device\MPIODisk1. DumpData contains the current number of paths.
9/12/2011      11:09:16 AM      hpeaadsm      Information      None      102      N/A      KTPO15      A new path (SCSI address Port 2 B0 T1 L1) was added to existing multipath capable disk 600508B400105F3000009000007D0000.
9/12/2011      11:09:16 AM      mpio      Information      None      2      N/A      KTPO15      Added device to \Device\MPIODisk0. DumpData contains the current number of paths.
9/12/2011      11:09:16 AM      hpeaadsm      Information      None      101      N/A      KTPO15      Discovered a new multipath capable disk with serial number 600508B400105F300000900000860000; first path SCSI address Port 2 B0 T0 L2.
9/12/2011      11:09:16 AM      mpio      Information      None      1      N/A      KTPO15      \Device\MPIODisk1 created.
9/12/2011      11:09:16 AM      hpeaadsm      Information      None      101      N/A      KTPO15      Discovered a new multipath capable disk with serial number 600508B400105F3000009000007D0000; first path SCSI address Port 2 B0 T0 L1.
9/12/2011      11:09:16 AM      mpio      Information      None      1      N/A      KTPO15      \Device\MPIODisk0 created.
9/12/2011      11:09:01 AM      hpeaadsm      Information      None      109      N/A      KTPO15      The DSM (version 2.1.2.130) has been started successfully.
9/12/2011      11:09:25 AM      DCOM      Information      None      10026      N/A      KTPO15      The COM sub system is suppressing duplicate event log entries for a duration of 86400 seconds.  The suppression timeout can be controlled by a REG_DWORD value named SuppressDuplicateDuration under the following registry key: HKLM\Software\Microsoft\Ole\EventLog.
9/12/2011      11:09:24 AM      EventLog      Information      None      6005      N/A      KTPO15      The Event log service was started.
9/12/2011      11:09:24 AM      EventLog      Information      None      6009      N/A      KTPO15      Microsoft (R) Windows (R) 5.02. 3790 Service Pack 1 Multiprocessor Free.
9/12/2011      11:10:10 AM      ClusSvc      Information      Event Logger       1202      N/A      KTPO16      The time delta between node KTPO16 and node KTPO15 is -57084471(in 100 nanosecs).
9/12/2011      11:09:48 AM      ClusSvc      Information      Node Mgr       1125      N/A      KTPO16      The interface for cluster node 'KTPO15' on network 'Public' is operational (up). The node can communicate with all other available cluster nodes on the network.
9/12/2011      11:09:48 AM      ClusSvc      Information      Node Mgr       1125      N/A      KTPO16      The interface for cluster node 'KTPO15' on network 'Heart Beat' is operational (up). The node can communicate with all other available cluster nodes on the network.
9/12/2011      11:09:46 AM      ClusSvc      Information      Node Mgr       1122      N/A      KTPO16      The node (re)established communication with cluster node 'KTPO15' on network 'Public'.
9/12/2011      11:09:46 AM      ClusSvc      Information      Node Mgr       1122      N/A      KTPO16      The node (re)established communication with cluster node 'KTPO15' on network 'Heart Beat'.
9/12/2011      11:08:34 AM      Foundation Agents      Error      Events       1172      N/A      KTPO16      "Cluster Agent: The cluster service on KTPO15 has failed.
[SNMP TRAP: 15004 in CPQCLUS.MIB]"
9/12/2011      11:07:20 AM      ClusSvc      Warning      Node Mgr       1135      N/A      KTPO16      Cluster node KTPO15 was removed from the active server cluster membership. Cluster service may have been stopped on the node, the node may have failed, or the node may have lost communication with the other active server cluster nodes.
9/12/2011      11:07:16 AM      ClusSvc      Warning      Node Mgr       1123      N/A      KTPO16      The node lost communication with cluster node 'KTPO15' on network 'Public'.
9/12/2011      11:07:16 AM      ClusSvc      Warning      Node Mgr       1123      N/A      KTPO16      The node lost communication with cluster node 'KTPO15' on network 'Heart Beat'.
9/12/2011      10:43:48 AM      TermService      Error      None      1041      N/A      KTPO16      Autoreconnect failed to reconnect user to session because authentication failed. (0x0)
9/12/2011      10:11:52 AM      Service Control Manager      Information      None      7036      N/A      KTPO16      The WMI Performance Adapter service entered the stopped state.
9/12/2011      10:11:52 AM      Service Control Manager      Information      None      7036      N/A      KTPO16      The WMI Performance Adapter service entered the running state.
9/12/2011      10:11:52 AM      Service Control Manager      Information      None      7035      S-1-5-21-675632585-1759720205-4280849243-1346      KTPO16      The WMI Performance Adapter service was successfully sent a start control.
9/12/2011      10:11:32 AM      ClusSvc      Information      Failover Mgr       1204      N/A      KTPO15      "The Cluster Service brought the Resource Group ""Group 0"" offline."
9/12/2011      10:11:32 AM      ClusSvc      Information      Failover Mgr       1203      N/A      KTPO15      "The Cluster Service is attempting to offline the Resource Group ""Group 0""."
9/12/2011      10:11:36 AM      Service Control Manager      Information      None      7036      N/A      KTPO16      The CIMPLICITY HMI Service service entered the running state.
9/12/2011      10:11:36 AM      Service Control Manager      Information      None      7035      S-1-5-21-675632585-1759720205-4280849243-1346      KTPO16      The CIMPLICITY HMI Service service was successfully sent a start control.
9/12/2011      10:11:33 AM      ClusSvc      Information      Failover Mgr       1205      N/A      KTPO16      "The Cluster Service failed to bring the Resource Group ""Group 0"" completely online or offline."
9/12/2011      10:11:32 AM      ClusSvc      Information      Failover Mgr       1200      N/A      KTPO16      "The Cluster Service is attempting to bring online the Resource Group ""Group 0""."
9/12/2011      10:11:26 AM      ClusSvc      Information      Failover Mgr       1201      N/A      KTPO16      "The Cluster Service brought the Resource Group ""Cluster Group"" online."
9/12/2011      10:11:17 AM      ClusSvc      Information      Failover Mgr       1204      N/A      KTPO15      "The Cluster Service brought the Resource Group ""Cluster Group"" offline."
9/12/2011      10:11:17 AM      ClusSvc      Information      Failover Mgr       1203      N/A      KTPO15      "The Cluster Service is attempting to offline the Resource Group ""Cluster Group""."
9/12/2011      10:11:04 AM      Service Control Manager      Information      None      7036      N/A      KTPO15      The CIMPLICITY HMI Service service entered the stopped state.
9/12/2011      10:11:17 AM      ClusSvc      Information      Failover Mgr       1200      N/A      KTPO16      "The Cluster Service is attempting to bring online the Resource Group ""Cluster Group""."
9/12/2011      10:10:33 AM      Foundation Agents      Warning      Events       1167      N/A      KTPO16      "Cluster Agent: The cluster resource KUB1_AND1 has become degraded.
[SNMP TRAP: 15005 in CPQCLUS.MIB]"
9/12/2011      10:10:33 AM      Foundation Agents      Warning      Events       1167      N/A      KTPO16      "Cluster Agent: The cluster resource KUB1_QUL1 has become degraded.
[SNMP TRAP: 15005 in CPQCLUS.MIB]"
9/12/2011      10:10:33 AM      Foundation Agents      Warning      Events       1167      N/A      KTPO16      "Cluster Agent: The cluster resource KUB1_MCS1 has become degraded.
[SNMP TRAP: 15005 in CPQCLUS.MIB]"
9/12/2011      10:01:45 AM      ClusSvc      Information      Failover Mgr       1201      N/A      KTPO15      "The Cluster Service brought the Resource Group ""Group 0"" online."
9/12/2011      10:01:39 AM      Service Control Manager      Information      None      7036      N/A      KTPO15      The WMI Performance Adapter service entered the stopped state.
9/12/2011      10:01:39 AM      Service Control Manager      Information      None      7036      N/A      KTPO15      The WMI Performance Adapter service entered the running state.
9/12/2011      10:01:39 AM      Service Control Manager      Information      None      7035      S-1-5-21-675632585-1759720205-4280849243-1346      KTPO15      The WMI Performance Adapter service was successfully sent a start control.
9/12/2011      10:01:30 AM      ClusSvc      Information      Failover Mgr       1201      N/A      KTPO15      "The Cluster Service brought the Resource Group ""Cluster Group"" online."
9/12/2011      10:01:25 AM      ClusSvc      Information      Node Mgr       1125      N/A      KTPO15      The interface for cluster node 'KTPO16' on network 'Public' is operational (up). The node can communicate with all other available cluster nodes on the network.
9/12/2011      10:01:25 AM      ClusSvc      Information      Node Mgr       1125      N/A      KTPO15      The interface for cluster node 'KTPO16' on network 'Heart Beat' is operational (up). The node can communicate with all other available cluster nodes on the network.
9/12/2011      10:01:24 AM      ClusSvc      Information      Node Mgr       1122      N/A      KTPO15      The node (re)established communication with cluster node 'KTPO16' on network 'Public'.
9/12/2011      10:01:24 AM      ClusSvc      Information      Node Mgr       1122      N/A      KTPO15      The node (re)established communication with cluster node 'KTPO16' on network 'Heart Beat'.
9/12/2011      10:01:23 AM      Service Control Manager      Information      None      7036      N/A      KTPO15      The CIMPLICITY HMI Service service entered the running state.
9/12/2011      10:01:23 AM      Service Control Manager      Information      None      7035      S-1-5-21-675632585-1759720205-4280849243-1346      KTPO15      The CIMPLICITY HMI Service service was successfully sent a start control.
9/12/2011      10:01:22 AM      ClusSvc      Information      Failover Mgr       1200      N/A      KTPO15      "The Cluster Service is attempting to bring online the Resource Group ""Group 0""."
9/12/2011      10:01:26 AM      Service Control Manager      Information      None      7036      N/A      KTPO16      The Cluster Service service entered the running state.
9/12/2011      10:01:26 AM      ClusSvc      Information      Startup/Shutdown       1062      N/A      KTPO16      Cluster service successfully joined the server cluster MESMCS.
9/12/2011      10:01:24 AM      ClusSvc      Information      Node Mgr       1122      N/A      KTPO16      The node (re)established communication with cluster node 'KTPO15' on network 'Public'.
9/12/2011      10:01:24 AM      ClusSvc      Information      Node Mgr       1122      N/A      KTPO16      The node (re)established communication with cluster node 'KTPO15' on network 'Heart Beat'.
9/12/2011      10:01:08 AM      Service Control Manager      Information      None      7036      N/A      KTPO16      The CIMPLICITY HMI Service service entered the stopped state.
9/12/2011      10:00:33 AM      Foundation Agents      Error      Events       1172      N/A      KTPO16      "Cluster Agent: The cluster service on KTPO16 has failed.
[SNMP TRAP: 15004 in CPQCLUS.MIB]"
9/12/2011      10:00:24 AM      Ftdisk      Warning      Disk       57      N/A      KTPO16      The system failed to flush data to the transaction log. Corruption may occur.
9/12/2011      10:00:24 AM      Ftdisk      Warning      Disk       57      N/A      KTPO16      The system failed to flush data to the transaction log. Corruption may occur.
9/12/2011      10:00:24 AM      Ftdisk      Warning      Disk       57      N/A      KTPO16      The system failed to flush data to the transaction log. Corruption may occur.
9/12/2011      10:00:24 AM      Ftdisk      Warning      Disk       57      N/A      KTPO16      The system failed to flush data to the transaction log. Corruption may occur.
9/12/2011      10:00:22 AM      Ftdisk      Warning      Disk       57      N/A      KTPO16      The system failed to flush data to the transaction log. Corruption may occur.
9/12/2011      10:00:22 AM      Service Control Manager      Error      None      7031      N/A      KTPO16      The Cluster Service service terminated unexpectedly.  It has done this 1 time(s).  The following corrective action will be taken in 60000 milliseconds: Restart the service.
9/12/2011      10:00:22 AM      Ftdisk      Warning      Disk       57      N/A      KTPO16      The system failed to flush data to the transaction log. Corruption may occur.
9/12/2011      10:00:22 AM      Ftdisk      Warning      Disk       57      N/A      KTPO16      The system failed to flush data to the transaction log. Corruption may occur.
9/12/2011      10:00:22 AM      Ftdisk      Warning      Disk       57      N/A      KTPO16      The system failed to flush data to the transaction log. Corruption may occur.
9/12/2011      10:00:22 AM      Ftdisk      Warning      Disk       57      N/A      KTPO16      The system failed to flush data to the transaction log. Corruption may occur.
9/12/2011      10:00:22 AM      Ftdisk      Warning      Disk       57      N/A      KTPO16      The system failed to flush data to the transaction log. Corruption may occur.
9/12/2011      10:00:21 AM      ClusNet      Error      None      1118      N/A      KTPO16      Cluster service was terminated as requested by Node 1.
9/12/2011      10:00:21 AM      ClusNet      Error      None      1118      N/A      KTPO16      Cluster service was terminated as requested by Node 1.
9/12/2011      9:59:52 AM      ClusSvc      Information      Node Mgr       1122      N/A      KTPO15      The node (re)established communication with cluster node 'KTPO16' on network 'Public'.
9/12/2011      9:59:52 AM      ClusSvc      Warning      Node Mgr       1123      N/A      KTPO15      The node lost communication with cluster node 'KTPO16' on network 'Public'.
9/12/2011      9:59:43 AM      ClusSvc      Information      Node Mgr       1122      N/A      KTPO15      The node (re)established communication with cluster node 'KTPO16' on network 'Public'.
9/12/2011      9:59:42 AM      ClusSvc      Warning      Node Mgr       1123      N/A      KTPO15      The node lost communication with cluster node 'KTPO16' on network 'Public'.
9/12/2011      10:00:09 AM      ClusSvc      Information      Event Logger       1202      N/A      KTPO16      The time delta between node KTPO16 and node KTPO15 is -2883631(in 100 nanosecs).
9/12/2011      9:59:27 AM      ClusSvc      Information      Node Mgr       1122      N/A      KTPO15      The node (re)established communication with cluster node 'KTPO16' on network 'Heart Beat'.
9/12/2011      9:59:26 AM      ClusSvc      Information      Node Mgr       1122      N/A      KTPO15      The node (re)established communication with cluster node 'KTPO16' on network 'Public'.
9/12/2011      9:59:25 AM      ClusSvc      Warning      Node Mgr       1123      N/A      KTPO15      The node lost communication with cluster node 'KTPO16' on network 'Public'.
9/12/2011      9:59:25 AM      ClusSvc      Warning      Node Mgr       1123      N/A      KTPO15      The node lost communication with cluster node 'KTPO16' on network 'Heart Beat'.
9/12/2011      9:59:19 AM      ClusSvc      Information      Node Mgr       1128      N/A      KTPO15      Cluster network 'Heart Beat' is operational (up). All available server cluster nodes attached to the network can communicate using it.
9/12/2011      9:59:19 AM      ClusSvc      Information      Node Mgr       1125      N/A      KTPO15      The interface for cluster node 'KTPO15' on network 'Heart Beat' is operational (up). The node can communicate with all other available cluster nodes on the network.
9/12/2011      9:59:19 AM      ClusSvc      Information      Node Mgr       1125      N/A      KTPO15      The interface for cluster node 'KTPO16' on network 'Heart Beat' is operational (up). The node can communicate with all other available cluster nodes on the network.
9/12/2011      9:59:17 AM      ClusSvc      Information      Node Mgr       1122      N/A      KTPO15      The node (re)established communication with cluster node 'KTPO16' on network 'Public'.
9/12/2011      9:59:17 AM      ClusSvc      Information      Node Mgr       1122      N/A      KTPO15      The node (re)established communication with cluster node 'KTPO16' on network 'Heart Beat'.
9/12/2011      10:00:09 AM      ClusSvc      Information      Event Logger       1202      N/A      KTPO16      The time delta between node KTPO16 and node KTPO15 is 317891460(in 100 nanosecs).
9/12/2011      10:00:06 AM      ClusSvc      Information      Node Mgr       1122      N/A      KTPO16      The node (re)established communication with cluster node 'KTPO15' on network 'Public'.
9/12/2011      10:00:04 AM      ClusSvc      Warning      Node Mgr       1123      N/A      KTPO16      The node lost communication with cluster node 'KTPO15' on network 'Public'.
9/12/2011      10:00:03 AM      ClusSvc      Information      Node Mgr       1122      N/A      KTPO16      The node (re)established communication with cluster node 'KTPO15' on network 'Heart Beat'.
9/12/2011      10:00:02 AM      ClusSvc      Warning      Node Mgr       1123      N/A      KTPO16      The node lost communication with cluster node 'KTPO15' on network 'Heart Beat'.
9/12/2011      9:59:40 AM      ClusSvc      Information      Node Mgr       1122      N/A      KTPO16      The node (re)established communication with cluster node 'KTPO15' on network 'Public'.
9/12/2011      9:59:39 AM      ClusSvc      Information      Node Mgr       1122      N/A      KTPO16      The node (re)established communication with cluster node 'KTPO15' on network 'Heart Beat'.
9/12/2011      9:59:38 AM      ClusSvc      Warning      Node Mgr       1123      N/A      KTPO16      The node lost communication with cluster node 'KTPO15' on network 'Public'.
9/12/2011      9:59:38 AM      ClusSvc      Warning      Node Mgr       1123      N/A      KTPO16      The node lost communication with cluster node 'KTPO15' on network 'Heart Beat'.
9/12/2011      9:59:34 AM      ClusSvc      Information      Node Mgr       1122      N/A      KTPO16      The node (re)established communication with cluster node 'KTPO15' on network 'Public'.
9/12/2011      9:59:32 AM      ClusSvc      Warning      Node Mgr       1123      N/A      KTPO16      The node lost communication with cluster node 'KTPO15' on network 'Public'.
9/12/2011      9:59:28 AM      ClusSvc      Information      Node Mgr       1122      N/A      KTPO16      The node (re)established communication with cluster node 'KTPO15' on network 'Heart Beat'.
9/12/2011      9:59:27 AM      ClusSvc      Warning      Node Mgr       1123      N/A      KTPO16      The node lost communication with cluster node 'KTPO15' on network 'Heart Beat'.
9/12/2011      9:59:16 AM      ClusSvc      Warning      Node Mgr       1123      N/A      KTPO15      The node lost communication with cluster node 'KTPO16' on network 'Public'.
9/12/2011      9:59:16 AM      ClusSvc      Warning      Node Mgr       1123      N/A      KTPO15      The node lost communication with cluster node 'KTPO16' on network 'Heart Beat'.
9/12/2011      9:59:11 AM      ClusSvc      Information      Node Mgr       1122      N/A      KTPO15      The node (re)established communication with cluster node 'KTPO16' on network 'Public'.
9/12/2011      9:59:11 AM      ClusSvc      Information      Node Mgr       1122      N/A      KTPO15      The node (re)established communication with cluster node 'KTPO16' on network 'Heart Beat'.
9/12/2011      9:59:10 AM      ClusSvc      Warning      Node Mgr       1130      N/A      KTPO15      Cluster network 'Public' is down. None of the available nodes can communicate using this network. If the condition persists, check for failures in any network components to which the nodes are connected such as hubs, switches, or bridges. Next, check the cables connecting the nodes to the network. Finally, check for hardware or software errors in the adapters that attach the nodes to the network.
9/12/2011      9:59:10 AM      ClusSvc      Warning      Node Mgr       1126      N/A      KTPO15      The interface for cluster node 'KTPO15' on network 'Public' is unreachable by at least one other cluster node attached to the network. the server cluster was not able to determine the location of the failure. Look for additional entries in the system event log indicating which other nodes have lost communication with node KTPO15. If the condition persists, check the cable connecting the node to the network. Next, check for hardware or software errors in the node's network adapter. Finally, check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.
9/12/2011      9:59:10 AM      ClusSvc      Warning      Node Mgr       1126      N/A      KTPO15      The interface for cluster node 'KTPO16' on network 'Public' is unreachable by at least one other cluster node attached to the network. the server cluster was not able to determine the location of the failure. Look for additional entries in the system event log indicating which other nodes have lost communication with node KTPO16. If the condition persists, check the cable connecting the node to the network. Next, check for hardware or software errors in the node's network adapter. Finally, check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.
9/12/2011      9:59:10 AM      ClusSvc      Warning      Node Mgr       1130      N/A      KTPO15      Cluster network 'Heart Beat' is down. None of the available nodes can communicate using this network. If the condition persists, check for failures in any network components to which the nodes are connected such as hubs, switches, or bridges. Next, check the cables connecting the nodes to the network. Finally, check for hardware or software errors in the adapters that attach the nodes to the network.
9/12/2011      9:59:10 AM      ClusSvc      Warning      Node Mgr       1126      N/A      KTPO15      The interface for cluster node 'KTPO15' on network 'Heart Beat' is unreachable by at least one other cluster node attached to the network. the server cluster was not able to determine the location of the failure. Look for additional entries in the system event log indicating which other nodes have lost communication with node KTPO15. If the condition persists, check the cable connecting the node to the network. Next, check for hardware or software errors in the node's network adapter. Finally, check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.
9/12/2011      9:59:10 AM      ClusSvc      Warning      Node Mgr       1126      N/A      KTPO15      The interface for cluster node 'KTPO16' on network 'Heart Beat' is unreachable by at least one other cluster node attached to the network. the server cluster was not able to determine the location of the failure. Look for additional entries in the system event log indicating which other nodes have lost communication with node KTPO16. If the condition persists, check the cable connecting the node to the network. Next, check for hardware or software errors in the node's network adapter. Finally, check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.
9/12/2011      9:59:08 AM      ClusSvc      Warning      Node Mgr       1123      N/A      KTPO15      The node lost communication with cluster node 'KTPO16' on network 'Public'.
9/12/2011      9:59:08 AM      ClusSvc      Warning      Node Mgr       1123      N/A      KTPO15      The node lost communication with cluster node 'KTPO16' on network 'Heart Beat'.
9/12/2011      9:59:20 AM      ClusSvc      Information      Node Mgr       1122      N/A      KTPO16      The node (re)established communication with cluster node 'KTPO15' on network 'Heart Beat'.
9/12/2011      9:59:20 AM      ClusSvc      Warning      Node Mgr       1123      N/A      KTPO16      The node lost communication with cluster node 'KTPO15' on network 'Heart Beat'.
9/12/2011      9:59:11 AM      ClusSvc      Information      Node Mgr       1122      N/A      KTPO16      The node (re)established communication with cluster node 'KTPO15' on network 'Public'.
9/12/2011      9:59:11 AM      ClusSvc      Information      Node Mgr       1122      N/A      KTPO16      The node (re)established communication with cluster node 'KTPO15' on network 'Heart Beat'.
9/12/2011      9:59:10 AM      ClusSvc      Warning      Node Mgr       1123      N/A      KTPO16      The node lost communication with cluster node 'KTPO15' on network 'Public'.
9/12/2011      9:59:10 AM      ClusSvc      Warning      Node Mgr       1123      N/A      KTPO16      The node lost communication with cluster node 'KTPO15' on network 'Heart Beat'.
9/12/2011      1:00:00 AM      Service Control Manager      Information      None      7036      N/A      KTPO15      The Performance Logs and Alerts service entered the running state.
0
 
LVL 76

Accepted Solution

by:
arnold earned 500 total points
ID: 36526204
The heartbeat is the only service that is self-monitoring. i.e. each node expects to receive an event from the "active node" as soon as the active node seems to become inaccessible, the other nodes depending on the policy setting would either assert that it is now the active node, or they will attempt to poll to establish which is to become the active node.

The errors seem to suggest that once the 'heartbeat' communications were lost the heartbeat connection was reestablished over the 'public' network.

The log is between KTO15 and KTO16.
In reallity this may mean that the heartbeat connection between KTO15 and KTO16 on the 192.168.75.x network.
The error on KTO15 seems to suggest that the system panicked after a disk issue.
9/12/2011      11:26:35 AM      Foundation Agents      Error      Events       1172      N/A      KTPO16      "Cluster Agent: The cluster service on KTPO15 has failed.
0
 
LVL 1

Author Closing Comment

by:ktpoitm
ID: 36550822
It wasn't the entire issue.
0

Featured Post

How to run any project with ease

Manage projects of all sizes how you want. Great for personal to-do lists, project milestones, team priorities and launch plans.
- Combine task lists, docs, spreadsheets, and chat in one
- View and edit from mobile/offline
- Cut down on emails

Join & Write a Comment

INTRODUCTION The purpose of this document is to demonstrate the Installation and configuration of the Data Protection Manager product. Note that this demonstration was prepared on the basis of Windows OS is 2008 R2 and DPM 2010. DATA PROTECTI…
The password reset disk is often mentioned as the best solution to deal with the lost Windows password problem. In Windows 2008, 7, Vista and XP, a password reset disk can be easily created. But besides Windows 7/Vista/XP, Windows Server 2008 and ot…
This video Micro Tutorial explains how to clone a hard drive using a commercial software product for Windows systems called Casper from Future Systems Solutions (FSS). Cloning makes an exact, complete copy of one hard disk drive (HDD) onto another d…
Windows 8 came with a dramatically different user interface known as Metro. Notably missing from that interface was a Start button and Start Menu. Microsoft responded to negative user feedback of the Metro interface, bringing back the Start button a…

707 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

12 Experts available now in Live!

Get 1:1 Help Now