Link to home
Start Free TrialLog in
Avatar of thesurg3on
thesurg3on

asked on

Windows Server 2003 R2 SP2 Cluster Reboots and is Unstable

I have a two node cluster. One of them is very unstable. I have 4 adapters, one is disabled. I am going to follow the steps in: http://support.microsoft.com/default.aspx?scid=kb;EN-US;258750 which tells me how to order my networks. The disabled adapter is in the list of connections and I should change it. Nevertheless, I would not imagine this will make the server reboot. But below are some of my error messages in my system event log. Please let me know if you want to see more event logs.

Event Type:      Warning
Event Source:      ClusSvc
Event Category:      Node Mgr
Event ID:      1123
Date:            6/5/2008
Time:            11:43:55 AM
User:            N/A
Computer:      SERVER-NODE-B
Description:
The node lost communication with cluster node 'SERVER-NODE-A' on network 'Heartbeat'.

For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.



-------------------------------------


Event Type:      Warning
Event Source:      ClusSvc
Event Category:      Node Mgr
Event ID:      1123
Date:            6/5/2008
Time:            11:43:55 AM
User:            N/A
Computer:      SERVER-NODE-B
Description:
The node lost communication with cluster node 'SERVER-NODE-A' on network 'Backup(2)'.

For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.



---------------------


Event Type:      Warning
Event Source:      ClusSvc
Event Category:      Node Mgr
Event ID:      1123
Date:            6/5/2008
Time:            11:43:55 AM
User:            N/A
Computer:      SERVER-NODE-B
Description:
The node lost communication with cluster node 'SERVER-NODE-A' on network 'Internal'.

For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.



---------------------


Event Type:      Error
Event Source:      ClusDisk
Event Category:      None
Event ID:      1209
Date:            6/5/2008
Time:            11:44:10 AM
User:            N/A
Computer:      SERVER-NODE-B
Description:
Cluster service is requesting a bus reset for device \Device\ClusDisk0.

For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.
Data:
0000: 00 00 00 00 01 00 5a 00   ......Z.
0008: 00 00 00 00 b9 04 00 00   ....¹...
0010: 00 00 00 00 00 00 00 00   ........
0018: 00 00 00 00 00 00 00 00   ........
0020: 00 00 00 00 00 00 00 00   ........



------------


Event Type:      Error
Event Source:      ClusDisk
Event Category:      None
Event ID:      1209
Date:            6/5/2008
Time:            11:44:10 AM
User:            N/A
Computer:      SERVER-NODE-B
Description:
Cluster service is requesting a bus reset for device \Device\ClusDisk0.

For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.
Data:
0000: 00 00 00 00 01 00 5a 00   ......Z.
0008: 00 00 00 00 b9 04 00 00   ....¹...
0010: 00 00 00 00 00 00 00 00   ........
0018: 00 00 00 00 00 00 00 00   ........
0020: 00 00 00 00 00 00 00 00   ........



----------------------


Event Type:      Warning
Event Source:      elxstor
Event Category:      None
Event ID:      118
Date:            6/5/2008
Time:            11:44:10 AM
User:            N/A
Computer:      SERVER-NODE-B
Description:
The driver for device \Device\RaidPort0 performed a bus reset upon request.

For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.
Data:
0000: 0f 00 18 00 01 00 68 00   ......h.
0008: 00 00 00 00 76 00 04 80   ....v..€
0010: 01 00 00 00 00 00 00 00   ........
0018: 00 00 00 00 00 00 00 00   ........
0020: 00 00 00 00 00 00 00 00   ........
0028: 00 00 00 00 00 00 00 00   ........
0030: 00 00 02 00 76 00 04 80   ....v..€
0038: 00 00 00 00 00 00 00 00   ........



-----------------------


Event Type:      Warning
Event Source:      elxstor
Event Category:      None
Event ID:      118
Date:            6/5/2008
Time:            11:44:10 AM
User:            N/A
Computer:      SERVER-NODE-B
Description:
The driver for device \Device\RaidPort0 performed a bus reset upon request.

For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.
Data:
0000: 0f 00 18 00 01 00 68 00   ......h.
0008: 00 00 00 00 76 00 04 80   ....v..€
0010: 01 00 00 00 00 00 00 00   ........
0018: 00 00 00 00 00 00 00 00   ........
0020: 00 00 00 00 00 00 00 00   ........
0028: 00 00 00 00 00 00 00 00   ........
0030: 00 01 02 00 76 00 04 80   ....v..€
0038: 00 00 00 00 00 00 00 00   ........



----------------


Event Type:      Warning
Event Source:      elxstor
Event Category:      None
Event ID:      118
Date:            6/5/2008
Time:            11:44:10 AM
User:            N/A
Computer:      SERVER-NODE-B
Description:
The driver for device \Device\RaidPort1 performed a bus reset upon request.

For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.
Data:
0000: 0f 00 18 00 01 00 68 00   ......h.
0008: 00 00 00 00 76 00 04 80   ....v..€
0010: 01 00 00 00 00 00 00 00   ........
0018: 00 00 00 00 00 00 00 00   ........
0020: 00 00 00 00 00 00 00 00   ........
0028: 00 00 00 00 00 00 00 00   ........
0030: 00 00 02 00 76 00 04 80   ....v..€
0038: 00 00 00 00 00 00 00 00   ........



------------------------


Event Type:      Warning
Event Source:      elxstor
Event Category:      None
Event ID:      118
Date:            6/5/2008
Time:            11:44:10 AM
User:            N/A
Computer:      SERVER-NODE-B
Description:
The driver for device \Device\RaidPort1 performed a bus reset upon request.

For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.
Data:
0000: 0f 00 18 00 01 00 68 00   ......h.
0008: 00 00 00 00 76 00 04 80   ....v..€
0010: 01 00 00 00 00 00 00 00   ........
0018: 00 00 00 00 00 00 00 00   ........
0020: 00 00 00 00 00 00 00 00   ........
0028: 00 00 00 00 00 00 00 00   ........
0030: 00 01 02 00 76 00 04 80   ....v..€
0038: 00 00 00 00 00 00 00 00   ........



-----------


Event Type:      Warning
Event Source:      elxstor
Event Category:      None
Event ID:      118
Date:            6/5/2008
Time:            11:44:10 AM
User:            N/A
Computer:      SERVER-NODE-B
Description:
The driver for device \Device\RaidPort0 performed a bus reset upon request.

For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.
Data:
0000: 0f 00 18 00 01 00 68 00   ......h.
0008: 00 00 00 00 76 00 04 80   ....v..€
0010: 01 00 00 00 00 00 00 00   ........
0018: 00 00 00 00 00 00 00 00   ........
0020: 00 00 00 00 00 00 00 00   ........
0028: 00 00 00 00 00 00 00 00   ........
0030: 00 00 02 00 76 00 04 80   ....v..€
0038: 00 00 00 00 00 00 00 00   ........



---------------


Event Type:      Warning
Event Source:      ClusSvc
Event Category:      Node Mgr
Event ID:      1135
Date:            6/5/2008
Time:            11:44:17 AM
User:            N/A
Computer:      SERVER-NODE-B
Description:
Cluster node SERVER-NODE-A was removed from the active server cluster membership. Cluster service may have been stopped on the node, the node may have failed, or the node may have lost communication with the other active server cluster nodes.

For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.



----------------


Event Type:      Error
Event Source:      ClusDisk
Event Category:      None
Event ID:      1209
Date:            6/5/2008
Time:            11:44:17 AM
User:            N/A
Computer:      SERVER-NODE-B
Description:
Cluster service is requesting a bus reset for device \Device\ClusDisk0.

For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.
Data:
0000: 00 00 00 00 01 00 5a 00   ......Z.
0008: 00 00 00 00 b9 04 00 00   ....¹...
0010: 00 00 00 00 00 00 00 00   ........
0018: 00 00 00 00 00 00 00 00   ........
0020: 00 00 00 00 00 00 00 00   ........



-------------


Event Type:      Warning
Event Source:      elxstor
Event Category:      None
Event ID:      118
Date:            6/5/2008
Time:            11:44:17 AM
User:            N/A
Computer:      SERVER-NODE-B
Description:
The driver for device \Device\RaidPort0 performed a bus reset upon request.

For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.
Data:
0000: 0f 00 18 00 01 00 68 00   ......h.
0008: 00 00 00 00 76 00 04 80   ....v..€
0010: 01 00 00 00 00 00 00 00   ........
0018: 00 00 00 00 00 00 00 00   ........
0020: 00 00 00 00 00 00 00 00   ........
0028: 00 00 00 00 00 00 00 00   ........
0030: 00 00 01 00 76 00 04 80   ....v..€
0038: 00 00 00 00 00 00 00 00   ........
Avatar of arnold
arnold
Flag of United States of America image

Starting from the beginning, make sure the Heartbeat network is setup correctly.
What are the heart Beat IPs for both systems?
Which node is active?

Are you using a cross cable to connect the heartbeat of the systems or are you connected to a switch?

The disk reset could be a result of node-B asserting that it is now the active node which requires that the SAN disks become active which is accomplished through a BUS reset which is reflected through the elxstor which is the Emulex HBA driver.

Which server reboots? Is there a kernel,memory/mini dump in %systemroot%\Minidump or %systemroot%\memory.DMP on the system that reboots?
 
Avatar of r_panos
r_panos

The portion of the cluster log, at the time the reset happens, could be usefull for troubleshooting.

In case you have applied the hotfix 911030 check the following article:

http://support.microsoft.com/kb/923424

Loosing connectivity of the NICs is rather common (although it happens for a very short time); check your event log and see if the connections are reestablished within 1-2 seconds; the fact that all NICs loose connectivity could be a problem of the switches (if all of them are connected to switches).
Avatar of thesurg3on

ASKER

Arnold,

B is the unstable one that reboots all the time. unfortuantely it will not push out a minidumb. i turned off ASR on my HP servers, but still no minidump or memory.dmp. A is active all the time. Our heartbeat is on a private network 192.168.x.x with no gateway, connected to switch.
We are on SP2 so the KB that you supplied doesn't apply.
ASKER CERTIFIED SOLUTION
Avatar of thesurg3on
thesurg3on

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
what NMI issue you are refering here?
Is this a configuration issue or hardware one?
Please provide more information on what was done to really resolve the issue.