Want to protect your cyber security and still get fast solutions? Ask a secure question today.Go Premium

x
?
Solved

cluster failure

Posted on 2011-09-02
7
Medium Priority
?
310 Views
Last Modified: 2014-05-12
About a month ago, I updated all firmware and drivers for my two different clusters

fast forward to now, i had a strange glitch

i thought it was a power outage that took everything offline, but found out that wasn't the case

both clusters didn't reboot, or anything, but both storage controllers (port 1 and port 2) of two different clusters both went offline at the same time in about the same fashion

was this caused by a bad UPS, or a bad switch\ports?


checking the event log shows me  private interface (cross over heart beat) went down -> then public interface -> slot 1 storage controller -> slot 2 storage controller

Event Type:      Warning
Event Source:      ClusSvc
Event Category:      Node Mgr
Event ID:      1123
Date:            9/1/2011
Time:            12:54:53 AM
User:            N/A
Computer:      SERVER
Description:
The node lost communication with cluster node 'Active node' on network 'Private'.

For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.



Event Type:      Warning
Event Source:      ClusSvc
Event Category:      Node Mgr
Event ID:      1123
Date:            9/1/2011
Time:            12:54:53 AM
User:            N/A
Computer:      SERVER
Description:
The node lost communication with cluster node 'Active node' on network 'Public'.

For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.


Event Type:      Error
Event Source:      Storage Agents
Event Category:      Events
Event ID:      1151
Date:            9/1/2011
Time:            12:55:30 AM
User:            N/A
Computer:      SERVER
Description:
External Array Controller Status Change.  The external controller in I/O slot 1 of array "ZWVTMT423R" has a new status of 4.
(Controller status values: 1=other, 2=ok, 3=failed, 4=offline, 5=redundantPathOffline, 6=notConnected)
[SNMP TRAP: 16020 in CPQFCA.MIB]
Data:


Event Type:      Error
Event Source:      Storage Agents
Event Category:      Events
Event ID:      1151
Date:            9/1/2011
Time:            12:55:30 AM
User:            N/A
Computer:      SERVER
Description:
External Array Controller Status Change.  The external controller in I/O slot 2 of array "ZWVTMT423R" has a new status of 4.
(Controller status values: 1=other, 2=ok, 3=failed, 4=offline, 5=redundantPathOffline, 6=notConnected)
[SNMP TRAP: 16020 in CPQFCA.MIB]
Data:



The other errors that bother me are:

Event Type:      Warning
Event Source:      Ftdisk
Event Category:      Disk
Event ID:      57
Date:            9/1/2011
Time:            12:54:55 AM
User:            N/A
Computer:      SERVER
Description:
The system failed to flush data to the transaction log. Corruption may occur.

For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.
Data:
0000: 00 00 00 00 01 00 be 00   ......¾.
0008: 02 00 00 00 39 00 04 80   ....9..¿
0010: 00 00 00 00 10 00 00 80   .......¿
0018: 00 00 00 00 00 00 00 00   ........
0020: 00 00 00 00 00 00 00 00   ........

Event Type:      Warning
Event Source:      Srv
Event Category:      None
Event ID:      2012
Date:            9/1/2011
Time:            12:54:57 AM
User:            N/A
Computer:      Server
Description:
While transmitting or receiving data, the server encountered a network error. Occassional errors are expected, but large amounts of these indicate a possible error in your network configuration.  The error status code is contained within the returned data (formatted as Words) and may point you towards the problem.

For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.
Data:
0000: 00 00 04 00 01 00 54 00   ......T.
0008: 00 00 00 00 dc 07 00 80   ....Ü..¿
0010: 00 00 00 00 84 01 00 c0   ....¿..À
0018: 00 00 00 00 00 00 00 00   ........
0020: 00 00 00 00 00 00 00 00   ........
0028: 7b 09 00 00               {...    


After rebooting both clusters, everything came online fine.

I am extremely bothered by this.  Everything was stable for about a month, and then the clusters both puke for no reason at all.

So, back to my original question:  What do you think caused this?  Bad network switch\ports, bad UPS, or bad drivers?

The curve ball for me is that no other servers show that there network interfaces were disconnected during the above times.

 
Thanks in advance for you help!

Cluster 1 is a file cluster:  Storageworks msa 500 g1, and two DL 380 g3's.
Cluster 2 is an exchange cluster:  Storageworks msa 500 g2, and two DL 380 g4's.

Both are running Windows 2003 server enterprise, sp2.
0
Comment
Question by:seitech229
  • 4
  • 3
7 Comments
 
LVL 40

Expert Comment

by:Philip Elder
ID: 36478224
Change _all_ of the network cables for new.

Philip
0
 
LVL 40

Expert Comment

by:Philip Elder
ID: 36478229
Then verify in the firmware's ReadMe.TXT file that there are no Known Issues identical to yours. Firmware updates can sometimes do more harm than good if done without cause.

Philip
0
 

Author Comment

by:seitech229
ID: 36479319
I just have a hard time understanding how a network disconnect would also take the storage controllers offline?
0
Get free NFR key for Veeam Availability Suite 9.5

Veeam is happy to provide a free NFR license (1 year, 2 sockets) to all certified IT Pros. The license allows for the non-production use of Veeam Availability Suite v9.5 in your home lab, without any feature limitations. It works for both VMware and Hyper-V environments

 
LVL 40

Expert Comment

by:Philip Elder
ID: 36479343
There are a number of factors involved. Since the firmware change the bus location (slot) of each component whether add-in and/or on board may now have issue with each other.

If there is a DMA or IRQ conflict (yeah, I know ... old school) or other hit between two components then yes an experience where two different components go offline would be had.

We used to see this type of behaviour a lot back in the day.

Philip
0
 

Author Comment

by:seitech229
ID: 36502985
My approach to this is one step to isolate the problem.  My hunch tells me that it's the network switch, or bad cables\ports.  

For now, I took the PSU's and moved them to different UPS'.  It's been about 1 week without any hiccups.  Our ultimate plan to to swap out the switches anyhow as there's other issues related to that.

If the above problem doesn't repeat itself, it'll be safe to assume that it was the UPS's or the network switch\cabling.  If i replace the switch and it comes back, it's probably safe to assume that it's bad hardware or a driver\firmware issue.

If the problem happens again between now and the time we replace the switches, the next step would be to change switch ports, cabling etc.

Thanks!
0
 

Accepted Solution

by:
seitech229 earned 0 total points
ID: 40048972
For anybody that sees this, the issue was a bad UPS unit.  Replaced the whole unit, and the issue went away.
0
 

Author Closing Comment

by:seitech229
ID: 40058383
This fixed the problem.
0

Featured Post

Technology Partners: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

In this article we will learn how to backup a VMware farm using Nakivo Backup & Replication. In this tutorial we will install the software on a Windows 2012 R2 Server.
A look at what happened in the Verizon cloud breach.
In this video, Percona Director of Solution Engineering Jon Tobin discusses the function and features of Percona Server for MongoDB. How Percona can help Percona can help you determine if Percona Server for MongoDB is the right solution for …
In this video, Percona Solutions Engineer Barrett Chambers discusses some of the basic syntax differences between MySQL and MongoDB. To learn more check out our webinar on MongoDB administration for MySQL DBA: https://www.percona.com/resources/we…

577 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question