Solved

Link down on IBM DS3400 double controller disk array

Posted on 2014-10-21
8
784 Views
Last Modified: 2014-11-04
Our dual controller disk array has  a link down to the B controller.  This is evident from a cfgmgr error (0514-061Cannot find a child device) on the server and the fiber cable port LEDs on the B controller and from Storage Manager 10 service information files.  I have diff'd a 2012 storageArrayProfile with the current file and I see this on the 2012 "B"  controller:
      Host interface:                 Fibre                          
         Channel:                     1                              
         Current ID:                  2/0xE4                          
         Preferred ID:                2/0xE4                          
         NL-Port ID:                  0x0000E4                        
         Maximum data rate:           4 Gbps                          
         Current data rate:           4 Gbps                          
         Data rate control:           Auto                            
         Link status:                 Up                              
         Topology:                    Arbitrated Loop - Private      
         World-wide port identifier:  20:25:00:a0:b8:68:85:79        
         World-wide node identifier:  20:04:00:a0:b8:68:85:79        
         Part type:                   HPFC-5700           revision 5  
while this on the current "B" controller:
      Host interface:                 Fibre                          
         Channel:                     1                              
         Current ID:                  Not applicable/0xFFFFFFFF      
         Preferred ID:                2/0xE4                          
         NL-Port ID:                  0xFFFFFF                        
         Maximum data rate:           4 Gbps                          
         Current data rate:           4 Gbps                          
         Data rate control:           Auto                            
         Link status:                 Down                            
         Topology:                    Not Available                  
         World-wide port identifier:  20:25:00:a0:b8:68:85:79        
         World-wide node identifier:  20:04:00:a0:b8:68:85:79        
         Part type:                   HPFC-5700           revision 5  
In both cases the active "A" controller data is identical:
      Host interface:                 Fibre                          
         Channel:                     1                              
         Current ID:                  0/0xEF                          
         Preferred ID:                0/0xEF                          
         NL-Port ID:                  0x0000EF                        
         Maximum data rate:           4 Gbps                          
         Current data rate:           4 Gbps                          
         Data rate control:           Auto                            
         Link status:                 Up                              
         Topology:                    Arbitrated Loop - Private      
         World-wide port identifier:  20:24:00:a0:b8:68:85:79        
         World-wide node identifier:  20:04:00:a0:b8:68:85:79        
         Part type:                   HPFC-5700           revision 5  
I have replaced the fiber link, the SFP module on the controller, then the whole controller and the corresponding power supply.  Originally, we noticed the cfgmgr error swapping the CDROM back and forth between our production and test servers, and some event occurred that caused the change.  This was long after we upgraded to 5.3 TL10, so there have been no OS software or firmware upgrades.  I think there must be a configuration bit that has flipped or some such.  I am in the process of getting the unit back on its support contract.  The server has been steadily on contract and IBM support assured be that the Emulex HBA was not defective.  It is the only thing I have not replaced.  Can you guide me in diagnosing this problem?  Thanks, Tom
0
Comment
Question by:mhcbbcadmin
  • 3
  • 2
  • 2
8 Comments
 

Author Comment

by:mhcbbcadmin
Comment Utility
Adding event log event to above description.  This is making it appear that the "good" A controller is actually the problem.
Date/Time: 10/18/14 10:17:39 PM
Sequence number: 3342
Event type: 1208
Description: Data rate negotiation failed
Event specific codes: 0/0/0
Event category: Error
Component type: Channel
Component location: Host-side: controller in slot , port 1
Logged by: Controller in slot A
0
 
LVL 61

Expert Comment

by:gheist
Comment Utility
Most likely somebody sneezed at optical fibre...
0
 
LVL 55

Assisted Solution

by:andyalder
andyalder earned 500 total points
Comment Utility
Says they have replaced the cable, but no mention of the HBA  in the server (don't think there is a switch since it says FC-AL for the topology of the working links).
0
What Should I Do With This Threat Intelligence?

Are you wondering if you actually need threat intelligence? The answer is yes. We explain the basics for creating useful threat intelligence.

 

Accepted Solution

by:
mhcbbcadmin earned 0 total points
Comment Utility
It turns out to be okay to disconnect the fiber cables when the link is down.  There was one light emitting SFP port on Controller B and no corresponding HBA ports emitting light.  Using the light-emitting port I was able to show that both the cables were good and most likely then the SFP's and Controller B and power supply.  Our IBM support contract got renewed and they agreed that the host bus adapter was the most likely remaining culprit and that will be replaced shortly.  Right in the middle of all this a drive went down so that was convenient.  I am going to assume that the new HBA will fix the problem.  Performance of the array is decreased under heavy loading conditions so it will be nice to have it fixed.  I'll try to append this comment later if the result is other than the HBA being out.
0
 

Author Comment

by:mhcbbcadmin
Comment Utility
I've requested that this question be closed as follows:

Accepted answer: 0 points for mhcbbcadmin's comment #a40407412

for the following reason:

We ruled out all but one likely candidate for the problem, with a cause that IBM could not detect directly.  There still could a problem with supporting server hardware.  An "A" would be waiting to test the new HBA, but we need to be able to make other posts here so I want this one closed.
0
 
LVL 55

Expert Comment

by:andyalder
Comment Utility
Told you it was the HBA.
0
 
LVL 61

Expert Comment

by:gheist
Comment Utility
It is exactly the prime invalid question closing reason
0

Featured Post

IT, Stop Being Called Into Every Meeting

Highfive is so simple that setting up every meeting room takes just minutes and every employee will be able to start or join a call from any room with ease. Never be called into a meeting just to get it started again. This is how video conferencing should work!

Join & Write a Comment

Suggested Solutions

Title # Comments Views Activity
Solaris 10.  Nmap installation fails 2 38
Server 2012 DISK ERROR 29 75
Data center mess 4 45
Server Room Hardware 5 46
Introduction Regular patching is part of a system administrator's tasks. However, many patches require that the system be in single-user mode before they can be installed. A cluster patch in particular can take quite a while to apply if the machineā€¦
Every server (virtual or physical) needs a console: and the console can be provided through hardware directly connected, software for remote connections, local connections, through a KVM, etc. This document explains the different types of consolā€¦
Learn how to find files with the shell using the find and locate commands. Use locate to find a needle in a haystack.: With locate, check if the file still exists.: Use find to get the actual location of the file.:
In a previous video, we went over how to export a DynamoDB table into Amazon S3.  In this video, we show how to load the export from S3 into a DynamoDB table.

772 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

15 Experts available now in Live!

Get 1:1 Help Now