Solved

Raid 5 - Swapping failed disk is not working with HP Smart Array Controller P600

Posted on 2012-03-26
19
4,191 Views
Last Modified: 2012-06-27
Hi,

Disk failed on our server. HP Proliant DL 380 G5.

Configuration is:

2 x Logical drives.

Logical Drive 1 (Array A) -
1 x 72GB 1 Port SAS Port 2I Box 1 Bay 1
1 x 72GB 1 Port SAS Port 2I Box 1 Bay 2

Logical Drive 2 (Array B)
1 x 146GB 2 Port SAS Port 1I Box 1 Bay 5
1 x 146GB 1 Port SAS Port 1I Box 1 Bay 6  ( Failed / Problem )
1 x 146GB 1 Port SAS Port 2I Box 1 Bay 3
1 x 146GB 2 Port SAS Port 2I Box 1 Bay 4

Disk flagged as failed in Event Viewer, and in HP Array Controller Utility. (HP ACU)

All other disks working as usual.  All users working on the server as normal.

I took out the failed Bay 6 disk, and replaced it with a brand new spare.  

I'm expecting to see some kind of acknowledgement from the status lights to suggest to me that a rebuild is taking place, BUT the lights remain the same ( Off, solid Red ).

So reading up on this, this may mean that the new disk is faulty, so I try another brand new one, and the same thing happens.

During this time I am checking and refreshing the HP ACU, and this is still reporting that the disk in Bay 6 has failed.
I can't see anything that tells me a rebuild is taking place, and no way of forcing one.  

Ideally. I'd like to see the system recognise the disk and automatically start the rebuild, so that I'm confident everything is OK, but what are anyone elses thoughts ?

We've had a raid controller go down in the past on another server, so in the back of my mind, I'm wondering whether this is the start of an issue with this controller ?

Is it possible for a specific bay to fail at controller level  ?

The HP ACU is telling me that no other array configuration/change can take place without resolving this problem.
I understand that with Raid 5, I don't want to be messing with the working disks, but when considering my options, I was wondering whether I could somehow extend the array to say, Bay 7, and take Bay 6 out of the picture at the same time. ( if a bay failure was a possibility).

Like I said, I don't want to take additional risks unnecessarily, It would be nice for a Raid 5 issue to be simple for once?  I'm sure they are 99.9% of the time, but .......

Help and advice please.

Thanks
0
Comment
Question by:EBIZ-Mark
  • 8
  • 7
  • 2
  • +2
19 Comments
 
LVL 30

Expert Comment

by:IanTh
Comment Utility
Is it possible for a specific bay to fail at controller level  ?

of course take the failing drive out and plug it back in may point at the slot if the drive works

cant you use a hot spare ?
0
 
LVL 16

Expert Comment

by:Syed_M_Usman
Comment Utility
Dear,

"Is it possible for a specific bay to fail at controller level"
This could be the case. i ran into similar problem with HP DL380G5, P410 (512MB BBWC), i put 2port disk it was showing red, then i manage to get 1 port but still it was same. i reported to HP and they says could be controller or bay,,,, finally they brough Bay and extra controller and we found Bay was Faulty...infact old disk was fine...


"I'm wondering whether this is the start of an issue with this controller" can be but i my experience if there is any problem in Raid, the raid controller will not work.

"I was wondering whether I could somehow extend the array to say, Bay 7, and take Bay 6 out of the picture at the same time" not recomended as this could lead you to system failure.
0
 
LVL 55

Expert Comment

by:andyalder
Comment Utility
Need an ADU report to be sure of what is happening. Two possibilities I see, one is that someone powered it off* and it is in interim recovery mode, the other is that there are unrecoverable read errors on another disk (although it should start to rebuild and then fail to complete).

*which is a big no-no unless it's not hot-swap.
0
 

Author Comment

by:EBIZ-Mark
Comment Utility
Thanks for the 3 responses so far.

IanTh: I've tried two disks in, and I've tried them both in and out several times.  This would lead me to believe there is an issue with Bay 6 ?
No Hot spare (!)

Syed_M_Usman:  This sounds like a specific Bay problem could be a possibility. You recommend against utilising Bay 7 instead of Bay 6.  Is this just not feasible at all.  Basically, if I can't get Bay 6 to work, is my only alternative start from scratch with the Raid setup, and recover from a backup afterwards ?

andyalder:  I've attached a report.
The array is reported to be in 'Interim Recovery Mode' (?).
ADUReport.txt
0
 
LVL 55

Expert Comment

by:andyalder
Comment Utility
ADU report says there's no disk in 1I:1:6, so probably is duff backplane assuming lever is hinged in right.
0
 

Author Comment

by:EBIZ-Mark
Comment Utility
Hi,

Just taken the disk out again, and there is a red light at the back of the Bay.  Is this just another visual prompt to where the error has occurred, or is this an indication that the card has a fault, not the disk ?

Just looking at the original fault in the Servers Event viewer.  Here are five consecutive errors, which were the first log(s) of anything being wrong.
Note:  The disk was not physically removed, as the server is in a secure location.
All five logs are within 11 seconds of each other.

Any thoughts ?   Thanks again for any help or advice.

---------------------------------------------------------------------------------------------------------------------------------
Event Type:      Information
Event Source:      Cissesrv
Event Category:      None
Event ID:      24598
Date:            3/22/2012
Time:            5:41:17 AM
User:            N/A
Computer:      WILAP07NOTES
Description:
Logical drive 2 of array controller P600 located in server slot 5 has encountered a status change from:  

Status: OK  
to  
Status: INTERIM RECOVERY MODE

------------------------------------------------------------------------------------------------------------------------------------------
Event Type:      Error
Event Source:      Cissesrv
Event Category:      None
Event ID:      24597
A drive failure notification has been received for the SAS physical drive located in bay 6.  This drive can be found in box 1 which is connected to port 1I of the array controller P600 located in server slot 5.  The failure reason received from the HP Smart Array firmware is: REMOVED_IN_HOT_PLUG.

---------------------------------------------------------------------------------------------------------------------------------------------

Event Type:      Information
Event Source:      Cissesrv
Event Category:      None
Event ID:      24581
Date:            3/22/2012
Time:            5:41:17 AM
User:            N/A
Computer:      WILAP07NOTES
Description:
A SAS physical drive located in bay 6 was removed. The drive can be found in box 1 which is attached  to port 1I of array controller P600 located in server slot 5.

---------------------------------------------------------------------------------------------------------------------------------------

Event Type:      Warning
Event Source:      Storage Agents
Event Category:      Events
Event ID:      1200
Date:            3/22/2012
Time:            5:41:28 AM
User:            N/A
Computer:      WILAP07NOTES
Description:
Drive Array Logical Drive Status Change.  Logical drive number 2 on the array controller in Slot 5 has a new status of 5.
(Logical Drive status values: 1=other, 2=ok, 3=failed, 4=unconfigured, 5=recovering, 6=readyForRebuild, 7=rebuilding, 8=wrongDrive, 9=badConnect, 10=overheating, 11=shutdown, 12=expanding, 13=notAvailable, 14=queuedForExpansion)
[SNMP TRAP: 3034 in CPQIDA.MIB]

--------------------------------------------------------------------------------------------------------------------------------------------
Event Type:      Error
Event Source:      Storage Agents
Event Category:      Events
Event ID:      1216
Date:            3/22/2012
Time:            5:41:28 AM
Description:
Drive Array Physical Drive Status Change.  The physical drive in Slot 5, Port 1I Box 1 Bay 6 with serial number "3NM10CVW00009740V2GX", has a new status of 3.
(Drive status values: 1=other, 2=ok, 3=failed, 4=predictiveFailure)
[SNMP TRAP: 3046 in CPQIDA.MIB]
0
 
LVL 55

Expert Comment

by:andyalder
Comment Utility
You sure the eject lever is clipped in right? Sounds like it's connecting and then disconnecting again.
0
 

Author Comment

by:EBIZ-Mark
Comment Utility
Hi,

I'm certain. But each time you ask, you make me want to check again !!, but I've swapped them in/out so many times today.   Also, the ins/outs of today, aren't showing in the event log at all.

The logs above are from before we knew we had an issue, so the event viewer showing that the disk has been removed, has happened in an empty server room at 5am- ish, when no-one would have been onsite.  

Are we at the stage where we think its a bay problem ? Can anyone confirm the meaning of the red light behind the disk i.e. physically on the 'backplane'  ?   Is this acknowledgement of a previosu disk problem, or telling us of a controller problem ?

Thanks
0
 
LVL 55

Accepted Solution

by:
andyalder earned 500 total points
Comment Utility
The LED is showing a disk problem for that bay, the controller doesn't know the disk is physically removed, just that it can't see it. It could be a controller or cable fault, but backplane is most likely after the disk.

No way to add a spare into another bay though since array changes are disabled when degraded (I know, that's a pain in the butt in this instance).
0
Threat Intelligence Starter Resources

Integrating threat intelligence can be challenging, and not all companies are ready. These resources can help you build awareness and prepare for defense.

 

Author Comment

by:EBIZ-Mark
Comment Utility
Thanks.

Am i correct in thinking that, if I replace the P600 controller, it won't recognise the disks anyway ?
This is something that an IT colleague has always assured me is true, but if it is , I think that's really poor too.

Basically, Can I replace the P600 easily, and know quite quickly, whether the issue is resolved ?

What do you think ?

I'm offline until tomorrow morning (UK) now, so apologies in advance for the delay in my next response.
0
 
LVL 55

Expert Comment

by:andyalder
Comment Utility
You can swap any HP Smart array controller with any other one within reason* and the controller will read the config off the disks and it'll work.

Your colleague is right for some brands but not HP Smart Array controllers, you can even go up a generation or two (e.g. Smart Array P412) and it'll read them fine because they all run HP's own RAID stack rather than the RAID stack that's supplied by default with the RAID on Chip. It's a major advantage of the Smart Array range.

* a few caveats on using an older controller or one with older firmware, for example new controllers support 3TB disks but older ones don't and you can't plug parallel SCSI disks into a SAS controller.
0
 
LVL 55

Expert Comment

by:andyalder
Comment Utility
If you're 100% sure you have a good backup and you want to identify the fault then you can prove that it's not a backplane / cable problem without buying any spare parts. Swap the two 4-lane SAS cables going to the backplane with each other and put disks 1-4 in bays 5-8 and vice-versa. If it's the backplane or cable then disk 2 will go offline and it won't stand a chance of booting, if it's a controller port then it'll still see the same disks as before and boot albeit in degraded mode. It'll probably complain about the drive swap but either reconcile the bay changes or ask you to swap them back unless it's a controller fault.
0
 

Author Comment

by:EBIZ-Mark
Comment Utility
andyalder -  Thanks for your continuing support.

OK, if I did want to carry out the 'drive swap test' what would the risks be to data ?  

I'm happy with the Array B backup as it's data only, but I'm pretty certain, I'd get 3rd party support in if I needed to reinstall Array A, as this has our Lotus Notes install on it. As a high usage server, I'd be concerned about the additional downtime I might add if I start dabbling myself.

The compatibility of the HP array controllers is really good to know. I'm sure not realising this has caused us additional grief in the past.

I'll discuss our options here then, and update again shortly.

Thanks
0
 

Author Comment

by:EBIZ-Mark
Comment Utility
Any additional risk if I power down the server in it's current state ?

Just wondered if there was likely to be any more information as part of the POST startup tests ?

Also, I've got to put it down at some point soon anyway. So it would be good to know whether I should be expecting this to get worse on startup.

Thanks
0
 
LVL 55

Expert Comment

by:andyalder
Comment Utility
It's never good to power down a server with the array in a degraded state, but in this case I don't see an option although I'd backup first.

Rather than using the ACU off the SmartStart CD you can use Insight Diagnostics off it instead and under an inventory of the server you can see what disks are connected. Since the RIS metadata contains the disk/array configuration you can even power it on with disks removed and use a known good disk to test each slot. Unplugging live does add one to the SMART error stats so to preserve the quality of the test disk you'll have to power off each time you test a slot.
0
 
LVL 30

Expert Comment

by:IanTh
Comment Utility
I have seen backplanes fail in my career
0
 

Expert Comment

by:TricksGuide
Comment Utility
As per the comment "Just taken the disk out again, and there is a red light at the back of the Bay", it seems that the hard drive backplane is faulty as you know that LEDs are always on the Hard drive backplane, not on the HDD :)

Replacing the HDD backplane will solve the issue here.

Regards
SiRu
0
 

Author Comment

by:EBIZ-Mark
Comment Utility
Thanks.

We have a 3rd party support company coming on site in the next day or two to sort this.

I have updated them with the thought processes so far, but I think they're planning on bringing in a temportary server just in case.

The plan at the moment is that the possible spares, (backplane and controller) will be at hand so that we can swap each out and see which fixes the problem.

Thanks again for your support.  I'll allocate points shortly.
0
 

Author Comment

by:EBIZ-Mark
Comment Utility
Issue appeared to be the backplane.

Swapped this, and the disks were fine with no data loss.

Discarded the disk that orginally flagged as failed.  

Added new disk, and the rebuild started immediately.

Thanks
0

Featured Post

Complete VMware vSphere® ESX(i) & Hyper-V Backup

Capture your entire system, including the host, with patented disk imaging integrated with VMware VADP / Microsoft VSS and RCT. RTOs is as low as 15 seconds with Acronis Active Restore™. You can enjoy unlimited P2V/V2V migrations from any source (even from a different hypervisor)

Join & Write a Comment

Having issues meeting security compliance criteria because of those pesky USB drives? Then I can help you! This article will explain how to disable USB Mass Storage devices in Windows Server 2008 R2.
Lets start to have a small explanation what is VAAI(vStorage API for Array Integration ) and what are the benefits using it. VAAI is an API framework in VMware that enable some Storage tasks. It first presented in ESXi 4.1, but only after 5.x sup…
This tutorial will walk an individual through the process of installing the necessary services and then configuring a Windows Server 2012 system as an iSCSI target. To install the necessary roles, go to Server Manager, and select Add Roles and Featu…
This Micro Tutorial will teach you how to reformat your flash drive. Sometimes your flash drive may have issues carrying files so this will completely restore it to manufacturing settings. Make sure to backup all files before reformatting. This w…

762 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

10 Experts available now in Live!

Get 1:1 Help Now