Still celebrating National IT Professionals Day with 3 months of free Premium Membership. Use Code ITDAY17

x
?
Solved

HP MSA70  HDD BAY 3 Keeps Failing (Replaced entire MSA70 and still happens)

Posted on 2014-10-13
13
Medium Priority
?
657 Views
Last Modified: 2014-10-27
I have two HP MSA70's connected with dual SAS I/O controllers (dual domain) for redundancy.  They are connected via HP SmartArray P812 controller.

About 2 months ago, I started getting the following message:
"Code 274:  72 GB 2-Port SAS Drive at Port 3E : Box 1 : Bay 3 is bad or missing.

To correct this problem, check the data and power connections to the physical drive.

For more information, generate a diagnostics report under the Diagnostics tab.
"



So, I replaced the hard-drive in box 1 bay 3.  The raid rebuilds just fine.  Then a couple hours later, the EXACT SAME message appears again.. for the same box 1 bay 3.   So I figured wow, the replacement hard drive was bad too..  So i swapped it out again..  and then, SAME THING..  I've replaced it 5x  and it continued to happen.  At which point I suspected maybe the MSA70's backpane is bad.  

So I replaced the entire MSA70 box (but I am still using the same original SAS I/O Controllers).

Put a 6th hard-drive in... and holy cow, SAME PROBLEM!

Any thoughts as to what could be happening?  Could it one of the SAS I/O controllers is bad???
0
Comment
Question by:jroozee
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 8
  • 4
13 Comments
 
LVL 17

Expert Comment

by:pjam
ID: 40377094
What is in Windows Logs?  Event Viewer\System
0
 
LVL 1

Author Comment

by:jroozee
ID: 40377164
Here's what I found....  Going from most recent to oldest (so read from bottom up)..  You'll see the RAID rebuilds and then seconds later the Hard drive is marked as Failed.

However,  what is with this "WRITE_RETRIES_FAILED" error?



Date:          10/12/2014 2:25:51 PM
Description:
Drive Array Physical Drive Status Change.  The physical drive in Slot 1, Port 3E Box 1 Bay 3 with serial number "3PD15K1M00009823J0AC", has a new status of 3.
(Drive status values: 1=other, 2=ok, 3=failed, 4=predictiveFailure, 5=erasing, 6=eraseDone, 7=eraseQueued, 8=ssdWearOut, 9=notAuthenticated)


Date:          10/12/2014 2:25:51 PM
Description:
Drive Array Logical Drive Status Change.  Logical drive number 1 on the array controller in Slot 1 has a new status of 2.
(Logical Drive status values: 1=other, 2=ok, 3=failed, 4=unconfigured, 5=recovering, 6=readyForRebuild, 7=rebuilding, 8=wrongDrive, 9=badConnect, 10=overheating, 11=shutdown, 12=expanding, 13=notAvailable, 14=queuedForExpansion, 15=multipathAccessDegraded,  16=erasing, 17=predictiveSpareRebuildReady, 18=rapidParityInitInProgress, 19=rapidParityInitPending, 20=noAccessEncryptedNoCntlrKey,  21=unencryptedToEncryptedInProgress, 22=newLogDrvKeyRekeyInProgress,  23=noAccessEncryptedCntlrEncryptnNotEnbld, 24=unencryptedToEncryptedNotStarted, 25=newLogDrvKeyRekeyRequestReceived)


Date:          10/12/2014 2:25:41 PM
Description:
Logical drive 1 of array controller P812 located in server slot 1 has encountered a status change from:  

Status: RECOVERING  
to  
Status: OK



Date:          10/12/2014 2:25:41 PM
Description:
A drive failure notification has been received for the SAS physical drive located in bay 3.  This drive can be found in box 1 which is connected to port 3E of the array controller P812 located in server slot 1.  The failure reason received from the HP Smart Array firmware is: WRITE_RETRIES_FAILED.


Date:          10/12/2014 2:24:09 PM
Description:
Logical drive 1 of array controller P812 located in server slot 1 has encountered a status change from:  

Status: READY FOR RECOVERY  
to  
Status: RECOVERING


Date:          10/12/2014 2:24:09 PM
Description:
Logical drive 1 of array controller P812 located in server slot 1 has encountered a status change from:  

Status: OK  
to  
Status: READY FOR RECOVERY


Date:          10/12/2014 2:24:09 PM
Description:
A SAS physical drive located in bay 3 was inserted. The drive can be found in box 1 which is attached  to port 3E of array controller P812 located in server slot 1.


Date:          10/12/2014 2:23:51 PM
Description:
Drive Array Physical Drive Status Change.  The physical drive in Slot 1, Port 3E Box 1 Bay 3 with serial number "3PD15K1M00009823J0AC", has a new status of 2.
(Drive status values: 1=other, 2=ok, 3=failed, 4=predictiveFailure, 5=erasing, 6=eraseDone, 7=eraseQueued, 8=ssdWearOut, 9=notAuthenticated)


Date:          10/12/2014 2:23:51 PM
Description:
Drive Array Logical Drive Status Change.  Logical drive number 1 on the array controller in Slot 1 has a new status of 7.
(Logical Drive status values: 1=other, 2=ok, 3=failed, 4=unconfigured, 5=recovering, 6=readyForRebuild, 7=rebuilding, 8=wrongDrive, 9=badConnect, 10=overheating, 11=shutdown, 12=expanding, 13=notAvailable, 14=queuedForExpansion, 15=multipathAccessDegraded,  16=erasing, 17=predictiveSpareRebuildReady, 18=rapidParityInitInProgress, 19=rapidParityInitPending, 20=noAccessEncryptedNoCntlrKey,  21=unencryptedToEncryptedInProgress, 22=newLogDrvKeyRekeyInProgress,  23=noAccessEncryptedCntlrEncryptnNotEnbld, 24=unencryptedToEncryptedNotStarted, 25=newLogDrvKeyRekeyRequestReceived)
0
 
LVL 1

Author Comment

by:jroozee
ID: 40377182
(ps)  The SmartArray P812 controller card has also been replaced with a brand-new one also since this started happening and the problem exists still.. So it's certainly not the controller card.
0
Create the perfect environment for any meeting

You might have a modern environment with all sorts of high-tech equipment, but what makes it worthwhile is how you seamlessly bring together the presentation with audio, video and lighting. The ATEN Control System provides integrated control and system automation.

 
LVL 1

Author Comment

by:jroozee
ID: 40377218
Here's something I noticed..  I just a random sample of the drives in the array and they are model #DH072BAAKN  and are running firmware HPD3 (which is the most current firmware for the model #'s of the drives).

However,  I noticed the model # and  firmware for the drive in the failed slot bay 3 (model #DH072BB978 )  is HPD7  (which is newer than HPD3) but also not the most current for this model # which is HPD9.


Could this be a problem of mix-matched model #'s/firmwares??
0
 
LVL 1

Author Comment

by:jroozee
ID: 40377231
Then again, I found a HDD Model #IBM-ESXSCBRBA073C3ETS0   F/W C49C  in the array...
0
 
LVL 56

Expert Comment

by:andyalder
ID: 40377328
Can you use the Array Diagnostic Utility to generate a report and upload it as an attachment please. There certainly shouldn't be an IBM drive in the array, that would have a different capacity than the HP branded drives and also possibly different TLER/CCL settings.
0
 
LVL 1

Author Comment

by:jroozee
ID: 40377349
0
 
LVL 56

Accepted Solution

by:
andyalder earned 2000 total points
ID: 40377741
It's not mirrored to the IBM disk in the second box but to the one next to it: "10       Physical Drive (72 GB SAS) 3E:1:3   Physical Drive (72 GB SAS) 3E:2:4". Can't see anything wrong with 3E:2:4. I do see hot-removed as last failure reason for almost all the disks, that's unexpected unless you had a bus disconnect.

3E:1:3 has a high number of bus faults, Since the backplane and disk has been eliminated I would suspect the SAS expander which I think is on the I/O module. You have two of them so I would try one at a time in non-redundant mode. I'm not sure how much dual-path is used as far as individual disks are concerned, if controller to box cable fails that's compensated for but if expander to disk link fails that may not be covered except by removing the i/o module or main SAS cable.
0
 
LVL 1

Author Comment

by:jroozee
ID: 40378155
Sounds good. I'm going to try that as well as make sure all hard-drives are of the same model #.

I'll update you once done (may be a few days before I can try this)
0
 
LVL 56

Expert Comment

by:andyalder
ID: 40378185
There's no need to have the same model no so long as the HP spare part no matches, SAS matrix
0
 
LVL 1

Author Comment

by:jroozee
ID: 40378315
Thought you said that that IBM drive shouldn't be in there? Also, what did you mean that it's not mirrored to the IBM drive?
0
 
LVL 56

Expert Comment

by:andyalder
ID: 40378677
HP use drives from various manufacturers and then put their own firmware on them to make them all behave the same, The HP SAS/SATA matrix I linked above lists DH072BAAKN as spare part no 418398-001 but it also lists DH072BB978 as the same spare part number, therefore those two drives can be considered to be identical even though HP got them from different disk manufacturers.

That matrix also lists the latest firmware for the drives, note that for each drive model HP buy in the first firmware release is HPD0 and the next release is HPD1 and so on. An ancient U160 SCSI disk may have HPDx firmware and a brand new SAS 6Gb may also have HPDx firmware but those two firmware files have no relation to each other.

The IBM disk isn't listed in the HP SAS matrix, I agree you should get rid of it when you can. It isn't causing your current problem though.

You have a RAID 10 array, each disk is mirrored to another disk for redundancy, then all the mirrored pairs are striped to increase capacity and overall performance. The adureport.txt file lists which disks are paired up to each other. Sometimes rebuild problems are caused by problems with the so-called working disk rather than the replacement of the failed partner but not in this case by the looks of it.
0
 
LVL 1

Author Comment

by:jroozee
ID: 40406392
I did determine it to be one of the I/O modules was bad. I swapped one out and the problem went away.

Thanks for your help!
0

Featured Post

Simplifying Server Workload Migrations

This use case outlines the migration challenges that organizations face and how the Acronis AnyData Engine supports physical-to-physical (P2P), physical-to-virtual (P2V), virtual to physical (V2P), and cross-virtual (V2V) migration scenarios to address these challenges.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

In this article we will learn how to backup a VMware farm using Nakivo Backup & Replication. In this tutorial we will install the software on a Windows 2012 R2 Server.
The business world is becoming increasingly integrated with tech. It’s not just for a select few anymore — but what about if you have a small business? It may be easier than you think to integrate technology into your small business, and it’s likely…
In this Micro Tutorial viewers will learn how they can get their files copied out from their unbootable system without need to use recovery services. As an example non-bootable Windows 2012R2 installation is used which has boot problems.
This Micro Tutorial will teach you how to reformat your flash drive. Sometimes your flash drive may have issues carrying files so this will completely restore it to manufacturing settings. Make sure to backup all files before reformatting. This w…

722 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question