[2 days left] What’s wrong with your cloud strategy? Learn why multicloud solutions matter with Nimble Storage.Register Now

x
?
Solved

RAID Array Rebuild failed: controller 1, logical drive 1 ("Primary")

Posted on 2011-09-21
13
Medium Priority
?
1,084 Views
Last Modified: 2014-07-28
System: HP ProLiant ML 110 Storage Server (Windows Storage Server 2003), RAID Controller: Adaptec 2410SA (4X 160GB).

Prompted by audible Disk status Alarm, investigated with HP Storage Manager, v. 3.10.00 (4242), Port 0 (Physical Drive) and Logical Drive ("Primary") were marked with a "Bang" (exclamation in yellow triangle), with the Logical Drive indicated "Degraded" status.

The Port 0 (Physical Drive) was marked "Set to Fail" and a OEM HP new replacement disk was installed. The Rebuild commenced automatically and the Data Logical Drive is "Optimal", however, Port 0 (Physical Drive), which was replaced, and the Logical Drive ("Primary") remain marked with "Bang", and the Logical Drive ("Primary") remains "Degraded" (Ports 1-3 are all "Optimal").

I ran the POST Adaptect (CTRL-A) BIOS utility and used "VERIFY & FIX" on the new Port 0 (Physical Drive) and the "Scan for New Disks" utility. Notwithstanding, the Logical Drive ("Primary") remains degraded with Port 0 (Physical Drive) marked with "Bang".

The most obvious solution is that the replacement disk is itself at fault, however, I suspect that this is just too easy, especially since we ran the "VERIFY & FIX" utility. (A replacement is, however, on order.) I also am skeptical that "power" or "cable" issues are the issue, as we were able to run the "VERIFY & FIX" utility.

Obviously, one needs to stay ahead on RAID array failures.
0
Comment
Question by:cekatz
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 7
  • 4
13 Comments
 
LVL 47

Expert Comment

by:David
ID: 36575185
you could have a badblock o the surviving disk.  
0
 

Author Comment

by:cekatz
ID: 36575904
When you refer to "surviving disk" are you referring to Port 0 (Physical Drive), i.e. the replaced drive. Would not the "VERIFY & FIX" Storage Manager Utility have eliminated any issue that would otherwise have been caused by such bad block(s)?

FYI, I should for clarity sake, note that, of course, at the onset, Logical Drive 3 ("DATA") was similarly "Degraded", but after the Rebuild was marked "Optimal".
0
 
LVL 47

Expert Comment

by:David
ID: 36575936
No, the drive that did not fail has a bad block.  If there is a bad block on replacement disk, then it just automatically gets repaired when it is synced up.  If you have bad block on one of the other disks, then it doesn't know what to put on the replacement drive.

You have to repair the degraded LUN by writing zeros to the bad area.  Run the repair there.
0
Free learning courses: Active Directory Deep Dive

Get a firm grasp on your IT environment when you learn Active Directory best practices with Veeam! Watch all, or choose any amount, of this three-part webinar series to improve your skills. From the basics to virtualization and backup, we got you covered.

 

Author Comment

by:cekatz
ID: 36576039
Neglected to mention that the Port 0 (Physical Disk) is still displaying "Rebuilding" as the status in Storage Manager. However, there is clearly no activity occuring, as when a rebuild is in process the Logical Drives display (i) % Completion and an (ii) Animated Disk Graphic. Further, this display has been static for several days, despite the  "Rebuild failed" entry in the Event Log.
0
 

Author Comment

by:cekatz
ID: 36576264
dlethe:

Got it!

That makes good sense.

I notice that on the Windows (GUI) version of Storage Manager the "Verify with fix" is not selectable, i.e. "greyed out". Am I correct to intuit that we can use the Storage Manager BIOS Utility to "Verify with fix" on the LUN, or is the correct process to "Verify with fix" each of the other three Physical Disks, Port(s) 1-3?
0
 

Author Comment

by:cekatz
ID: 36581999
We ran the Storage Manager BIOS Utility (Disk utility) "Verify". Of the four Physical Disks, Port 1 returned "Verification failed". The other three Disks all verified. There was no "Fix" done for  Port 1, just the error message. How can the "Fix" be accomplished? This does seem to be at the route of the array problem.
0
 

Author Comment

by:cekatz
ID: 36583420
Update: "Verify with fix" is "greyed out" on the Storage Manager GUI. This is including when logged on directly to the Server via local rather than Domain.
0
 
LVL 47

Expert Comment

by:David
ID: 36583591
It takes 10-60 seconds for disks to deal with just ONE bad block.  You have billions of blocks.  It's working.  You need to face facts, however, your online disk is in great stress now, and could die any moment now.  If it was MY data and was worth $1000+, then I would strongly consider a data recovery firm.  If that is a little steep, make sure you take a full backup NOW, kill every process you can and let the system just crank away.  I've seen such things take days.  Don't rock the boat.

No matter what you DO have some data corruption / loss.  Be prepared for backup software to crap out on you, and for the files you do get to have some corruption.  You can't trust any of your data.
0
 

Author Comment

by:cekatz
ID: 36583792
Ok, so what you are telling me is that despite the fact that seemingly nothing is happening, the "Rebuilding" status on Port 0 is a reliable indication that an actual rebuild process is underway (even though the Logical Disk is not showing the usual "rebuild animation") ? And what of the fact that the BIOS Utility "Verify" returned "Verification failed" on Port 1, without fixing that issue?

Our issue is not data protection/recovery, we have multiple tiers of back-up prior to this issue, with no exposure to critical data loss. The primary issue is that the Storage Server is compromised and out-of-production for the duration.

Finally, and not to be cynical, but... how many days? Here is the array: Port 0 (465 GB), Port 1 (149 GB), Port 2 (149 GB), and Port 3 (232 GB). FYI, we've been dumping the old Maxtor 160 GB drives as they have failed.
0
 
LVL 47

Accepted Solution

by:
David earned 2000 total points
ID: 36584049
To be politically incorrect, your controller is a piece of crap. It could very well be hung. But statistically it is more likely you have a chunk of bad blocks.  If you just have 1 MB of bad blocks (2048) worth, then it could take some controllers & disks in an imperfect world, 2000 minutes to get that. Granted that is a worst case scenario.    If you don't have any diagnostics that let you see what is going on (which that controller won't let you monitor individual disks to confirm block numbers and status ..),  then you just have to make a mental coin flip to

1. wait it out
2. shut the box down, build a fresh raid, restore and move on.

You could run some scanning software to see how many bad blocks you have, but that puts you at even further risk.   My suggestion is that you invest in SAS drives.   By any chance are these desktop/consumer disks?   That would explain the lockups.  Enterprise class SATA disks will typically guarantee a rebuild/reassignment in around 5 seconds. Desktop/consumer disks are the ones that can go from 30-60 seconds each.

If it was me, I would first check to see if you have latest firmware, drivers, BIOS, and enterprise class disks.  You can't risk changing anything while it is degraded, but if your adaptec software is ancient, then look at release notes and updates to see if any address such behavior.  If everything is current, and you have right kind of disks, then shut it down and restart and hopefully it will be OK. Otherwise prepare to restore after rebuilding config on 2 fresh known good disks

Sorry I can't tell you what is going on, but your hardware just doesn't have the premium features necessary to provide details such as that, so .. just figure out how long you are willing to wait, and be prepared to do a restore.  
0
 

Author Comment

by:cekatz
ID: 36589454
dlethe,

Again thank you for the insight. The system in question is an HP ProLiant ML 110 Storage Server (ancient). Interestingly enough, as we are backed-up, the notion of investing substantial "brain power" into bring this bit of "low-end" tin back on line is "negligible". Initially we implemented this server because it was "cheap" and had all the functionality we needed. Over the course of its humble but highly productive and long-lived life-cycle we learned that the following capabilities are indispensible for any new storage serves (1) 1U-2U Form Factor (avoids back sprains), (2) Hot Swapable (avoids dirty finger nails and wasted time), (3) On-Line Spare (avoids all manner of problems), (3) Within OEM "Life Cycle" Support (including Extended Contract). I can see that the better part of valor here is the 'ol "get your check book out" play, as reestablishing this server in production would be a pyrrich victory, with no benefit to our IT infrastructure.
0

Featured Post

NFR key for Veeam Backup for Microsoft Office 365

Veeam is happy to provide a free NFR license (for 1 year, up to 10 users). This license allows for the non‑production use of Veeam Backup for Microsoft Office 365 in your home lab without any feature limitations.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Finding original email is quite difficult due to their duplicates. From this article, you will come to know why multiple duplicates of same emails appear and how to delete duplicate emails from Outlook securely and instantly while vital emails remai…
When we purchase storage, we typically are advertised storage of 500GB, 1TB, 2TB and so on. However, when you actually install it into your computer, your 500GB HDD will actually show up as 465GB. Why? It has to do with the way people and computers…
This tutorial will walk an individual through the process of installing the necessary services and then configuring a Windows Server 2012 system as an iSCSI target. To install the necessary roles, go to Server Manager, and select Add Roles and Featu…
In this video, Percona Solutions Engineer Barrett Chambers discusses some of the basic syntax differences between MySQL and MongoDB. To learn more check out our webinar on MongoDB administration for MySQL DBA: https://www.percona.com/resources/we…
Suggested Courses

649 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question