Solved

RAID Array Rebuild failed: controller 1, logical drive 1 ("Primary")

Posted on 2011-09-21
13
954 Views
Last Modified: 2014-07-28
System: HP ProLiant ML 110 Storage Server (Windows Storage Server 2003), RAID Controller: Adaptec 2410SA (4X 160GB).

Prompted by audible Disk status Alarm, investigated with HP Storage Manager, v. 3.10.00 (4242), Port 0 (Physical Drive) and Logical Drive ("Primary") were marked with a "Bang" (exclamation in yellow triangle), with the Logical Drive indicated "Degraded" status.

The Port 0 (Physical Drive) was marked "Set to Fail" and a OEM HP new replacement disk was installed. The Rebuild commenced automatically and the Data Logical Drive is "Optimal", however, Port 0 (Physical Drive), which was replaced, and the Logical Drive ("Primary") remain marked with "Bang", and the Logical Drive ("Primary") remains "Degraded" (Ports 1-3 are all "Optimal").

I ran the POST Adaptect (CTRL-A) BIOS utility and used "VERIFY & FIX" on the new Port 0 (Physical Drive) and the "Scan for New Disks" utility. Notwithstanding, the Logical Drive ("Primary") remains degraded with Port 0 (Physical Drive) marked with "Bang".

The most obvious solution is that the replacement disk is itself at fault, however, I suspect that this is just too easy, especially since we ran the "VERIFY & FIX" utility. (A replacement is, however, on order.) I also am skeptical that "power" or "cable" issues are the issue, as we were able to run the "VERIFY & FIX" utility.

Obviously, one needs to stay ahead on RAID array failures.
0
Comment
Question by:cekatz
  • 7
  • 4
13 Comments
 
LVL 47

Expert Comment

by:dlethe
ID: 36575185
you could have a badblock o the surviving disk.  
0
 

Author Comment

by:cekatz
ID: 36575904
When you refer to "surviving disk" are you referring to Port 0 (Physical Drive), i.e. the replaced drive. Would not the "VERIFY & FIX" Storage Manager Utility have eliminated any issue that would otherwise have been caused by such bad block(s)?

FYI, I should for clarity sake, note that, of course, at the onset, Logical Drive 3 ("DATA") was similarly "Degraded", but after the Rebuild was marked "Optimal".
0
 
LVL 47

Expert Comment

by:dlethe
ID: 36575936
No, the drive that did not fail has a bad block.  If there is a bad block on replacement disk, then it just automatically gets repaired when it is synced up.  If you have bad block on one of the other disks, then it doesn't know what to put on the replacement drive.

You have to repair the degraded LUN by writing zeros to the bad area.  Run the repair there.
0
 

Author Comment

by:cekatz
ID: 36576039
Neglected to mention that the Port 0 (Physical Disk) is still displaying "Rebuilding" as the status in Storage Manager. However, there is clearly no activity occuring, as when a rebuild is in process the Logical Drives display (i) % Completion and an (ii) Animated Disk Graphic. Further, this display has been static for several days, despite the  "Rebuild failed" entry in the Event Log.
0
 

Author Comment

by:cekatz
ID: 36576264
dlethe:

Got it!

That makes good sense.

I notice that on the Windows (GUI) version of Storage Manager the "Verify with fix" is not selectable, i.e. "greyed out". Am I correct to intuit that we can use the Storage Manager BIOS Utility to "Verify with fix" on the LUN, or is the correct process to "Verify with fix" each of the other three Physical Disks, Port(s) 1-3?
0
Maximize Your Threat Intelligence Reporting

Reporting is one of the most important and least talked about aspects of a world-class threat intelligence program. Here’s how to do it right.

 

Author Comment

by:cekatz
ID: 36581999
We ran the Storage Manager BIOS Utility (Disk utility) "Verify". Of the four Physical Disks, Port 1 returned "Verification failed". The other three Disks all verified. There was no "Fix" done for  Port 1, just the error message. How can the "Fix" be accomplished? This does seem to be at the route of the array problem.
0
 

Author Comment

by:cekatz
ID: 36583420
Update: "Verify with fix" is "greyed out" on the Storage Manager GUI. This is including when logged on directly to the Server via local rather than Domain.
0
 
LVL 47

Expert Comment

by:dlethe
ID: 36583591
It takes 10-60 seconds for disks to deal with just ONE bad block.  You have billions of blocks.  It's working.  You need to face facts, however, your online disk is in great stress now, and could die any moment now.  If it was MY data and was worth $1000+, then I would strongly consider a data recovery firm.  If that is a little steep, make sure you take a full backup NOW, kill every process you can and let the system just crank away.  I've seen such things take days.  Don't rock the boat.

No matter what you DO have some data corruption / loss.  Be prepared for backup software to crap out on you, and for the files you do get to have some corruption.  You can't trust any of your data.
0
 

Author Comment

by:cekatz
ID: 36583792
Ok, so what you are telling me is that despite the fact that seemingly nothing is happening, the "Rebuilding" status on Port 0 is a reliable indication that an actual rebuild process is underway (even though the Logical Disk is not showing the usual "rebuild animation") ? And what of the fact that the BIOS Utility "Verify" returned "Verification failed" on Port 1, without fixing that issue?

Our issue is not data protection/recovery, we have multiple tiers of back-up prior to this issue, with no exposure to critical data loss. The primary issue is that the Storage Server is compromised and out-of-production for the duration.

Finally, and not to be cynical, but... how many days? Here is the array: Port 0 (465 GB), Port 1 (149 GB), Port 2 (149 GB), and Port 3 (232 GB). FYI, we've been dumping the old Maxtor 160 GB drives as they have failed.
0
 
LVL 47

Accepted Solution

by:
dlethe earned 500 total points
ID: 36584049
To be politically incorrect, your controller is a piece of crap. It could very well be hung. But statistically it is more likely you have a chunk of bad blocks.  If you just have 1 MB of bad blocks (2048) worth, then it could take some controllers & disks in an imperfect world, 2000 minutes to get that. Granted that is a worst case scenario.    If you don't have any diagnostics that let you see what is going on (which that controller won't let you monitor individual disks to confirm block numbers and status ..),  then you just have to make a mental coin flip to

1. wait it out
2. shut the box down, build a fresh raid, restore and move on.

You could run some scanning software to see how many bad blocks you have, but that puts you at even further risk.   My suggestion is that you invest in SAS drives.   By any chance are these desktop/consumer disks?   That would explain the lockups.  Enterprise class SATA disks will typically guarantee a rebuild/reassignment in around 5 seconds. Desktop/consumer disks are the ones that can go from 30-60 seconds each.

If it was me, I would first check to see if you have latest firmware, drivers, BIOS, and enterprise class disks.  You can't risk changing anything while it is degraded, but if your adaptec software is ancient, then look at release notes and updates to see if any address such behavior.  If everything is current, and you have right kind of disks, then shut it down and restart and hopefully it will be OK. Otherwise prepare to restore after rebuilding config on 2 fresh known good disks

Sorry I can't tell you what is going on, but your hardware just doesn't have the premium features necessary to provide details such as that, so .. just figure out how long you are willing to wait, and be prepared to do a restore.  
0
 

Author Comment

by:cekatz
ID: 36589454
dlethe,

Again thank you for the insight. The system in question is an HP ProLiant ML 110 Storage Server (ancient). Interestingly enough, as we are backed-up, the notion of investing substantial "brain power" into bring this bit of "low-end" tin back on line is "negligible". Initially we implemented this server because it was "cheap" and had all the functionality we needed. Over the course of its humble but highly productive and long-lived life-cycle we learned that the following capabilities are indispensible for any new storage serves (1) 1U-2U Form Factor (avoids back sprains), (2) Hot Swapable (avoids dirty finger nails and wasted time), (3) On-Line Spare (avoids all manner of problems), (3) Within OEM "Life Cycle" Support (including Extended Contract). I can see that the better part of valor here is the 'ol "get your check book out" play, as reestablishing this server in production would be a pyrrich victory, with no benefit to our IT infrastructure.
0

Featured Post

Highfive Gives IT Their Time Back

Highfive is so simple that setting up every meeting room takes just minutes and every employee will be able to start or join a call from any room with ease. Never be called into a meeting just to get it started again. This is how video conferencing should work!

Join & Write a Comment

Learn about cloud computing and its benefits for small business owners.
Create your own, high-performance VM backup appliance by installing NAKIVO Backup & Replication directly onto a Synology NAS!
This video teaches viewers how to encrypt an external drive that requires a password to read and edit the drive. All tasks are done in Disk Utility. Plug in the external drive you wish to encrypt: Make sure all previous data on the drive has been …
This Micro Tutorial will teach you how to reformat your flash drive. Sometimes your flash drive may have issues carrying files so this will completely restore it to manufacturing settings. Make sure to backup all files before reformatting. This w…

708 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

17 Experts available now in Live!

Get 1:1 Help Now