Link to home
Start Free TrialLog in
Avatar of cekatz
cekatz

asked on

RAID Array Rebuild failed: controller 1, logical drive 1 ("Primary")

System: HP ProLiant ML 110 Storage Server (Windows Storage Server 2003), RAID Controller: Adaptec 2410SA (4X 160GB).

Prompted by audible Disk status Alarm, investigated with HP Storage Manager, v. 3.10.00 (4242), Port 0 (Physical Drive) and Logical Drive ("Primary") were marked with a "Bang" (exclamation in yellow triangle), with the Logical Drive indicated "Degraded" status.

The Port 0 (Physical Drive) was marked "Set to Fail" and a OEM HP new replacement disk was installed. The Rebuild commenced automatically and the Data Logical Drive is "Optimal", however, Port 0 (Physical Drive), which was replaced, and the Logical Drive ("Primary") remain marked with "Bang", and the Logical Drive ("Primary") remains "Degraded" (Ports 1-3 are all "Optimal").

I ran the POST Adaptect (CTRL-A) BIOS utility and used "VERIFY & FIX" on the new Port 0 (Physical Drive) and the "Scan for New Disks" utility. Notwithstanding, the Logical Drive ("Primary") remains degraded with Port 0 (Physical Drive) marked with "Bang".

The most obvious solution is that the replacement disk is itself at fault, however, I suspect that this is just too easy, especially since we ran the "VERIFY & FIX" utility. (A replacement is, however, on order.) I also am skeptical that "power" or "cable" issues are the issue, as we were able to run the "VERIFY & FIX" utility.

Obviously, one needs to stay ahead on RAID array failures.
Avatar of David
David
Flag of United States of America image

you could have a badblock o the surviving disk.  
Avatar of cekatz
cekatz

ASKER

When you refer to "surviving disk" are you referring to Port 0 (Physical Drive), i.e. the replaced drive. Would not the "VERIFY & FIX" Storage Manager Utility have eliminated any issue that would otherwise have been caused by such bad block(s)?

FYI, I should for clarity sake, note that, of course, at the onset, Logical Drive 3 ("DATA") was similarly "Degraded", but after the Rebuild was marked "Optimal".
No, the drive that did not fail has a bad block.  If there is a bad block on replacement disk, then it just automatically gets repaired when it is synced up.  If you have bad block on one of the other disks, then it doesn't know what to put on the replacement drive.

You have to repair the degraded LUN by writing zeros to the bad area.  Run the repair there.
Avatar of cekatz

ASKER

Neglected to mention that the Port 0 (Physical Disk) is still displaying "Rebuilding" as the status in Storage Manager. However, there is clearly no activity occuring, as when a rebuild is in process the Logical Drives display (i) % Completion and an (ii) Animated Disk Graphic. Further, this display has been static for several days, despite the  "Rebuild failed" entry in the Event Log.
Avatar of cekatz

ASKER

dlethe:

Got it!

That makes good sense.

I notice that on the Windows (GUI) version of Storage Manager the "Verify with fix" is not selectable, i.e. "greyed out". Am I correct to intuit that we can use the Storage Manager BIOS Utility to "Verify with fix" on the LUN, or is the correct process to "Verify with fix" each of the other three Physical Disks, Port(s) 1-3?
Avatar of cekatz

ASKER

We ran the Storage Manager BIOS Utility (Disk utility) "Verify". Of the four Physical Disks, Port 1 returned "Verification failed". The other three Disks all verified. There was no "Fix" done for  Port 1, just the error message. How can the "Fix" be accomplished? This does seem to be at the route of the array problem.
Avatar of cekatz

ASKER

Update: "Verify with fix" is "greyed out" on the Storage Manager GUI. This is including when logged on directly to the Server via local rather than Domain.
It takes 10-60 seconds for disks to deal with just ONE bad block.  You have billions of blocks.  It's working.  You need to face facts, however, your online disk is in great stress now, and could die any moment now.  If it was MY data and was worth $1000+, then I would strongly consider a data recovery firm.  If that is a little steep, make sure you take a full backup NOW, kill every process you can and let the system just crank away.  I've seen such things take days.  Don't rock the boat.

No matter what you DO have some data corruption / loss.  Be prepared for backup software to crap out on you, and for the files you do get to have some corruption.  You can't trust any of your data.
Avatar of cekatz

ASKER

Ok, so what you are telling me is that despite the fact that seemingly nothing is happening, the "Rebuilding" status on Port 0 is a reliable indication that an actual rebuild process is underway (even though the Logical Disk is not showing the usual "rebuild animation") ? And what of the fact that the BIOS Utility "Verify" returned "Verification failed" on Port 1, without fixing that issue?

Our issue is not data protection/recovery, we have multiple tiers of back-up prior to this issue, with no exposure to critical data loss. The primary issue is that the Storage Server is compromised and out-of-production for the duration.

Finally, and not to be cynical, but... how many days? Here is the array: Port 0 (465 GB), Port 1 (149 GB), Port 2 (149 GB), and Port 3 (232 GB). FYI, we've been dumping the old Maxtor 160 GB drives as they have failed.
ASKER CERTIFIED SOLUTION
Avatar of David
David
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of cekatz

ASKER

dlethe,

Again thank you for the insight. The system in question is an HP ProLiant ML 110 Storage Server (ancient). Interestingly enough, as we are backed-up, the notion of investing substantial "brain power" into bring this bit of "low-end" tin back on line is "negligible". Initially we implemented this server because it was "cheap" and had all the functionality we needed. Over the course of its humble but highly productive and long-lived life-cycle we learned that the following capabilities are indispensible for any new storage serves (1) 1U-2U Form Factor (avoids back sprains), (2) Hot Swapable (avoids dirty finger nails and wasted time), (3) On-Line Spare (avoids all manner of problems), (3) Within OEM "Life Cycle" Support (including Extended Contract). I can see that the better part of valor here is the 'ol "get your check book out" play, as reestablishing this server in production would be a pyrrich victory, with no benefit to our IT infrastructure.