cekatz
asked on
RAID Array Rebuild failed: controller 1, logical drive 1 ("Primary")
System: HP ProLiant ML 110 Storage Server (Windows Storage Server 2003), RAID Controller: Adaptec 2410SA (4X 160GB).
Prompted by audible Disk status Alarm, investigated with HP Storage Manager, v. 3.10.00 (4242), Port 0 (Physical Drive) and Logical Drive ("Primary") were marked with a "Bang" (exclamation in yellow triangle), with the Logical Drive indicated "Degraded" status.
The Port 0 (Physical Drive) was marked "Set to Fail" and a OEM HP new replacement disk was installed. The Rebuild commenced automatically and the Data Logical Drive is "Optimal", however, Port 0 (Physical Drive), which was replaced, and the Logical Drive ("Primary") remain marked with "Bang", and the Logical Drive ("Primary") remains "Degraded" (Ports 1-3 are all "Optimal").
I ran the POST Adaptect (CTRL-A) BIOS utility and used "VERIFY & FIX" on the new Port 0 (Physical Drive) and the "Scan for New Disks" utility. Notwithstanding, the Logical Drive ("Primary") remains degraded with Port 0 (Physical Drive) marked with "Bang".
The most obvious solution is that the replacement disk is itself at fault, however, I suspect that this is just too easy, especially since we ran the "VERIFY & FIX" utility. (A replacement is, however, on order.) I also am skeptical that "power" or "cable" issues are the issue, as we were able to run the "VERIFY & FIX" utility.
Obviously, one needs to stay ahead on RAID array failures.
Prompted by audible Disk status Alarm, investigated with HP Storage Manager, v. 3.10.00 (4242), Port 0 (Physical Drive) and Logical Drive ("Primary") were marked with a "Bang" (exclamation in yellow triangle), with the Logical Drive indicated "Degraded" status.
The Port 0 (Physical Drive) was marked "Set to Fail" and a OEM HP new replacement disk was installed. The Rebuild commenced automatically and the Data Logical Drive is "Optimal", however, Port 0 (Physical Drive), which was replaced, and the Logical Drive ("Primary") remain marked with "Bang", and the Logical Drive ("Primary") remains "Degraded" (Ports 1-3 are all "Optimal").
I ran the POST Adaptect (CTRL-A) BIOS utility and used "VERIFY & FIX" on the new Port 0 (Physical Drive) and the "Scan for New Disks" utility. Notwithstanding, the Logical Drive ("Primary") remains degraded with Port 0 (Physical Drive) marked with "Bang".
The most obvious solution is that the replacement disk is itself at fault, however, I suspect that this is just too easy, especially since we ran the "VERIFY & FIX" utility. (A replacement is, however, on order.) I also am skeptical that "power" or "cable" issues are the issue, as we were able to run the "VERIFY & FIX" utility.
Obviously, one needs to stay ahead on RAID array failures.
you could have a badblock o the surviving disk.
ASKER
When you refer to "surviving disk" are you referring to Port 0 (Physical Drive), i.e. the replaced drive. Would not the "VERIFY & FIX" Storage Manager Utility have eliminated any issue that would otherwise have been caused by such bad block(s)?
FYI, I should for clarity sake, note that, of course, at the onset, Logical Drive 3 ("DATA") was similarly "Degraded", but after the Rebuild was marked "Optimal".
FYI, I should for clarity sake, note that, of course, at the onset, Logical Drive 3 ("DATA") was similarly "Degraded", but after the Rebuild was marked "Optimal".
No, the drive that did not fail has a bad block. If there is a bad block on replacement disk, then it just automatically gets repaired when it is synced up. If you have bad block on one of the other disks, then it doesn't know what to put on the replacement drive.
You have to repair the degraded LUN by writing zeros to the bad area. Run the repair there.
You have to repair the degraded LUN by writing zeros to the bad area. Run the repair there.
ASKER
Neglected to mention that the Port 0 (Physical Disk) is still displaying "Rebuilding" as the status in Storage Manager. However, there is clearly no activity occuring, as when a rebuild is in process the Logical Drives display (i) % Completion and an (ii) Animated Disk Graphic. Further, this display has been static for several days, despite the "Rebuild failed" entry in the Event Log.
ASKER
dlethe:
Got it!
That makes good sense.
I notice that on the Windows (GUI) version of Storage Manager the "Verify with fix" is not selectable, i.e. "greyed out". Am I correct to intuit that we can use the Storage Manager BIOS Utility to "Verify with fix" on the LUN, or is the correct process to "Verify with fix" each of the other three Physical Disks, Port(s) 1-3?
Got it!
That makes good sense.
I notice that on the Windows (GUI) version of Storage Manager the "Verify with fix" is not selectable, i.e. "greyed out". Am I correct to intuit that we can use the Storage Manager BIOS Utility to "Verify with fix" on the LUN, or is the correct process to "Verify with fix" each of the other three Physical Disks, Port(s) 1-3?
ASKER
We ran the Storage Manager BIOS Utility (Disk utility) "Verify". Of the four Physical Disks, Port 1 returned "Verification failed". The other three Disks all verified. There was no "Fix" done for Port 1, just the error message. How can the "Fix" be accomplished? This does seem to be at the route of the array problem.
ASKER
Update: "Verify with fix" is "greyed out" on the Storage Manager GUI. This is including when logged on directly to the Server via local rather than Domain.
It takes 10-60 seconds for disks to deal with just ONE bad block. You have billions of blocks. It's working. You need to face facts, however, your online disk is in great stress now, and could die any moment now. If it was MY data and was worth $1000+, then I would strongly consider a data recovery firm. If that is a little steep, make sure you take a full backup NOW, kill every process you can and let the system just crank away. I've seen such things take days. Don't rock the boat.
No matter what you DO have some data corruption / loss. Be prepared for backup software to crap out on you, and for the files you do get to have some corruption. You can't trust any of your data.
No matter what you DO have some data corruption / loss. Be prepared for backup software to crap out on you, and for the files you do get to have some corruption. You can't trust any of your data.
ASKER
Ok, so what you are telling me is that despite the fact that seemingly nothing is happening, the "Rebuilding" status on Port 0 is a reliable indication that an actual rebuild process is underway (even though the Logical Disk is not showing the usual "rebuild animation") ? And what of the fact that the BIOS Utility "Verify" returned "Verification failed" on Port 1, without fixing that issue?
Our issue is not data protection/recovery, we have multiple tiers of back-up prior to this issue, with no exposure to critical data loss. The primary issue is that the Storage Server is compromised and out-of-production for the duration.
Finally, and not to be cynical, but... how many days? Here is the array: Port 0 (465 GB), Port 1 (149 GB), Port 2 (149 GB), and Port 3 (232 GB). FYI, we've been dumping the old Maxtor 160 GB drives as they have failed.
Our issue is not data protection/recovery, we have multiple tiers of back-up prior to this issue, with no exposure to critical data loss. The primary issue is that the Storage Server is compromised and out-of-production for the duration.
Finally, and not to be cynical, but... how many days? Here is the array: Port 0 (465 GB), Port 1 (149 GB), Port 2 (149 GB), and Port 3 (232 GB). FYI, we've been dumping the old Maxtor 160 GB drives as they have failed.
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
dlethe,
Again thank you for the insight. The system in question is an HP ProLiant ML 110 Storage Server (ancient). Interestingly enough, as we are backed-up, the notion of investing substantial "brain power" into bring this bit of "low-end" tin back on line is "negligible". Initially we implemented this server because it was "cheap" and had all the functionality we needed. Over the course of its humble but highly productive and long-lived life-cycle we learned that the following capabilities are indispensible for any new storage serves (1) 1U-2U Form Factor (avoids back sprains), (2) Hot Swapable (avoids dirty finger nails and wasted time), (3) On-Line Spare (avoids all manner of problems), (3) Within OEM "Life Cycle" Support (including Extended Contract). I can see that the better part of valor here is the 'ol "get your check book out" play, as reestablishing this server in production would be a pyrrich victory, with no benefit to our IT infrastructure.
Again thank you for the insight. The system in question is an HP ProLiant ML 110 Storage Server (ancient). Interestingly enough, as we are backed-up, the notion of investing substantial "brain power" into bring this bit of "low-end" tin back on line is "negligible". Initially we implemented this server because it was "cheap" and had all the functionality we needed. Over the course of its humble but highly productive and long-lived life-cycle we learned that the following capabilities are indispensible for any new storage serves (1) 1U-2U Form Factor (avoids back sprains), (2) Hot Swapable (avoids dirty finger nails and wasted time), (3) On-Line Spare (avoids all manner of problems), (3) Within OEM "Life Cycle" Support (including Extended Contract). I can see that the better part of valor here is the 'ol "get your check book out" play, as reestablishing this server in production would be a pyrrich victory, with no benefit to our IT infrastructure.