ampranti
asked on
3ware raid - degraded raid 5
I have a raid 5 with degraded state,The current state of the raid is:
server:~# tw_cli info c0
Unit UnitType Status %RCmpl %V/I/M Stripe Size(GB) Cache AVrfy
-------------------------- ---------- ---------- ---------- ---------- ---------- --
u0 RAID-5 DEGRADED - - 64K 465.641 OFF OFF
Port Status Unit Size Blocks Serial
-------------------------- ---------- ---------- ---------- -------
p0 OK u0 233.76 GB 490234752 WD-WCANY3473512
p1 ECC-ERROR u0 233.76 GB 490234752 WD-WCANY3522205
p2 DEGRADED u0 233.76 GB 490234752 WD-WCANY3473475
p3 NOT-PRESENT - - - -
After using the commands:
tw_cli maint remove c0 p2
tw_cli maint rescan c0
tw_cli maint rebuild c0 u0 p2
The rebuilding starts but never finish, and exit with this error:
server:~# tw_cli info c0 u0
Unit UnitType Status %RCmpl %V/I/M Port Stripe Size(GB)
-------------------------- ---------- ---------- ---------- ---------- ------
u0 RAID-5 DEGRADED* - - - 64K 465.641
u0-0 DISK DEGRADED - - p2 - 232.82
u0-1 DISK OK - - p0 - 232.82
u0-2 DISK WARNING - - p1 - 232.82
Any idea how can i fix this problem and restore my RAID functionality/???
Thank you
server:~# tw_cli info c0
Unit UnitType Status %RCmpl %V/I/M Stripe Size(GB) Cache AVrfy
--------------------------
u0 RAID-5 DEGRADED - - 64K 465.641 OFF OFF
Port Status Unit Size Blocks Serial
--------------------------
p0 OK u0 233.76 GB 490234752 WD-WCANY3473512
p1 ECC-ERROR u0 233.76 GB 490234752 WD-WCANY3522205
p2 DEGRADED u0 233.76 GB 490234752 WD-WCANY3473475
p3 NOT-PRESENT - - - -
After using the commands:
tw_cli maint remove c0 p2
tw_cli maint rescan c0
tw_cli maint rebuild c0 u0 p2
The rebuilding starts but never finish, and exit with this error:
server:~# tw_cli info c0 u0
Unit UnitType Status %RCmpl %V/I/M Port Stripe Size(GB)
--------------------------
u0 RAID-5 DEGRADED* - - - 64K 465.641
u0-0 DISK DEGRADED - - p2 - 232.82
u0-1 DISK OK - - p0 - 232.82
u0-2 DISK WARNING - - p1 - 232.82
Any idea how can i fix this problem and restore my RAID functionality/???
Thank you
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
If it fails at any point, that indicates to me that the drive may have a problem with it, though it may be intermittent. Trying a new drive will save you a lot of time.
ASKER
In var/log/messages when rebuild stops i get these errors:
May 3 17:00:56 skilla kernel: [94981.224368] 3w-9xxx: scsi0: AEN: ERROR (0x04:0x0026): Drive ECC error reported:port=1, unit=0.May 3 17:00:56 skilla kernel: [94981.224637] 3w-9xxx: scsi0: AEN: ERROR (0x04:0x002D): Source drive error occurred:unit=0, port=1.May 3 17:00:56 skilla kernel: [94981.224899] 3w-9xxx: scsi0: AEN: ERROR (0x04:0x0004): Rebuild failed:unit=0.May 3 17:00:56 skilla kernel: [94981.225160] 3w-9xxx: scsi0: AEN: ERROR (0x04:0x0002): Degraded unit:unit=0, port=2.
According to this site:
https://twiki.cern.ch/twiki/bin/view/FIOgroup/DiskPrbTw
A drive has reported an ECC-error and the disk should be replaced. This will generally lead to a RAID_TW alarm and the vendor call will follow from the standard procedure.
So i have to replace the disk
May 3 17:00:56 skilla kernel: [94981.224368] 3w-9xxx: scsi0: AEN: ERROR (0x04:0x0026): Drive ECC error reported:port=1, unit=0.May 3 17:00:56 skilla kernel: [94981.224637] 3w-9xxx: scsi0: AEN: ERROR (0x04:0x002D): Source drive error occurred:unit=0, port=1.May 3 17:00:56 skilla kernel: [94981.224899] 3w-9xxx: scsi0: AEN: ERROR (0x04:0x0004): Rebuild failed:unit=0.May 3 17:00:56 skilla kernel: [94981.225160] 3w-9xxx: scsi0: AEN: ERROR (0x04:0x0002): Degraded unit:unit=0, port=2.
According to this site:
https://twiki.cern.ch/twiki/bin/view/FIOgroup/DiskPrbTw
A drive has reported an ECC-error and the disk should be replaced. This will generally lead to a RAID_TW alarm and the vendor call will follow from the standard procedure.
So i have to replace the disk
ASKER
skilla:~# tw_cli info c0 u0
Unit UnitType Status %RCmpl %V/I/M Port Stripe Size(GB)
-------------------------- ---------- ---------- ---------- ---------- ------
u0 RAID-5 DEGRADED* - - - 64K 465.641
u0-0 DISK DEGRADED - - p2 - 232.82
u0-1 DISK OK - - p0 - 232.82
u0-2 DISK WARNING - - p1 - 232.82
skilla:~# tw_cli info c0
Unit UnitType Status %RCmpl %V/I/M Stripe Size(GB) Cache AVrfy
-------------------------- ---------- ---------- ---------- ---------- ---------- --
u0 RAID-5 DEGRADED - - 64K 465.641 OFF OFF
Port Status Unit Size Blocks Serial
-------------------------- ---------- ---------- ---------- -------
p0 OK u0 233.76 GB 490234752 WD-WCANY3473512
p1 ECC-ERROR u0 233.76 GB 490234752 WD-WCANY3522205
p2 DEGRADED u0 233.76 GB 490234752 WD-WCANY3473475
p3 NOT-PRESENT - - - -
Which disk do i have to replace ???
p1 or p2 ???
Unit UnitType Status %RCmpl %V/I/M Port Stripe Size(GB)
--------------------------
u0 RAID-5 DEGRADED* - - - 64K 465.641
u0-0 DISK DEGRADED - - p2 - 232.82
u0-1 DISK OK - - p0 - 232.82
u0-2 DISK WARNING - - p1 - 232.82
skilla:~# tw_cli info c0
Unit UnitType Status %RCmpl %V/I/M Stripe Size(GB) Cache AVrfy
--------------------------
u0 RAID-5 DEGRADED - - 64K 465.641 OFF OFF
Port Status Unit Size Blocks Serial
--------------------------
p0 OK u0 233.76 GB 490234752 WD-WCANY3473512
p1 ECC-ERROR u0 233.76 GB 490234752 WD-WCANY3522205
p2 DEGRADED u0 233.76 GB 490234752 WD-WCANY3473475
p3 NOT-PRESENT - - - -
Which disk do i have to replace ???
p1 or p2 ???
The one with the ECC error - p1.
ASKER
After removing p2, the raid is disk changed state to "OK"
tw_cli info c0 u0
Unit UnitType Status %RCmpl %V/I/M Port Stripe Size(GB)
-------------------------- ---------- ---------- ---------- ---------- ------
u0 RAID-5 DEGRADED - - - 64K 465.641
u0-0 DISK DEGRADED - - - - 232.82
u0-1 DISK OK - - p0 - 232.82
u0-2 DISK OK - - p1 - 232.82
I hope the data to be regenerated after replacing the disk...
tw_cli info c0 u0
Unit UnitType Status %RCmpl %V/I/M Port Stripe Size(GB)
--------------------------
u0 RAID-5 DEGRADED - - - 64K 465.641
u0-0 DISK DEGRADED - - - - 232.82
u0-1 DISK OK - - p0 - 232.82
u0-2 DISK OK - - p1 - 232.82
I hope the data to be regenerated after replacing the disk...
ASKER
Service checked disk "p2" and found that it was ok.
I put back disk "p2" and now all disks are OK
# tw_cli info c0
Unit UnitType Status %RCmpl %V/I/M Stripe Size(GB) Cache AVrfy
-------------------------- ---------- ---------- ---------- ---------- ---------- --
u0 RAID-5 DEGRADED - - 64K 465.641 OFF OFF
Port Status Unit Size Blocks Serial
-------------------------- ---------- ---------- ---------- -------
p0 OK u0 233.76 GB 490234752 WD-WCANY3473512
p1 OK u0 233.76 GB 490234752 WD-WCANY3522205
p2 OK - 233.76 GB 490234752 WD-WCANY3473475
p3 NOT-PRESENT - - - -
However, if i remove disk p1 (the one with ECC-ERROR) "/" parttition is remount as read-only
/dev/sda5 on / type ext3 (rw,errors=remount-ro)
Of course all services misfunction....
Can i check somehow that if i replace "p1" disk I will be able to regenarate the data for p1?
I put back disk "p2" and now all disks are OK
# tw_cli info c0
Unit UnitType Status %RCmpl %V/I/M Stripe Size(GB) Cache AVrfy
--------------------------
u0 RAID-5 DEGRADED - - 64K 465.641 OFF OFF
Port Status Unit Size Blocks Serial
--------------------------
p0 OK u0 233.76 GB 490234752 WD-WCANY3473512
p1 OK u0 233.76 GB 490234752 WD-WCANY3522205
p2 OK - 233.76 GB 490234752 WD-WCANY3473475
p3 NOT-PRESENT - - - -
However, if i remove disk p1 (the one with ECC-ERROR) "/" parttition is remount as read-only
/dev/sda5 on / type ext3 (rw,errors=remount-ro)
Of course all services misfunction....
Can i check somehow that if i replace "p1" disk I will be able to regenarate the data for p1?
If you are sure p0 and p2 are in good working order, replacing p1 should work. However, you've had a series of problems that involved more than one drive, so I'm not sure everything will turn out ok. You DO have backups, don't you?
ASKER
I still can read the disk and have a recent backup of everything.
If i get a backup using clonezilla (system is offline and get an image of hard disk), if i replace both bad disks and install the recover the system from the image, should i be ok???
Thanks
If i get a backup using clonezilla (system is offline and get an image of hard disk), if i replace both bad disks and install the recover the system from the image, should i be ok???
Thanks
Yes - if that doesn't work, that might mean the controller is no good. Either way, a backup is how you recover from these situations.
ASKER
The drive seems to work, the process start but later fails