asked on

3ware raid - degraded raid 5

I have a raid 5 with degraded state,The current state of the raid is:

server:~# tw_cli info c0

Unit UnitType Status %RCmpl %V/I/M Stripe Size(GB) Cache AVrfy
------------------------------------------------------------------------------
u0 RAID-5 DEGRADED - - 64K 465.641 OFF OFF

Port Status Unit Size Blocks Serial
---------------------------------------------------------------
p0 OK u0 233.76 GB 490234752 WD-WCANY3473512
p1 ECC-ERROR u0 233.76 GB 490234752 WD-WCANY3522205
p2 DEGRADED u0 233.76 GB 490234752 WD-WCANY3473475
p3 NOT-PRESENT - - - -

After using the commands:

tw_cli maint remove c0 p2
tw_cli maint rescan c0
tw_cli maint rebuild c0 u0 p2

The rebuilding starts but never finish, and exit with this error:

server:~# tw_cli info c0 u0

Unit UnitType Status %RCmpl %V/I/M Port Stripe Size(GB)
------------------------------------------------------------------------
u0 RAID-5 DEGRADED* - - - 64K 465.641
u0-0 DISK DEGRADED - - p2 - 232.82
u0-1 DISK OK - - p0 - 232.82
u0-2 DISK WARNING - - p1 - 232.82

Any idea how can i fix this problem and restore my RAID functionality/???
Thank you

ASKER CERTIFIED SOLUTION

Callandor

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

ampranti

ASKER

Degraded mean that the drive is faulty??
The drive seems to work, the process start but later fails

Callandor

If it fails at any point, that indicates to me that the drive may have a problem with it, though it may be intermittent. Trying a new drive will save you a lot of time.

ampranti

ASKER

In var/log/messages when rebuild stops i get these errors:

May 3 17:00:56 skilla kernel: [94981.224368] 3w-9xxx: scsi0: AEN: ERROR (0x04:0x0026): Drive ECC error reported:port=1, unit=0.May 3 17:00:56 skilla kernel: [94981.224637] 3w-9xxx: scsi0: AEN: ERROR (0x04:0x002D): Source drive error occurred:unit=0, port=1.May 3 17:00:56 skilla kernel: [94981.224899] 3w-9xxx: scsi0: AEN: ERROR (0x04:0x0004): Rebuild failed:unit=0.May 3 17:00:56 skilla kernel: [94981.225160] 3w-9xxx: scsi0: AEN: ERROR (0x04:0x0002): Degraded unit:unit=0, port=2.

According to this site:
https://twiki.cern.ch/twiki/bin/view/FIOgroup/DiskPrbTw
A drive has reported an ECC-error and the disk should be replaced. This will generally lead to a RAID_TW alarm and the vendor call will follow from the standard procedure.

So i have to replace the disk

ampranti

ASKER

skilla:~# tw_cli info c0 u0

Unit UnitType Status %RCmpl %V/I/M Port Stripe Size(GB)
------------------------------------------------------------------------
u0 RAID-5 DEGRADED* - - - 64K 465.641
u0-0 DISK DEGRADED - - p2 - 232.82
u0-1 DISK OK - - p0 - 232.82
u0-2 DISK WARNING - - p1 - 232.82

skilla:~# tw_cli info c0

Unit UnitType Status %RCmpl %V/I/M Stripe Size(GB) Cache AVrfy
------------------------------------------------------------------------------
u0 RAID-5 DEGRADED - - 64K 465.641 OFF OFF

Port Status Unit Size Blocks Serial
---------------------------------------------------------------
p0 OK u0 233.76 GB 490234752 WD-WCANY3473512
p1 ECC-ERROR u0 233.76 GB 490234752 WD-WCANY3522205
p2 DEGRADED u0 233.76 GB 490234752 WD-WCANY3473475
p3 NOT-PRESENT - - - -

Which disk do i have to replace ???

p1 or p2 ???

Callandor

The one with the ECC error - p1.

ampranti

ASKER

After removing p2, the raid is disk changed state to "OK"

tw_cli info c0 u0

Unit UnitType Status %RCmpl %V/I/M Port Stripe Size(GB)
------------------------------------------------------------------------
u0 RAID-5 DEGRADED - - - 64K 465.641
u0-0 DISK DEGRADED - - - - 232.82
u0-1 DISK OK - - p0 - 232.82
u0-2 DISK OK - - p1 - 232.82

I hope the data to be regenerated after replacing the disk...

ampranti

ASKER

Service checked disk "p2" and found that it was ok.
I put back disk "p2" and now all disks are OK

# tw_cli info c0

Unit UnitType Status %RCmpl %V/I/M Stripe Size(GB) Cache AVrfy
------------------------------------------------------------------------------
u0 RAID-5 DEGRADED - - 64K 465.641 OFF OFF

Port Status Unit Size Blocks Serial
---------------------------------------------------------------
p0 OK u0 233.76 GB 490234752 WD-WCANY3473512
p1 OK u0 233.76 GB 490234752 WD-WCANY3522205
p2 OK - 233.76 GB 490234752 WD-WCANY3473475
p3 NOT-PRESENT - - - -

However, if i remove disk p1 (the one with ECC-ERROR) "/" parttition is remount as read-only
/dev/sda5 on / type ext3 (rw,errors=remount-ro)
Of course all services misfunction....

Can i check somehow that if i replace "p1" disk I will be able to regenarate the data for p1?

Callandor

If you are sure p0 and p2 are in good working order, replacing p1 should work. However, you've had a series of problems that involved more than one drive, so I'm not sure everything will turn out ok. You DO have backups, don't you?

ampranti

ASKER

I still can read the disk and have a recent backup of everything.

If i get a backup using clonezilla (system is offline and get an image of hard disk), if i replace both bad disks and install the recover the system from the image, should i be ok???

Thanks

Callandor

Yes - if that doesn't work, that might mean the controller is no good. Either way, a backup is how you recover from these situations.