Link to home
Start Free TrialLog in
Avatar of ampranti
amprantiFlag for Greece

asked on

3ware raid - degraded raid 5

I have a raid 5 with degraded state,The current state of the raid is:

server:~# tw_cli info c0

Unit  UnitType  Status         %RCmpl  %V/I/M  Stripe  Size(GB)  Cache  AVrfy
------------------------------------------------------------------------------
u0    RAID-5    DEGRADED       -       -       64K     465.641   OFF    OFF    

Port   Status           Unit   Size        Blocks        Serial
---------------------------------------------------------------
p0     OK               u0     233.76 GB   490234752     WD-WCANY3473512    
p1     ECC-ERROR        u0     233.76 GB   490234752     WD-WCANY3522205    
p2     DEGRADED         u0     233.76 GB   490234752     WD-WCANY3473475    
p3     NOT-PRESENT      -      -           -             -

After using the commands:

tw_cli maint remove c0 p2
tw_cli maint rescan c0
tw_cli maint rebuild c0 u0 p2

The rebuilding starts but never finish, and exit with this error:

server:~# tw_cli info c0 u0

Unit     UnitType  Status         %RCmpl  %V/I/M  Port  Stripe  Size(GB)
------------------------------------------------------------------------
u0       RAID-5    DEGRADED*      -       -       -     64K     465.641  
u0-0     DISK      DEGRADED       -       -       p2    -       232.82    
u0-1     DISK      OK             -       -       p0    -       232.82    
u0-2     DISK      WARNING        -       -       p1    -       232.82    

Any idea how can i fix this problem and restore my RAID functionality/???
Thank you
ASKER CERTIFIED SOLUTION
Avatar of Callandor
Callandor
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of ampranti

ASKER

Degraded mean that the drive is faulty??
The drive seems to work, the process start but later fails
If it fails at any point, that indicates to me that the drive may have a problem with it, though it may be intermittent.  Trying a new drive will save you a lot of time.
In var/log/messages when rebuild stops i get these errors:

May  3 17:00:56 skilla kernel: [94981.224368] 3w-9xxx: scsi0: AEN: ERROR (0x04:0x0026): Drive ECC error reported:port=1, unit=0.May  3 17:00:56 skilla kernel: [94981.224637] 3w-9xxx: scsi0: AEN: ERROR (0x04:0x002D): Source drive error occurred:unit=0, port=1.May  3 17:00:56 skilla kernel: [94981.224899] 3w-9xxx: scsi0: AEN: ERROR (0x04:0x0004): Rebuild failed:unit=0.May  3 17:00:56 skilla kernel: [94981.225160] 3w-9xxx: scsi0: AEN: ERROR (0x04:0x0002): Degraded unit:unit=0, port=2.


According to this site:
https://twiki.cern.ch/twiki/bin/view/FIOgroup/DiskPrbTw
A drive has reported an ECC-error and the disk should be replaced. This will generally lead to a RAID_TW alarm and the vendor call will follow from the standard procedure.

So i have to replace the disk
skilla:~# tw_cli info c0 u0    

Unit     UnitType  Status         %RCmpl  %V/I/M  Port  Stripe  Size(GB)
------------------------------------------------------------------------
u0       RAID-5    DEGRADED*      -       -       -     64K     465.641  
u0-0     DISK      DEGRADED       -       -       p2    -       232.82    
u0-1     DISK      OK             -       -       p0    -       232.82    
u0-2     DISK      WARNING        -       -       p1    -       232.82    

skilla:~# tw_cli info c0

Unit  UnitType  Status         %RCmpl  %V/I/M  Stripe  Size(GB)  Cache  AVrfy
------------------------------------------------------------------------------
u0    RAID-5    DEGRADED       -       -       64K     465.641   OFF    OFF    

Port   Status           Unit   Size        Blocks        Serial
---------------------------------------------------------------
p0     OK               u0     233.76 GB   490234752     WD-WCANY3473512    
p1     ECC-ERROR        u0     233.76 GB   490234752     WD-WCANY3522205    
p2     DEGRADED         u0     233.76 GB   490234752     WD-WCANY3473475    
p3     NOT-PRESENT      -      -           -             -


Which disk do i have to replace ???

p1 or p2 ???
The one with the ECC error - p1.
After removing p2, the raid is disk changed state to "OK"

tw_cli info c0 u0

Unit     UnitType  Status         %RCmpl  %V/I/M  Port  Stripe  Size(GB)
------------------------------------------------------------------------
u0       RAID-5    DEGRADED       -       -       -     64K     465.641  
u0-0     DISK      DEGRADED       -       -       -     -       232.82    
u0-1     DISK      OK             -       -       p0    -       232.82    
u0-2     DISK      OK             -       -       p1    -       232.82    

I hope the data to be regenerated after replacing the disk...
Service checked disk "p2" and found that it was ok.
I put back disk "p2"  and now all disks are OK

# tw_cli info c0

Unit  UnitType  Status         %RCmpl  %V/I/M  Stripe  Size(GB)  Cache  AVrfy
------------------------------------------------------------------------------
u0    RAID-5    DEGRADED       -       -       64K     465.641   OFF    OFF    

Port   Status           Unit   Size        Blocks        Serial
---------------------------------------------------------------
p0     OK               u0     233.76 GB   490234752     WD-WCANY3473512    
p1     OK               u0     233.76 GB   490234752     WD-WCANY3522205    
p2     OK               -      233.76 GB   490234752     WD-WCANY3473475    
p3     NOT-PRESENT      -      -           -             -


However, if i remove disk p1 (the one with ECC-ERROR) "/" parttition is remount as read-only
/dev/sda5 on / type ext3 (rw,errors=remount-ro)
Of course all services misfunction....

Can i check somehow that if i replace "p1" disk I will be able to regenarate the data for p1?
If you are sure p0 and p2 are in good working order, replacing p1 should work.  However, you've had a series of problems that involved more than one drive, so I'm not sure everything will turn out ok.  You DO have backups, don't you?
I still can read the disk and have a recent backup of everything.

If i get a backup using clonezilla (system is offline and get an image of hard disk), if i replace both bad disks and install the recover the system from the image, should i be ok???


Thanks
Yes - if that doesn't work, that might mean the controller is no good.  Either way, a backup is how you recover from these situations.