asked on

AIX 5.1 HD failure?

I was checking errpt and I found all this... It seems to be just from 6/17... does it look like a HD went bad or does this look like a temporary issue that did not come back since I only see errpt's 6/17?

ibm1:/> errpt
IDENTIFIER TIMESTAMP  T C RESOURCE_NAME  DESCRIPTION
D8CF8401   0617184114 U H LVDD           SOFTWARE DISK BLOCK RELOCATION ACHIEVED
9811EB50   0617184114 U H LVDD           HARDWARE DISK BLOCK RELOCATION FAILED
21F54B38   0617184114 P H hdisk4         DISK OPERATION ERROR
613E5F38   0617183914 P H LVDD           I/O ERROR DETECTED BY LVM
1581762B   0617183914 T H hdisk4         DISK OPERATION ERROR
D8CF8401   0617183914 U H LVDD           SOFTWARE DISK BLOCK RELOCATION ACHIEVED
9811EB50   0617183914 U H LVDD           HARDWARE DISK BLOCK RELOCATION FAILED
21F54B38   0617183914 P H hdisk4         DISK OPERATION ERROR
613E5F38   0617183814 P H LVDD           I/O ERROR DETECTED BY LVM
1581762B   0617183814 T H hdisk4         DISK OPERATION ERROR
D8CF8401   0617183814 U H LVDD           SOFTWARE DISK BLOCK RELOCATION ACHIEVED
9811EB50   0617183814 U H LVDD           HARDWARE DISK BLOCK RELOCATION FAILED
21F54B38   0617183814 P H hdisk4         DISK OPERATION ERROR
613E5F38   0617183714 P H LVDD           I/O ERROR DETECTED BY LVM
1581762B   0617183714 T H hdisk4         DISK OPERATION ERROR
D8CF8401   0617183714 U H LVDD           SOFTWARE DISK BLOCK RELOCATION ACHIEVED
9811EB50   0617183714 U H LVDD           HARDWARE DISK BLOCK RELOCATION FAILED
21F54B38   0617183714 P H hdisk4         DISK OPERATION ERROR
613E5F38   0617183714 P H LVDD           I/O ERROR DETECTED BY LVM
1581762B   0617183714 T H hdisk4         DISK OPERATION ERROR
D8CF8401   0617183714 U H LVDD           SOFTWARE DISK BLOCK RELOCATION ACHIEVED
9811EB50   0617183714 U H LVDD           HARDWARE DISK BLOCK RELOCATION FAILED
21F54B38   0617183714 P H hdisk4         DISK OPERATION ERROR
613E5F38   0617183614 P H LVDD           I/O ERROR DETECTED BY LVM
1581762B   0617183614 T H hdisk4         DISK OPERATION ERROR
D8CF8401   0617183614 U H LVDD           SOFTWARE DISK BLOCK RELOCATION ACHIEVED
9811EB50   0617183614 U H LVDD           HARDWARE DISK BLOCK RELOCATION FAILED
21F54B38   0617183614 P H hdisk4         DISK OPERATION ERROR
613E5F38   0617183514 P H LVDD           I/O ERROR DETECTED BY LVM
1581762B   0617183514 T H hdisk4         DISK OPERATION ERROR
D8CF8401   0617183514 U H LVDD           SOFTWARE DISK BLOCK RELOCATION ACHIEVED
9811EB50   0617183514 U H LVDD           HARDWARE DISK BLOCK RELOCATION FAILED
21F54B38   0617183514 P H hdisk4         DISK OPERATION ERROR
613E5F38   0617183514 P H LVDD           I/O ERROR DETECTED BY LVM
1581762B   0617183514 T H hdisk4         DISK OPERATION ERROR
D8CF8401   0617183514 U H LVDD           SOFTWARE DISK BLOCK RELOCATION ACHIEVED
9811EB50   0617183514 U H LVDD           HARDWARE DISK BLOCK RELOCATION FAILED
21F54B38   0617183514 P H hdisk4         DISK OPERATION ERROR
613E5F38   0617183414 P H LVDD           I/O ERROR DETECTED BY LVM
1581762B   0617183414 T H hdisk4         DISK OPERATION ERROR
D8CF8401   0617183414 U H LVDD           SOFTWARE DISK BLOCK RELOCATION ACHIEVED
D8CF8401   0617183414 U H LVDD           SOFTWARE DISK BLOCK RELOCATION ACHIEVED
9811EB50   0617183414 U H LVDD           HARDWARE DISK BLOCK RELOCATION FAILED
21F54B38   0617183414 P H hdisk4         DISK OPERATION ERROR
9811EB50   0617183314 U H LVDD           HARDWARE DISK BLOCK RELOCATION FAILED
21F54B38   0617183314 P H hdisk4         DISK OPERATION ERROR
613E5F38   0617183314 P H LVDD           I/O ERROR DETECTED BY LVM
1581762B   0617183314 T H hdisk4         DISK OPERATION ERROR
613E5F38   0617183314 P H LVDD           I/O ERROR DETECTED BY LVM
1581762B   0617183314 T H hdisk4         DISK OPERATION ERROR
D8CF8401   0617183214 U H LVDD           SOFTWARE DISK BLOCK RELOCATION ACHIEVED
9811EB50   0617183214 U H LVDD           HARDWARE DISK BLOCK RELOCATION FAILED
21F54B38   0617183214 P H hdisk4         DISK OPERATION ERROR
613E5F38   0617183214 P H LVDD           I/O ERROR DETECTED BY LVM
1581762B   0617183214 T H hdisk4         DISK OPERATION ERROR
D8CF8401   0617183214 U H LVDD           SOFTWARE DISK BLOCK RELOCATION ACHIEVED
9811EB50   0617183214 U H LVDD           HARDWARE DISK BLOCK RELOCATION FAILED
21F54B38   0617183214 P H hdisk4         DISK OPERATION ERROR
613E5F38   0617183114 P H LVDD           I/O ERROR DETECTED BY LVM
1581762B   0617183114 T H hdisk4         DISK OPERATION ERROR




LABEL:          LVM_SWREL
IDENTIFIER:     D8CF8401

Date/Time:       Tue Jun 17 18:32:54 EDT
Sequence Number: 9192
Machine Id:      000150514C00
Node Id:         ibm1
Class:           H
Type:            UNKN
Resource Name:   LVDD
Resource Class:  NONE
Resource Type:   NONE
Location:        NONE

Description
SOFTWARE DISK BLOCK RELOCATION ACHIEVED

Probable Causes
NONE

Failure Causes
NONE

        Recommended Actions
        REVIEW RECENT HISTORY FOR THIS DEVICE
        MULTIPLE RELOCATIONS INDICATE DEGRADATION OF MEDIA

Detail Data
MAJOR/MINOR DEVICE NUMBER
8000 0015 0000 0005
BLOCK NUMBER
              10938208
RELOCATION BLOCK NUMBER
              35548064
SENSE DATA
0001 5051 A529 9FDF 0000 0000 0000 0000 0001 5051 E2E7 D077 0000 0000 0000 0000
---------------------------------------------------------------------------
LABEL:          LVM_HWFAIL
IDENTIFIER:     9811EB50

Date/Time:       Tue Jun 17 18:32:54 EDT
Sequence Number: 9191
Machine Id:      000150514C00
Node Id:         ibm1
Class:           H
Type:            UNKN
Resource Name:   LVDD
Resource Class:  NONE
Resource Type:   NONE
Location:        NONE

Description
HARDWARE DISK BLOCK RELOCATION FAILED

Probable Causes
DEVICE DOES NOT SUPPORT HW RELOCATION
DASD DEVICE

Failure Causes
DEVICE DOES NOT SUPPORT HW RELOCATION
DASD MEDIA
DISK DRIVE
DISK DRIVE ELECTRONICS

        Recommended Actions
        IF HW RELOCATION NOT SUPPORTED ON DEVICE NO ACTION REQUIRED
        PERFORM PROBLEM DETERMINATION PROCEDURES

Detail Data
MAJOR/MINOR DEVICE NUMBER
8000 0015 0000 0005
BLOCK NUMBER
              10938208
SENSE DATA
0001 5051 A529 9FDF 0000 0000 0000 0000 0001 5051 E2E7 D077 0000 0000 0000 0000
---------------------------------------------------------------------------
LABEL:          DISK_ERR1
IDENTIFIER:     21F54B38

Date/Time:       Tue Jun 17 18:32:54 EDT
Sequence Number: 9190
Machine Id:      000150514C00
Node Id:         ibm1
Class:           H
Type:            PERM
Resource Name:   hdisk4
Resource Class:  disk
Resource Type:   scsd
Location:        10-60-00-12,0
VPD:
        Manufacturer................IBM
        Machine Type and Model......ST318305LW
        FRU Number..................09P4429
        ROS Level and ID............43353039
        Serial Number...............0002B610
        EC Level....................H11936
        Part Number.................09P4428
        Device Specific.(Z0)........000003129F00013E
        Device Specific.(Z1)........0211C509
        Device Specific.(Z2)........1000
        Device Specific.(Z3)........02041
        Device Specific.(Z4)........0001
        Device Specific.(Z5)........22
        Device Specific.(Z6)........162870 C

Description
DISK OPERATION ERROR

Probable Causes
MEDIA

User Causes
MEDIA DEFECTIVE

        Recommended Actions
        FOR REMOVABLE MEDIA, CHANGE MEDIA AND RETRY
        PERFORM PROBLEM DETERMINATION PROCEDURES

Failure Causes
MEDIA
DISK DRIVE

        Recommended Actions
        FOR REMOVABLE MEDIA, CHANGE MEDIA AND RETRY
        PERFORM PROBLEM DETERMINATION PROCEDURES

Detail Data
SENSE DATA
060C 0000 0700 0000 0000 0000 0000 0000 0102 0000 F000 0300 A6E7 600A 0000 0000
3201 018F 0004 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0004 D0C3 000C 0B40

Diagnostic Analysis
Diagnostic Log sequence number: 14322
Resource tested:        hdisk4
Resource Description:   16 Bit LVD SCSI Disk Drive
Location:               10-60-00-12,0
SRN:                    60B-128
Description:            Error log analysis indicates a hardware failure.
Possible FRUs:
    hdisk4           FRU: 09P4429              10-60-00-12,0
                     16 Bit LVD SCSI Disk Drive

---------------------------------------------------------------------------
LABEL:          LVM_IO_FAIL
IDENTIFIER:     613E5F38

Date/Time:       Tue Jun 17 18:32:24 EDT
Sequence Number: 9189
Machine Id:      000150514C00
Node Id:         ibm1
Class:           H
Type:            PERM
Resource Name:   LVDD
Resource Class:  NONE
Resource Type:   NONE
Location:        NONE

Description
I/O ERROR DETECTED BY LVM

Probable Causes
POWER, DRIVE, ADAPTER, OR CABLE FAILURE

        Recommended Actions
        RUN DIAGNOSTICS AGAINST THE FAILING DEVICE

Detail Data
PHYSICAL VOLUME DEVICE MAJOR/MINOR
8000 0015 0000 0005
ERROR CODE AS DEFINED IN sys/errno.h
         111
BLOCK NUMBER
              10938208
LOGICAL VOLUME DEVICE MAJOR/MINOR
8000 002C 0000 0002
PHYSICAL BUFFER TRANSACTION TIME
                     6
SENSE DATA
0000 0000 0000 A6E7 0001 5051 A529 9FDF 0000 0000 0000 0000 0001 5051 E2E7 D077
0000 0000 0000 0000
---------------------------------------------------------------------------
LABEL:          DISK_ERR4
IDENTIFIER:     1581762B

Date/Time:       Tue Jun 17 18:32:24 EDT
Sequence Number: 9188
Machine Id:      000150514C00
Node Id:         ibm1
Class:           H
Type:            TEMP
Resource Name:   hdisk4
Resource Class:  disk
Resource Type:   scsd
Location:        10-60-00-12,0
VPD:
        Manufacturer................IBM
        Machine Type and Model......ST318305LW
        FRU Number..................09P4429
        ROS Level and ID............43353039
        Serial Number...............0002B610
        EC Level....................H11936
        Part Number.................09P4428
        Device Specific.(Z0)........000003129F00013E
        Device Specific.(Z1)........0211C509
        Device Specific.(Z2)........1000
        Device Specific.(Z3)........02041
        Device Specific.(Z4)........0001
        Device Specific.(Z5)........22
        Device Specific.(Z6)........162870 C

Description
DISK OPERATION ERROR

Probable Causes
MEDIA
DASD DEVICE

User Causes
MEDIA DEFECTIVE

        Recommended Actions
        FOR REMOVABLE MEDIA, CHANGE MEDIA AND RETRY
        PERFORM PROBLEM DETERMINATION PROCEDURES

Failure Causes
MEDIA
DISK DRIVE

        Recommended Actions
        FOR REMOVABLE MEDIA, CHANGE MEDIA AND RETRY
        PERFORM PROBLEM DETERMINATION PROCEDURES

Detail Data
SENSE DATA
0A0C 0000 2800 00A6 E760 0000 0800 0000 0102 0000 F000 0300 A6E7 600A 0000 0000
1104 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0004 D0C3 000C 0B40
---------------------------------------------------------------------------
LABEL:          LVM_SWREL
IDENTIFIER:     D8CF8401

Date/Time:       Tue Jun 17 18:32:14 EDT
Sequence Number: 9187
Machine Id:      000150514C00
Node Id:         ibm1
Class:           H
Type:            UNKN
Resource Name:   LVDD
Resource Class:  NONE
Resource Type:   NONE
Location:        NONE

Description
SOFTWARE DISK BLOCK RELOCATION ACHIEVED

Probable Causes
NONE

Failure Causes
NONE

        Recommended Actions
        REVIEW RECENT HISTORY FOR THIS DEVICE
        MULTIPLE RELOCATIONS INDICATE DEGRADATION OF MEDIA

Detail Data
MAJOR/MINOR DEVICE NUMBER
8000 0015 0000 0005
BLOCK NUMBER
              10623096
RELOCATION BLOCK NUMBER
              35548063
SENSE DATA
0001 5051 A529 9FDF 0000 0000 0000 0000 0001 5051 E2E7 D077 0000 0000 0000 0000
---------------------------------------------------------------------------
LABEL:          LVM_HWFAIL
IDENTIFIER:     9811EB50

Date/Time:       Tue Jun 17 18:32:14 EDT
Sequence Number: 9186
Machine Id:      000150514C00
Node Id:         ibm1
Class:           H
Type:            UNKN
Resource Name:   LVDD
Resource Class:  NONE
Resource Type:   NONE
Location:        NONE

Description
HARDWARE DISK BLOCK RELOCATION FAILED

Probable Causes
DEVICE DOES NOT SUPPORT HW RELOCATION
DASD DEVICE

Failure Causes
DEVICE DOES NOT SUPPORT HW RELOCATION
DASD MEDIA
DISK DRIVE
DISK DRIVE ELECTRONICS

        Recommended Actions
        IF HW RELOCATION NOT SUPPORTED ON DEVICE NO ACTION REQUIRED
        PERFORM PROBLEM DETERMINATION PROCEDURES

Detail Data
MAJOR/MINOR DEVICE NUMBER
8000 0015 0000 0005
BLOCK NUMBER
              10623096
SENSE DATA
0001 5051 A529 9FDF 0000 0000 0000 0000 0001 5051 E2E7 D077 0000 0000 0000 0000
---------------------------------------------------------------------------
LABEL:          DISK_ERR1
IDENTIFIER:     21F54B38

Date/Time:       Tue Jun 17 18:32:14 EDT
Sequence Number: 9185
Machine Id:      000150514C00
Node Id:         ibm1
Class:           H
Type:            PERM
Resource Name:   hdisk4
Resource Class:  disk
Resource Type:   scsd
Location:        10-60-00-12,0
VPD:
        Manufacturer................IBM
        Machine Type and Model......ST318305LW
        FRU Number..................09P4429
        ROS Level and ID............43353039
        Serial Number...............0002B610
        EC Level....................H11936
        Part Number.................09P4428
        Device Specific.(Z0)........000003129F00013E
        Device Specific.(Z1)........0211C509
        Device Specific.(Z2)........1000
        Device Specific.(Z3)........02041
        Device Specific.(Z4)........0001
        Device Specific.(Z5)........22
        Device Specific.(Z6)........162870 C

Description
DISK OPERATION ERROR

Probable Causes
MEDIA

User Causes
MEDIA DEFECTIVE

        Recommended Actions
        FOR REMOVABLE MEDIA, CHANGE MEDIA AND RETRY
        PERFORM PROBLEM DETERMINATION PROCEDURES

Failure Causes
MEDIA
DISK DRIVE

        Recommended Actions
        FOR REMOVABLE MEDIA, CHANGE MEDIA AND RETRY
        PERFORM PROBLEM DETERMINATION PROCEDURES

Detail Data
SENSE DATA
060C 0000 0700 0000 0000 0000 0000 0000 0102 0000 F000 0300 A218 780A 0000 0000
3201 018F 0004 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0004 D0C3 000B 7B40

Diagnostic Analysis
Diagnostic Log sequence number: 14320
Resource tested:        hdisk4
Resource Description:   16 Bit LVD SCSI Disk Drive
Location:               10-60-00-12,0
SRN:                    60B-128
Description:            Error log analysis indicates a hardware failure.
Possible FRUs:
    hdisk4           FRU: 09P4429              10-60-00-12,0
                     16 Bit LVD SCSI Disk Drive

---------------------------------------------------------------------------
LABEL:          LVM_IO_FAIL
IDENTIFIER:     613E5F38

Date/Time:       Tue Jun 17 18:31:43 EDT
Sequence Number: 9184
Machine Id:      000150514C00
Node Id:         ibm1
Class:           H
Type:            PERM
Resource Name:   LVDD
Resource Class:  NONE
Resource Type:   NONE
Location:        NONE

Description
I/O ERROR DETECTED BY LVM

Probable Causes
POWER, DRIVE, ADAPTER, OR CABLE FAILURE

        Recommended Actions
        RUN DIAGNOSTICS AGAINST THE FAILING DEVICE

Detail Data
PHYSICAL VOLUME DEVICE MAJOR/MINOR
8000 0015 0000 0005
ERROR CODE AS DEFINED IN sys/errno.h
         111
BLOCK NUMBER
              10623088
LOGICAL VOLUME DEVICE MAJOR/MINOR
8000 002C 0000 0002
PHYSICAL BUFFER TRANSACTION TIME
                     4
SENSE DATA
0000 0000 0000 A218 0001 5051 A529 9FDF 0000 0000 0000 0000 0001 5051 E2E7 D077
0000 0000 0000 0000
---------------------------------------------------------------------------
LABEL:          DISK_ERR4
IDENTIFIER:     1581762B

Date/Time:       Tue Jun 17 18:31:43 EDT
Sequence Number: 9183
Machine Id:      000150514C00
Node Id:         ibm1
Class:           H
Type:            TEMP
Resource Name:   hdisk4
Resource Class:  disk
Resource Type:   scsd
Location:        10-60-00-12,0
VPD:
        Manufacturer................IBM
        Machine Type and Model......ST318305LW
        FRU Number..................09P4429
        ROS Level and ID............43353039
        Serial Number...............0002B610
        EC Level....................H11936
        Part Number.................09P4428
        Device Specific.(Z0)........000003129F00013E
        Device Specific.(Z1)........0211C509
        Device Specific.(Z2)........1000
        Device Specific.(Z3)........02041
        Device Specific.(Z4)........0001
        Device Specific.(Z5)........22
        Device Specific.(Z6)........162870 C

Description
DISK OPERATION ERROR

Probable Causes
MEDIA
DASD DEVICE

User Causes
MEDIA DEFECTIVE

        Recommended Actions
        FOR REMOVABLE MEDIA, CHANGE MEDIA AND RETRY
        PERFORM PROBLEM DETERMINATION PROCEDURES

Failure Causes
MEDIA
DISK DRIVE

        Recommended Actions
        FOR REMOVABLE MEDIA, CHANGE MEDIA AND RETRY
        PERFORM PROBLEM DETERMINATION PROCEDURES

Detail Data
SENSE DATA
0A0C 0000 2800 00A2 1870 0000 1800 0000 0102 0000 F000 0300 A218 780A 0000 0000
1104 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0004 D0C3 000B 7B40

Open in new window

ASKER CERTIFIED SOLUTION

David

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

Xetroximyn

ASKER

it isn't a RAID array. So I should be good as long as it is replaced before another disc fails right? Not that I want to wait to replace it but just want to confirm my understanding. Thanks!

David

You mean it is a raid array? I assume answer is yes ....

The failed mapping means it tried to relocate a failed write. We know that this disk has incorrect parity then. Let's say you had a RAID1 to make the math easy, and this disk is half the RAID1. If you lose the other disk you have data loss, the data it could not write. If this was a 3-disk RAID5 and you lost another disk, you still would have data loss.

RAID1,10,5 does not protect against MULTIPLE data losses on a stripe. It only protects against data loss on a single stripe. You already had your data loss on that stripe.

SOLUTION

gheist

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

Xetroximyn

ASKER

Hdisk4 (well hdisk0 through 11) are listed under "List All Physical Volumes in System"

I believe it is a 12 disk raid5.

We are 'supposed' to be under hardware warranty... though its expired so accounting is working that out now...

Though its worth noting it is ONLY HW support. IBM stopped supporting AIX 5.1 a year or two ago... so it's up to me to handle actually removing the HD from the array before they swap and adding it back in.... woolmilkporc was kind enough to walk me through it last time... (though there were a lot of odd hurdles that time, that he had to sort out first, so hopefully this time things are in better shape).

Anyway - any guidance on that part would be greatly appreciated! (Or if its possible for me to... in the meantime remove hdisk4, and tell the system to make it an 11 disk array for now... which I assume would reduce available space but would at least give redundancy back in the meantime until we can get the drive replaced.

FYI - Here is the thread where wmp helped me last time: https://www.experts-exchange.com/questions/27811932/AIX-5-1-remove-drive-from-volume-group-so-it-can-be-replaced.html

David

Then call IBM and tell them about the error and have them send in a replacement ASAP. Then set that new disk up as a hot spare if you have a free bay. They'll direct you what to do if you have no free bays. But best practice on a RAID5 with ancient disks is to always do a full backup before a rebuild. You could very well have other errors or parity issues and a rebuild could fail a disk due to the stress of a rebuild alone.

Xetroximyn

ASKER

No spare bays. I will get a new disk ASAP... but that might be a few days given our support contact failed.

We do nightly full backups to tape.

in the meantime is it possible to remove hdisk4 from the array and tell the system to make it an 11 disk array for now... which I assume would reduce available space but would at least give redundancy back in the meantime until we can get the drive replaced?

Thanks!

David

You could do that, but doing so profoundly increases risk of data loss. Murphy's laws and all that.

Q. What would happen if ANOTHER drive died after you pulled this disk out?

A. 100% data loss of everything not backed up, and need to rebuild the system and restore from a mksysb or whatever you use.

Plus it will run slower even if you didn't have another failure.. The system is running on borrowed time, but at least this disk is helping with all read I/O. It just can't be trusted to deal with an unrecoverable write error. Reads are fine.

Xetroximyn

ASKER

ah - so this disk basically hasn't fully failed - its just starting to go out?

You mentioned if I remove this and another disk fails we lose all data. So you are saying currently if another disk fails I don't lose all data? (Cause I was thinking I was already down 1 disk, so a loss of a 2nd means data loss whether I remove this one or not)

Not questioning you... just trying to understand.

Thanks!

David

Yes, it is sick, and the redundancy is keeping you alive. This disk is injecting errors and if any other disks have errors then you could end up with partial data loss no matter what. You have incorrect data on this drive in places, so this one disk is your margin of error. If you have another drive somewhere that has an unrecoverable write error then you will end up with some data loss.

Heck, you might already have unrecoverable write errors on other disks. So don't delete any old backups until you do a rebuild, then do a fsck and check all event logs when you rebuild the array with replacement.

You are only protected against data loss at this instant if no other blocks on no other disks are also stale and had the same error. Look at all the logs and see if there are any errors on any other disks that were not recovered. If you have any of this same unable to write block errors on other disks, then you WILL have data loss when you replace this drive. That is because this disk has the missing chunk of data that the other disk couldn't write.

Just leave things alone, and replace this drive ASAP. Do a full backup if you have not done so before doing a rebuild. Do NOT remove this drive until you have the replacement. You are probably OK and will not have data loss, once you replace this drive. Only way to know for sure is to look at all the disks and make sure there were no other errors. You can probably run a manual parity (or sometimes called data consistency check). This will check for bad data on all disks. If ONLY the one drive (this one) has errors on it, then you are safe to replace this disk w/o any data loss.

Xetroximyn

ASKER

question... I ask this because we don't need a lot of space on this server and company president wants to save some money... (if it where my choice I'd be having the disk replaced ASAP....)

Currently we have these 12 disks in the array and have 60GB of available space on the logical partition...

Obviously if we set up this server from scratch we could have an 11 disk RAID array, and it would just probably be roughly 55GB instead of 60GB.

Is there NO WAY to convert the 12 disk array to an 11 disk one? I realize the answer is likely yes, but I just want to make sure...

If it IS possible at all.... I realize there would be risk. I am curious if this risk would be any more than replacing the disk? Either way redundancy is fully removed and a rebuild must take place. Until that rebuild is done if anything else fails then data gets lost.

Thoughts?

Thanks!

Xetroximyn

ASKER

Oh - one more thing... I am not even entirely sure it is a RAID 5... There is a slight possibility it is RAID1.... if that where the case would that change the answer to my last question?

and if so how can I check which kind of RAID it is?

thanks!

gheist

Well what happens if you pulll wrong drive out, or shake it and it resets....

I would suggest to go to diag, find RAID and leave bad disk permanently lit up and try to convince IBM to send tech to do replacement (it may fail)

Remember that antistatic strap? Use it. And call somebody to watch over you too.

Xetroximyn

ASKER

Confused... I am not talking about removing a drive physically at all... I am asking if it is possible to logically remove the drive from the array and then to tell the array to rebuild with the 11 remaining disks such that it has redundancy with the 11 remaining disks. (this would of course reduce the available space as it would be 11-1 instead of 12-1)

Is something like that possible to do?

David

No, it won't have your desired effect. Nice idea, however.

Rebuilding the RAID with fewer disks can be done on some controllers, but you never want to do that on a system in stress anyway. Too much risk.

gheist

Once long ago Mylex RAID controllers did it, now you need software RAID for that (think lvm)

Xetroximyn

ASKER

to confirm - is this not possible even if its raid 1 instead of raid5? (and how can I check which RAID it is?) Thanks!

David

Confirmed - not possible. Even if it was, you would end up with a volume that is smaller than what you started with, so would have to then deal with LVM.

gheist

Only if you had LVM with mirrors you could do it. But then again everywhere bad disk was present other side of mirror may be corrupt too (from bad data read before disk failure was detected)
Once you are around "diag" schedule all periodic self-tests so that you get timely predictive notice of hardware failures.

David

A plan "B" purely to think outside of the box. you have what, 12 x 18GB SCSI-2 disks? Do you have some POWER based systems you can use for virtualization? A single pair of decent SAS drives mirrored will blow that away in terms of performance, if you can migrate that system to a virtual machine.

Xetroximyn

ASKER

No - we have no power systems and no software support to help with this.

So hopefully we will get this disk replaced soon... assuming so... is there anyone here that can walk me through the commands to remove and then add the new disk and rebuild, etc make sure all is good?

Woolmilcporc did this last time in thread https://www.experts-exchange.com/questions/27811932/AIX-5-1-remove-drive-from-volume-group-so-it-can-be-replaced.html

but had so many side issues hard to know what is needed this time vs what was just sorting out the mess that time.

David

Sorry, I rather not have that responsibility. I haven't touched an AIX 5.x system in years so wouldn't even be able to pull up a man page in case something bad happened. I suggest contracting with IBM professional services to take care of you, if this system is that important. A few hundred bucks would be well worth it for the peace of mind.

Xetroximyn

ASKER

IBM stopped supporting AIX 5.1 at all a year or two ago. (And even a year or two before that when they still technically let us have a software support contract for it, the techs were not very helpful... constantly running into issues where the commands they would run didn't have the options they were trying to use....)

WoolMilkPorc was actually more helpful than IBM was...

Anyway - point is IBM won't even support 5.1 any more... so fact is I am on my own... so no responsibility on you... :-) I just appreciate any input I can get!!

Can you tell me if this looks about right?

- run "rmlvcopy usr1 1 hdisk4"
- run "reducevg usr1vg hdisk4" 
- run "rmdev -dl hdisk4"
- replace the disk
- run "cfgmgr"
- find out the name of the new disk (probably again hdisk4, but I'll call it hdiskx below)
- run "extendvg usr1vg hdiskx"
- run "mklvcopy usr1 2 hdiskx"
- run "syncvg -v usr1vg"

Open in new window

Looking back at the old thread it looks like this was going to be the basic process before we ran into issues (I think the issues where one-time issues... and that the steps we did where to "clean up" the mess so it wouldn't be a problem next time)

Below is what wmp said about the situation. We went with option 2.

the primary partitions of usr1 are widely scattered over 7 disks instead of the 5 disks one should expect. I assume that the volume had often to be enlarged, and that this has been done without any regard for partition placement.

Now we have two options:

- We can ignore any strictness-of-placement policy and mirror the logical volume to wherever there is space to accommodate the copies. This means keeping your current chaotic distribution. The drawback is that loss of a single disk can imply loss of the whole usr1 volume.

- We can do kind of tyding up, by moving certain parts of usr1 to other disks, so that the primary partitions would occupy just 5 of the available 10 disks, so we can later create the mirrors on the remaining disks.

Open in new window

ibm1:/> lspv
hdisk0          00015051814ca2c5                    rootvg
hdisk1          000150514226fc44                    usr1vg
hdisk2          000150519965a2bb                    usr1vg
hdisk3          0001505115c7dbce                    usr1vg
hdisk4          000c925d02a3b3b2                    usr1vg
hdisk5          000c925d822f5eda                    usr1vg
hdisk6          00011784d15410dc                    rootvg
hdisk7          000150512ffa6367                    usr1vg
hdisk8          000150519965a4eb                    usr1vg
hdisk9          000150512b7bdb40                    usr1vg
hdisk10         000c925d02a3a8c3                    usr1vg
hdisk11         000c924d87206941                    usr1vg
ibm1:/>
ibm1:/> lsvg -l usr1vg
usr1vg:
LV NAME             TYPE       LPs   PPs   PVs  LV STATE      MOUNT POINT
loglv00             jfslog     16    32    4    open/syncd    N/A
usr1                jfs        3943  7886  10   open/syncd    /usr1
ibm1:/>

Open in new window

David

Look, you can google aix consultants and find them easily enough. I told you what the error meant and what you have to do, but my AIX is too rusty to take it further and I don't have access to a local machine to futz around , and I sure as heck don't remember all the best practices if something goes bad.

Clearly this is an important computer to you. So just hire somebody for a phone consult or to even remote connect to do it. Then you can sleep knowing you have somebody you paid to do this, who is obligated to see it through and do it right. Remember, we're not compensated and volunteers. You need to pay somebody to help you do this.

Xetroximyn

ASKER

Thanks! I will see what I can find in the way of consultants.

One last question... I see wmp said "the LVs in usr1vg are mirrored." (there are 10 disks in that VG)

So would that actually mean that we could have another disk fail and as long as a disk pair does not fail we would still be OK? Like in a RAID 1 with 10 disks are there generally specific disk pairs mirroring each other?

David

Cool - I'm glad you understand.

No that does not necessarily mean another disk can fail. 2 disks in a mirror. If you yank one of this pair, and then the other disk in the pair dies, you have 100% data loss

Xetroximyn

ASKER

Thanks - yes - I understand the situation if 2 disks are mirrored... trying to better understand 10 disks in mirror though...

Is that like five 2-disk mirrors? So they are paired off? like 1-2, 3-4, 5-6, 7-8, 9-10. So since 4 is bad then any other disk could fail except 3 and not lose data. (hypothetically.... I am not trying to directly relate this to my situation as to exactly what disk is mirroring which other....)

OR

Are they striped in some way such that it's sort of like a 2 disk mirror? With 5 disks being one "set" and the other 5 being the other "set". So like 1-5 and 6-10 Such that if a single disk fails in both sets you lose data? And it doesn't matter... It could be 1 and 5 or 1 and 9 or 3 and 6. and data would be lost....

In the first case I would think if another disk fails I have a 1 in 9 chance of it being the wrong disk. In the second case I would like I have a 1 in 2 chance that the next disk to fail will be in the other "set" and thus cause data loss.

David

Yes they are all paired off. Lose a pair and you are screwed. Lose the left half of all pairs, and you are still online, even though you lost half of your disks.