acubedec15
asked on
IBM AIX disk Numbering Issues
We have a VIO Client on POWER 7 running AIX 7 operating System. This is a oracle server running ASM. As we know ASM disks are not controlled by ODM so there is no "PVID" and the disk permissions on /dev/rhdisk# are oracle:dba.
Recently we have upgraded this server from AIX 7.1 TL2 to AIX 7.1 TL3, After the server was rebooted for some reason, hdisk# for oracle have changed, For example - One of the ASM disks - hdisk71 was changed to hdisk19. Since hdisk19 is new hdisk# the permissions were reverted back to root:system and since its an ASM disk, ASM failed to start because of disk permissions.
Can someone explain me why this has happened. I have read some literature online which talked about PCM module might be responsible if yes how can it affect the disk numbering.
here is the disk parameters -
csc06tst09:/
root> lsattr -El hdisk19
PCM PCM/friend/MSYMM_VRAID Path Control Module True
PR_key_value none Persistant Reserve Key Value True
algorithm round_robin Algorithm True
clr_q yes Device CLEARS its Queue on error True
hcheck_interval 60 Health Check Interval True
hcheck_mode nonactive Health Check Mode True
location Location Label True
lun_id 0x218000000000000 Logical Unit Number ID False
lun_reset_spt yes FC Forced Open LUN True
max_coalesce 0x10000 Maximum Coalesce Size True
max_retries 5 Maximum Number of Retries True
max_transfer 0x40000 Maximum TRANSFER Size True
node_name 0x5000097208439c00 FC Node Name False
pvid none Physical volume identifier False
q_err no Use QERR bit True
q_type simple Queue TYPE True
queue_depth 16 Queue DEPTH True
reserve_policy no_reserve Reserve Policy True
rw_timeout 40 READ/WRITE time out value True
scsi_id 0x1a440 SCSI ID False
start_timeout 180 START UNIT time out value True
ww_name 0x5000097208439ad8 FC World Wide Name False
-Thanks.
Recently we have upgraded this server from AIX 7.1 TL2 to AIX 7.1 TL3, After the server was rebooted for some reason, hdisk# for oracle have changed, For example - One of the ASM disks - hdisk71 was changed to hdisk19. Since hdisk19 is new hdisk# the permissions were reverted back to root:system and since its an ASM disk, ASM failed to start because of disk permissions.
Can someone explain me why this has happened. I have read some literature online which talked about PCM module might be responsible if yes how can it affect the disk numbering.
here is the disk parameters -
csc06tst09:/
root> lsattr -El hdisk19
PCM PCM/friend/MSYMM_VRAID Path Control Module True
PR_key_value none Persistant Reserve Key Value True
algorithm round_robin Algorithm True
clr_q yes Device CLEARS its Queue on error True
hcheck_interval 60 Health Check Interval True
hcheck_mode nonactive Health Check Mode True
location Location Label True
lun_id 0x218000000000000 Logical Unit Number ID False
lun_reset_spt yes FC Forced Open LUN True
max_coalesce 0x10000 Maximum Coalesce Size True
max_retries 5 Maximum Number of Retries True
max_transfer 0x40000 Maximum TRANSFER Size True
node_name 0x5000097208439c00 FC Node Name False
pvid none Physical volume identifier False
q_err no Use QERR bit True
q_type simple Queue TYPE True
queue_depth 16 Queue DEPTH True
reserve_policy no_reserve Reserve Policy True
rw_timeout 40 READ/WRITE time out value True
scsi_id 0x1a440 SCSI ID False
start_timeout 180 START UNIT time out value True
ww_name 0x5000097208439ad8 FC World Wide Name False
-Thanks.
This can happen if the SCSI ID changed. If you plugged in another SCSI controller, or used a different initiator port, then that would explain the problem.
ASKER
Actually these are NPIV disks and the initiator ports on the storage were same. So dont know why this has happened.
As a matter of fact, I have experienced this anomaly on few servers.
As a matter of fact, I have experienced this anomaly on few servers.
ASKER
Also the multipathing software we have is IBM "MPIO"
ASKER
Or the whole question is, what decides the hdisk# while the AIX server is coming up online.
Typically AIX assigns a new hdisk number when it thinks it sees a new disk. In your case perhaps the devices were originally installed in an out of order sequence (like hdisk1, hdisk2, hdisk7, hdisk4 etc) which can happen if a bad disk is replaced. Now upon new install it runs cfgmgr and upon not seeing a sequential disk (like hdisk3 in my example) assigns one from the top end to that number.
Have you had disks replaced, or out of sequence numbering?
Have you had disks replaced, or out of sequence numbering?
Sequence numbering is arbitrary, but reasonably consistent, based on power-on discovery time.
There are settings in some SAF-TE enclosures that stagger spinup delays so all drives don't power up at same time and blow a circuit breaker or overtax a power supply.
There may be a configurable setting on the BIOS for your HBA adapters as well.
As such, you should never hardwire paths into any scripts. Also you could have a "intelligent" SCSI enclosure that will attempt to reconcile 2 devices that have same target ID. (So make sure that this hasn't been going on from the beginning by checking jumper settings if your enclosure doesn't have scsi ID switches that override the jumpers).
There are settings in some SAF-TE enclosures that stagger spinup delays so all drives don't power up at same time and blow a circuit breaker or overtax a power supply.
There may be a configurable setting on the BIOS for your HBA adapters as well.
As such, you should never hardwire paths into any scripts. Also you could have a "intelligent" SCSI enclosure that will attempt to reconcile 2 devices that have same target ID. (So make sure that this hasn't been going on from the beginning by checking jumper settings if your enclosure doesn't have scsi ID switches that override the jumpers).
ASKER
The server got rebooted and the disk number changed, these disks are SAN disks using NPIV technology,
Note these disks are NPIV disks, never the numbers changed for a reboot, they do change when we migrate because of the new kernel.
Note these disks are NPIV disks, never the numbers changed for a reboot, they do change when we migrate because of the new kernel.
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
Its not a full solution , But I can search for more info.