Link to home
Start Free TrialLog in
Avatar of acubedec15
acubedec15

asked on

IBM AIX disk Numbering Issues

We have a VIO Client on POWER 7 running AIX 7 operating System. This is a oracle server running ASM. As we know ASM disks are not controlled by ODM so there is no "PVID" and the disk permissions on /dev/rhdisk#  are oracle:dba.

Recently we have upgraded this server from AIX 7.1 TL2 to AIX 7.1 TL3, After the server was rebooted for some reason, hdisk# for oracle have changed, For example - One of the ASM disks - hdisk71 was changed to hdisk19. Since hdisk19 is new hdisk# the permissions were reverted back to root:system and since its an ASM disk, ASM failed to start because of disk permissions.

Can someone explain me why this has happened. I have read some literature online which talked about PCM module might be responsible if yes how can it affect the disk numbering.

here is the disk parameters -
csc06tst09:/
root> lsattr -El hdisk19
PCM             PCM/friend/MSYMM_VRAID Path Control Module              True
PR_key_value    none                   Persistant Reserve Key Value     True
algorithm       round_robin            Algorithm                        True
clr_q           yes                    Device CLEARS its Queue on error True
hcheck_interval 60                     Health Check Interval            True
hcheck_mode     nonactive              Health Check Mode                True
location                               Location Label                   True
lun_id          0x218000000000000      Logical Unit Number ID           False
lun_reset_spt   yes                    FC Forced Open LUN               True
max_coalesce    0x10000                Maximum Coalesce Size            True
max_retries     5                      Maximum Number of Retries        True
max_transfer    0x40000                Maximum TRANSFER Size            True
node_name       0x5000097208439c00     FC Node Name                     False
pvid            none                   Physical volume identifier       False
q_err           no                     Use QERR bit                     True
q_type          simple                 Queue TYPE                       True
queue_depth     16                     Queue DEPTH                      True
reserve_policy  no_reserve             Reserve Policy                   True
rw_timeout      40                     READ/WRITE time out value        True
scsi_id         0x1a440                SCSI ID                          False
start_timeout   180                    START UNIT time out value        True
ww_name         0x5000097208439ad8     FC World Wide Name               False


-Thanks.
Avatar of David
David
Flag of United States of America image

This can happen if the SCSI ID changed. If you plugged in another SCSI controller, or used a different initiator port, then that would explain the problem.
Avatar of acubedec15
acubedec15

ASKER

Actually these are NPIV disks and the initiator ports on the storage were same. So dont know why this has happened.

As a matter of fact, I have experienced this anomaly on few servers.
Also the multipathing software we have is IBM "MPIO"
Or the whole question is, what decides the hdisk# while the AIX server is coming up online.
Typically AIX assigns a new hdisk number when it thinks it sees a new disk. In your case perhaps the devices were originally installed in an out of order sequence (like hdisk1, hdisk2, hdisk7, hdisk4 etc) which can happen if a bad disk is replaced. Now upon new install it runs cfgmgr and upon not seeing a sequential disk (like hdisk3 in my example) assigns one from the top end to that number.

Have you had disks replaced, or out of sequence numbering?
Sequence numbering is arbitrary, but reasonably consistent, based on power-on discovery time.

There are settings in some SAF-TE enclosures that stagger spinup delays so all drives don't power up at same time and blow a circuit breaker or overtax a power supply.  

There may be a configurable setting on the BIOS for your HBA adapters as well.

 As such, you should never hardwire paths into any scripts.  Also you could have a "intelligent" SCSI enclosure that will attempt to reconcile 2 devices that have same target ID.   (So make sure that this hasn't been going on from the beginning by checking jumper settings if your enclosure doesn't have scsi ID switches that override the jumpers).
The server got rebooted and the disk number changed, these disks are SAN disks using NPIV technology,

Note these disks are NPIV disks, never the numbers changed for a reboot, they do change when we migrate because of the new kernel.
ASKER CERTIFIED SOLUTION
Avatar of David
David
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Its not a full solution , But I can search for more info.