Replacing failed HD in AIX 4.3.3

Posted on 2009-04-04
I have an RS6000 unit with AIX 4.3.3 and two drives, 4.6G and 18.3G.  Sometime back it would no longer boot and when I ran standalone diagnostics from install CD 1, I found that the 18.3G drive had failed (in system configuration it had '????'' instead of an identifier).  I had two problems to solve then: a system that would not boot, and a bad drive.  I decided to purchase a replacement 18.3G drive.  The following is the output after installing the drive and running diagnostics:

Volume group 000766351f330bc4 contains these disks:
hdisk1 4303 10-80-00-4, 0

Volume group 000766351f330bc4 includes the following logical volumes:
hd5 hd6 hd8 hd4 hd2 hd9var hd3 hd1 lv00

When I choose the option to access rootvg before mounting filesystems, I get the following output:
PV Status:      hdisk1      PVACTIVE

varyonvg: Volume group rootvg is varied on.
0516-510 updatevg: Physical volume not found for physical volume identifier 00076635403389c7.
0516-548 syncl volm: Partially successful with updating volume group rootvg.
0516-622 updatelv: Warning, cannot write lv control block data.
0516-782 importvg: Partially successful importing of hdisk1.
Checking the /filesystem.
log redo processing for /dev/rhd4
Syncpt record at 13f028
end of log 13f028
Syncpt record at 13f028
Syncpt address 13f028
Number of log records=1
Number of data blocks=0
Number of nodo blocks=0
/dev/rhd4 (/): ** Unmounted cleanly - Check suppressed
Checking the /usr filesystem
/dev/rhd2 (usr): ** Unmounted cleanly - Check suppressed

If I try to access rootvg and mount filesystems, it goes into an infinite loop trying to load some module.

Ive tried a number of things from the 4.3 vintage LVM manual, and a 5.3 troubleshooting guide.

From the # prompt after accessing rootvg before mounting filesystems:

Any smit commands, cfgmgr, rmlvcopy, rmdev, reducevg all fail with /usr/bin/ksh not found.

Lsdev Ccdisk results in

hdisk0 Available 10-80-00-00, 0 N/A
hdisk1 Available 10-80-00-04, 0 N/A

Extendvg is functional but I havent yet done it since it appears that some boot files are missing.

The 5.3 troubleshooting guide recommends doing a system restore from an image backup.  The client that uses this machine (and wants it running again) did do data and image backups but the tapes are not labeled clearly.  I do have a tape labeled 'Image backup set #1'.  I inserted this tape, booted from CD #1, and selected restore from backup tape.

After some time I got the message 'Invalid disk found'.  Upon researching this further I concluded that the original disks were a mirrored set.  The posts I found related to this (not on this forum) suggested restoring to two disks.  I have however not been able to find how exactly to configure the system to restore to two disks - the option is to select one or the other, but not both.

When I pull up the option to change disks, I get the following:

hdisk1 - (what looks like a valid identifier)
hdisk0 - 0000000000000000


1. Does anyone know how to restore to two disks?
2. Since the new disk is now hdisk0, will that cause problems?  The LVM guide suggested creating a dummy hd identifier and letting the system renumber the new drive to be higher than the boot drive.
3. Do I need to do anything else to initialize the new drive?

At this point Im wondering if I should just do a new install.  I appreciate any assistance.
Question by:acort
Accepted Solution

for a moment ignore new disk.
Seeing boot sequence in SMS (system Management Service, the menu you access by F1),
which is the boot disk, the 18 or 4 one ?
I thinked was 4 and 18 contains data, (hdisk0 the 4gb and hdisk1 18 for history) but your report seem
to show different. Could you explain this?
What the system do (did). What is important to restore: data, OS ? and in which disk are they located ?

When you access volume group, before mounting fs, have you tried to perform a
fsck -y /
fsck -y /usr
fsck -y /var
fsck -y /tmp
fsck -y /dev/lv00 ?

Supposing you are in the disaster sit, to cannot anyway boot from old disk, I know 2 way.

One is to boot from CD1, and chose the option (I don't remember detailed name) that holds your data,
rebuilding new base OS (you loose your netcfg,name,program under /usr) but you maintain user fs
(if these are available, not on a failed hd). If the more important thigs are data and not sys cfg this may be a way. If the important is syscfg, usually a restore from system backup will do.

A way that I follow in some case is to take out both old disk, perform a new setup  in the new disk
(choose before a SCSI id different from old 2), and reach a minimum going system.
Then, seen new system goes (boots, reboots), attach 2 old disks, and, using smitty vg,
import VG from both old disk naming it phtmvg.
You will see errors, becouse /,/usr,/var... already exist, then it import them in new VG, puts the LVs
in odm, but does not update your /etc/filesystem.
Seeing the output in smitty (if you loose the video output, see it on /*smit* files) understand
the match between /dev/fs001 and old names: ie the system says: I have found an lv hd4 but I cannot import it becouse already exists, then it names /dev/fs003. Rename the lv as ohd4 and create an entry
in /etc/filesystem mounting it under /restore
do the same with /dev/ousr in /restore/usr.
and for other.
Make attention to use the second jfslog for these fs, the one of imported vg and not the same of rootvg.
You will see a phantom system under /restore... with all old fs, mounted in



What message do you get when you try to restore the system to hdisk0? Does it allow you to or you just get an error message?

Maybe you need an fsck to be run on all file systems since the shared libraries might not be accessible.

Can you paste the ouput of the results?

hdisk1 is the boot disk (4G), hdisk0 is the new data disk.  I think the replacement disk was assigned hdisk0 when I installed it.  The 4.3 LVM guide said something about creating a dummy hdisk0 and letting the system asign a new ID (which I assume would become hdisk2).

System restore will allow me to choose one disk or the other, but not both.  A post I read on another forum describing the 'invalid disk found' recommended restoring the image to both disks, but I don't see how you can do that.

I think two things happened: I ran out of space on hdisk1 (the boot disk), looking at the 'cannot write LV control block data' message.  Shortly thereafter, the 18G disk died.

I really need the data which is on a backup tape, so installing a new OS would be acceptable.

I will try the commands you suggested and post the output - thanks.

hi, warning,
mksysb save on tape just rootvg VG,
and I see from your output that rootvg contians just hdisk1.
hdisk1 contains standard lv and lv00, that may contain data, but probably the most
were on the other disk inside another vg

(if I had installed the system, I had created by the big disk a 2nd vg named as datavg,
and on this data, done periodically backup, while by mksysb copied rootvg to restart th sysstem
in case of failure..)

If rootvg contains just 4GB disk, and for some motivation it is corrupted, may be that 18gbhd is good
and for not fully compatibility with risc, it does not show size in SMS.
Are you sure that box has still 4gb hd as default boot devices ?
Sometime these 43P (I don't know your HW) forget the boot sequence



