Failing in bringing a resource group off-line (AIX-HACMP)


I have created a resource group (proftpdrg) and assigned it a service IP and and an application server.
When I try to make it offline (or move to other node), it gives me the following error in smit screen:

Before command completion, additional instructions may appear below.
Attempting to bring group proftpdrg offline on node x1.
Waiting for the cluster to process the resource group movement request....
Waiting for the cluster to stabilize....
ERROR: Event processing has failed for the requested resource
group movement.  The cluster is unstable and requires manual intervention
to continue processing.

And then it puts the cluster in "UNSTABLE" state (as seen in clstat output)
                State: UP               Nodes: 2
                SubState: UNSTABLE

However, if I do not assign an application server to the resource group, I can easily make it offline or
move to other node.

What can be the reason of that?

Who is Participating?
Interesting, indeed, but not explicable.
 In virtually all of my HACMP stop scripts one or even several 'kills' can be found,
 and I never had such problems (well, at least not because of using kill).
 Sorry for the question - is your exit 0 at the right place? Perhaps more than one exit?

If kill is the culprit - perhaps you should try /usr/bin/kill instead of the shell builtin.

I fear I will be running out of ideas soon ...


this problem is caused almost every time by clstrmgr not being able to unmount a filesystem belonging to the volume group assigned to your resource group.

You can verify this, the cluster being in its 'unstable' state, by issuing 'mount' and checking
whether one of the filesystems in question is still mounted.
If yes, try to umount it by hand. You'll probably see some message indicating that the FS was busy. Find out why it is busy! Often there is an error in the stop script of your application, or someone (probably you)
has one of the directories of the FS as his current directory. It's also possible that your application doesn't manage to come to a regular end and ist still 'active'.

You should, besides the above, check your hacmp.out and clstrmgr.debug files.
They are located, by default, in /tmp, but it's possible that another location has been chosen in your environment. Check this by issuing
/usr/es/sbin/cluster/utilities/cllog -g hacmp.out
/usr/es/sbin/cluster/utilities/cllog -g clstrmgr.debug

The second-to-last field contains the actual location of the logfile.

Those logs are somewhat 'hard' to read, but try it! Search e.g. for '! ERROR !'

Once you found the reason for the problem and corrected it, you can force the cluster to continue stopping the RG by
/usr/es/sbin/cluster/utilities/clruncmd '[yournodename]'

Managing HACMP problems is not easy stuff, so please come back here and ask, should there be further questions.

Good luck!


AnkCBSAuthor Commented:
Hi Wmp,

First of all, I would like to thank for your quick reply.

As for the problem, there is no VG (or a filesystem) in my RG. There is only one application server and service IP associated to it. Since the filesystems of proftpd on both hosts are mirrored with rsync, there is no need to include VG in the RG. May it be reason for the problem? Isn't is possible to move a RG including no VG (or FS)?

I have just given another try and collected the logs of only this try in the log files. I have gone over the log files (they are attached) line by line. There are "error" lines, but I couldn't find any clues from these.

And another problem, when my cluster is in "unstable" state, it is written in the log that:
" Failure occurred while processing Resource Group proftpdrg. Manual intervention required"

I can not make the cluster "stable" again using smit-cspoc-start/stop hacmp services. My only way is to reboot the host. Is there anyway to make it stable without rebooting.

Best Regards...
Cloud Class® Course: Microsoft Office 2010

This course will introduce you to the interfaces and features of Microsoft Office 2010 Word, Excel, PowerPoint, Outlook, and Access. You will learn about the features that are shared between all products in the Office suite, as well as the new features that are product specific.

OK ,before I look at the logfiles:

1) a VG (or a FS) is not needed for a resource group.

2) I assume there is something blocking the service IP from being (re)moved (proftpd still running? Don't know if this might be a problem)

3) Try to get your cluster stable again by issuing the above-mentioned

/usr/es/sbin/cluster/utilities/clruncmd '[yournodename]'
or use
'smitty hacmp' -> 'Problem Determination ...' -> 'Recover from HACMP script failures' -> ' [yournode]

--- perhaps more than once ...
à bientôt


AnkCBSAuthor Commented:
Hello again,

1) OK

2) No proftpd process in both hosts after the move operation. I haven't suspected service IP, since If I exclude the application server from the RG, everything is fine. The service IP can be moved between participating nodes successfully.

3) It works, great :) Thank you for your close interest.

AnkCBSAuthor Commented:

When I issue the command on both hosts

/usr/es/sbin/cluster/utilities/clruncmd '[node name]'

after moving RG and the cluster is in unstable mode, everthing is fine. The move operation is completed successfully. The proftpd is started on the other node

Glad to hear.

One very simple, but common thing:
Do you terminate your start/stop scripts for the application server by exit 0 ?
HACMP tends to propagate a returncode != 0 trough all its steps and will complain about it at the end.

There is a tiny hint in hacmp.out pointing to wpar processing. Do you use WLM or PLM? Perhaps you could look there for inconsistencies. I, for one, don't know much about that, because I don't use it.


AnkCBSAuthor Commented:
I think, we are approaching  to a possible solution, though its a more interesting situation :)

First of all, my start/stop scripts were like that:

sleep 1

propid=`ps -ef -o %a,%p | grep proftpd | grep -v grep | cut -d "," -f 2`
kill $propid
sleep 1

After your recommendation, I have added "exit 0" to the end of both start/stop scripts on both hosts. However, the same result happened.

However, when I excluded "kill $propid" in stop scripts on both hosts, I can successfully move the RG btw nodes.


AnkCBSAuthor Commented:
By the way, no WLM nor PLM

AnkCBSAuthor Commented:
Hello again,

Finally, I have found the reason :)

Look at my start script. Is searches processes including "proftpd". It is right but the cluster process which is responsible for killing the proftpd process also includes the string "proftpd". So the stop script kills itself which means that it never reaches "exit 0". In other word, you have been on the right side :)

Thank you dude, I have fixed it by your directions.
Nasty catch!
Glad you've found it, thanks for the points!
Cheers and good luck,
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.