Failing in bringing a resource group off-line (AIX-HACMP)

Posted on 2009-02-22
Last Modified: 2013-11-17

I have created a resource group (proftpdrg) and assigned it a service IP and and an application server.
When I try to make it offline (or move to other node), it gives me the following error in smit screen:

Before command completion, additional instructions may appear below.
Attempting to bring group proftpdrg offline on node x1.
Waiting for the cluster to process the resource group movement request....
Waiting for the cluster to stabilize....
ERROR: Event processing has failed for the requested resource
group movement.  The cluster is unstable and requires manual intervention
to continue processing.

And then it puts the cluster in "UNSTABLE" state (as seen in clstat output)
                State: UP               Nodes: 2
                SubState: UNSTABLE

However, if I do not assign an application server to the resource group, I can easily make it offline or
move to other node.

What can be the reason of that?

Question by:AnkCBS
    LVL 68

    Expert Comment

    this problem is caused almost every time by clstrmgr not being able to unmount a filesystem belonging to the volume group assigned to your resource group.

    You can verify this, the cluster being in its 'unstable' state, by issuing 'mount' and checking
    whether one of the filesystems in question is still mounted.
    If yes, try to umount it by hand. You'll probably see some message indicating that the FS was busy. Find out why it is busy! Often there is an error in the stop script of your application, or someone (probably you)
    has one of the directories of the FS as his current directory. It's also possible that your application doesn't manage to come to a regular end and ist still 'active'.

    You should, besides the above, check your hacmp.out and clstrmgr.debug files.
    They are located, by default, in /tmp, but it's possible that another location has been chosen in your environment. Check this by issuing
    /usr/es/sbin/cluster/utilities/cllog -g hacmp.out
    /usr/es/sbin/cluster/utilities/cllog -g clstrmgr.debug

    The second-to-last field contains the actual location of the logfile.

    Those logs are somewhat 'hard' to read, but try it! Search e.g. for '! ERROR !'

    Once you found the reason for the problem and corrected it, you can force the cluster to continue stopping the RG by
    /usr/es/sbin/cluster/utilities/clruncmd '[yournodename]'

    Managing HACMP problems is not easy stuff, so please come back here and ask, should there be further questions.

    Good luck!



    Author Comment

    Hi Wmp,

    First of all, I would like to thank for your quick reply.

    As for the problem, there is no VG (or a filesystem) in my RG. There is only one application server and service IP associated to it. Since the filesystems of proftpd on both hosts are mirrored with rsync, there is no need to include VG in the RG. May it be reason for the problem? Isn't is possible to move a RG including no VG (or FS)?

    I have just given another try and collected the logs of only this try in the log files. I have gone over the log files (they are attached) line by line. There are "error" lines, but I couldn't find any clues from these.

    And another problem, when my cluster is in "unstable" state, it is written in the log that:
    " Failure occurred while processing Resource Group proftpdrg. Manual intervention required"

    I can not make the cluster "stable" again using smit-cspoc-start/stop hacmp services. My only way is to reboot the host. Is there anyway to make it stable without rebooting.

    Best Regards...
    LVL 68

    Expert Comment

    OK ,before I look at the logfiles:

    1) a VG (or a FS) is not needed for a resource group.

    2) I assume there is something blocking the service IP from being (re)moved (proftpd still running? Don't know if this might be a problem)

    3) Try to get your cluster stable again by issuing the above-mentioned

    /usr/es/sbin/cluster/utilities/clruncmd '[yournodename]'
    or use
    'smitty hacmp' -> 'Problem Determination ...' -> 'Recover from HACMP script failures' -> ' [yournode]

    --- perhaps more than once ...
    à bientôt



    Author Comment

    Hello again,

    1) OK

    2) No proftpd process in both hosts after the move operation. I haven't suspected service IP, since If I exclude the application server from the RG, everything is fine. The service IP can be moved between participating nodes successfully.

    3) It works, great :) Thank you for your close interest.


    Author Comment


    When I issue the command on both hosts

    /usr/es/sbin/cluster/utilities/clruncmd '[node name]'

    after moving RG and the cluster is in unstable mode, everthing is fine. The move operation is completed successfully. The proftpd is started on the other node

    LVL 68

    Expert Comment

    Glad to hear.

    One very simple, but common thing:
    Do you terminate your start/stop scripts for the application server by exit 0 ?
    HACMP tends to propagate a returncode != 0 trough all its steps and will complain about it at the end.

    There is a tiny hint in hacmp.out pointing to wpar processing. Do you use WLM or PLM? Perhaps you could look there for inconsistencies. I, for one, don't know much about that, because I don't use it.



    Author Comment

    I think, we are approaching  to a possible solution, though its a more interesting situation :)

    First of all, my start/stop scripts were like that:

    sleep 1

    propid=`ps -ef -o %a,%p | grep proftpd | grep -v grep | cut -d "," -f 2`
    kill $propid
    sleep 1

    After your recommendation, I have added "exit 0" to the end of both start/stop scripts on both hosts. However, the same result happened.

    However, when I excluded "kill $propid" in stop scripts on both hosts, I can successfully move the RG btw nodes.



    Author Comment

    By the way, no WLM nor PLM

    LVL 68

    Accepted Solution

    Interesting, indeed, but not explicable.
     In virtually all of my HACMP stop scripts one or even several 'kills' can be found,
     and I never had such problems (well, at least not because of using kill).
     Sorry for the question - is your exit 0 at the right place? Perhaps more than one exit?

    If kill is the culprit - perhaps you should try /usr/bin/kill instead of the shell builtin.

    I fear I will be running out of ideas soon ...



    Author Comment

    Hello again,

    Finally, I have found the reason :)

    Look at my start script. Is searches processes including "proftpd". It is right but the cluster process which is responsible for killing the proftpd process also includes the string "proftpd". So the stop script kills itself which means that it never reaches "exit 0". In other word, you have been on the right side :)

    Thank you dude, I have fixed it by your directions.
    LVL 68

    Expert Comment

    Nasty catch!
    Glad you've found it, thanks for the points!
    Cheers and good luck,

    Write Comment

    Please enter a first name

    Please enter a last name

    We will never share this with anyone.

    Featured Post

    How to run any project with ease

    Manage projects of all sizes how you want. Great for personal to-do lists, project milestones, team priorities and launch plans.
    - Combine task lists, docs, spreadsheets, and chat in one
    - View and edit from mobile/offline
    - Cut down on emails

    This tech tip describes how to install the Solaris Operating System from a tape backup that was created using the Solaris flash archive utility. I have used this procedure on the Solaris 8 and 9 OS, and it shoudl also work well on the Solaris 10 rel…
    Using libpcap/Jpcap to capture and send packets on Solaris version (10/11) Library used: 1.      Libpcap ( Version 1.2 2.      Jpcap( Version 0.6 Prerequisite: 1.      GCC …
    Learn how to find files with the shell using the find and locate commands. Use locate to find a needle in a haystack.: With locate, check if the file still exists.: Use find to get the actual location of the file.:
    This tutorial goes over how to archive and restore FreeBSD jails that are managed by ezjail.

    779 members asked questions and received personalized solutions in the past 7 days.

    Join the community of 500,000 technology professionals and ask your questions.

    Join & Ask a Question

    Need Help in Real-Time?

    Connect with top rated Experts

    12 Experts available now in Live!

    Get 1:1 Help Now