Want to protect your cyber security and still get fast solutions? Ask a secure question today.Go Premium

x
?
Solved

Failing in bringing a resource group off-line (AIX-HACMP)

Posted on 2009-02-22
11
Medium Priority
?
7,141 Views
Last Modified: 2013-11-17
Hello,

I have created a resource group (proftpdrg) and assigned it a service IP and and an application server.
When I try to make it offline (or move to other node), it gives me the following error in smit screen:

Before command completion, additional instructions may appear below.
Attempting to bring group proftpdrg offline on node x1.
Waiting for the cluster to process the resource group movement request....
Waiting for the cluster to stabilize....
ERROR: Event processing has failed for the requested resource
group movement.  The cluster is unstable and requires manual intervention
to continue processing.

And then it puts the cluster in "UNSTABLE" state (as seen in clstat output)
            ...
                State: UP               Nodes: 2
                SubState: UNSTABLE
            ...

However, if I do not assign an application server to the resource group, I can easily make it offline or
move to other node.

What can be the reason of that?

Thanx
0
Comment
Question by:AnkCBS
  • 6
  • 5
11 Comments
 
LVL 68

Expert Comment

by:woolmilkporc
ID: 23704431
Hi,
this problem is caused almost every time by clstrmgr not being able to unmount a filesystem belonging to the volume group assigned to your resource group.

You can verify this, the cluster being in its 'unstable' state, by issuing 'mount' and checking
whether one of the filesystems in question is still mounted.
If yes, try to umount it by hand. You'll probably see some message indicating that the FS was busy. Find out why it is busy! Often there is an error in the stop script of your application, or someone (probably you)
has one of the directories of the FS as his current directory. It's also possible that your application doesn't manage to come to a regular end and ist still 'active'.

You should, besides the above, check your hacmp.out and clstrmgr.debug files.
They are located, by default, in /tmp, but it's possible that another location has been chosen in your environment. Check this by issuing
/usr/es/sbin/cluster/utilities/cllog -g hacmp.out
or
/usr/es/sbin/cluster/utilities/cllog -g clstrmgr.debug

The second-to-last field contains the actual location of the logfile.

Those logs are somewhat 'hard' to read, but try it! Search e.g. for '! ERROR !'

Once you found the reason for the problem and corrected it, you can force the cluster to continue stopping the RG by
/usr/es/sbin/cluster/utilities/clruncmd '[yournodename]'


Managing HACMP problems is not easy stuff, so please come back here and ask, should there be further questions.


Good luck!

wmp



0
 

Author Comment

by:AnkCBS
ID: 23704773
Hi Wmp,

First of all, I would like to thank for your quick reply.

As for the problem, there is no VG (or a filesystem) in my RG. There is only one application server and service IP associated to it. Since the filesystems of proftpd on both hosts are mirrored with rsync, there is no need to include VG in the RG. May it be reason for the problem? Isn't is possible to move a RG including no VG (or FS)?

I have just given another try and collected the logs of only this try in the log files. I have gone over the log files (they are attached) line by line. There are "error" lines, but I couldn't find any clues from these.

And another problem, when my cluster is in "unstable" state, it is written in the log that:
" Failure occurred while processing Resource Group proftpdrg. Manual intervention required"

I can not make the cluster "stable" again using smit-cspoc-start/stop hacmp services. My only way is to reboot the host. Is there anyway to make it stable without rebooting.

Best Regards...
hacmp-out.log
clstrmgr-debug.log
cluster.log
0
 
LVL 68

Expert Comment

by:woolmilkporc
ID: 23704812
OK ,before I look at the logfiles:

1) a VG (or a FS) is not needed for a resource group.

2) I assume there is something blocking the service IP from being (re)moved (proftpd still running? Don't know if this might be a problem)

3) Try to get your cluster stable again by issuing the above-mentioned

/usr/es/sbin/cluster/utilities/clruncmd '[yournodename]'
or use
'smitty hacmp' -> 'Problem Determination ...' -> 'Recover from HACMP script failures' -> ' [yournode]

--- perhaps more than once ...
 
à bientôt

wmp

0
Independent Software Vendors: We Want Your Opinion

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 

Author Comment

by:AnkCBS
ID: 23704936
Hello again,

1) OK

2) No proftpd process in both hosts after the move operation. I haven't suspected service IP, since If I exclude the application server from the RG, everything is fine. The service IP can be moved between participating nodes successfully.

3) It works, great :) Thank you for your close interest.

Regards.
0
 

Author Comment

by:AnkCBS
ID: 23704990
Additionally,

When I issue the command on both hosts

/usr/es/sbin/cluster/utilities/clruncmd '[node name]'

after moving RG and the cluster is in unstable mode, everthing is fine. The move operation is completed successfully. The proftpd is started on the other node

0
 
LVL 68

Expert Comment

by:woolmilkporc
ID: 23705115
Glad to hear.

One very simple, but common thing:
Do you terminate your start/stop scripts for the application server by exit 0 ?
HACMP tends to propagate a returncode != 0 trough all its steps and will complain about it at the end.

There is a tiny hint in hacmp.out pointing to wpar processing. Do you use WLM or PLM? Perhaps you could look there for inconsistencies. I, for one, don't know much about that, because I don't use it.

wmp






0
 

Author Comment

by:AnkCBS
ID: 23705486
I think, we are approaching  to a possible solution, though its a more interesting situation :)

First of all, my start/stop scripts were like that:

--start--
/usr/sbin/proftpd
sleep 1

--stop--
propid=`ps -ef -o %a,%p | grep proftpd | grep -v grep | cut -d "," -f 2`
kill $propid
sleep 1

After your recommendation, I have added "exit 0" to the end of both start/stop scripts on both hosts. However, the same result happened.

However, when I excluded "kill $propid" in stop scripts on both hosts, I can successfully move the RG btw nodes.

Interesting?

0
 

Author Comment

by:AnkCBS
ID: 23705494
By the way, no WLM nor PLM

0
 
LVL 68

Accepted Solution

by:
woolmilkporc earned 1000 total points
ID: 23705715
Interesting, indeed, but not explicable.
 
 In virtually all of my HACMP stop scripts one or even several 'kills' can be found,
 and I never had such problems (well, at least not because of using kill).
 
 Sorry for the question - is your exit 0 at the right place? Perhaps more than one exit?

If kill is the culprit - perhaps you should try /usr/bin/kill instead of the shell builtin.

I fear I will be running out of ideas soon ...

wmp





 
0
 

Author Comment

by:AnkCBS
ID: 23708889
Hello again,

Finally, I have found the reason :)

Look at my start script. Is searches processes including "proftpd". It is right but the cluster process which is responsible for killing the proftpd process also includes the string "proftpd". So the stop script kills itself which means that it never reaches "exit 0". In other word, you have been on the right side :)

Thank you dude, I have fixed it by your directions.
0
 
LVL 68

Expert Comment

by:woolmilkporc
ID: 23709128
Nasty catch!
Glad you've found it, thanks for the points!
Cheers and good luck,
wmp
 
 
 
0

Featured Post

Industry Leaders: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Attention: This article will no longer be maintained. If you have any questions, please feel free to mail me. jgh@FreeBSD.org Please see http://www.freebsd.org/doc/en_US.ISO8859-1/articles/freebsd-update-server/ for the updated article. It is avail…
Java performance on Solaris - Managing CPUs There are various resource controls in operating system which directly/indirectly influence the performance of application. one of the most important resource controls is "CPU".   In a multithreaded…
Learn several ways to interact with files and get file information from the bash shell. ls lists the contents of a directory: Using the -a flag displays hidden files: Using the -l flag formats the output in a long list: The file command gives us mor…
Learn how to navigate the file tree with the shell. Use pwd to print the current working directory: Use ls to list a directory's contents: Use cd to change to a new directory: Use wildcards instead of typing out long directory names: Use ../ to move…
Suggested Courses

580 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question