kill -9 won't kill multithreaded process on DEC Alphaserver 1200 dual cpu

wadedl
wadedl used Ask the Experts™
on
 We have a multithreaded Mail Server application PMDF running
on a  dual processor DEC AlphaServer 1200 running Compaq
Tru64 Unix 5.1a. The PMDF dispatcher process, which is
multithreaded and responsible for dispatching SMTP and POP3
mail transport agents,  hangs occasionally on a (otherwise normal)
restart.

    When PMDF displatcher process hangs, the process will not
die with a  kill signal - i.e.,  sending kill -9 doesn't cause the
process to die.

     We would like to know if anyone has seen this or can suggest
why a UNIX process would not die when sent SIGKILL.
Comment
Watch Question

Do more with

Expert Office
EXPERT OFFICE® is a registered trademark of EXPERTS EXCHANGE®

Commented:
Take a look at the parent process id (PPID) and see what it is.  In some unices a child process's parent dies then the child process attches itself to "init" (PID 1) and once that happens the offending child process cannot be killed which means it is time to reboot the system.  If the offending process is adopting init for a parent you need to determine why the spawning parent died and left the child process an orphan.

did you try to kill the (child) threads first?

Commented:
Typically, a process can not be killed when at least one of the following issues takes place:
1. OS bug (hardly to beleive)
2. OS is doing something when stopping this process
(for example, dumping a big core file, or declaring swap pages as free, etc.)

The situation described by sppalser is very strange,
and it does not explain why it's still impossible to kill the process (by following his reasons, it's also impossible to kill any daemon process, that their typical PPID is 1).

Try to use "top" to find out what is happening on the CPU,
"vmstat" to find out the disk activities, and "lsof" for that specific process to find out which resources it took.

Also, when debugging threaded programs on Tru64, I can recommend to first set default behaviour for SIGSEGV/SIGBUS/SIGFPE to disable stack unwinding (in C++)
and thread cancelling, and you will get a normal core file that can tell you where it *really* was signalled.
OWASP: Avoiding Hacker Tricks

Learn to build secure applications from the mindset of the hacker and avoid being exploited.

Commented:
Some programs, especially mail servers, have watchdog processes.  These watchdog processes will restart or protect it's processes, so you cannot kill it. You have to kill the watchdog process first.  You could also try finding the ppid, and sending it a kill -HUP {PID#}.

Commented:
when a kill -9 (having rights)is not efficace, without any error message, this means the process is in a system call thet not termine, because a system call is said atomic in unix system and cannot be interrupted, and your program receive the SIGKILL when it returns from the system call.
normally, a system call cannot turn always, otherwise, there is some bug in the system.
i've allready see this problem which is corrected only with a patch in a next release
use sar (system activity report) to try to have more information (perhaps...)
AFAIK the only system calls which are not interupted by kill -9 are disk I/O and some network (like NFS, which disk I/O again, somehow).
You can identify these processes by the D flag in ps' output.
No comment has been added lately, so it's time to clean up this TA.
I will leave the following recommendation for this question in the Cleanup topic area:

Accept: ahoffmann {http:#7477345}

Please leave any comments here within the next four days.
PLEASE DO NOT ACCEPT THIS COMMENT AS AN ANSWER!

jmcg
EE Cleanup Volunteer

Do more with

Expert Office
Submit tech questions to Ask the Experts™ at any time to receive solutions, advice, and new ideas from leading industry professionals.

Start 7-Day Free Trial