Link to home
Start Free TrialLog in
Avatar of rejeeth
rejeethFlag for United States of America

asked on

Process hung - Futex wait

One of our application process went in to hung state for a few hours and we had to terminate it. Though this was a one off issue, would like to know the cause for it.
 OS: RHEL 4.6
Kernel: 2.6.9-67.0.1.ELsmp

~]$ strace -p 10366
Process 10366 attached - interrupt to quit
[ Process PID=10366 runs in 32 bit mode. ]
clock_gettime(CLOCK_REALTIME, {1285087816, 337203000}) = 0
futex(0xa5051f4, FUTEX_WAIT, 1, {1604, 133451000}
--------------------------------------------------------------------------------------------------------------------
Name:   app
State:  T (stopped)
SleepAVG:       88%
Tgid:   10366
Pid:    10366
PPid:   8407
TracerPid:      0
Uid:    17247   17247   17247   17247
Gid:    17247   17247   17247   17247
FDSize: 64
Groups: 17247 17248 30266 30267 172248
VmSize:   743096 kB
VmLck:         0 kB
VmRSS:    400300 kB
VmData:   417344 kB
VmStk:       140 kB
VmExe:     20210 kB
VmLib:    375306 kB
StaBrk: 0947f000 kB
Brk:    1e7c3000 kB
StaStk: ffff4f20 kB
Threads:        6
SigPnd: 0000000000000000
ShdPnd: 0000000000000000
SigBlk: 0000000000000000
SigIgn: 0000000000091020
SigCgt: 00000001c18024cf
CapInh: 0000000000000000
CapPrm: 0000000000000000
CapEff: 0000000000000000
--------------------------------------------------------------------------------------------------------------
Thread 6 (Thread 3912346528 (LWP 10367)):
#0  0xffffe410 in __kernel_vsyscall ()
#1  0x0062fbf1 in ___newselect_nocancel () from /lib/tls/libc.so.6
#2  0xf7b38cb0 in RCE_Select_Reactor_T<RCE_Reactor_Token_T<RCE_Token> >::wait_for_multiple_events ()
#3  0xf7b357d3 in RCE_Select_Reactor_T<RCE_Reactor_Token_T<RCE_Token> >::handle_events_i () from <app path>/libRCE-5.6.so
#4  0xf7b358c9 in RCE_Select_Reactor_T<RCE_Reactor_Token_T<RCE_Token> >::handle_events () from <app path>/libRCE-5.6.so
#5  0xf7bb6f8f in RCE_Reactor::run_reactor_event_loop ()
#6  0xf7de7305 in ResultCollectFunc ()
#7  0x00000000 in ?? ()
Thread 5 (Thread 3826920352 (LWP 10378)):
#0  0xffffe410 in __kernel_vsyscall ()
#1  0x005f9456 in __nanosleep_nocancel () from /lib/tls/libc.so.6
#2  0x006304ba in usleep () from /lib/tls/libc.so.6
#3  0x0855dee6 in timerNS::Timer::threadEnrty ()
#4  0x0855e8db in boost::detail::function::void_function_obj_invoker0<boost::_bi::bind_t<void, boost::_mfi::mf0<void, timerNS::Timer>, boost::_bi::list1<boost::reference_wrapper<timerNS::Timer> > >, void>::invoke ()
#5  0xf775d749 in boost::function0<void, std::allocator<boost::function_base> >::operator() ()
#6  0xf775cc8d in boost::thread_group::join_all ()
#7  0x006a13cc in start_thread () from /lib/tls/libpthread.so.0
#8  0x006371ae in clone () from /lib/tls/libc.so.6
Thread 4 (Thread 3816430496 (LWP 10679)):
#0  0xffffe410 in __kernel_vsyscall ()
#1  0x0062fbf1 in ___newselect_nocancel () from /lib/tls/libc.so.6
#2  0xf7b38cb0 in RCE_Select_Reactor_T<RCE_Reactor_Token_T<RCE_Token> >::wait_for_multiple_events ()
#3  0xf7bde8a7 in RCE_TP_Reactor::get_event_for_dispatching ()
#4  0xf7bdf5e5 in RCE_TP_Reactor::dispatch_i ()
#5  0xf7bdf6fb in RCE_TP_Reactor::handle_events ()
#6  0xf7bb6f8f in RCE_Reactor::run_reactor_event_loop ()
#7  0xf7e1b178 in RceTransport::process ()
#8  0x00000000 in ?? ()
Thread 3 (Thread 3805940640 (LWP 10683)):
#0  0xffffe410 in __kernel_vsyscall ()
#1  0x006a3cf6 in pthread_cond_wait@@GLIBC_2.3.2 ()
#2  0xf5d2b386 in MLA2::TaskExecutor::threadFunction ()
#3  0xf5d2cb42 in boost::detail::thread_data<boost::_bi::bind_t<void, boost::_mfi::mf0<void, MLA2::TaskExecutor>, boost::_bi::list1<boost::_bi::value<MLA2::TaskExecutor*> > > >::run () from <MLAlibspath>/libmla2rt.so
#4  0xf5d87b79 in thread_proxy ()
#5  0x006a13cc in start_thread () from /lib/tls/libpthread.so.0
#6  0x006371ae in clone () from /lib/tls/libc.so.6
Thread 2 (Thread 3795450784 (LWP 11536)):
#0  0xffffe410 in __kernel_vsyscall ()
#1  0x006a3cf6 in pthread_cond_wait@@GLIBC_2.3.2 ()
#2  0xf5d17218 in MLA2::ObserverNotificationManager::getNextMessage ()
#3  0xf5d17fe5 in MLA2::ObserverNotificationManager::threadFunction ()
#4  0xf5d1b382 in boost::detail::thread_data<boost::_bi::bind_t<void, boost::_mfi::mf0<void, MLA2::ObserverNotificationManager>, boost::_bi::list1<boost::_bi::value<MLA2::ObserverNotificationManager*> > > >::run ()
#5  0xf5d87b79 in thread_proxy ()
#6  0x006a13cc in start_thread () from /lib/tls/libpthread.so.0
#7  0x006371ae in clone () from /lib/tls/libc.so.6
Thread 1 (Thread 4130949344 (LWP 10366)):
#0  0xffffe410 in __kernel_vsyscall ()
#1  0x006a3f7c in pthread_cond_timedwait@@GLIBC_2.3.2 ()
#2  0xf7eef62e in RCE_Condition<RCE_Thread_Mutex>::wait ()
#3  0x0a5051f0 in ?? ()
#4  0xf7e0d274 in MMClientProxy::instance(char const*, std::basic_ofstream<char, std::char_traits<char> > const*)::the_instance ()
#5  0xffff22c8 in ?? ()
#6  0xf65c230e in std::string::assign ()
#7  0x1c0d9c30 in ?? ()
#8  0x4c98e88c in ?? ()
#9  0x1c0d9c30 in ?? ()
#10 0xffff22f8 in ?? ()
#11 0x0a5051f0 in ?? ()
#12 0xf7e0d220 in MMClientProxy::instance(char const*, std::basic_ofstream<char, std::char_traits<char> > const*)::the_instance ()
#13 0xf7e0d210 in MMClientProxy::instance(char const*, std::basic_ofstream<char, std::char_traits<char> > const*)::the_instance ()
#14 0xf7e0d1c0 in guard variable for MMClientProxy::instance(char const*, std::basic_ofstream<char, std::char_traits<char> > const*)::the_instance ()
#15 0xf7eef6ee in RCE_Condition<RCE_Thread_Mutex>::wait ()
#16 0x0a5051f0 in ?? ()
#17 0xf7e0d274 in MMClientProxy::instance(char const*, std::basic_ofstream<char, std::char_traits<char> > const*)::the_instance ()
#18 0xffff2348 in ?? ()
#19 0xf7df6020 in MMClientProxy::waitWithId ()
#20 0x1ce00488 in ?? ()
#21 0x1e35bf00 in ?? ()
#22 0xf7e0d1f0 in MMClientProxy::instance(char const*, std::basic_ofstream<char, std::char_traits<char> > const*)::the_instance ()
#23 0xf7e0d274 in MMClientProxy::instance(char const*, std::basic_ofstream<char, std::char_traits<char> > const*)::the_instance ()
#24 0x0a0973d4 in ?? ()
#25 0xffff266c in ?? ()
#26 0xffff3080 in ?? ()
#27 0xf65c4714 in std::string::_S_construct<char const*> ()
--------------------------------------------------------------------------------------------------------------


Avatar of Hugh Fraser
Hugh Fraser
Flag of Canada image

It's the userspace mutex, used by pthreads to synchronize activity among threads. Threads 1,2 and 3 are in a wait state for the mutex. Undoubtedly, one thread has obtained the mutex and not released it.
Avatar of rejeeth

ASKER

This process runs thousands of times a day without issues. Only in this particular case it got stuck. Would it be impossible to explain the behavior now? I had seen similar issue reported in bugzilla but was claimed to have been fixed in the kernel version where it failed.
It's tough to say exactly where it's hung, since it could be an application problem. Thread 5 is in a short-time sleep state, likely a loop waiting for work. Threads 4 and 6 are waiting on multiple events. And threads 1,2, and 3 are waiting on the futex.

It's easy to get into a deadlock with threads waiting on resources. Thread 1 holds resource A, and is waiting for B. Thread 2 holds B, and is waiting for A. But I believe the process status will reflect if it's in a deadlock situation, and the kernel will detect the deadlock and return the appropriate status to the threads waiting.

So unfortunately, it's tough to tell much more from a process snapshot, except that there's nothing obviously wrong. Are there event logs from the process indicating what's happening?
Avatar of rejeeth

ASKER

What the process does is - generate a large number of requests (say 1000) and submit them to a server from where it expetcs a set of computed results in return for each of the 1000 requests. Once it receives all 1000 results, the process is supposed to do an aggregation and publish the results.

From the application logs, it is not clear whether it received all 1000 results. But the server which send the results says it had returned all 1000 results.
ASKER CERTIFIED SOLUTION
Avatar of Hugh Fraser
Hugh Fraser
Flag of Canada image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of rejeeth

ASKER

Thanks hfraser for the suggestion. Will try to implement as you've suggested.

We are also following up with RH becase a fix for similar issue was added to the version we use.