rejeeth
asked on
Process hung - Futex wait
One of our application process went in to hung state for a few hours and we had to terminate it. Though this was a one off issue, would like to know the cause for it.
OS: RHEL 4.6
Kernel: 2.6.9-67.0.1.ELsmp
~]$ strace -p 10366
Process 10366 attached - interrupt to quit
[ Process PID=10366 runs in 32 bit mode. ]
clock_gettime(CLOCK_REALTI ME, {1285087816, 337203000}) = 0
futex(0xa5051f4, FUTEX_WAIT, 1, {1604, 133451000}
-------------------------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- ----------
Name: app
State: T (stopped)
SleepAVG: 88%
Tgid: 10366
Pid: 10366
PPid: 8407
TracerPid: 0
Uid: 17247 17247 17247 17247
Gid: 17247 17247 17247 17247
FDSize: 64
Groups: 17247 17248 30266 30267 172248
VmSize: 743096 kB
VmLck: 0 kB
VmRSS: 400300 kB
VmData: 417344 kB
VmStk: 140 kB
VmExe: 20210 kB
VmLib: 375306 kB
StaBrk: 0947f000 kB
Brk: 1e7c3000 kB
StaStk: ffff4f20 kB
Threads: 6
SigPnd: 0000000000000000
ShdPnd: 0000000000000000
SigBlk: 0000000000000000
SigIgn: 0000000000091020
SigCgt: 00000001c18024cf
CapInh: 0000000000000000
CapPrm: 0000000000000000
CapEff: 0000000000000000
-------------------------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- ----
Thread 6 (Thread 3912346528 (LWP 10367)):
#0 0xffffe410 in __kernel_vsyscall ()
#1 0x0062fbf1 in ___newselect_nocancel () from /lib/tls/libc.so.6
#2 0xf7b38cb0 in RCE_Select_Reactor_T<RCE_R eactor_Tok en_T<RCE_T oken> >::wait_for_multiple_event s ()
#3 0xf7b357d3 in RCE_Select_Reactor_T<RCE_R eactor_Tok en_T<RCE_T oken> >::handle_events_i () from <app path>/libRCE-5.6.so
#4 0xf7b358c9 in RCE_Select_Reactor_T<RCE_R eactor_Tok en_T<RCE_T oken> >::handle_events () from <app path>/libRCE-5.6.so
#5 0xf7bb6f8f in RCE_Reactor::run_reactor_e vent_loop ()
#6 0xf7de7305 in ResultCollectFunc ()
#7 0x00000000 in ?? ()
Thread 5 (Thread 3826920352 (LWP 10378)):
#0 0xffffe410 in __kernel_vsyscall ()
#1 0x005f9456 in __nanosleep_nocancel () from /lib/tls/libc.so.6
#2 0x006304ba in usleep () from /lib/tls/libc.so.6
#3 0x0855dee6 in timerNS::Timer::threadEnrt y ()
#4 0x0855e8db in boost::detail::function::v oid_functi on_obj_inv oker0<boos t::_bi::bi nd_t<void, boost::_mfi::mf0<void, timerNS::Timer>, boost::_bi::list1<boost::r eference_w rapper<tim erNS::Time r> > >, void>::invoke ()
#5 0xf775d749 in boost::function0<void, std::allocator<boost::func tion_base> >::operator() ()
#6 0xf775cc8d in boost::thread_group::join_ all ()
#7 0x006a13cc in start_thread () from /lib/tls/libpthread.so.0
#8 0x006371ae in clone () from /lib/tls/libc.so.6
Thread 4 (Thread 3816430496 (LWP 10679)):
#0 0xffffe410 in __kernel_vsyscall ()
#1 0x0062fbf1 in ___newselect_nocancel () from /lib/tls/libc.so.6
#2 0xf7b38cb0 in RCE_Select_Reactor_T<RCE_R eactor_Tok en_T<RCE_T oken> >::wait_for_multiple_event s ()
#3 0xf7bde8a7 in RCE_TP_Reactor::get_event_ for_dispat ching ()
#4 0xf7bdf5e5 in RCE_TP_Reactor::dispatch_i ()
#5 0xf7bdf6fb in RCE_TP_Reactor::handle_eve nts ()
#6 0xf7bb6f8f in RCE_Reactor::run_reactor_e vent_loop ()
#7 0xf7e1b178 in RceTransport::process ()
#8 0x00000000 in ?? ()
Thread 3 (Thread 3805940640 (LWP 10683)):
#0 0xffffe410 in __kernel_vsyscall ()
#1 0x006a3cf6 in pthread_cond_wait@@GLIBC_2 .3.2 ()
#2 0xf5d2b386 in MLA2::TaskExecutor::thread Function ()
#3 0xf5d2cb42 in boost::detail::thread_data <boost::_b i::bind_t< void, boost::_mfi::mf0<void, MLA2::TaskExecutor>, boost::_bi::list1<boost::_ bi::value< MLA2::Task Executor*> > > >::run () from <MLAlibspath>/libmla2rt.so
#4 0xf5d87b79 in thread_proxy ()
#5 0x006a13cc in start_thread () from /lib/tls/libpthread.so.0
#6 0x006371ae in clone () from /lib/tls/libc.so.6
Thread 2 (Thread 3795450784 (LWP 11536)):
#0 0xffffe410 in __kernel_vsyscall ()
#1 0x006a3cf6 in pthread_cond_wait@@GLIBC_2 .3.2 ()
#2 0xf5d17218 in MLA2::ObserverNotification Manager::g etNextMess age ()
#3 0xf5d17fe5 in MLA2::ObserverNotification Manager::t hreadFunct ion ()
#4 0xf5d1b382 in boost::detail::thread_data <boost::_b i::bind_t< void, boost::_mfi::mf0<void, MLA2::ObserverNotification Manager>, boost::_bi::list1<boost::_ bi::value< MLA2::Obse rverNotifi cationMana ger*> > > >::run ()
#5 0xf5d87b79 in thread_proxy ()
#6 0x006a13cc in start_thread () from /lib/tls/libpthread.so.0
#7 0x006371ae in clone () from /lib/tls/libc.so.6
Thread 1 (Thread 4130949344 (LWP 10366)):
#0 0xffffe410 in __kernel_vsyscall ()
#1 0x006a3f7c in pthread_cond_timedwait@@GL IBC_2.3.2 ()
#2 0xf7eef62e in RCE_Condition<RCE_Thread_M utex>::wai t ()
#3 0x0a5051f0 in ?? ()
#4 0xf7e0d274 in MMClientProxy::instance(ch ar const*, std::basic_ofstream<char, std::char_traits<char> > const*)::the_instance ()
#5 0xffff22c8 in ?? ()
#6 0xf65c230e in std::string::assign ()
#7 0x1c0d9c30 in ?? ()
#8 0x4c98e88c in ?? ()
#9 0x1c0d9c30 in ?? ()
#10 0xffff22f8 in ?? ()
#11 0x0a5051f0 in ?? ()
#12 0xf7e0d220 in MMClientProxy::instance(ch ar const*, std::basic_ofstream<char, std::char_traits<char> > const*)::the_instance ()
#13 0xf7e0d210 in MMClientProxy::instance(ch ar const*, std::basic_ofstream<char, std::char_traits<char> > const*)::the_instance ()
#14 0xf7e0d1c0 in guard variable for MMClientProxy::instance(ch ar const*, std::basic_ofstream<char, std::char_traits<char> > const*)::the_instance ()
#15 0xf7eef6ee in RCE_Condition<RCE_Thread_M utex>::wai t ()
#16 0x0a5051f0 in ?? ()
#17 0xf7e0d274 in MMClientProxy::instance(ch ar const*, std::basic_ofstream<char, std::char_traits<char> > const*)::the_instance ()
#18 0xffff2348 in ?? ()
#19 0xf7df6020 in MMClientProxy::waitWithId ()
#20 0x1ce00488 in ?? ()
#21 0x1e35bf00 in ?? ()
#22 0xf7e0d1f0 in MMClientProxy::instance(ch ar const*, std::basic_ofstream<char, std::char_traits<char> > const*)::the_instance ()
#23 0xf7e0d274 in MMClientProxy::instance(ch ar const*, std::basic_ofstream<char, std::char_traits<char> > const*)::the_instance ()
#24 0x0a0973d4 in ?? ()
#25 0xffff266c in ?? ()
#26 0xffff3080 in ?? ()
#27 0xf65c4714 in std::string::_S_construct< char const*> ()
-------------------------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- ----
OS: RHEL 4.6
Kernel: 2.6.9-67.0.1.ELsmp
~]$ strace -p 10366
Process 10366 attached - interrupt to quit
[ Process PID=10366 runs in 32 bit mode. ]
clock_gettime(CLOCK_REALTI
futex(0xa5051f4, FUTEX_WAIT, 1, {1604, 133451000}
--------------------------
Name: app
State: T (stopped)
SleepAVG: 88%
Tgid: 10366
Pid: 10366
PPid: 8407
TracerPid: 0
Uid: 17247 17247 17247 17247
Gid: 17247 17247 17247 17247
FDSize: 64
Groups: 17247 17248 30266 30267 172248
VmSize: 743096 kB
VmLck: 0 kB
VmRSS: 400300 kB
VmData: 417344 kB
VmStk: 140 kB
VmExe: 20210 kB
VmLib: 375306 kB
StaBrk: 0947f000 kB
Brk: 1e7c3000 kB
StaStk: ffff4f20 kB
Threads: 6
SigPnd: 0000000000000000
ShdPnd: 0000000000000000
SigBlk: 0000000000000000
SigIgn: 0000000000091020
SigCgt: 00000001c18024cf
CapInh: 0000000000000000
CapPrm: 0000000000000000
CapEff: 0000000000000000
--------------------------
Thread 6 (Thread 3912346528 (LWP 10367)):
#0 0xffffe410 in __kernel_vsyscall ()
#1 0x0062fbf1 in ___newselect_nocancel () from /lib/tls/libc.so.6
#2 0xf7b38cb0 in RCE_Select_Reactor_T<RCE_R
#3 0xf7b357d3 in RCE_Select_Reactor_T<RCE_R
#4 0xf7b358c9 in RCE_Select_Reactor_T<RCE_R
#5 0xf7bb6f8f in RCE_Reactor::run_reactor_e
#6 0xf7de7305 in ResultCollectFunc ()
#7 0x00000000 in ?? ()
Thread 5 (Thread 3826920352 (LWP 10378)):
#0 0xffffe410 in __kernel_vsyscall ()
#1 0x005f9456 in __nanosleep_nocancel () from /lib/tls/libc.so.6
#2 0x006304ba in usleep () from /lib/tls/libc.so.6
#3 0x0855dee6 in timerNS::Timer::threadEnrt
#4 0x0855e8db in boost::detail::function::v
#5 0xf775d749 in boost::function0<void, std::allocator<boost::func
#6 0xf775cc8d in boost::thread_group::join_
#7 0x006a13cc in start_thread () from /lib/tls/libpthread.so.0
#8 0x006371ae in clone () from /lib/tls/libc.so.6
Thread 4 (Thread 3816430496 (LWP 10679)):
#0 0xffffe410 in __kernel_vsyscall ()
#1 0x0062fbf1 in ___newselect_nocancel () from /lib/tls/libc.so.6
#2 0xf7b38cb0 in RCE_Select_Reactor_T<RCE_R
#3 0xf7bde8a7 in RCE_TP_Reactor::get_event_
#4 0xf7bdf5e5 in RCE_TP_Reactor::dispatch_i
#5 0xf7bdf6fb in RCE_TP_Reactor::handle_eve
#6 0xf7bb6f8f in RCE_Reactor::run_reactor_e
#7 0xf7e1b178 in RceTransport::process ()
#8 0x00000000 in ?? ()
Thread 3 (Thread 3805940640 (LWP 10683)):
#0 0xffffe410 in __kernel_vsyscall ()
#1 0x006a3cf6 in pthread_cond_wait@@GLIBC_2
#2 0xf5d2b386 in MLA2::TaskExecutor::thread
#3 0xf5d2cb42 in boost::detail::thread_data
#4 0xf5d87b79 in thread_proxy ()
#5 0x006a13cc in start_thread () from /lib/tls/libpthread.so.0
#6 0x006371ae in clone () from /lib/tls/libc.so.6
Thread 2 (Thread 3795450784 (LWP 11536)):
#0 0xffffe410 in __kernel_vsyscall ()
#1 0x006a3cf6 in pthread_cond_wait@@GLIBC_2
#2 0xf5d17218 in MLA2::ObserverNotification
#3 0xf5d17fe5 in MLA2::ObserverNotification
#4 0xf5d1b382 in boost::detail::thread_data
#5 0xf5d87b79 in thread_proxy ()
#6 0x006a13cc in start_thread () from /lib/tls/libpthread.so.0
#7 0x006371ae in clone () from /lib/tls/libc.so.6
Thread 1 (Thread 4130949344 (LWP 10366)):
#0 0xffffe410 in __kernel_vsyscall ()
#1 0x006a3f7c in pthread_cond_timedwait@@GL
#2 0xf7eef62e in RCE_Condition<RCE_Thread_M
#3 0x0a5051f0 in ?? ()
#4 0xf7e0d274 in MMClientProxy::instance(ch
#5 0xffff22c8 in ?? ()
#6 0xf65c230e in std::string::assign ()
#7 0x1c0d9c30 in ?? ()
#8 0x4c98e88c in ?? ()
#9 0x1c0d9c30 in ?? ()
#10 0xffff22f8 in ?? ()
#11 0x0a5051f0 in ?? ()
#12 0xf7e0d220 in MMClientProxy::instance(ch
#13 0xf7e0d210 in MMClientProxy::instance(ch
#14 0xf7e0d1c0 in guard variable for MMClientProxy::instance(ch
#15 0xf7eef6ee in RCE_Condition<RCE_Thread_M
#16 0x0a5051f0 in ?? ()
#17 0xf7e0d274 in MMClientProxy::instance(ch
#18 0xffff2348 in ?? ()
#19 0xf7df6020 in MMClientProxy::waitWithId ()
#20 0x1ce00488 in ?? ()
#21 0x1e35bf00 in ?? ()
#22 0xf7e0d1f0 in MMClientProxy::instance(ch
#23 0xf7e0d274 in MMClientProxy::instance(ch
#24 0x0a0973d4 in ?? ()
#25 0xffff266c in ?? ()
#26 0xffff3080 in ?? ()
#27 0xf65c4714 in std::string::_S_construct<
--------------------------
It's the userspace mutex, used by pthreads to synchronize activity among threads. Threads 1,2 and 3 are in a wait state for the mutex. Undoubtedly, one thread has obtained the mutex and not released it.
ASKER
This process runs thousands of times a day without issues. Only in this particular case it got stuck. Would it be impossible to explain the behavior now? I had seen similar issue reported in bugzilla but was claimed to have been fixed in the kernel version where it failed.
It's tough to say exactly where it's hung, since it could be an application problem. Thread 5 is in a short-time sleep state, likely a loop waiting for work. Threads 4 and 6 are waiting on multiple events. And threads 1,2, and 3 are waiting on the futex.
It's easy to get into a deadlock with threads waiting on resources. Thread 1 holds resource A, and is waiting for B. Thread 2 holds B, and is waiting for A. But I believe the process status will reflect if it's in a deadlock situation, and the kernel will detect the deadlock and return the appropriate status to the threads waiting.
So unfortunately, it's tough to tell much more from a process snapshot, except that there's nothing obviously wrong. Are there event logs from the process indicating what's happening?
It's easy to get into a deadlock with threads waiting on resources. Thread 1 holds resource A, and is waiting for B. Thread 2 holds B, and is waiting for A. But I believe the process status will reflect if it's in a deadlock situation, and the kernel will detect the deadlock and return the appropriate status to the threads waiting.
So unfortunately, it's tough to tell much more from a process snapshot, except that there's nothing obviously wrong. Are there event logs from the process indicating what's happening?
ASKER
What the process does is - generate a large number of requests (say 1000) and submit them to a server from where it expetcs a set of computed results in return for each of the 1000 requests. Once it receives all 1000 results, the process is supposed to do an aggregation and publish the results.
From the application logs, it is not clear whether it received all 1000 results. But the server which send the results says it had returned all 1000 results.
From the application logs, it is not clear whether it received all 1000 results. But the server which send the results says it had returned all 1000 results.
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
Thanks hfraser for the suggestion. Will try to implement as you've suggested.
We are also following up with RH becase a fix for similar issue was added to the version we use.
We are also following up with RH becase a fix for similar issue was added to the version we use.