Solved

Process hung - Futex wait

Posted on 2010-09-22
6
2,142 Views
Last Modified: 2012-05-10
One of our application process went in to hung state for a few hours and we had to terminate it. Though this was a one off issue, would like to know the cause for it.
 OS: RHEL 4.6
Kernel: 2.6.9-67.0.1.ELsmp

~]$ strace -p 10366
Process 10366 attached - interrupt to quit
[ Process PID=10366 runs in 32 bit mode. ]
clock_gettime(CLOCK_REALTIME, {1285087816, 337203000}) = 0
futex(0xa5051f4, FUTEX_WAIT, 1, {1604, 133451000}
--------------------------------------------------------------------------------------------------------------------
Name:   app
State:  T (stopped)
SleepAVG:       88%
Tgid:   10366
Pid:    10366
PPid:   8407
TracerPid:      0
Uid:    17247   17247   17247   17247
Gid:    17247   17247   17247   17247
FDSize: 64
Groups: 17247 17248 30266 30267 172248
VmSize:   743096 kB
VmLck:         0 kB
VmRSS:    400300 kB
VmData:   417344 kB
VmStk:       140 kB
VmExe:     20210 kB
VmLib:    375306 kB
StaBrk: 0947f000 kB
Brk:    1e7c3000 kB
StaStk: ffff4f20 kB
Threads:        6
SigPnd: 0000000000000000
ShdPnd: 0000000000000000
SigBlk: 0000000000000000
SigIgn: 0000000000091020
SigCgt: 00000001c18024cf
CapInh: 0000000000000000
CapPrm: 0000000000000000
CapEff: 0000000000000000
--------------------------------------------------------------------------------------------------------------
Thread 6 (Thread 3912346528 (LWP 10367)):
#0  0xffffe410 in __kernel_vsyscall ()
#1  0x0062fbf1 in ___newselect_nocancel () from /lib/tls/libc.so.6
#2  0xf7b38cb0 in RCE_Select_Reactor_T<RCE_Reactor_Token_T<RCE_Token> >::wait_for_multiple_events ()
#3  0xf7b357d3 in RCE_Select_Reactor_T<RCE_Reactor_Token_T<RCE_Token> >::handle_events_i () from <app path>/libRCE-5.6.so
#4  0xf7b358c9 in RCE_Select_Reactor_T<RCE_Reactor_Token_T<RCE_Token> >::handle_events () from <app path>/libRCE-5.6.so
#5  0xf7bb6f8f in RCE_Reactor::run_reactor_event_loop ()
#6  0xf7de7305 in ResultCollectFunc ()
#7  0x00000000 in ?? ()
Thread 5 (Thread 3826920352 (LWP 10378)):
#0  0xffffe410 in __kernel_vsyscall ()
#1  0x005f9456 in __nanosleep_nocancel () from /lib/tls/libc.so.6
#2  0x006304ba in usleep () from /lib/tls/libc.so.6
#3  0x0855dee6 in timerNS::Timer::threadEnrty ()
#4  0x0855e8db in boost::detail::function::void_function_obj_invoker0<boost::_bi::bind_t<void, boost::_mfi::mf0<void, timerNS::Timer>, boost::_bi::list1<boost::reference_wrapper<timerNS::Timer> > >, void>::invoke ()
#5  0xf775d749 in boost::function0<void, std::allocator<boost::function_base> >::operator() ()
#6  0xf775cc8d in boost::thread_group::join_all ()
#7  0x006a13cc in start_thread () from /lib/tls/libpthread.so.0
#8  0x006371ae in clone () from /lib/tls/libc.so.6
Thread 4 (Thread 3816430496 (LWP 10679)):
#0  0xffffe410 in __kernel_vsyscall ()
#1  0x0062fbf1 in ___newselect_nocancel () from /lib/tls/libc.so.6
#2  0xf7b38cb0 in RCE_Select_Reactor_T<RCE_Reactor_Token_T<RCE_Token> >::wait_for_multiple_events ()
#3  0xf7bde8a7 in RCE_TP_Reactor::get_event_for_dispatching ()
#4  0xf7bdf5e5 in RCE_TP_Reactor::dispatch_i ()
#5  0xf7bdf6fb in RCE_TP_Reactor::handle_events ()
#6  0xf7bb6f8f in RCE_Reactor::run_reactor_event_loop ()
#7  0xf7e1b178 in RceTransport::process ()
#8  0x00000000 in ?? ()
Thread 3 (Thread 3805940640 (LWP 10683)):
#0  0xffffe410 in __kernel_vsyscall ()
#1  0x006a3cf6 in pthread_cond_wait@@GLIBC_2.3.2 ()
#2  0xf5d2b386 in MLA2::TaskExecutor::threadFunction ()
#3  0xf5d2cb42 in boost::detail::thread_data<boost::_bi::bind_t<void, boost::_mfi::mf0<void, MLA2::TaskExecutor>, boost::_bi::list1<boost::_bi::value<MLA2::TaskExecutor*> > > >::run () from <MLAlibspath>/libmla2rt.so
#4  0xf5d87b79 in thread_proxy ()
#5  0x006a13cc in start_thread () from /lib/tls/libpthread.so.0
#6  0x006371ae in clone () from /lib/tls/libc.so.6
Thread 2 (Thread 3795450784 (LWP 11536)):
#0  0xffffe410 in __kernel_vsyscall ()
#1  0x006a3cf6 in pthread_cond_wait@@GLIBC_2.3.2 ()
#2  0xf5d17218 in MLA2::ObserverNotificationManager::getNextMessage ()
#3  0xf5d17fe5 in MLA2::ObserverNotificationManager::threadFunction ()
#4  0xf5d1b382 in boost::detail::thread_data<boost::_bi::bind_t<void, boost::_mfi::mf0<void, MLA2::ObserverNotificationManager>, boost::_bi::list1<boost::_bi::value<MLA2::ObserverNotificationManager*> > > >::run ()
#5  0xf5d87b79 in thread_proxy ()
#6  0x006a13cc in start_thread () from /lib/tls/libpthread.so.0
#7  0x006371ae in clone () from /lib/tls/libc.so.6
Thread 1 (Thread 4130949344 (LWP 10366)):
#0  0xffffe410 in __kernel_vsyscall ()
#1  0x006a3f7c in pthread_cond_timedwait@@GLIBC_2.3.2 ()
#2  0xf7eef62e in RCE_Condition<RCE_Thread_Mutex>::wait ()
#3  0x0a5051f0 in ?? ()
#4  0xf7e0d274 in MMClientProxy::instance(char const*, std::basic_ofstream<char, std::char_traits<char> > const*)::the_instance ()
#5  0xffff22c8 in ?? ()
#6  0xf65c230e in std::string::assign ()
#7  0x1c0d9c30 in ?? ()
#8  0x4c98e88c in ?? ()
#9  0x1c0d9c30 in ?? ()
#10 0xffff22f8 in ?? ()
#11 0x0a5051f0 in ?? ()
#12 0xf7e0d220 in MMClientProxy::instance(char const*, std::basic_ofstream<char, std::char_traits<char> > const*)::the_instance ()
#13 0xf7e0d210 in MMClientProxy::instance(char const*, std::basic_ofstream<char, std::char_traits<char> > const*)::the_instance ()
#14 0xf7e0d1c0 in guard variable for MMClientProxy::instance(char const*, std::basic_ofstream<char, std::char_traits<char> > const*)::the_instance ()
#15 0xf7eef6ee in RCE_Condition<RCE_Thread_Mutex>::wait ()
#16 0x0a5051f0 in ?? ()
#17 0xf7e0d274 in MMClientProxy::instance(char const*, std::basic_ofstream<char, std::char_traits<char> > const*)::the_instance ()
#18 0xffff2348 in ?? ()
#19 0xf7df6020 in MMClientProxy::waitWithId ()
#20 0x1ce00488 in ?? ()
#21 0x1e35bf00 in ?? ()
#22 0xf7e0d1f0 in MMClientProxy::instance(char const*, std::basic_ofstream<char, std::char_traits<char> > const*)::the_instance ()
#23 0xf7e0d274 in MMClientProxy::instance(char const*, std::basic_ofstream<char, std::char_traits<char> > const*)::the_instance ()
#24 0x0a0973d4 in ?? ()
#25 0xffff266c in ?? ()
#26 0xffff3080 in ?? ()
#27 0xf65c4714 in std::string::_S_construct<char const*> ()
--------------------------------------------------------------------------------------------------------------


0
Comment
Question by:rejeeth
  • 3
  • 3
6 Comments
 
LVL 12

Expert Comment

by:hfraser
ID: 33748260
It's the userspace mutex, used by pthreads to synchronize activity among threads. Threads 1,2 and 3 are in a wait state for the mutex. Undoubtedly, one thread has obtained the mutex and not released it.
0
 

Author Comment

by:rejeeth
ID: 33750771
This process runs thousands of times a day without issues. Only in this particular case it got stuck. Would it be impossible to explain the behavior now? I had seen similar issue reported in bugzilla but was claimed to have been fixed in the kernel version where it failed.
0
 
LVL 12

Expert Comment

by:hfraser
ID: 33753794
It's tough to say exactly where it's hung, since it could be an application problem. Thread 5 is in a short-time sleep state, likely a loop waiting for work. Threads 4 and 6 are waiting on multiple events. And threads 1,2, and 3 are waiting on the futex.

It's easy to get into a deadlock with threads waiting on resources. Thread 1 holds resource A, and is waiting for B. Thread 2 holds B, and is waiting for A. But I believe the process status will reflect if it's in a deadlock situation, and the kernel will detect the deadlock and return the appropriate status to the threads waiting.

So unfortunately, it's tough to tell much more from a process snapshot, except that there's nothing obviously wrong. Are there event logs from the process indicating what's happening?
0
How to run any project with ease

Manage projects of all sizes how you want. Great for personal to-do lists, project milestones, team priorities and launch plans.
- Combine task lists, docs, spreadsheets, and chat in one
- View and edit from mobile/offline
- Cut down on emails

 

Author Comment

by:rejeeth
ID: 33754247
What the process does is - generate a large number of requests (say 1000) and submit them to a server from where it expetcs a set of computed results in return for each of the 1000 requests. Once it receives all 1000 results, the process is supposed to do an aggregation and publish the results.

From the application logs, it is not clear whether it received all 1000 results. But the server which send the results says it had returned all 1000 results.
0
 
LVL 12

Accepted Solution

by:
hfraser earned 500 total points
ID: 33757935
I'd suggest three things:
1. Log when a message is sent and received (I'm assuming it's in the threads that are in the wait_for_multiple_events calls) with timestamps to line up with the sender process.
2. Make sure the code waiting on the futex is checking for deadlocks.
3. Put some log messages in the working threads that do the processing, again to check the results as they appear.

If there's a counter that waits on a number of messages, like 1000, and they don't appear, perhaps the process is simply waiting for dropped messages that never arrive.
0
 

Author Closing Comment

by:rejeeth
ID: 33757982
Thanks hfraser for the suggestion. Will try to implement as you've suggested.

We are also following up with RH becase a fix for similar issue was added to the version we use.
0

Featured Post

Maximize Your Threat Intelligence Reporting

Reporting is one of the most important and least talked about aspects of a world-class threat intelligence program. Here’s how to do it right.

Join & Write a Comment

Linux users are sometimes dumbfounded by the severe lack of documentation on a topic. Sometimes, the documentation is copious, but other times, you end up with some obscure "it varies depending on your distribution" over and over when searching for …
The purpose of this article is to demonstrate how we can upgrade Python from version 2.7.6 to Python 2.7.10 on the Linux Mint operating system. I am using an Oracle Virtual Box where I have installed Linux Mint operating system version 17.2. Once yo…
Video by: Grant
The goal of this video is to provide viewers with basic examples to understand and use while-loops in the C programming language.
The goal of this video is to provide viewers with basic examples to understand opening and reading files in the C programming language.

746 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

12 Experts available now in Live!

Get 1:1 Help Now