Process hung - Futex wait

Posted on 2010-09-22
Last Modified: 2012-05-10
One of our application process went in to hung state for a few hours and we had to terminate it. Though this was a one off issue, would like to know the cause for it.
 OS: RHEL 4.6
Kernel: 2.6.9-67.0.1.ELsmp

~]$ strace -p 10366
Process 10366 attached - interrupt to quit
[ Process PID=10366 runs in 32 bit mode. ]
clock_gettime(CLOCK_REALTIME, {1285087816, 337203000}) = 0
futex(0xa5051f4, FUTEX_WAIT, 1, {1604, 133451000}
Name:   app
State:  T (stopped)
SleepAVG:       88%
Tgid:   10366
Pid:    10366
PPid:   8407
TracerPid:      0
Uid:    17247   17247   17247   17247
Gid:    17247   17247   17247   17247
FDSize: 64
Groups: 17247 17248 30266 30267 172248
VmSize:   743096 kB
VmLck:         0 kB
VmRSS:    400300 kB
VmData:   417344 kB
VmStk:       140 kB
VmExe:     20210 kB
VmLib:    375306 kB
StaBrk: 0947f000 kB
Brk:    1e7c3000 kB
StaStk: ffff4f20 kB
Threads:        6
SigPnd: 0000000000000000
ShdPnd: 0000000000000000
SigBlk: 0000000000000000
SigIgn: 0000000000091020
SigCgt: 00000001c18024cf
CapInh: 0000000000000000
CapPrm: 0000000000000000
CapEff: 0000000000000000
Thread 6 (Thread 3912346528 (LWP 10367)):
#0  0xffffe410 in __kernel_vsyscall ()
#1  0x0062fbf1 in ___newselect_nocancel () from /lib/tls/
#2  0xf7b38cb0 in RCE_Select_Reactor_T<RCE_Reactor_Token_T<RCE_Token> >::wait_for_multiple_events ()
#3  0xf7b357d3 in RCE_Select_Reactor_T<RCE_Reactor_Token_T<RCE_Token> >::handle_events_i () from <app path>/
#4  0xf7b358c9 in RCE_Select_Reactor_T<RCE_Reactor_Token_T<RCE_Token> >::handle_events () from <app path>/
#5  0xf7bb6f8f in RCE_Reactor::run_reactor_event_loop ()
#6  0xf7de7305 in ResultCollectFunc ()
#7  0x00000000 in ?? ()
Thread 5 (Thread 3826920352 (LWP 10378)):
#0  0xffffe410 in __kernel_vsyscall ()
#1  0x005f9456 in __nanosleep_nocancel () from /lib/tls/
#2  0x006304ba in usleep () from /lib/tls/
#3  0x0855dee6 in timerNS::Timer::threadEnrty ()
#4  0x0855e8db in boost::detail::function::void_function_obj_invoker0<boost::_bi::bind_t<void, boost::_mfi::mf0<void, timerNS::Timer>, boost::_bi::list1<boost::reference_wrapper<timerNS::Timer> > >, void>::invoke ()
#5  0xf775d749 in boost::function0<void, std::allocator<boost::function_base> >::operator() ()
#6  0xf775cc8d in boost::thread_group::join_all ()
#7  0x006a13cc in start_thread () from /lib/tls/
#8  0x006371ae in clone () from /lib/tls/
Thread 4 (Thread 3816430496 (LWP 10679)):
#0  0xffffe410 in __kernel_vsyscall ()
#1  0x0062fbf1 in ___newselect_nocancel () from /lib/tls/
#2  0xf7b38cb0 in RCE_Select_Reactor_T<RCE_Reactor_Token_T<RCE_Token> >::wait_for_multiple_events ()
#3  0xf7bde8a7 in RCE_TP_Reactor::get_event_for_dispatching ()
#4  0xf7bdf5e5 in RCE_TP_Reactor::dispatch_i ()
#5  0xf7bdf6fb in RCE_TP_Reactor::handle_events ()
#6  0xf7bb6f8f in RCE_Reactor::run_reactor_event_loop ()
#7  0xf7e1b178 in RceTransport::process ()
#8  0x00000000 in ?? ()
Thread 3 (Thread 3805940640 (LWP 10683)):
#0  0xffffe410 in __kernel_vsyscall ()
#1  0x006a3cf6 in pthread_cond_wait@@GLIBC_2.3.2 ()
#2  0xf5d2b386 in MLA2::TaskExecutor::threadFunction ()
#3  0xf5d2cb42 in boost::detail::thread_data<boost::_bi::bind_t<void, boost::_mfi::mf0<void, MLA2::TaskExecutor>, boost::_bi::list1<boost::_bi::value<MLA2::TaskExecutor*> > > >::run () from <MLAlibspath>/
#4  0xf5d87b79 in thread_proxy ()
#5  0x006a13cc in start_thread () from /lib/tls/
#6  0x006371ae in clone () from /lib/tls/
Thread 2 (Thread 3795450784 (LWP 11536)):
#0  0xffffe410 in __kernel_vsyscall ()
#1  0x006a3cf6 in pthread_cond_wait@@GLIBC_2.3.2 ()
#2  0xf5d17218 in MLA2::ObserverNotificationManager::getNextMessage ()
#3  0xf5d17fe5 in MLA2::ObserverNotificationManager::threadFunction ()
#4  0xf5d1b382 in boost::detail::thread_data<boost::_bi::bind_t<void, boost::_mfi::mf0<void, MLA2::ObserverNotificationManager>, boost::_bi::list1<boost::_bi::value<MLA2::ObserverNotificationManager*> > > >::run ()
#5  0xf5d87b79 in thread_proxy ()
#6  0x006a13cc in start_thread () from /lib/tls/
#7  0x006371ae in clone () from /lib/tls/
Thread 1 (Thread 4130949344 (LWP 10366)):
#0  0xffffe410 in __kernel_vsyscall ()
#1  0x006a3f7c in pthread_cond_timedwait@@GLIBC_2.3.2 ()
#2  0xf7eef62e in RCE_Condition<RCE_Thread_Mutex>::wait ()
#3  0x0a5051f0 in ?? ()
#4  0xf7e0d274 in MMClientProxy::instance(char const*, std::basic_ofstream<char, std::char_traits<char> > const*)::the_instance ()
#5  0xffff22c8 in ?? ()
#6  0xf65c230e in std::string::assign ()
#7  0x1c0d9c30 in ?? ()
#8  0x4c98e88c in ?? ()
#9  0x1c0d9c30 in ?? ()
#10 0xffff22f8 in ?? ()
#11 0x0a5051f0 in ?? ()
#12 0xf7e0d220 in MMClientProxy::instance(char const*, std::basic_ofstream<char, std::char_traits<char> > const*)::the_instance ()
#13 0xf7e0d210 in MMClientProxy::instance(char const*, std::basic_ofstream<char, std::char_traits<char> > const*)::the_instance ()
#14 0xf7e0d1c0 in guard variable for MMClientProxy::instance(char const*, std::basic_ofstream<char, std::char_traits<char> > const*)::the_instance ()
#15 0xf7eef6ee in RCE_Condition<RCE_Thread_Mutex>::wait ()
#16 0x0a5051f0 in ?? ()
#17 0xf7e0d274 in MMClientProxy::instance(char const*, std::basic_ofstream<char, std::char_traits<char> > const*)::the_instance ()
#18 0xffff2348 in ?? ()
#19 0xf7df6020 in MMClientProxy::waitWithId ()
#20 0x1ce00488 in ?? ()
#21 0x1e35bf00 in ?? ()
#22 0xf7e0d1f0 in MMClientProxy::instance(char const*, std::basic_ofstream<char, std::char_traits<char> > const*)::the_instance ()
#23 0xf7e0d274 in MMClientProxy::instance(char const*, std::basic_ofstream<char, std::char_traits<char> > const*)::the_instance ()
#24 0x0a0973d4 in ?? ()
#25 0xffff266c in ?? ()
#26 0xffff3080 in ?? ()
#27 0xf65c4714 in std::string::_S_construct<char const*> ()

Question by:rejeeth
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 3
  • 3
LVL 12

Expert Comment

ID: 33748260
It's the userspace mutex, used by pthreads to synchronize activity among threads. Threads 1,2 and 3 are in a wait state for the mutex. Undoubtedly, one thread has obtained the mutex and not released it.

Author Comment

ID: 33750771
This process runs thousands of times a day without issues. Only in this particular case it got stuck. Would it be impossible to explain the behavior now? I had seen similar issue reported in bugzilla but was claimed to have been fixed in the kernel version where it failed.
LVL 12

Expert Comment

ID: 33753794
It's tough to say exactly where it's hung, since it could be an application problem. Thread 5 is in a short-time sleep state, likely a loop waiting for work. Threads 4 and 6 are waiting on multiple events. And threads 1,2, and 3 are waiting on the futex.

It's easy to get into a deadlock with threads waiting on resources. Thread 1 holds resource A, and is waiting for B. Thread 2 holds B, and is waiting for A. But I believe the process status will reflect if it's in a deadlock situation, and the kernel will detect the deadlock and return the appropriate status to the threads waiting.

So unfortunately, it's tough to tell much more from a process snapshot, except that there's nothing obviously wrong. Are there event logs from the process indicating what's happening?
Enterprise Mobility and BYOD For Dummies

Like “For Dummies” books, you can read this in whatever order you choose and learn about mobility and BYOD; and how to put a competitive mobile infrastructure in place. Developed for SMBs and large enterprises alike, you will find helpful use cases, planning, and implementation.


Author Comment

ID: 33754247
What the process does is - generate a large number of requests (say 1000) and submit them to a server from where it expetcs a set of computed results in return for each of the 1000 requests. Once it receives all 1000 results, the process is supposed to do an aggregation and publish the results.

From the application logs, it is not clear whether it received all 1000 results. But the server which send the results says it had returned all 1000 results.
LVL 12

Accepted Solution

hfraser earned 500 total points
ID: 33757935
I'd suggest three things:
1. Log when a message is sent and received (I'm assuming it's in the threads that are in the wait_for_multiple_events calls) with timestamps to line up with the sender process.
2. Make sure the code waiting on the futex is checking for deadlocks.
3. Put some log messages in the working threads that do the processing, again to check the results as they appear.

If there's a counter that waits on a number of messages, like 1000, and they don't appear, perhaps the process is simply waiting for dropped messages that never arrive.

Author Closing Comment

ID: 33757982
Thanks hfraser for the suggestion. Will try to implement as you've suggested.

We are also following up with RH becase a fix for similar issue was added to the version we use.

Featured Post

[Webinar] Code, Load, and Grow

Managing multiple websites, servers, applications, and security on a daily basis? Join us for a webinar on May 25th to learn how to simplify administration and management of virtual hosts for IT admins, create a secure environment, and deploy code more effectively and frequently.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

If you have a server on collocation with the super-fast CPU, that doesn't mean that you get it running at full power. Here is a preamble. When doing inventory of Linux servers, that I'm administering, I've found that some of them are running on l…
The purpose of this article is to fix the unknown display problem in Linux Mint operating system. After installing the OS if you see Display monitor is not recognized then we can install "MESA" utilities to fix this problem or we can install additio…
Learn how to navigate the file tree with the shell. Use pwd to print the current working directory: Use ls to list a directory's contents: Use cd to change to a new directory: Use wildcards instead of typing out long directory names: Use ../ to move…
This demo shows you how to set up the containerized NetScaler CPX with NetScaler Management and Analytics System in a non-routable Mesos/Marathon environment for use with Micro-Services applications.
Suggested Courses

734 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question