Getting lwp_park() ETIME Error 62 and process is hung

Hello,
I am running a multi-threaded process on Solaris 10 platform; compiled using SunStudio 12. Sometimes our process is hung; I looked at "truss -p <PID>"; it gives strange errors like this:
truss -p 1739
/6:     nanosleep(0xFAC7BBE0, 0xFAC7BBD8) (sleeping...)
/7:     pollsys(0xFAB7BC30, 1, 0x00000000, 0x00000000) (sleeping...)
/9:     lwp_park(0xFA97BD78, 0)         (sleeping...)
/5:     lwp_park(0xFAD7BD80, 0)         (sleeping...)
/11:    lwp_park(0xFA77BD78, 0)         (sleeping...)
/10:    lwp_park(0xFA87BD78, 0)         (sleeping...)
/2:     nanosleep(0xFB07BE58, 0xFB07BE50) (sleeping...)
/3:     nanosleep(0xFAF7BE58, 0xFAF7BE50) (sleeping...)
/4:     lwp_park(0x00000000, 0)         (sleeping...)
/1:     pollsys(0xFFBFEF48, 6, 0xFFBFF0D0, 0x00000000) (sleeping...)
/8:     lwp_park(0xFAA7BD78, 0)         (sleeping...)
/1:     pollsys(0xFFBFEF48, 6, 0xFFBFF0D0, 0x00000000)  = 0
/5:     lwp_park(0xFAD7BD80, 0)                         Err#62 ETIME
/1:     pollsys(0xFFBFEF48, 6, 0xFFBFF0D0, 0x00000000) (sleeping...)
/8:     lwp_park(0xFAA7BD78, 0)                         Err#62 ETIME
/5:     lwp_park(0xFAD7BD80, 0)         (sleeping...)
/11:    lwp_park(0xFA77BD78, 0)                         Err#62 ETIME
/8:     lwp_park(0xFAA7BD78, 0)         (sleeping...)
/11:    lwp_park(0xFA77BD78, 0)         (sleeping...)
/1:     pollsys(0xFFBFEF48, 6, 0xFFBFF0D0, 0x00000000)  = 0
/1:     pollsys(0xFFBFEF48, 6, 0xFFBFF0D0, 0x00000000) (sleeping...)
/1:     pollsys(0xFFBFEF48, 6, 0xFFBFF0D0, 0x00000000)  = 0
/1:     pollsys(0xFFBFEF48, 6, 0xFFBFF0D0, 0x00000000) (sleeping...)
/1:     pollsys(0xFFBFEF48, 6, 0xFFBFF0D0, 0x00000000)  = 0
/1:     pollsys(0xFFBFEF48, 6, 0xFFBFF0D0, 0x00000000) (sleeping...)
/1:     pollsys(0xFFBFEF48, 6, 0xFFBFF0D0, 0x00000000)  = 0
/1:     pollsys(0xFFBFEF48, 6, 0xFFBFF0D0, 0x00000000) (sleeping...)
/1:     pollsys(0xFFBFEF48, 6, 0xFFBFF0D0, 0x00000000)  = 0
/1:     pollsys(0xFFBFEF48, 6, 0xFFBFF0D0, 0x00000000) (sleeping...)
/1:     pollsys(0xFFBFEF48, 6, 0xFFBFF0D0, 0x00000000)  = 0
/1:     pollsys(0xFFBFEF48, 6, 0xFFBFF0D0, 0x00000000) (sleeping...)
/1:     pollsys(0xFFBFEF48, 6, 0xFFBFF0D0, 0x00000000)  = 0
/1:     pollsys(0xFFBFEF48, 6, 0xFFBFF0D0, 0x00000000) (sleeping...)
/1:     pollsys(0xFFBFEF48, 6, 0xFFBFF0D0, 0x00000000)  = 0
/1:     pollsys(0xFFBFEF48, 6, 0xFFBFF0D0, 0x00000000) (sleeping...)
/1:     pollsys(0xFFBFEF48, 6, 0xFFBFF0D0, 0x00000000)  = 0
/1:     pollsys(0xFFBFEF48, 6, 0xFFBFF0D0, 0x00000000) (sleeping...)
/9:     lwp_park(0xFA97BD78, 0)                         Err#62 ETIME
/10:    lwp_park(0xFA87BD78, 0)                         Err#62 ETIME
/9:     lwp_park(0xFA97BD78, 0)         (sleeping...)
/10:    lwp_park(0xFA87BD78, 0)         (sleeping...)
/1:     pollsys(0xFFBFEF48, 6, 0xFFBFF0D0, 0x00000000)  = 0
/1:     pollsys(0xFFBFEF48, 6, 0xFFBFF0D0, 0x00000000) (sleeping...)
/1:     pollsys(0xFFBFEF48, 6, 0xFFBFF0D0, 0x00000000)  = 0
/1:     pollsys(0xFFBFEF48, 6, 0xFFBFF0D0, 0x00000000) (sleeping...)
/1:     pollsys(0xFFBFEF48, 6, 0xFFBFF0D0, 0x00000000)  = 0
/5:     lwp_park(0xFAD7BD80, 0)                         Err#62 ETIME
/1:     pollsys(0xFFBFEF48, 6, 0xFFBFF0D0, 0x00000000) (sleeping...)
/8:     lwp_park(0xFAA7BD78, 0)                         Err#62 ETIME
/5:     lwp_park(0xFAD7BD80, 0)         (sleeping...)
/11:    lwp_park(0xFA77BD78, 0)                         Err#62 ETIME

sriramAsked:
Who is Participating?
 
RowleyCommented:
In conjunction with pstack and dbx, you might also want to look at dtrace. Good starting point which has info and examples that could be pertinent to your particular issue here. Another good reference source here.

Agree with cup - there is not enough info or detail in the truss to tell you exactly what is happening here.
0
 
cupCommented:
Haven't used Solaris for a long time so please forgive any inaccuracies.  truss just tells you what system calls are being used (similar to procmon in windows).  Try pstack.  That will tell you where every thread is.

The example in http://publib.boulder.ibm.com/httpserv/ihsdiag/get_backtrace.html shows you how to use pstack with a core dump, including how to find the bad guy.  Using pstack on a pid is similar.
0
 
Brian UtterbackPrinciple Software EngineerCommented:
When you say the process is hung, what does it not respond to that you expect it to? From the truss it appears that you have a 11 threads in your process, all of which are pretty actively engaged in sleeping.

There are three types of sleeping in evidence in the truss. The first is simply a call to nanosleep, which blocks until a certain period of time has passed.

Threads 2, 3 and 6 are sleeping in nanosleep and have not returned during the truss.

The second type of sleeping is via lwp_park. This is what a thread uses when it has no more work to do. The lwp_park syscall puts the thread to sleep for a time so that another thread might be scheduled. The sleep time is typically short so that the thread might wake up to see if there is any work to do.

Threads 4, 5, 8, 9, 10 and 11 are all in lwp_park and all but thread 4 are seen waking up and then going back to sleep. Since I can't tell how long the truss is for, I don't know if thread 4's not waking up is suspicious.

The third method is to sleep waiting for some data to become available via the pollsys call. These are threads that like the previous threads are sleeping, waiting for work to do. But in this case the wait is for data to become available on a file descriptor. The pollsys call will sleep for a fixed length of time, and will wake up either when the time is up or there is an I/O event on one of the file descriptors it is interested in.

Threads 7 and 1 are waiting in pollsys. Thread 1 is waking up quite often and then going back to sleep. Thread 7 is only seen sleeping, but since the length of time in pollsys is often very long, that may not be suspicious. Thread 7 is interested in only one file descriptor, while thread 1 is interested in 6 different ones.

I don't see any threads missing (unless there are more than 11 threads), and all but 2 of them are seen actively doing work. So, again, when you say it is hung, what interaction are you expecting? I presume that some data is expected to either be written or read that is not. Do you know which file descriptors you expected interaction?

Is this your own application? As mentioned earlier in the thread, you can use pstack to determine where each thread is in you application. Also, you can use pfiles to match up file descriptors to file. And finally, you could re-run truss with the following arguments "-faled -vall" to tell you which file descriptors the threads in pollsys are waiting on and how long the timeouts are.

You can also attach to the running process with dbx and examine it's current state that way.
0
Cloud Class® Course: C++ 11 Fundamentals

This course will introduce you to C++ 11 and teach you about syntax fundamentals.

 
sriramAuthor Commented:
Thank you all for the wonderful suggestions. I tried using <pstack pid>; it clearly says our application is failing while sending a packet out using TCP/IP ::send(...) system call. The application is spitting errors is the TCP/IP server and clients connect to it for service and disconnects.

The Server runs for a while and starts spitting ETIME error (pstack says in TCP/IP send()). Any help finding issue.
Any examples for writing an effective TCP/IP send() on Solaris 10 would be greatly appreciated.

I never used dtrace; let me try to use it.

Thanks again.
0
 
cupCommented:
That depends on your setup.  send and recv are just the end bits.  It is normally the initial bits that may be dodgy.  

1) What does the TCP initialization look like?  What setsockopt and ioctlsocket calls are you using?
2) Is the thread that is failing using a blocking send or a non-blocking send.  Basically, are you sitting there waiting until it has finished or are you polling?
0
 
Brian UtterbackPrinciple Software EngineerCommented:
The send library call never returns ETIME, so I think you are mis-interpreting something. The method of obtaining the error is a little convoluted and if you are not careful you can get the wrong value. If, for instance your code treats a return of 0 as an error, then you are likely to return the error code from that last failing system call, which from the truss appears to be the lwp_park calls. Perhaps you could post the actual output from pstack?
0
 
sriramAuthor Commented:
Thank You...
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.