sriram
asked on
Getting lwp_park() ETIME Error 62 and process is hung
Hello,
I am running a multi-threaded process on Solaris 10 platform; compiled using SunStudio 12. Sometimes our process is hung; I looked at "truss -p <PID>"; it gives strange errors like this:
truss -p 1739
/6: nanosleep(0xFAC7BBE0, 0xFAC7BBD8) (sleeping...)
/7: pollsys(0xFAB7BC30, 1, 0x00000000, 0x00000000) (sleeping...)
/9: lwp_park(0xFA97BD78, 0) (sleeping...)
/5: lwp_park(0xFAD7BD80, 0) (sleeping...)
/11: lwp_park(0xFA77BD78, 0) (sleeping...)
/10: lwp_park(0xFA87BD78, 0) (sleeping...)
/2: nanosleep(0xFB07BE58, 0xFB07BE50) (sleeping...)
/3: nanosleep(0xFAF7BE58, 0xFAF7BE50) (sleeping...)
/4: lwp_park(0x00000000, 0) (sleeping...)
/1: pollsys(0xFFBFEF48, 6, 0xFFBFF0D0, 0x00000000) (sleeping...)
/8: lwp_park(0xFAA7BD78, 0) (sleeping...)
/1: pollsys(0xFFBFEF48, 6, 0xFFBFF0D0, 0x00000000) = 0
/5: lwp_park(0xFAD7BD80, 0) Err#62 ETIME
/1: pollsys(0xFFBFEF48, 6, 0xFFBFF0D0, 0x00000000) (sleeping...)
/8: lwp_park(0xFAA7BD78, 0) Err#62 ETIME
/5: lwp_park(0xFAD7BD80, 0) (sleeping...)
/11: lwp_park(0xFA77BD78, 0) Err#62 ETIME
/8: lwp_park(0xFAA7BD78, 0) (sleeping...)
/11: lwp_park(0xFA77BD78, 0) (sleeping...)
/1: pollsys(0xFFBFEF48, 6, 0xFFBFF0D0, 0x00000000) = 0
/1: pollsys(0xFFBFEF48, 6, 0xFFBFF0D0, 0x00000000) (sleeping...)
/1: pollsys(0xFFBFEF48, 6, 0xFFBFF0D0, 0x00000000) = 0
/1: pollsys(0xFFBFEF48, 6, 0xFFBFF0D0, 0x00000000) (sleeping...)
/1: pollsys(0xFFBFEF48, 6, 0xFFBFF0D0, 0x00000000) = 0
/1: pollsys(0xFFBFEF48, 6, 0xFFBFF0D0, 0x00000000) (sleeping...)
/1: pollsys(0xFFBFEF48, 6, 0xFFBFF0D0, 0x00000000) = 0
/1: pollsys(0xFFBFEF48, 6, 0xFFBFF0D0, 0x00000000) (sleeping...)
/1: pollsys(0xFFBFEF48, 6, 0xFFBFF0D0, 0x00000000) = 0
/1: pollsys(0xFFBFEF48, 6, 0xFFBFF0D0, 0x00000000) (sleeping...)
/1: pollsys(0xFFBFEF48, 6, 0xFFBFF0D0, 0x00000000) = 0
/1: pollsys(0xFFBFEF48, 6, 0xFFBFF0D0, 0x00000000) (sleeping...)
/1: pollsys(0xFFBFEF48, 6, 0xFFBFF0D0, 0x00000000) = 0
/1: pollsys(0xFFBFEF48, 6, 0xFFBFF0D0, 0x00000000) (sleeping...)
/1: pollsys(0xFFBFEF48, 6, 0xFFBFF0D0, 0x00000000) = 0
/1: pollsys(0xFFBFEF48, 6, 0xFFBFF0D0, 0x00000000) (sleeping...)
/1: pollsys(0xFFBFEF48, 6, 0xFFBFF0D0, 0x00000000) = 0
/1: pollsys(0xFFBFEF48, 6, 0xFFBFF0D0, 0x00000000) (sleeping...)
/9: lwp_park(0xFA97BD78, 0) Err#62 ETIME
/10: lwp_park(0xFA87BD78, 0) Err#62 ETIME
/9: lwp_park(0xFA97BD78, 0) (sleeping...)
/10: lwp_park(0xFA87BD78, 0) (sleeping...)
/1: pollsys(0xFFBFEF48, 6, 0xFFBFF0D0, 0x00000000) = 0
/1: pollsys(0xFFBFEF48, 6, 0xFFBFF0D0, 0x00000000) (sleeping...)
/1: pollsys(0xFFBFEF48, 6, 0xFFBFF0D0, 0x00000000) = 0
/1: pollsys(0xFFBFEF48, 6, 0xFFBFF0D0, 0x00000000) (sleeping...)
/1: pollsys(0xFFBFEF48, 6, 0xFFBFF0D0, 0x00000000) = 0
/5: lwp_park(0xFAD7BD80, 0) Err#62 ETIME
/1: pollsys(0xFFBFEF48, 6, 0xFFBFF0D0, 0x00000000) (sleeping...)
/8: lwp_park(0xFAA7BD78, 0) Err#62 ETIME
/5: lwp_park(0xFAD7BD80, 0) (sleeping...)
/11: lwp_park(0xFA77BD78, 0) Err#62 ETIME
I am running a multi-threaded process on Solaris 10 platform; compiled using SunStudio 12. Sometimes our process is hung; I looked at "truss -p <PID>"; it gives strange errors like this:
truss -p 1739
/6: nanosleep(0xFAC7BBE0, 0xFAC7BBD8) (sleeping...)
/7: pollsys(0xFAB7BC30, 1, 0x00000000, 0x00000000) (sleeping...)
/9: lwp_park(0xFA97BD78, 0) (sleeping...)
/5: lwp_park(0xFAD7BD80, 0) (sleeping...)
/11: lwp_park(0xFA77BD78, 0) (sleeping...)
/10: lwp_park(0xFA87BD78, 0) (sleeping...)
/2: nanosleep(0xFB07BE58, 0xFB07BE50) (sleeping...)
/3: nanosleep(0xFAF7BE58, 0xFAF7BE50) (sleeping...)
/4: lwp_park(0x00000000, 0) (sleeping...)
/1: pollsys(0xFFBFEF48, 6, 0xFFBFF0D0, 0x00000000) (sleeping...)
/8: lwp_park(0xFAA7BD78, 0) (sleeping...)
/1: pollsys(0xFFBFEF48, 6, 0xFFBFF0D0, 0x00000000) = 0
/5: lwp_park(0xFAD7BD80, 0) Err#62 ETIME
/1: pollsys(0xFFBFEF48, 6, 0xFFBFF0D0, 0x00000000) (sleeping...)
/8: lwp_park(0xFAA7BD78, 0) Err#62 ETIME
/5: lwp_park(0xFAD7BD80, 0) (sleeping...)
/11: lwp_park(0xFA77BD78, 0) Err#62 ETIME
/8: lwp_park(0xFAA7BD78, 0) (sleeping...)
/11: lwp_park(0xFA77BD78, 0) (sleeping...)
/1: pollsys(0xFFBFEF48, 6, 0xFFBFF0D0, 0x00000000) = 0
/1: pollsys(0xFFBFEF48, 6, 0xFFBFF0D0, 0x00000000) (sleeping...)
/1: pollsys(0xFFBFEF48, 6, 0xFFBFF0D0, 0x00000000) = 0
/1: pollsys(0xFFBFEF48, 6, 0xFFBFF0D0, 0x00000000) (sleeping...)
/1: pollsys(0xFFBFEF48, 6, 0xFFBFF0D0, 0x00000000) = 0
/1: pollsys(0xFFBFEF48, 6, 0xFFBFF0D0, 0x00000000) (sleeping...)
/1: pollsys(0xFFBFEF48, 6, 0xFFBFF0D0, 0x00000000) = 0
/1: pollsys(0xFFBFEF48, 6, 0xFFBFF0D0, 0x00000000) (sleeping...)
/1: pollsys(0xFFBFEF48, 6, 0xFFBFF0D0, 0x00000000) = 0
/1: pollsys(0xFFBFEF48, 6, 0xFFBFF0D0, 0x00000000) (sleeping...)
/1: pollsys(0xFFBFEF48, 6, 0xFFBFF0D0, 0x00000000) = 0
/1: pollsys(0xFFBFEF48, 6, 0xFFBFF0D0, 0x00000000) (sleeping...)
/1: pollsys(0xFFBFEF48, 6, 0xFFBFF0D0, 0x00000000) = 0
/1: pollsys(0xFFBFEF48, 6, 0xFFBFF0D0, 0x00000000) (sleeping...)
/1: pollsys(0xFFBFEF48, 6, 0xFFBFF0D0, 0x00000000) = 0
/1: pollsys(0xFFBFEF48, 6, 0xFFBFF0D0, 0x00000000) (sleeping...)
/1: pollsys(0xFFBFEF48, 6, 0xFFBFF0D0, 0x00000000) = 0
/1: pollsys(0xFFBFEF48, 6, 0xFFBFF0D0, 0x00000000) (sleeping...)
/9: lwp_park(0xFA97BD78, 0) Err#62 ETIME
/10: lwp_park(0xFA87BD78, 0) Err#62 ETIME
/9: lwp_park(0xFA97BD78, 0) (sleeping...)
/10: lwp_park(0xFA87BD78, 0) (sleeping...)
/1: pollsys(0xFFBFEF48, 6, 0xFFBFF0D0, 0x00000000) = 0
/1: pollsys(0xFFBFEF48, 6, 0xFFBFF0D0, 0x00000000) (sleeping...)
/1: pollsys(0xFFBFEF48, 6, 0xFFBFF0D0, 0x00000000) = 0
/1: pollsys(0xFFBFEF48, 6, 0xFFBFF0D0, 0x00000000) (sleeping...)
/1: pollsys(0xFFBFEF48, 6, 0xFFBFF0D0, 0x00000000) = 0
/5: lwp_park(0xFAD7BD80, 0) Err#62 ETIME
/1: pollsys(0xFFBFEF48, 6, 0xFFBFF0D0, 0x00000000) (sleeping...)
/8: lwp_park(0xFAA7BD78, 0) Err#62 ETIME
/5: lwp_park(0xFAD7BD80, 0) (sleeping...)
/11: lwp_park(0xFA77BD78, 0) Err#62 ETIME
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
Thank you all for the wonderful suggestions. I tried using <pstack pid>; it clearly says our application is failing while sending a packet out using TCP/IP ::send(...) system call. The application is spitting errors is the TCP/IP server and clients connect to it for service and disconnects.
The Server runs for a while and starts spitting ETIME error (pstack says in TCP/IP send()). Any help finding issue.
Any examples for writing an effective TCP/IP send() on Solaris 10 would be greatly appreciated.
I never used dtrace; let me try to use it.
Thanks again.
The Server runs for a while and starts spitting ETIME error (pstack says in TCP/IP send()). Any help finding issue.
Any examples for writing an effective TCP/IP send() on Solaris 10 would be greatly appreciated.
I never used dtrace; let me try to use it.
Thanks again.
That depends on your setup. send and recv are just the end bits. It is normally the initial bits that may be dodgy.
1) What does the TCP initialization look like? What setsockopt and ioctlsocket calls are you using?
2) Is the thread that is failing using a blocking send or a non-blocking send. Basically, are you sitting there waiting until it has finished or are you polling?
1) What does the TCP initialization look like? What setsockopt and ioctlsocket calls are you using?
2) Is the thread that is failing using a blocking send or a non-blocking send. Basically, are you sitting there waiting until it has finished or are you polling?
The send library call never returns ETIME, so I think you are mis-interpreting something. The method of obtaining the error is a little convoluted and if you are not careful you can get the wrong value. If, for instance your code treats a return of 0 as an error, then you are likely to return the error code from that last failing system call, which from the truss appears to be the lwp_park calls. Perhaps you could post the actual output from pstack?
ASKER
Thank You...
The example in http://publib.boulder.ibm.com/httpserv/ihsdiag/get_backtrace.html shows you how to use pstack with a core dump, including how to find the bad guy. Using pstack on a pid is similar.