asked on

How to clear CLOSE_WAIT state of a TCP connection?

When i perform netstat -a, i saw the connections are in CLOSE_WAIT state. This causes my program using these connections to sleep, truss -p <process_pid>. Only after i terminate and restart my program, the connections turn back to ESTABLISHED state.

Is there a timer to set, say after 120 seconds the CLOSE_WAIT connections will break so my program can reconnect again?? For example the "ndd" command??

Otetelisanu

Look
http://www.sean.de/Solaris/soltune.html

tcp_close_wait_interval
default 240000 (according to RFC 1122, 2MSL), recommended 60000, possibly lower
Since 7: obsoleted parameter, use tcp_time_wait_interval instead
Since 8: no more access, use tcp_time_wait_interval

SharonLaw

ASKER

The tcp_time_wait_interval value is 240,000 (4 minutes). But the connections stay at CLOSE_WAIT state even after hours.

Please advise. Thanks.

soupdragon

CLIENT SERVER

1. ESTABLISHED ESTABLISHED
2. (Close)
FIN-WAIT-1 --> <FIN,ACK> --> CLOSE-WAIT
3. FIN-WAIT-2 <-- <ACK> <-- CLOSE-WAIT
4. (Close)
TIME-WAIT <-- <FIN,ACK> <-- LAST-ACK
5. TIME-WAIT --> <ACK> --> CLOSED
(2 MSL)

As I understand it, the tcp_time_wait_interval doesn't kick in until after the CLOSE_WAIT. There is no parameter that directly affects the tcp_close_wait interval. In a scenario where the client sends a close, the server acknowledges this and sends whatever data is still in its buffers. Server in CLOSE_WAIT state. The server will only close the connection once it has sent a FIN to the client and received an ACK for that.

CLOSE_WAIT state means the other end of the connection has been closed while the local end is still waiting for the app to close.

Similarly, if the server receives a SYN + FIN from the client, it does not know what to do and leaves connections stuck in the CLOSE_WAIT state.

It is best to "truss" the application and "snoop" the tcp session to narrow down the problem.

# truss -o truss.out -laef -vall -p <the pid of the server process>
# snoop -o snoop.out port <tcp port number>

SD

omerkose

Add the following line
to /etc/init.d/inetinit

/usr/sbin/ndd -set /dev/tcp tcp_close_wait_interval 1500
/usr/sbin/ndd -set /dev/tcp tcp_keepalive_interval 1500

and reboot

Brian Utterback

Do not set tcp_close_wait_interval, tcp_time_wait_interval, or tcp_keepalive_interval. None of them have anything to do with your problem.

The problem is that your application is not closing the socket now that the other host has closed its socket. That's what CLOSE_WAIT means, namely that the OS is waiting for the application to close the socket.

There are numerous reasons that the application isn't closing the socket, almost all of them are because of application bugs. There are a few ways that the application can be informed that the other end has closed the socket. The most common is to try to read the socket and get back the EOF indicator, which is a successful read of zero bytes.

Writing to the socket may tell you, but not always. In TCP it is allowed to write on a socket that has been closed on the other end, because in TCP closing the socket says that you will do no more writes, but says nothing about whether or not you will still read. If the socket was closed abortively and no longer exists at all, then the write will return an EPIPE error.

If the protocol that you are using does not allow for reading any data, then use non-blocking sockets and the poll call to tell when the socket is readable without actually blocking in the read call.

The only real possibility that is not a bug in the application is if there is a bug in the OS that prevented it from informing the application that the EOF was available. This is of course unlikely, but not impossible.

SharonLaw

ASKER

truss -p results:
smsmgr@ws01-1a:admin/bin% psc smmgr
smsmgr 1581 1563 0 Feb 28 pts/7 10:53 smmgr
smsmgr@ws01-1a:admin/bin% truss -p 1581
lwp_sema_wait(0xFEE0BE78) (sleeping...)
signotifywait() (sleeping...)
lwp_sema_wait(0xFEC07E78) (sleeping...)
lwp_sema_wait(0xFF12DF08) (sleeping...)
lwp_sema_wait(0xFEB05E78) (sleeping...)
lwp_sema_wait(0xFEA03E78) (sleeping...)
lwp_sema_wait(0xFE901E78) (sleeping...)
lwp_sema_wait(0xFE40FE78) (sleeping...)
lwp_sema_wait(0xFE30DE78) (sleeping...)
lwp_sema_wait(0xFE20BE78) (sleeping...)
lwp_sema_wait(0xFE109E78) (sleeping...)
semop(5, 0x00032124, 1) (sleeping...)
semop(5, 0x00032124, 1) (sleeping...)
door_return(0x00000000, 0, 0x00000000, 0) (sleeping...)

The program has 10 threads connections. The program sleeps and unable to perform the next read or write actions during the CLOSE_WAIT state.

omerkose

This behavior is well studied and documented by IBM Websphere (Solaris) Performance team. When high connection rates occur, a large backlog of TCP connections build up and can slow the server down.
It is been witnessed the server stallling during certain peak periods. Netstat showed many sockets opened to port 80 were in CLOSE_WAIT or FIN_WAIT_2. Visible delays of up to 4 minutes in which the server does not send any responses has occurred, but CPU utilization stays high, with all of the activity in system processes.

It is recommended to keep
tcp_close_wait_interval, tcp_time_wait_interval, or tcp_keepalive_interval values to less than or equal to a 1 minute (60000).
A socket remains in CLOSE_WAIT till the server does passive close and sending FIN packet back to client. Due to heavy thread activity the server thread might not get enough CPU cycle to do so. tcp_close_wait_interval suggests that solaris kernel to give up on orphaned close-wait sockets.

ASKER CERTIFIED SOLUTION

soupdragon

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

bummerlord

If the application seems to behave ok, and the server isn't heavily loaded, (referring to above comments),
look for droped packets/errors on the network.
Study IP counters with netstat.
Make sure you have NIC and switch ports set to 100mbit/s full duplex and no autoneg(!)
Is there a firewall in the path that "forgets" about sessions after a certain amount of idle time?
Can you see a pattern in when/how frequent this happens? (long periods of idle time, or always after say 15 minutes)

/b

hbsharp

I have faced a similar problem while using iPlanet 4.1 sp9. The connections appear to stay on forever in CLOSE_WAIT state. I tried working with the tcp* parameters to no avail.

You may try sending a HUP signal to the Server process that binds to the port. I wrote a simple script that would count the number of CLOSE_WAITs on a particular port and if it exceeded 4 (my application would hang at 4 CLOSE_WAITs) it would do a "kill -1 PID-OF-PROCESS". This immediately closes all connections and quicky refreshes the application without any downtime. Let us know if this works for you too.