Unexpected 40 ms delayed tcp ack over local loopback with 16K segments on RedHat
Posted on 2006-05-22
Sorry it's long - it's a hard problem :-)
Throughput on a tcp connection over local loopback drops two orders of magnitude very occasionally - 1-2 times/week. Typical traffic is ~120 Mbits/sec. At problem time, lots of data is backed up and sender is sending 16396 byte segments. One gets immediate ack, next gets 40 ms delayed ack; cycle repeats. At non-problem times with lots of data backed up, every 16396 byte segment gets an immediate ack. We believe app on receiving side is not a bottleneck. Once problem starts, it remains for at least minutes (then client gets mad and restarts everything). There are ~20 boxes receiving more or less the same data, but only one breaks at a time. Maybe 5 boxes have had the problem, some several times. Window is 32K (scaled 256 <<7) in both problem and non-problem cases, having started out at 48000. MTU on lo is default 16436. Problem is not (yet) reproducible in lab - only at (very annoyed) client site.
The 16396 byte segment is big enough that the 32K window will not allow sender to send a second full segment. We suspect if it could, the receiver would be obligated to send an immediate ack to that second segment and the problem would disappear. We've recommended reducing MTU to 8000, hoping sender will be able to send at least two segments and avoid the delayed ack, but can't verify that it works. We're concerned the window will shrink and we'll be stuck again. (But if it shrinks to 16K, sender can still get 2 segments out, so maybe...)
We've seen it with AS2.1 and 3.0 (they have no 4.0). /proc/sys/net/ipv4/tcp_timestamps is 0. We think that's making room for the additional 12 bytes to make segments 16396 instead of 16384. (At 16384, sender should be able to send 2 segments in the 32K window.) Client is unwilling turn timestamps on (yet).
Hardware is multiproc Xeon and Opteron - problem has occurred on both.
The loopback tcp sender (Tibco Rendezvous) gets its data via multicast from a gigE network. At least in AS2.1, irqs are not bound to a processor.
TCPNODELAY is set on both ends, but that should not affect receiver behavior. PSH is set sometimes, pattern not obvious. Might that affect receiver ack delay behavior?
Red Hat has told us about loopback problems with segments not a multiple of 8 bytes and problems with segments > 16K, but symptoms do not match our delayed ack observations.
Loopback is assumed lossless. Might a glitch that causes a segment loss branch into unintended code?
Has anybody seen this? Does anybody know under what high-traffic circumstances tcp receiver will decide to delay acks?
Thanks for any insights!