Learn how to a build a cloud-first strategyRegister Now

x
?
Solved

Unexpected 40 ms delayed tcp ack over local loopback with 16K segments on RedHat

Posted on 2006-05-22
12
Medium Priority
?
2,026 Views
Last Modified: 2009-09-01
Sorry it's long - it's a hard problem :-)

Throughput on a tcp connection over local loopback drops two orders of magnitude very occasionally - 1-2 times/week.  Typical traffic is ~120 Mbits/sec.  At problem time, lots of data is backed up and sender is sending 16396 byte segments.  One gets immediate ack, next gets 40 ms delayed ack; cycle repeats.  At non-problem times with lots of data backed up, every 16396 byte segment gets an immediate ack.  We believe app on receiving side is not a bottleneck.  Once problem starts, it remains for at least minutes (then client gets mad and restarts everything).  There are ~20 boxes receiving more or less the same data, but only one breaks at a time.  Maybe 5 boxes have had the problem, some several times.  Window is 32K (scaled 256 <<7) in both problem and non-problem cases, having started out at 48000.  MTU on lo is default 16436.  Problem is not (yet) reproducible in lab - only at (very annoyed) client site.

The 16396 byte segment is big enough that the 32K window will not allow sender to send a second full segment.  We suspect if it could, the receiver would be obligated to send an immediate ack to that second segment and the problem would disappear.  We've recommended reducing MTU to 8000, hoping sender will be able to send at least two segments and avoid the delayed ack, but can't verify that it works.  We're concerned the window will shrink and we'll be stuck again.  (But if it shrinks to 16K, sender can still get 2 segments out, so maybe...)

We've seen it with AS2.1 and 3.0 (they have no 4.0).  /proc/sys/net/ipv4/tcp_timestamps is 0.  We think that's making room for the additional 12 bytes to make segments 16396 instead of 16384.  (At 16384, sender should be able to send 2 segments in the 32K window.)  Client is unwilling turn timestamps on (yet).

Hardware is multiproc Xeon and Opteron - problem has occurred on both.  

The loopback tcp sender (Tibco Rendezvous) gets its data via multicast from a gigE network.  At least in AS2.1, irqs are not bound to a processor.

TCPNODELAY is set on both ends, but that should not affect receiver behavior.  PSH is set sometimes, pattern not obvious.  Might that affect receiver ack delay behavior?

Red Hat has told us about loopback problems with segments not a multiple of 8 bytes and problems with segments > 16K, but symptoms do not match our delayed ack observations.

Loopback is assumed lossless.  Might a glitch that causes a segment loss branch into unintended code?

Has anybody seen this?  Does anybody know under what high-traffic circumstances tcp receiver will decide to delay acks?  

Thanks for any insights!

Jim
0
Comment
Question by:Jim Williams
  • 5
  • 4
10 Comments
 
LVL 27

Expert Comment

by:Nopius
ID: 16740143
- Hardware is multiproc Xeon and Opteron - problem has occurred on both.  
Does the problem ever occures on a single-proc system?
My suspection is on SMP kernel 2.4.x. Possibly you have a race condition (access to loopback network buffer from two different CPUs). One CPU is copying buffer to your application and another is unable to get lock to that buffer, when network packet is ready to be placed from NIC to the buffer. Loopback driver will probably wait until lock will be freed. Also it's possible you have buffer overrun, when data is not processed by aplication with that speed. Then your loopback driver has more data to process, but nowhere to place. Second segment sent to the loopback interface will be delayed then. This may be a problem on such heavy load.

- Loopback is assumed lossless.  Might a glitch that causes a segment loss branch into unintended code?
Yes, loopback is lossless, unless you have intentionnally configured netfilter to do random packet drops. Probably there is no such unintended code.

- Does anybody know under what high-traffic circumstances tcp receiver will decide to delay acks?  
When network driver is unable to process packet immediately.

My recommendation is to try Linux kernel with version above 2.6.11, 2.4.x kernels are not very good with concurrency on kernel (driver) level.
Probably reducing MTU size will help, but the overall transfer rate will be degraded also.
0
 
LVL 27

Expert Comment

by:Nopius
ID: 16740147
of course NIC is virtual interfacec card :-)
0
 

Author Comment

by:Jim Williams
ID: 16740463
Thanks for the reply, Nopius.  Some answers below.

>>Does the problem ever occures on a single-proc system?
*Client isn't running any single-proc machines for this, so it hasn't ever happened, and we don't know if it would.

>>My suspection is on SMP kernel 2.4.x. Possibly you have a race condition...
*That sounds reasonable.

 >>...Loopback driver will probably wait until lock will be freed.
 *Yes, I would expect that.  But I'd expect the wait on the lock to be microseconds - not exactly 40ms, exactly every other frame.
 
 >>Also it's possible you have buffer overrun, when data is not processed by aplication with that speed. Then your loopback driver has more data to process, but nowhere to place. Second segment sent to the loopback interface will be delayed then.
*From other probes we've put in the receiving code, we don't think that code is having trouble keeping up.  It keeps up at hundreds of times the 'slow' rate 99.9% of the time.  And again, even if it were delayed, the clockwork 40ms delay isn't what I'd expect to see.

>>- Does anybody know under what high-traffic circumstances tcp receiver will decide to delay acks?  
>>When network driver is unable to process packet immediately.
*If the problem were just that the receiver didn't have room, I'd expect to see tcp's usual response of closing the window.  But that doesn't happen.  The window stays at 32K.  But *something* makes it go into delayed ack mode.  That's really the immediate question:  What triggers delayed ack mode?  If we understood that well enough to design something that would let us reproduce the problem in the lab, we'd be miles ahead.

>>My recommendation is to try Linux kernel with version above 2.6.11, 2.4.x kernels are not very good with concurrency on kernel (driver) level.
*We'd be happy for the client to upgrade to 2.6, but unfortunately we can't dictate what O/S release they use.  We had a struggle just to get them to upgrade to 2.4.24 (AS3.0).

>>Probably reducing MTU size will help, but the overall transfer rate will be degraded also.
*Yes, there would be a slight decrease.  In tests trying to reproduce the symptoms using iperf to blast traffic through loopback, we saw throughput of around 5Gbit/sec with MTU 8000 (down from 6.4 Gbit/sec with the default MTU).  Since the throughput we need is 20 times less than that, we can afford that slight degradation.  (And it's a lot better than 32Kbytes every 40ms!)
0
Technology Partners: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 
LVL 27

Expert Comment

by:Nopius
ID: 16748814
>*From other probes we've put in the receiving code, we don't think that code is having trouble keeping up.  It keeps up at hundreds of times the 'slow' rate 99.9% of the time.  >And again, even if it were delayed, the clockwork 40ms delay isn't what I'd expect to see.
40ms delay may come from some other part of your system. Suppose your application is flushing all received data to disk. It waits for HDD until data will be written (not buffered, but physically written) to disk.

>*If the problem were just that the receiver didn't have room, I'd expect to see tcp's usual response of closing the window.  But that doesn't happen.  The window stays at 32K.
It's not that case. TCP window cannot be shorter then MTU size, and MTU size is 32K.

It's very bad that problem is not reproducable in a lab enviroment, so you can only tell clients to try one or another solution. Definitly reducing MTU size is your next step :-)

You may try to reduce feed speed of your data to loopback interface using iptables , but I'm not shure, will it work for such huge speed (6Gbit/s) or not. BTW what is the source of that data with so much bitrate and what is the media type used to transver it to the server?
0
 

Author Comment

by:Jim Williams
ID: 16749161
>40ms delay may come from some other part of your system. Suppose your application is flushing all received data to disk. It waits for HDD until data will be written (not buffered, but physically written) to disk.

*It's certainly possible it comes from something else - though not from disk write.  The app on the receive side of the loopback takes the data, maybe does a little magic with it, and sends it on to other processes across another network path (either udp or tcp) - nothing goes to disk.  And the other network connections can't exert back pressure, either:  The udp version is unthrottled, and the tcp version doesn't block, just cuts channel to slow consumers.  And again, at the slow time we're running at a tiny fraction of the throughput it handles continuously.


>>*If the problem were just that the receiver didn't have room, I'd expect to see tcp's usual response of closing the window.  But that doesn't happen.  The window stays at 32K.
It's not that case. TCP window cannot be shorter then MTU size, and MTU size is 32K.

*I think tcp window can certainly be zero.  And MTU here is 16436, so I think window could be reduced from 32K.  Window scaling is 7 here, so there's still granularity to reduce it below 32K and still above MTU.  Based on that, if the receiver was having trouble, I'd still expect the window to reflect that by getting smaller (or going to zero).  Does that make sense?


>It's very bad that problem is not reproducable in a lab enviroment, so you can only tell clients to try one or another solution. Definitly reducing MTU size is your next step :-)

*Right.  And we all hate going to a client recommending a change with nothing better than "This might help."

BTW - I misstated what RedHat told us about the known problem:  It's for packets > 8033 bytes when MTU is not a multiple of 8 bytes (not when packet is not multiple of 8 bytes).  Not directly relevant, but I wanted to fix my error.  They also recommended 8000 for MTU.


>You may try to reduce feed speed of your data to loopback interface using iptables , but I'm not shure, will it work for such huge speed (6Gbit/s) or not. BTW what is the source of that data with so much bitrate and what is the media type used to transver it to the server?

*The data is "real time" financial market data.  Very bursty in nature.  Slowing it down isn't really an option, since delivering it fast is what the system is for.  It's not "real time" as hardware guys use the term, but the client starts yelling at us if latency through the several components and network paths that comprise the whole system is more than a millisecond or two.

Typical traffic rate at this client site is maybe 120Mbits/sec, and comes into the loopback sender from the network via a negative-ack reliable protocol layer on top of udp multicast from a gigabit nic.  The 6Gbit/sec was running iperf in the lab across loopback trying (unsuccessfully) to break it.


Going back to the delayed acks which I still believe are the root of the slowdown:  I can find all kinds of references to why a receiver should use delayed acks.  (It's a SHOULD in the RFCs, after all.)  But other than the obligation to ack at least every second packet (give or take discussion of every other packet vs every 2*MSS bytes, etc), I can't find why a receiver would promptly ack every packet.  It makes sense, and I want it to happen, and in my case everything is cool when every packet gets a quick ack and goes south when acks are delayed - but I can't find anything justifying that a receiver should provide prompt acks to every packet.  Can anyone point me to what I've missed?  

0
 
LVL 27

Expert Comment

by:Nopius
ID: 16757331
Everything you said has sense. Probably it's a Linux bug or a feature.
You may look inside kernel (2.4.24) sources:
/usr/src/linux/net/ipv4/tcp_input.c
or
/usr/src/linux/net/ipv4/tcp.c
and search for 'delay' string.

Also, could you post some lines from TCP dump with sequence numbers and timestamps from begining of TCP session? I mean problematic sessions.
0
 

Author Comment

by:Jim Williams
ID: 16758245
I'm leaning toward a bug :-)

Unfortunately, we don't have a capture at the beginning of a session that got in trouble - or even covering the transient from OK to not OK.  But here's a couple of cycles of what it looked like while it was not OK:

09:46:53.569413 IP 50524 > 7500: . ack 32792 win 256 (DF)
09:46:53.569436 IP 7500 > 50524: P 32792:49188(16396) ack 1 win 8191 (DF)
09:46:53.569502 IP 50524 > 7500: . ack 49188 win 256 (DF)
09:46:53.569509 IP 7500 > 50524: . 49188:65584(16396) ack 1 win 8191 (DF)
09:46:53.609415 IP 50524 > 7500: . ack 65584 win 256 (DF)
09:46:53.609441 IP 7500 > 50524: P 65584:81980(16396) ack 1 win 8191 (DF)
09:46:53.609502 IP 50524 > 7500: . ack 81980 win 256 (DF)
09:46:53.609510 IP 7500 > 50524: . 81980:98376(16396) ack 1 win 8191 (DF)
09:46:53.649410 IP 50524 > 7500: . ack 98376 win 256 (DF)
09:46:53.649423 IP 7500 > 50524: P 98376:114772(16396) ack 1 win 8191 (DF)
09:46:53.649465 IP 50524 > 7500: . ack 114772 win 256 (DF)
09:46:53.649471 IP 7500 > 50524: . 114772:131168(16396) ack 1 win 8191 (DF)
09:46:53.689418 IP 50524 > 7500: . ack 131168 win 256 (DF)

Window scale is 7.  I might have the beginning of a session after it restarted, which I would expect to be the same as for this session, but the capture file is a couple of hundred MB and I don't have good access to it from here.  If I have it, I'll post a bit from work tomorrow.

I have kernel sources on a box at work.  I was hoping our premium develeloper support from Red Hat would allow me to not have to resort to that, but it's not looking good.  Thanks for the pointers.  Looks like tcp_send_delayed_ack() is in tcp_output.c. *sigh*

If/when I learn more, I'll certainly post here.

Thanks!

Jim

0
 
LVL 27

Expert Comment

by:Nopius
ID: 16758574
If it's a bug, it might be already fixed even in latest 2.4.x kernel. Here is a list of changes from 2.4.24 to 2.4.32 related to TCP and packet scheduling.
Really as I suggest related changes where made in 2.4.27.

May be your customer will aggree to update kernel after looking to the list of TCP related changes?

2.4.25
o [TCP]: Put Alexey's -EAGAIN change back in with Linus's fix on top

2.4.26
o [TCP]: Use tcp_tw_put on time-wait sockets

2.4.27
o [TCP]: Bic tcp congestion calculation timestamp
o [PKT_SCHED]: netem limit not returned correctly
o [TCP]: Fix build in 2.4.x with SCTP disabled
o [PKT_SCHED]: Missing rta_len init in sch_delay
o [TCP]: Kill distance enforcement between tcp_mem[] elements
o [TCP]: Abstract out all settings of tcp_opt->ca_state into a function
o [TCP]: Backport Vegas support from 2.6.x
o [TCP]: Backport BIC TCP from 2.6.x
o [TCP]: Add tcp_default_win_scale sysctl
o [TCP]: Add receiver side RTT estimation
o [TCP]: Grow socket receive buffer based upon estimated sender window
o [TCP]: More sysctl tweakings for rcvbuf stuff
o [TCP]: Add sysctl to turn off metrics caching
o [TCP]: Add vegas sysctl docs

2.4.28
o [TCP]: Store congestion algorithm per socket
o [TCP]: Add vegas style bandwidth info to 2.4.x tcp diag
o [TCP]: Backport 2.6.x cleanup of westwood code
o [TCP]: When fetching srtt from metrics, do not forget to set rtt_seq

2.4.29
o [TCP]: Receive buffer moderation fixes
o [NETLINK]: sed 's/->sk_/->//' in af_netlink.c

2.4.30
o [TCP]: BIC not binary searching correctly
o [TCP]: Fix BIC max_cwnd calculation error
o [TCP]: Fix calculation for collapsed skb size

2.4.32
o [TCP]: Don't over-clamp window in tcp_clamp_window()

0
 

Author Comment

by:Jim Williams
ID: 17173650
Apologies to admins and community for letting this sit so long.

For the record, the issue is understood and has been successfully worked around.  Here are the critical bits - all of which needed to be in the wrong state for the problem to occur:

a) tcp_timestamps had to be turned off (default is on).  This caused the MSS to be 16396 instead of 16384 bytes.

b) Window scaling had to be on (that's the default).  With scaling off, the correctly computed minimum window size of 2*16396=32792 bytes was offered, and two segments could be sent successfully, avoiding the delayed acks.  The power-of-two mechanism of window scaling doesn't play nice with MSS slightly greater than 1/2 the window size.

c) The receiving app had set SO_RCVBUF to 32K.  If it had set it larger (which it does in later versions), the window would not have shrunk to 32K.

d) MTU for loopback had to be the default of 16436 - allowing for 16396 bytes of payload (plus headers).  The ultimate workaround was to reduce it to 8000 bytes.  Probably reducing it to 16424 - limiting payload to 16384 bytes - would have been enough.

Anyway, the client is working now, later versions of our stuff set the socket to 128K, and probably very few people turn off tcp_timestamps anyway, so nobody is likely to ever see this again.

As for closing the question, I'm afraid I had to answer my own question, through a combination of RedHat support and going through the tcp source code.  I'll post in Community Support to close and refund.

Thanks to nopius, though - for helping me think through some of the possibilities.

- wheelthru
0
 
LVL 5

Accepted Solution

by:
Netminder earned 0 total points
ID: 17220171
Closed, 250 points refunded.
Netminder
Site Admin
0

Featured Post

New feature and membership benefit!

New feature! Upgrade and increase expert visibility of your issues with Priority Questions.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

I have seen several blogs and forum entries elsewhere state that because NTFS volumes do not support linux ownership or permissions, they cannot be used for anonymous ftp upload through the vsftpd program.   IT can be done and here's how to get i…
Note: for this to work properly you need to use a Cross-Over network cable. 1. Connect both servers S1 and S2 on the second network slots respectively. Note that you can use the 1st slots but usually these would be occupied by the Service Provide…
If you're a developer or IT admin, you’re probably tasked with managing multiple websites, servers, applications, and levels of security on a daily basis. While this can be extremely time consuming, it can also be frustrating when systems aren't wor…
We’ve all felt that sense of false security before—locking down external access to a database or component and feeling like we’ve done all we need to do to secure company data. But that feeling is fleeting. Attacks these days can happen in many w…
Suggested Courses

810 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question