Link to home
Start Free TrialLog in
Avatar of omers
omers

asked on

TCP/IP operations consume lots of CPU.

Dear fellows.

The client in our application is a client that reads lots of data (images, each ~200k) over a 1GBit line from the server.

The client's bottleneck is its CPU. It's a Dell 670 dual processor (xeon) at 3.4GHz running Windows2K.
I found out that the tcp recv() command takes a lot of cpu. When all the machine does is recv() commands and buffer allocations, it reads at ~60MByte/sec. This seems a bit slow to me, also by rules of thumb-measures of TCP/IP CPU consumption this seems ~3 times slower than expected.

I tried rewriting the recv command - to use ACE, to call WSARecv, and to use Overlapped I/O, it all seems to matter very little.

We are working in blocking mode, and TCPNODelay is set to true (Nagle algorithm disabled). RCVBUF is 128K, changing it didn't matter very much. The server is either a unix or a windows machine.

My question is - What else can I do to try and reduce TCP/IP CPU consumption ?

Thanx very much in advance, Omer Shibolet.



Avatar of BigRat
BigRat
Flag of France image

>>it reads at ~60MByte/sec
>>tcp recv() command takes a lot of cpu.

Sounds as if is doing more admin that work.

Check the packet size on the wire with a packet sniffer.
Avatar of omers
omers

ASKER

Thnx.
I checked this with ethreal on the receiving side and it's 1460 (ethernet)  so this seems ok no ?
Omer.
Hmmm, that was my first guess, but 1500 sounds about right for ethernet.

>>I found out that the tcp recv() command takes a lot of cpu

How did you do this exactly? It *would* appear to take a lot of real time since it blocks, but the actual movement of data is actually performed in the Winsock2 dll and NOT in the client side recv() call.
Avatar of omers

ASKER

Thx again Bigrat.

The sentence
>>I found out that the tcp recv() command takes a lot of cpu
is a conclusion of the following. I wrote a simple tcp client that does nothing but recv buffers. My computer was running idle otherwise, and when the tcp loop runs it's busy 100%, part (most, I seem to remember) system and part my process.
Omer.
OK, then it is in the TCP/IP stack that the time is being consumed.

There have been problems on Win2K with programs installing their own wsock32 dll, so try to locate it and determine the version number (should be 5.0.2195.6603)
Avatar of omers

ASKER

This is the wsock32.dll version on my PC   :(
Yes, start Windows Explorer, right click on the file under WinNt/system32 and bring up the menu. Choose properties and the Version tab.
Avatar of omers

ASKER

What I ment was that the version of my wsock32.ddl is the one you indicated above....
Oh, sorry, so it is not corrupted.

So what kind of ethernet card do you have and with what is it being driven (asks the Rat who is running out of alternatives)?
Avatar of omers

ASKER

Dear Rat,

It's not only on my PC, but on other (non Dell) machines, so I assume it's not hardware related...

Do you have an estimate of expected tcp/ip performance on similar machines ?

FWIW, I use ACE for hardware interface (including i/o routines), although at some stage I also tried direct winnsock WSARecv routines.

Thanx yet once more.
Omer.
No, because I don't have that LAN, 100 MBit is ours and we don't see things like that. A 1GigaBit line should maximize out at 100MByte because of Frame overhead and ACK, so 60MB may be a bit low but not wide off mark.

What is strange is the CPU time. Since the async. socket call is effectly the same as ACE, it might be the lack of DMA or such, forcing the processor to do all the work. That is why I asked about card and driver.
Have tried other flags like SO_KEEPALIVE,SO_LINGER,SO_REUSEADDR and
MSG_DONTROUTE.
Avatar of omers

ASKER

Sorry for delay.

BigRat,
Do you have a figure of expected CPU conxumption of TCP ? Are the numbers I am getting really slow ?
I hope I am not mistaken here, the driver details are Intel PRO 1000/MT Server Connection, I went over the driver options, it has TCP/IP Offloading Options enabled. Is there anything specific I should look for ?

Thnx again, Omer.
Avatar of omers

ASKER

DineshJolania,

Thanx.

Can the SO_KEEPALIVE, SO_LINGER and SO_REUSEADDR change CPU consumption  ? the socket is already connected and working ok...

MSG_DONTROUTE if I am not mistaken is only for outgoing communication and for very specific purposes ?

Thnx, Omer.
Those options would, if anything, slow down the reception and release CPU power. KEEP_ALIVE means maintain with SYNCs the line when idle, SO_LINGER keeps the line open longer (DONT_LINGER shorter) when there is no traffic and REUSEADDR is used to issue a listen command immediately after the socket has been closed and reopened. None of these options effect recv().

If the CPU is busy it is doing work. The throughput will be directly related to the CPU speed.

What I don't understand are your measurements w.r.t CPU time. The statement "I wrote a simple tcp client that does nothing but recv buffers" seems to suggest that recv() was called to get the data and that the CPU time (as seem for example with the Windows performance monitor) went to 100%. It shouldn't, since the line only delivers 1GBit/sec. Are you certain that the code actually blocks (ie: the socket was not set to non-blocking mode)?
Avatar of omers

ASKER

BigRat,

Thanx. Your analysis is exact. All the performance analysis tools I have written are useless because it's basically the simplest problem, CPU consumption.

Basically the application receives consecutive large buffers of size ~200K (Bytes). The data is sent from the server in a single sendv(iovec*,...) command.

The client works in blocking mode with large recv and send (server) buffer sizes, I also have TCP NO DELAY on.
When I print the sizes that I receive in the WSARecv, they are around 5K, (I "ask" for 200K). No buffers of size 0 are returned, so I guess the blocking mode works ?  I loop the Recv until I accumulate the required buffer. No buffer copying on my part are involved of course. I tried to wait for an input event before I call the recv, that didn't change.

I tried to tell the recv command to wait for all the data before it returns, but failed. there was a related flag (MSG_WAITALL ? not sure) that only works in Unix and not Win32.

Omer.
Avatar of omers

ASKER

All,

If the issue is solved I will be allocating another 500 points , this is important...
Thanx again, humans and vermin...

Omer.
>> vermin

Rodents would be a better term.

>>No buffers of size 0 are returned, so I guess the blocking mode works?

The integer status from recv() has positive values for data, negative values for errors and warnings, and zero for connection closed. If blocking mode is off and recv() is called and there is no data a response of -1 is returned and "lasterror" gives "operation would block".


If you do something like :-

repeat
   status:=recv();
   if status>0 then
      // move data somewhere
   fi;
until status=0;

you'll get 100% CPU if there is any reason why the recv() call would not block (eg: if in non-blocking mode, if the connection has been closed, if the socket is invalid etc...).

I would be interested to hear what the Windows Performance monitor says.
Avatar of omers

ASKER

Thanx Bigrat.

I look for any non-positive value of rc of recv(), record and treat it so it's not a simple CPU loop.
Also the fact that I call ACE::recv_n(....) which does the loop by itself and get the same results hints that the problem lies somewhere more basic.

I have run this under v-tune, with sampling, and it says that the most busy routine, by far, is
exi386InterlockedExchangeUlong in ntoskrnl.exe . This is some system routine, any idea what this is ?

I also see in the performance monitor that the system takes much more cpu than the process itself.

Regards, Omer.
>>exi386InterlockedExchangeUlong

aka ExInterlockedExchangeUlong
aka InterlockedExchange

Yes, this is a routine which locks/unlocks a flag which is used to protect system tables against a second CPU. Basically when a CPU enters kernel mode it can access it's tables as it wants. But when you have a second CPU the second CPU could also be in Kernel mode and accessing the tables. So a location in memory is set/unset by a special memory access which cannot be accessed by any other CPU and this is controlled in exi386InterlockedExchangeUlong.

BUT

according to some you can get invalid results. The O/S symbols must be installed else vtune sees the wrong results (see: http://discuss.microsoft.com/SCRIPTS/WA-MSD.EXE?A2=ind0106a&L=directxdev&D=1&P=6929) (and also http://groups.google.com/group/comp.os.ms-windows.programmer.nt.kernel-mode/browse_thread/thread/611a1b9b21a8913a?hl=en&lr=&safe=off&ic=1)

The recv() function will go in and out of kernel every 1500 bytes and that must lock out the second CPU, which is probably idle anyway.

I'd ensure that my symbols are correctly installed and that vtune gives the correct results before taking out the second CPU.

Avatar of omers

ASKER

Thanx again.

I have downloaded and installed the retail symbols, vtune still hasn't changed but I will reboot just in case.
According to the links you posted it could be the processor's idle loop and I would bet on that because it seems to be the idle process that takes it.
Although this is really strange how it can take > 90% of cpu when I see in the performance monitor my process taking around 60% cpu (in dual cpu mode).

I will reboot now and also do the test with a single CPU later, Bless Dell CPUs can be disabled easily from BIOS.

Will report, Omer.

PS I should start giving you points for all this side-help, (I feel like in a private lesson), is this still done using an empty question addressed at someone specific ? Cheers again.

ASKER CERTIFIED SOLUTION
Avatar of BigRat
BigRat
Flag of France image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of omers

ASKER

Hi,
The VTune results I have are not of only the TCP I/O, but of some thread handling added that adds around 30% overhead over the clean TCP code. I will take vtune samples of the clean TCP and report  but my server is down now so this will have to wait a day.

[As a sidenote, in our full applicaiton (which is around 3 times the CPU load of the TCP)  , which is complex image viewing with many threads , while reading tcp, some decompression happens so given a strong enough server, both CPUs are 100% busy, but still, this yields only about X 1.4 throughput over a the same single CPU machine.]

Cheers, Omer.
I think in the short term you'd bet better off investing in faster CPUs.

In the longer run, Kernel mode seems the only way forward to get the sort of performance you're looking for. Have a look at :-

http://blogs.msdn.com/wndp/

and

http://www.pcausa.com/tdisamp/default.htm
Avatar of omers

ASKER

Bigrat

Back from some debugging etc for version release, and back to a performance session...

In the meantime till it's solved, I have published a question with points for you for your help on this issue, here:
https://www.experts-exchange.com/questions/21783781/points-for-BigRat-re-TCP-CPU.html

I will continue to do the tests with vtune and will publish my results here hoping to resolve the issue.

Thanx, Omer.
Avatar of omers

ASKER

OK, after we got reproached for those extra points...

I have done the vtune test on the sole TCP tester, and most work was in the system, by vsdatant.sys .

That's right, ZoneLabs' module, even though Zone-Alarm was not active and only installed for a minute for debugging half a year ago...
Tried to make sure it's not active, didn't help.
Unisntalled ZA, and the loop that took 3-4 secs of pure CPU takes now a fraction of a second and is a non-factor.

BigRat, I have more performance issues (also this firewall penalty sounds severe...) , I will hopefully post a question sometime next week, I will post a remark here so people and rodents can follow-up.
Thanx again, Omer.



It obviously hooked itself into WinSock to monitor the packets and remains hooked even though inactive!

>>also this firewall penalty sounds severe

Same sort of thing. If the firewall just blocks, then OK. But if it packet sniffs - a real performance killer! This can also happen if you block I/P adresses.