Gigabit and parallelism

I working on a performance issue involving a backup software (HP Dataprotector, for information purposes, but forget about it). During an analysis using the iperf tool I noticed this values:

Using 1 thread: 760Mbits/s (iperf -P 1)
Using 3 threads: about 900Mbits/s (iperf -P 3)
Using 4 threads and beyond: 946Mbits/s (iperf -P 4)

The question is: Why can't I reach the full gig speed by using just one thread (one connection)?

Using a fast ethernet card, I can achieve 94Mb/s with 1 thread. No problem at all. The problem seems to show just with a gig connection.
LVL 11
Renato Montenegro RusticiIT SpecialistAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

bbaoIT ConsultantCommented:
basically, you cannot acutally reach the 1G limit as the benchmark is for physical bandwidth, ideally.

beside the payload for transferring actual data over the network, extra payload is required for packaging the raw data and its protocols, just like TCP is sort of payload of IP and HTTP is part of payload of TCP.

also be aware the metrics here, Mbps, it is bit per second for measuring bit stream, not Byte per Second or KB/S for benchmarking application payload.
Renato Montenegro RusticiIT SpecialistAuthor Commented:
Actually I can get near 1Gbit/s when I start 4 simultaneous threads (iperf -c -t 60 -P 4). The network interface utilization in Windows shows 99%. I can get 960Mbits/s. I think those remaining 40Mbits/s are related to some overhead, that's ok.

What I can't do is to reach anything beyond 760Mbits/s when using just one thread (iperf -c -t 60 -P 1). In that case, network interface utilization in Windows shows about 70%. I was wondering why I can't go beyond it. Maybe it's a limitation in iperf. That's what I want to discuss with you guys. Why a single data stream can't get to 1gig but a four way stream can.

When using a fast ethernet card, at 100Mbits/s, I can get to 94.6Mbits/s (almost full bandwidth) using a single stream of data.

You may try increasing window size (iperf has an option for that) and/or increasing interface MTU(operating system settings - but all boxes in same segment should have same MTU).
Ultimate Tool Kit for Technology Solution Provider

Broken down into practical pointers and step-by-step instructions, the IT Service Excellence Tool Kit delivers expert advice for technology solution providers. Get your free copy now.

nociSoftware EngineerCommented:
10 times the transfer rate als means 10 times the OS calls ==> 10 times the extra Overhead.
So that CPU & call overhead can sustain 760Mbps in a linear fashion.
This clearly is more than 100Mbps ==> you can saturate a 100Mbps connection.
By adding more threads you can help the frontend of the processing.., but the limit will be the overhead on the network adapter...

You may reach the 1Gbps in one thread if you use jumbo frames (frames of 8K-9K depending on hardware).
That also presumes you have a switch & other system that can handle this. (And the switch can handle this bandwidth)
Steve JenningsIT ManagerCommented:
noci is on to something . . , it's likely the CPU that is "limiting" your output speed. Gigabit cards require lots of CPU on a standard machine.

Good luck,
nociSoftware EngineerCommented:
If you have a multicore CPU having several processes(threads) helps pumping out more data.
Renato Montenegro RusticiIT SpecialistAuthor Commented:
This is the hardware I am using in the test (2 identical servers):

Dell PowerEdge R610
2 x Intel Xeon E5630 2.53GHz Quad Core
2 x 136GB SAS (RAID 1)
2 Broadcom BCM5709C NetXtreme II GegE (Dual Port)
Windows 2008 R2 (fully updated)

The network interfaces are connected with a cross cable (no switch).

When I issue the iperf command, the CPU time (in all cores) barely moves. So I don't think CPU is a issue. I think the bus speed is quite good since it's one of the best hardwares from Dell.

I tried to increase the frame sizes in the network interface, There was no improvement. When I set the greatest frame size, I noticied errors and the speed dropped. It's now 1500 bytes, the default. I tried to set the maximum MTU size (-M option). There was no difference: 760Mbits/s with 1 data stream, 940Mbits/s with 4 data streams.

Any ideas? Or even other tools?
Steve JenningsIT ManagerCommented:
You are correct, it is not CPU. I mentioned that without understanding the type of machine. What happens when you run it thru a switch?
have You tries the iperf "-w" option? And possibly the "-N"?

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
nociSoftware EngineerCommented:
One processes does a sync write:

- write(xxx)
  (sysCALL write()
     - copy to system buffers
     - queue to driver
     - start driver
     - wait for driver

     - driver - create task on card
     - start transfer
     - wait for end of xfer

     - get xfer status
     - post to process
     - resume process

So you can see that  although the process in line is BUSY (with mostly waiting) it will not start another write until the first is completed. ==> no high cpu load but one task will wait.

With multicore some of these processes can overlap helping even further to push data.

You will also see that on architecture without DMA the CPU is more busy (pushing data to adapters) that on systems with DMA.
(Non-DMA architecture = PIO mode ide disks).
nociSoftware EngineerCommented:
Jumbo frames is not large MTU only, it needs to be enabled & supported with the switches. If you just declare a large MTU it will only produce NON-communication if dont fragment is set or heave fragmentation otherwise.

Packet fragmentation will proceduce large overhead on systems.
Renato Montenegro RusticiIT SpecialistAuthor Commented:
I will answer by the end of the day.
Renato Montenegro RusticiIT SpecialistAuthor Commented:
I managed to achieve the full bandwidth with only one thread by increasing the TCP Windows Size to, at least, 64MB:

iperf -c <server ip address> -t 60 -w 64000
Renato Montenegro RusticiIT SpecialistAuthor Commented:
Just a correction: 64KB, not MB.
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Networking Hardware-Other

From novice to tech pro — start learning today.

Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.