asked on

Tools to diagnose performance issues (tcp connections)

I have IOT devices reporting to a bunch of windows 2016 servers
Different sensors report to different server

our highest throughput server is dealing with about 70,000 sensor data records per min
One of our other server is doing about 16,000 and grinds to a halt and stops replying to tcp packets (i think)
The established netstat connections sky rockets to hundreds of thousands (when in fact there is only around 7,000 devices)
The server becomes very slow, and even dragging a window is very difficult and very slow
As soon as I stop the TCP importer process the server responds fine again, so we think we might not be acknowledging the tcp packets or some devices are misbehaving

We are trying to figure out where its going wrong, and was wondering are there any tools to see what is happening on the network
I've used wireshark and frankly i'm overwhelmed

I would like to see number of requests/responses per sec for each port as a summary as this may help us diagnose where its going wrong

Can anyone suggest some tools / things to try and help narrow in on the issue

ASKER CERTIFIED SOLUTION

Dr. Klahn

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

arnold

To add toDr.Klahn's insight, please provide spec of server Physical/VM
CPU, RAM, .., the bottle neck might be in the processing of the data once received
i.e. the backend DB

Type of transactions the IoT report TCP or UDP packets range of data

the addition of the data into where it is stored could begin the domino of run away load.

ref restroom utilization at an event.

difference between the systems to which the devices report and where the 16k overwhelms the system.

resource exhaustion
please paint a picture.
IoT device event notification to TCP/UDP port?
What happens with the data?
is the DB server/service local to the system or it is a separate server?
how is the performance on the SQL/backend server if separate? What is expected to happen on that server?
is this server also queried by a reporting process that displays data for a NOC, or other observers as to what is going on. OR the data is for retro-lookback function?

the data insert/data display refresh could slow down the backend DB and cascade if using TCP connection for the IoT's, that take longer, i.e. instead of a hundred's of a second, here is my data, thanks. once loaded, here is my data, a few seconds of a delay, thanks.

What is the reporting frequency for each IoT device?

websss

ASKER

There are only 64,000 TCP port numbers maximum in the non-static range and the Server default is to use 1000 - 60000 which is only 59,000 ports.

While you are correct about the TCP ports, i think you are assuming that each device connect to each port
This is incorrect
for example, 50,000 devices could all connect on TCP port 9008

TCP view looks good though

Dr. Klahn

for example, 50,000 devices could all connect on TCP port 9008

This is correct as far as it goes. 50,000 devices can all connect on a single TCP port.

But when a service daemon accepts a connection the connection then continues on a different, unused ephemeral port so that the connection port is not blocked. So 50,000 devices connecting on port 9008 does result in 50,000 ports being used.

arnold

one can focus to try to attribute the issue to the quantity of the quantity of the connections.
Iif that is the issue of concern, you can setup monitoring that will consume another connection, but will tell you whether based on your checking interval there is a time when the TCP stack reaches the limit that results in a denied connection.

I think, the issue is not the quantity of connections, but the duration of time for which the connections remain active.
Without knowing what the data exchange is, tcp versus udp....

The goal is to determine whether the processing is optimal or can be improved.
The difference between the two servers where 1/3rd if connections grinds the second server..

The main point you likely can not limit the quantity of devices but you might be able to improve the processing side.

If processing is optimal and scaling up is the only remedy, in this case you would need to use a loadbalancer HW that will terminate the connection while routing it on the backend to multiple backend servers
If going down this path, make sure to spec the loadbalancer above current demand ....

skullnobrains

this is an application level issue, most likely not or very loosely related to the network.

websss

ASKER

Thanks all
There was a lot of good information in the above

Here is where this led me

multiple microservices, apps an DB showed different symptoms and errors, however the root problem seems to be network saturation
I have a simple ping -t running and I can see <1ms when things are fine, as soon as there is a hiccup these response times skyrocket

We are now re-architecturing to do bulk inserts and try to reduce network saturation

however, one questions was raised, is network saturation related to going "external"
or will we still get other saturation with IO bound operations if most stuff lived on local box?
I have the option of moving a bunch of lookup API's locally, so not sure if this will help?

arnold

It is hard to answer what is the amount of incoming data stream?
Available bandwidth.
Resulting processing
etc.
much depends on type and methods. you could use a load balancer to distribute the incoming flow while the data on the backend can be replicating/merging, etc.

To comment more intelligently the scope and range of all that is needed has to be known/analyzed.

skullnobrains

the TIME_WAIT and such statuses will harm client side connections.

if you are using lots of microservices, this is expectable, but they can be made to work efficiently.

in most cases, the following should help :

start by either leaving connections open as much as possible ( some services might be more effcient by working sequentially. others will use connection pools. ), or use UDP queries, or tcp connections with SO_REUSE wherever the previous cannot be easily achieved.

if you want to minimise code rewrites and act quickly, run the webservices on all machines, and use localhost by default when calling microservices, with a possible failover to neighbor hosts. if possible, which is fairly possible, better use unix sockets.

haproxy might proove helpful in order to handle the failover. it is also capable of closing connection using RST packets which will help in cases where the client tends to dirtily disconnect

websss

ASKER

Thanks guys

It certainly was network saturation, data coming in is way out of my control
but the data going out (saving to DB, websocket sending data to web server etc) was happening one at a time
I implemented batching on these processes on different intervals (200ms, 30 secs etc) for different processes and it made a massive difference

skullnobrains

yes. note the network number of connect/disconnect between microclients is the main issue here rether than the throughput.
see my above post for techniques allowing to minimise this while committing to upstream services before acknowleging downstream.
batch commits minimise the number of tcp/connect/disconnect but need to be committed to disk unless you can allow loosing a batch-worth of data on power failure and other crashes