Link to home
Start Free TrialLog in
Avatar of syinfra
syinfraFlag for India

asked on

Time_Wait error on netstat output

Hello,

We are facing issues of transaction declines on our Application Server. After running netstat -ano it is observed that there are many entries find out with "Time_Wait" status. This server has Apache, Java Middleware, Switch & ActiveMQ components installed on it. The transactions are high in volume approximately 2.5 million per day.

We want to know any parameters needs to be checked on registry or in Application to troubleshoot this issue.

Regards
Gurunath
Avatar of David Favor
David Favor
Flag of United States of America image

This is a common problem for older Kernels.

The Kernel 3.18 refactoring of TCP (roughly 30% entire networking code changed) greatly reduced this problem.

The Kernel 4.X (recent) TCP changes have fixed this problem so well, I haven't seen the problem occur... since somewhere in the 4.X Kernel series.

There are many tuning suggestions to escape this problem. They all bandage over the problem.

Example: Just check a site running Kernel 4.8.0.30 with 50K-100K requests/minute with a TIME_WAIT count of 54.

If this Kernel was pre 3.18, the LAMP Stack would lock up in a few seconds, because of TIME_WAIT amplification causing all new connections to be deferred + timeout + die.

So... your first consideration is what Kernel you're running.

I suggest running Ubuntu Bionic + SNAP version of LXD at machine level. Then run your App inside an LXD container. Even if you have to install an old OS Distro in your LXD container, you'll still be using the Bionic 4.15.0.33 Kernel, as all containers share the machine Kernel.

You can use this trick to... in essence... upgrade an old Distro, to use newest Kernel code, without making any App changes.

If you're running Apps which require old OS Distros, this little trick can save you massive amounts of dev dollars for upgrading.
BTW, you read the throughput I mentioned correctly. This is not a typo.

I run many sites running many millions of requests each hour.

These sites simply can't survive, even for a short time, if a stampeding herd of TIME_WAIT states begin to pile up + amplify into zero new connections.
Avatar of syinfra

ASKER

Hi David,

We are using Windows 2012 Standard edition & not Linux.

Roughly 50 transactions per second are coming with total 2.5 million transactions per day.

Thx
Guru
Windows 2012 sounds old.

How the TCP states are managed + reaped over time is a function of the Kernel, so in your case the Windows 2012 Kernel code.

As an experiment, try installing the latest version of Windows + your App on a test machine.

My guess is you'll find this problem has been fixed in more recent versions of Windows.
Avatar of syinfra

ASKER

Hi David,

We are using Windows 2012 R2 Standard Edition which is latest.

Regards
Gurunath
Then you're pretty much stuck with no solution.

Many years ago (5-10 or so), I use to run software which ran through all the internal TCP tables every 1 second + tore down any TIME_WAIT status connections by brute force.

I have no clue how you'd do this on Windows + TCP... is TCP anywhere... so they only way to... accelerate your TCP Stack implementation reaping TIME_WAIT state connections is to... hijack the process... so you'll write code to walk your internal TCP connection Kernel tables + reap/teardown/destroy all TIME_WAIT state connections before TCP does this for you.

The easy way to fix this, which I'm sure you won't like hearing, switch to running a LAMP server, as all the code you mention runs is normal packaged software for most Linux Distros.

If you start down the migration Rabbit Hole, start with a useful Distro like Ubuntu Bionic, which has a very recent Kernel + 5 years of updates.
Avatar of syinfra

ASKER

Hi David,

Thank you very much for the update. How to then resolve the issue, as installing Linux on production environment which is running from past 2 years is difficult. Any Load Balancer technique can help? I have seen 2 event ID's are repeated from past 1-2 years are as follows, please let me know if anything from that can help?

1)  The Open Procedure for service BITS in DLL C:\Windows\System32\bitsperf.dll failed. Performance data for this service will not be available
Source Perflib - Event ID 1008 from 26-Dec-2016 same error on DC from 28-Dec-18

Unable to read performance data for the server service.
Source PerfNet - Event ID 2005 from 23-Aug-2017


2)  Server Event Logs - System

A fatal alert was generated and sent to the remote endpoint. This may result in termination of the connection. the TLS protocol defined fatal error code is 40. The Windows SChannel error state is 1205.

Event ID 36888

Thanx
Guru
This question needs an answer!
Become an EE member today
7 DAY FREE TRIAL
Members can start a 7-Day Free trial then enjoy unlimited access to the platform.
View membership options
or
Learn why we charge membership fees
We get it - no one likes a content blocker. Take one extra minute and find out why we block content.