asked on

Time_Wait error on netstat output

Hello,

We are facing issues of transaction declines on our Application Server. After running netstat -ano it is observed that there are many entries find out with "Time_Wait" status. This server has Apache, Java Middleware, Switch & ActiveMQ components installed on it. The transactions are high in volume approximately 2.5 million per day.

We want to know any parameters needs to be checked on registry or in Application to troubleshoot this issue.

Regards
Gurunath

David Favor

This is a common problem for older Kernels.

The Kernel 3.18 refactoring of TCP (roughly 30% entire networking code changed) greatly reduced this problem.

The Kernel 4.X (recent) TCP changes have fixed this problem so well, I haven't seen the problem occur... since somewhere in the 4.X Kernel series.

There are many tuning suggestions to escape this problem. They all bandage over the problem.

Example: Just check a site running Kernel 4.8.0.30 with 50K-100K requests/minute with a TIME_WAIT count of 54.

If this Kernel was pre 3.18, the LAMP Stack would lock up in a few seconds, because of TIME_WAIT amplification causing all new connections to be deferred + timeout + die.

So... your first consideration is what Kernel you're running.

I suggest running Ubuntu Bionic + SNAP version of LXD at machine level. Then run your App inside an LXD container. Even if you have to install an old OS Distro in your LXD container, you'll still be using the Bionic 4.15.0.33 Kernel, as all containers share the machine Kernel.

You can use this trick to... in essence... upgrade an old Distro, to use newest Kernel code, without making any App changes.

If you're running Apps which require old OS Distros, this little trick can save you massive amounts of dev dollars for upgrading.

David Favor

BTW, you read the throughput I mentioned correctly. This is not a typo.

I run many sites running many millions of requests each hour.

These sites simply can't survive, even for a short time, if a stampeding herd of TIME_WAIT states begin to pile up + amplify into zero new connections.

syinfra

ASKER

Hi David,

We are using Windows 2012 Standard edition & not Linux.

Roughly 50 transactions per second are coming with total 2.5 million transactions per day.

Thx
Guru

David Favor

Windows 2012 sounds old.

How the TCP states are managed + reaped over time is a function of the Kernel, so in your case the Windows 2012 Kernel code.

As an experiment, try installing the latest version of Windows + your App on a test machine.

My guess is you'll find this problem has been fixed in more recent versions of Windows.

syinfra

ASKER

Hi David,

We are using Windows 2012 R2 Standard Edition which is latest.

Regards
Gurunath

David Favor

Then you're pretty much stuck with no solution.

Many years ago (5-10 or so), I use to run software which ran through all the internal TCP tables every 1 second + tore down any TIME_WAIT status connections by brute force.

I have no clue how you'd do this on Windows + TCP... is TCP anywhere... so they only way to... accelerate your TCP Stack implementation reaping TIME_WAIT state connections is to... hijack the process... so you'll write code to walk your internal TCP connection Kernel tables + reap/teardown/destroy all TIME_WAIT state connections before TCP does this for you.

The easy way to fix this, which I'm sure you won't like hearing, switch to running a LAMP server, as all the code you mention runs is normal packaged software for most Linux Distros.

If you start down the migration Rabbit Hole, start with a useful Distro like Ubuntu Bionic, which has a very recent Kernel + 5 years of updates.

syinfra

ASKER

Hi David,

Thank you very much for the update. How to then resolve the issue, as installing Linux on production environment which is running from past 2 years is difficult. Any Load Balancer technique can help? I have seen 2 event ID's are repeated from past 1-2 years are as follows, please let me know if anything from that can help?

1) The Open Procedure for service BITS in DLL C:\Windows\System32\bitsperf.dll failed. Performance data for this service will not be available
Source Perflib - Event ID 1008 from 26-Dec-2016 same error on DC from 28-Dec-18

Unable to read performance data for the server service.
Source PerfNet - Event ID 2005 from 23-Aug-2017

2) Server Event Logs - System

A fatal alert was generated and sent to the remote endpoint. This may result in termination of the connection. the TLS protocol defined fatal error code is 40. The Windows SChannel error state is 1205.

Event ID 36888

Thanx
Guru

This question needs an answer!

Become an EE member today

7 DAY FREE TRIAL

Members can start a 7-Day Free trial then enjoy unlimited access to the platform.

View membership options

Learn why we charge membership fees

We get it - no one likes a content blocker. Take one extra minute and find out why we block content.