Solved

RHEL, tomcat is getting into hung situation.

Posted on 2006-07-17
22
2,418 Views
Last Modified: 2008-05-03
Hi,
I am running tomcat 5.0.18(using JDK 1.4.2_07) on RHEL.
-bash-3.00$ uname -a
Linux localhost.com 2.6.9-22.ELsmp #1 SMP Mon Sep 19 18:32:14 EDT 2005 i686 i686 i386 GNU/Linux

After doing certain load runs, tomcat is getting into hung situation. netstat shows that the new request is not getting into ESTABLISHED state.
If i send new wget localhost:port (from same host where tomcat is running) then in netstat, i will see the entry in SYN_RECV state for quite sometime. After 2-3 minutes, wget command comes out saying no answer from the server.

In the netstat -s output, i see
From netstat -s, I noticed overflow of the socket queue:
  7 packets pruned from receive queue because of socket buffer overrun
421277 times the listen queue of a socket overflowed
    421277 SYNs to LISTEN sockets ignored

Should i fine tune something in linux?

0
Comment
Question by:bhaskarna
  • 8
  • 7
  • 4
22 Comments
 
LVL 22

Expert Comment

by:pjedmond
Comment Utility
>After doing certain load runs
>From netstat -s, I noticed overflow of the socket queue

The whole point of load runs is to find where performance limitations and bottlenecks are. You want these to be at a point in the system that 'fails gracefully'. You need to decide what is acceptable. In particular, they may also highlight vulnerability to flood or DOS attacks.

Tuning:

You can increase the transmission queue for your ethernet interface by:

/sbin/ifconfig eth0 txqueuelen 4000

You can also increase improve kernel performance by tweaking sysctrl

http://datatag.web.cern.ch/datatag/howto/tcp.html

Unfortunately, neither of these is really the solution in this case, as what this does is increase the ability of the system to cope with peak loads for a slightly longer period of time. (Whilst the 'buffers/queues' fill up).

You need to have a look at your overall application set up, and perhaps have your application monitor load in some way, and reject new connection or fail gracefully once a load threshold is exceeded.

(   (()
(`-' _\
 ''  ''
0
 
LVL 22

Expert Comment

by:pjedmond
Comment Utility
..or buy a faster processor and more memory, until the ethernet input connection becomes the bottleneck. (You always want the input to be the bottleneck with a process when you are designing it - that way the 'system' has a little bit of leaway to cope with anything 'unusual', which helps guarantee that the 'system' or in this case your application, doesn't deal with anything unexpected.)

(   (()
(`-' _\
 ''  ''

0
 

Author Comment

by:bhaskarna
Comment Utility
Even after stopping all the load clients, why will port remain in frozen state. Also i don't see too many active connections?
0
 
LVL 22

Expert Comment

by:pjedmond
Comment Utility
>Even after stopping all the load clients, why will port remain in frozen state. Also i don't see too many active connections?

Correct - You've just managed a successful DOS attack on your system. Look at it from this point of view:

As root, you have permissions to turn off an ethernet connection. Once you've done it, it can't be used until it is reset.

As your tomcat has permissions to create sockets, and exceed the number of sockets available, and fill up the queue, then it does, and it won't work, until you reset it. Ideally, the system wouldn't allow you to create the sockets in the first place, but as the socket creation functionality is such a key part of many processes in a Linux system, the code is designed to be as fast and efficient as possible. As a result, some checks and limitations that otherwise might be there are not. Instead, the user is expected to make the checks necesary.

It may even be that there is an error somewhere being produced, and your system is ignoring it, and then carrying on to do something it shouldn't, thus freezing the setup. Probably worth checking your error logs.

(   (()
(`-' _\
 ''  ''

0
 

Author Comment

by:bhaskarna
Comment Utility
Please execuseme if i am asking very basic questions.

Purposefully, after hitting hung situation, i left the tomcat process for 10hrs(without killing). Even after 10 hrs, i was seeing the same problem. (New requests were not getting into ESTABLISHED state). It use to stay at SYN_RECV for 2 minutes and comeout with no answer.

What does this mean? If tomcat is holding lot of socket connections then i should be able to notice the huge list of ESTABLISHED connections in netstat. But i was not seeing lot of established connections. Does it mean connections were released? if yes, then why i am not getting new connection? will there be any leaks? if yes, how to find out?

Will RHEL blocks a particular port after bombarding it? Do we need to finetune the message_burst, message_cost etc?

0
 
LVL 22

Expert Comment

by:pjedmond
Comment Utility
Imagine a stick - If you bend it a little bit, then it will spring back. (This is equivalent to using the socket queue). - If you bend it a lot, then the stick snaps, and you need to get a new one. (This is equivalent to filling the socket queue, and then trying to keep on going...which is what your load testing is doing!)

Purposefully, after hitting hung situation, i left the tomcat process for 10hrs(without killing). Even after 10 hrs, i was seeing the same problem. (New requests were not getting into ESTABLISHED state). It use to stay at SYN_RECV for 2 minutes and comeout with no answer. - You've effectively 'broken' the system, so it won't recover without a restart.

If tomcat is holding lot of socket connections then i should be able to notice the huge list of ESTABLISHED connections in netstat. But i was not seeing lot of established connections. Does it mean connections were released? - All connections in Q were probably released, but the pointer for adding new connections is now pointing in the wrong place (possibly loads of bytes beyond where the Q should be), and as a result new connections cannot be placed in this Q.

Will RHEL blocks a particular port after bombarding it? - No - not unless you have a firewall rule of some sort to do this.

Do we need to finetune the message_burst, message_cost etc? - Without studying your configuration, there is no way to say. The key point is that your application is exceeding the capabilities of the setup that you have. Ideally, RHEL should prevent you from doing this, but in this case it obviously does not - probably due to optimisations that improve performance. You need to change your tomcat code to prevent it exceeding these limitations. For example - if there are more than say 100 connections, then you tomcat servlets need to reject the connection and produce a page that says "Please retry again later". A better alternative would be to configure the necessary restriction at the Tomcat Level:

http://tomcat.apache.org/tomcat-4.0-doc/config/http11.html

In particular in the services.xml file, check and adjust the 'connector' (HttpConnector) configuration to 'tune' your setup. The exact 'tuning' required will depend on your application.

(   (()
(`-' _\
 ''  ''











0
 
LVL 22

Expert Comment

by:pjedmond
Comment Utility
Sorry - noticed that you were running Tomcat 5. Worth having a look at logging, and see if that can help locate at what point the setup fails. Start here:

http://tomcat.apache.org/tomcat-5.0-doc/config/

Note that Tomcat 5, and 5.5 have a different logging configuration.

(   (()
(`-' _\
 ''  ''
0
 

Author Comment

by:bhaskarna
Comment Utility
Till now i was suspecting that the socket q overflow problem is caused due to my load runs. But today in nestat -s output, I noticed the following entry in the box where load runner will not hit:
"    391808 times the listen queue of a socket overflowed"

To confirm that loadrunner didn't hit this box, in the tomcat access logs (for last 15 days) i see max of 200-300 requests per day. Out of 300 requests, 200+ will be heartbeat checks and around 60-80 requests are real requests.

What are the various reasons to see 'listen queue of a socket overflowed'?
Is there a easy way to debug and find out who is causing this overflow?

0
 
LVL 22

Expert Comment

by:pjedmond
Comment Utility
Interesting, as it implies that tomcat may be the cause:

http://www.linuxquestions.org/questions/showthread.php?t=457517

>What are the various reasons to see 'listen queue of a socket overflowed'?
Only 1 reason really - too many connection attempts in too short a period of time.

>Is there a easy way to debug and find out who is causing this overflow?
Not really, unless someone is doing a 'standard' attack on your PC (Is is isolated from the rest of the network?). Best bet is probably to try using "tcpdump > filename.txt" and check what's going on after the freeze. These are raw packets and this approach is not for the faint-hearted, and will be timeconsuming to check/analyse!

(   (()
(`-' _\
 ''  ''

0
How your wiki can always stay up-to-date

Quip doubles as a “living” wiki and a project management tool that evolves with your organization. As you finish projects in Quip, the work remains, easily accessible to all team members, new and old.
- Increase transparency
- Onboard new hires faster
- Access from mobile/offline

 

Author Comment

by:bhaskarna
Comment Utility
I tried taking tcpdump of 5 seconds and noticed:

listening on eth0, link-type EN10MB (Ethernet), capture size 96 bytes
589 packets captured
595 packets received by filter
6 packets dropped by kernel

is it too much? normal? should i check specific entries in tcpdump output to better understand the symptom?
0
 
LVL 22

Accepted Solution

by:
pjedmond earned 500 total points
Comment Utility
That could be normal - I notice that you have a 10MB card - worth upgrading for 2 reasons:

1.    It can only cope with so much trafiic, and due to the ethernet protocol, 2 packets sent at the same time will 'collide' giving an invalid packet...which might cause a problem.
2.    Older drivers potentially have more errors in them and may not make full use of functionality in modern kernels to deal with errors or other problems with packets, or the network .

The key here is that you really need to get a tcpdump during the failure process in order to see what is happening. Log files will be huge...!...but I suspect that once the proble occurs the packet behaviour will change.

      
Comment from pjedmond
Date: 07/17/2006 11:00AM BST
      Your Comment       

>After doing certain load runs
>From netstat -s, I noticed overflow of the socket queue

The whole point of load runs is to find where performance limitations and bottlenecks are. You want these to be at a point in the system that 'fails gracefully'. You need to decide what is acceptable. In particular, they may also highlight vulnerability to flood or DOS attacks.

Tuning:

You can increase the transmission queue for your ethernet interface by:

/sbin/ifconfig eth0 txqueuelen 4000

You can also increase improve kernel performance by tweaking sysctrl

http://datatag.web.cern.ch/datatag/howto/tcp.html

Unfortunately, neither of these is really the solution in this case, as what this does is increase the ability of the system to cope with peak loads for a slightly longer period of time. (Whilst the 'buffers/queues' fill up).

You need to have a look at your overall application set up, and perhaps have your application monitor load in some way, and reject new connection or fail gracefully once a load threshold is exceeded.

(   (()
(`-' _\
 ''  ''

Comment from pjedmond
Date: 07/17/2006 11:05AM BST
      Your Comment       

..or buy a faster processor and more memory, until the ethernet input connection becomes the bottleneck. (You always want the input to be the bottleneck with a process when you are designing it - that way the 'system' has a little bit of leaway to cope with anything 'unusual', which helps guarantee that the 'system' or in this case your application, doesn't deal with anything unexpected.)

(   (()
(`-' _\
 ''  ''


Comment from bhaskarna
Date: 07/17/2006 11:38AM BST
      Author Comment       

Even after stopping all the load clients, why will port remain in frozen state. Also i don't see too many active connections?

Comment from pjedmond
Date: 07/17/2006 11:49AM BST
      Your Comment       

>Even after stopping all the load clients, why will port remain in frozen state. Also i don't see too many active connections?

Correct - You've just managed a successful DOS attack on your system. Look at it from this point of view:

As root, you have permissions to turn off an ethernet connection. Once you've done it, it can't be used until it is reset.

As your tomcat has permissions to create sockets, and exceed the number of sockets available, and fill up the queue, then it does, and it won't work, until you reset it. Ideally, the system wouldn't allow you to create the sockets in the first place, but as the socket creation functionality is such a key part of many processes in a Linux system, the code is designed to be as fast and efficient as possible. As a result, some checks and limitations that otherwise might be there are not. Instead, the user is expected to make the checks necesary.

It may even be that there is an error somewhere being produced, and your system is ignoring it, and then carrying on to do something it shouldn't, thus freezing the setup. Probably worth checking your error logs.

(   (()
(`-' _\
 ''  ''


Comment from bhaskarna
Date: 07/17/2006 12:27PM BST
      Author Comment       

Please execuseme if i am asking very basic questions.

Purposefully, after hitting hung situation, i left the tomcat process for 10hrs(without killing). Even after 10 hrs, i was seeing the same problem. (New requests were not getting into ESTABLISHED state). It use to stay at SYN_RECV for 2 minutes and comeout with no answer.

What does this mean? If tomcat is holding lot of socket connections then i should be able to notice the huge list of ESTABLISHED connections in netstat. But i was not seeing lot of established connections. Does it mean connections were released? if yes, then why i am not getting new connection? will there be any leaks? if yes, how to find out?

Will RHEL blocks a particular port after bombarding it? Do we need to finetune the message_burst, message_cost etc?


Comment from pjedmond
Date: 07/17/2006 01:45PM BST
      Your Comment       

Imagine a stick - If you bend it a little bit, then it will spring back. (This is equivalent to using the socket queue). - If you bend it a lot, then the stick snaps, and you need to get a new one. (This is equivalent to filling the socket queue, and then trying to keep on going...which is what your load testing is doing!)

Purposefully, after hitting hung situation, i left the tomcat process for 10hrs(without killing). Even after 10 hrs, i was seeing the same problem. (New requests were not getting into ESTABLISHED state). It use to stay at SYN_RECV for 2 minutes and comeout with no answer. - You've effectively 'broken' the system, so it won't recover without a restart.

If tomcat is holding lot of socket connections then i should be able to notice the huge list of ESTABLISHED connections in netstat. But i was not seeing lot of established connections. Does it mean connections were released? - All connections in Q were probably released, but the pointer for adding new connections is now pointing in the wrong place (possibly loads of bytes beyond where the Q should be), and as a result new connections cannot be placed in this Q.

Will RHEL blocks a particular port after bombarding it? - No - not unless you have a firewall rule of some sort to do this.

Do we need to finetune the message_burst, message_cost etc? - Without studying your configuration, there is no way to say. The key point is that your application is exceeding the capabilities of the setup that you have. Ideally, RHEL should prevent you from doing this, but in this case it obviously does not - probably due to optimisations that improve performance. You need to change your tomcat code to prevent it exceeding these limitations. For example - if there are more than say 100 connections, then you tomcat servlets need to reject the connection and produce a page that says "Please retry again later". A better alternative would be to configure the necessary restriction at the Tomcat Level:

http://tomcat.apache.org/tomcat-4.0-doc/config/http11.html

In particular in the services.xml file, check and adjust the 'connector' (HttpConnector) configuration to 'tune' your setup. The exact 'tuning' required will depend on your application.

(   (()
(`-' _\
 ''  ''












Comment from pjedmond
Date: 07/17/2006 01:49PM BST
      Your Comment       

Sorry - noticed that you were running Tomcat 5. Worth having a look at logging, and see if that can help locate at what point the setup fails. Start here:

http://tomcat.apache.org/tomcat-5.0-doc/config/

Note that Tomcat 5, and 5.5 have a different logging configuration.

(   (()
(`-' _\
 ''  ''

Comment from bhaskarna
Date: 07/18/2006 10:17AM BST
      Author Comment       

Till now i was suspecting that the socket q overflow problem is caused due to my load runs. But today in nestat -s output, I noticed the following entry in the box where load runner will not hit:
"    391808 times the listen queue of a socket overflowed"

To confirm that loadrunner didn't hit this box, in the tomcat access logs (for last 15 days) i see max of 200-300 requests per day. Out of 300 requests, 200+ will be heartbeat checks and around 60-80 requests are real requests.

What are the various reasons to see 'listen queue of a socket overflowed'?
Is there a easy way to debug and find out who is causing this overflow?


Comment from pjedmond
Date: 07/18/2006 10:48AM BST
      Your Comment       

Interesting, as it implies that tomcat may be the cause:

http://www.linuxquestions.org/questions/showthread.php?t=457517

>What are the various reasons to see 'listen queue of a socket overflowed'?
Only 1 reason really - too many connection attempts in too short a period of time.

>Is there a easy way to debug and find out who is causing this overflow?
Not really, unless someone is doing a 'standard' attack on your PC (Is is isolated from the rest of the network?). Best bet is probably to try using "tcpdump > filename.txt" and check what's going on after the freeze. These are raw packets and this approach is not for the faint-hearted, and will be timeconsuming to check/analyse!

(   (()
(`-' _\
 ''  ''
0
 

Author Comment

by:bhaskarna
Comment Utility
I noticed that the overflowcount is growing studily(even though tomcat didn't get into hung situation). Because yesterday when I checked the count was:
    22288 times the listen queue of a socket overflowed
Today it grown to:
    22330 times the listen queue of a socket overflowed

sar logs doesnt' show any spikes..
Will there be any other logs that justifies too many connection attempts in too short a period of time.?  or should i explicitly turnon some specific logs to get more data on what type of requests that caused this small growth.

0
 
LVL 22

Expert Comment

by:pjedmond
Comment Utility
Well found......

Can you try leaving the server disconnected from the network and see if these overflows continue to grow. If so, then we know that the problem is part of the configuration rather than anything external. May be time to start looking for tomcat logs.

(   (()
(`-' _\
 ''  ''



0
 

Author Comment

by:bhaskarna
Comment Utility
The below thread is initiated by me for the same issue:
http://www.linuxquestions.org/questions/showthread.php?t=457517

The proc status provided in the above thread is taken when tomcat got into the frozen state.

Does proc status of jsvc (tomcat) hints something to you to tell that tomcat is a cause for this problem?
0
 

Author Comment

by:bhaskarna
Comment Utility
i tried to bombard the tomcat server with 100 threads again and noticed that the  overflowcount was growing studily.
Now, didn't get the right answer on why the port didn't hang this time.
But i didn't hit the no answer from port issue. In between we did some tuning in tomcat like maxThreads, acceptCount, jvm parameters like XmX, XmS.


0
 
LVL 87

Expert Comment

by:rindi
Comment Utility
Any reason not to close this Q? It seems there is nothing new...
0
 
LVL 87

Expert Comment

by:rindi
Comment Utility
bhaskarna, still there?
0
 
LVL 87

Expert Comment

by:rindi
Comment Utility
Thanks, but please don't accept a post which only reminds you to close a Question and has nothing to with the Q at all. Select one of the other posts or follow the instructions in my link in the administrative comment above to get the Q closed. I will reopen this Q so that you can properly close it.

Thanks, rindi
PE Storage
0
 
LVL 87

Expert Comment

by:rindi
Comment Utility
Please close the Q according to the EE rules, thanks.
0

Featured Post

Better Security Awareness With Threat Intelligence

See how one of the leading financial services organizations uses Recorded Future as part of a holistic threat intelligence program to promote security awareness and proactively and efficiently identify threats.

Join & Write a Comment

Many people tend to confuse the function of a virus with the one of adware, this misunderstanding of the basic of what each software is and how it operates causes users and organizations to take the wrong security measures that would protect them ag…
In this article we will discuss all things related to StageFright bug, the most vulnerable bug of android devices.
Get a first impression of how PRTG looks and learn how it works.   This video is a short introduction to PRTG, as an initial overview or as a quick start for new PRTG users.
This video shows how to remove a single email address from the Outlook 2010 Auto Suggestion memory. NOTE: For Outlook 2016 and 2013 perform the exact same steps. Open a new email: Click the New email button in Outlook. Start typing the address: …

743 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

17 Experts available now in Live!

Get 1:1 Help Now