?
Solved

WebSphere CPU Starvation and hung threads

Posted on 2009-12-28
18
Medium Priority
?
5,435 Views
Last Modified: 2013-12-11
I have attached a snapshot of the error from the SystemOut.log. The website was slow and became fine once the errors stopped

Appreciate if someone can explain if its a serious error. I had done some research on CPU Starvation and read articles mentioning that if the CPU thread scheduling is less than 20 seconds and does not repeat , there is no need to worry.

  The only recent change which was done was the datasource was made unshareable from shareable to resolve db connection leakage.

My log shows delay of 219 seconds.

The hardware is 32 CPU and 32 GB RAM with websphere 6.0.2.21 . WebSphere connects to 20 Oracle databases.

JVM Settings are as below

/opt/IBM/WebSphere/AppServer/java/bin/java -server -Dwas.status.socket=54964 -XX:MaxPermSiz
e=256m -Xbootclasspath/p:/opt/IBM/WebSphere/AppServer/java/jre/lib/ext/ibmorb.jar:/opt/IBM/WebSphere/AppServer/java/jre/lib/ext/ibmext.jar -classpath /opt/IB
M/WebSphere/AppServer/profiles/AppSrv02/properties:/opt/IBM/WebSphere/AppServer/properties:/opt/IBM/WebSphere/AppServer/lib/bootstrap.jar:/opt/IBM/WebSphere/
AppServer/lib/j2ee.jar:/opt/IBM/WebSphere/AppServer/lib/lmproxy.jar:/opt/IBM/WebSphere/AppServer/lib/urlprotocols.jar -verbose:gc -Xms512m -Xmx2048m -Dws.ext
.dirs=/opt/IBM/WebSphere/AppServer/java/lib:/opt/IBM/WebSphere/AppServer/profiles/AppSrv02/classes:/opt/IBM/WebSphere/AppServer/classes:/opt/IBM/WebSphere/Ap
pServer/lib:/opt/IBM/WebSphere/AppServer/installedChannels:/opt/IBM/WebSphere/AppServer/lib/ext:/opt/IBM/WebSphere/AppServer/web/help:/opt/IBM/WebSphere/AppS
erver/deploytool/itp/plugins/com.ibm.etools.ejbdeploy/runtime -Dderby.system.home=/opt/IBM/WebSphere/AppServer/derby -Dcom.ibm.itp.location=/opt/IBM/WebSpher
e/AppServer/bin -Djava.util.logging.configureByServer=true -Dibm.websphere.preload.classes=true -Duser.install.root=/opt/IBM/WebSphere/AppServer/profiles/App
Srv02 -Dwas.install.root=/opt/IBM/WebSphere/AppServer -Djava.util.logging.manager=com.ibm.ws.bootstrap.WsLogManager -Ddb2j.system.home=/opt/IBM/WebSphere/App
Server/cloudscape -Dserver.root=/opt/IBM/WebSphere/AppServer/profiles/AppSrv02 -XX:+PrintGCTimeStamps -XX:+PrintGCDetails -XX:+PrintHeapAtGC -XX:+PrintGCAppl
icationStoppedTime -Djava.awt.headless=true -Dclient.encoding.override=UTF-8 -Xmn256m -XX:+AggressiveHeap -XX:-UseAdaptiveSizePolicy -XX:MaxTenuringThreshold
=4 -XX:+UseParallelGC -XX:ParallelGCThreads=4 -XX:+DisableExplicitGC -XX:PermSize=96m -XX:+HeapDumpOnOutOfMemoryError -XX:+HeapDumpOnCtrlBreak -Djava.securit
y.auth.login.config=/opt/IBM/WebSphere/AppServer/profiles/AppSrv02/properties/wsjaas.conf -Djava.security.policy=/opt/IBM/WebSphere/AppServer/profiles/AppSrv
02/properties/server.policy com.ibm.ws.bootstrap.WSLauncher com.ibm.ws.runtime.WsServer /opt/IBM/WebSphere/AppServer/profiles/AppSrv02/config redCellX1 XYZ Node01 ABC1


Thanks in Advance !

Merry Christmas and a Happy New Year :-)



HAMgrerrorDec28thnight.txt
0
Comment
Question by:wasadmin11
  • 9
  • 5
  • 4
18 Comments
 
LVL 41

Expert Comment

by:HonorGod
ID: 26140165
It's very hard to judge, especially from that small snippet of the SystemOut.log, what is actually happening.

I suggest that you take a look at the "IBM Thread and Monitor Dump Analyzer for Java" program from AlphaWorks:

http://www.alphaworks.ibm.com/tech/jca

Do you have a stand-alone application server, or a federated or clustered environment?

What version of the WebSphere Application Server product is being used?

- Open a command prompt
- "cd" to the AppServer bin directory
  Windows:
  > cd /d D:\IBM\WebSphere\AppServer\bin
  Unix type systems:
  # cd /opt/IBM/WebSphere/AppServer/bin
- Execute the verisonInfo command script
  Windows:
  > versionInfo
  Unix type systems:
  # ./versionInfo.sh
- Look for the information in the "Installed product" section
  & provide the complete version

Thanks

0
 
LVL 2

Author Comment

by:wasadmin11
ID: 26142937
Clustered environment with 10 members (5 on each box)
Version is 6.0.2.21

From Java dumps , i see mostly blocked threads.

vmstat shows no paging . CPU is 96% idle .
0
 
LVL 41

Expert Comment

by:HonorGod
ID: 26148228
How many physical machines?

Is the database local to a machine with 1+ AppServers, or remote from all?

How "far" is the database server from the AppServers?

Are there bridges, switches, routers, firewalls between the AppServers and the database server?
0
Get expert help—faster!

Need expert help—fast? Use the Help Bell for personalized assistance getting answers to your important questions.

 
LVL 2

Author Comment

by:wasadmin11
ID: 26151501
2 physical macihnes

all databases are remote

there are firewalls between appservers and db servers.

There are very few articles/presentations on HA Manager /Core group . I guess i may have read all of them. I have noticed when Full GC occurs CPU starvation message pops up but the thread scheduling delays is less than 20 seconds. What worries me is when the delay is so huge like 200+ seconds.

Wondering if -XX:ParallelGCThreads=4 is too low for 32 CPU .
0
 
LVL 41

Accepted Solution

by:
HonorGod earned 2000 total points
ID: 26156600
If you're spending a lot of time during garbage collection, it may be due to having max heap set too big.

After the GC, how much heap is available?

It it is "too small", then you have a situation where the AppServer goes for a long time, and then pays the price of this uninterrupted execution time by the "GC tax"...

Does that make sense?

What kind of responsiveness are your applications seeing from the DB?

You might want to take a look at using the Performance Monitoring Infrastructure (PMI) to help you better understand what is happening, and when.

Performance Monitoring Infrastructure (PMI)

http://publib.boulder.ibm.com/infocenter/wasinfo/v6r1/index.jsp?topic=/com.ibm.websphere.nd.doc/info/ae/ae/cprf_pmidata.html

Q: Is ParallelGCThreads=4 is too low for 32 CPU
A: That is a distinct possibility, yes.
0
 

Expert Comment

by:bepot
ID: 26182833

Can you please provide GC log. It should be very detailed according to the java options you use:
-XX:+PrintGCTimeStamps -XX:+PrintGCDetails -XX:+PrintHeapAtGC -XX:+PrintGCAppl
icationStoppedTime
0
 
LVL 2

Author Comment

by:wasadmin11
ID: 26197212
Attached is GC log.
native-stdout.log.txt
0
 
LVL 2

Author Comment

by:wasadmin11
ID: 26291521
any update on the gc log analysis
thanks in advance
0
 

Expert Comment

by:bepot
ID: 26294370
The slowest GC took about 25 seconds. This is way below 230 sec. Nevertheless, according to GC log there probably was something in the system which consumed significant amount of CPU. Loot at the collection below. It took 25 seconds and what collector did is not different from other collection which took only a fraction of a second.


 {Heap before GC invocations=1268:
Heap
 PSYoungGen      total 229376K, used 207872K [0x69000000, 0x79000000, 0x79000000)
  eden space 196608K, 100% used [0x69000000,0x75000000,0x75000000)
  from space 32768K, 34% used [0x75000000,0x75b00000,0x77000000)
  to   space 32768K, 0% used [0x77000000,0x77000000,0x79000000)
 PSOldGen        total 1835008K, used 300363K [0x79000000, 0xe9000000, 0xe9000000)
  object space 1835008K, 16% used [0x79000000,0x8b552f30,0xe9000000)
 PSPermGen       total 192512K, used 96338K [0xe9000000, 0xf4c00000, 0xf9000000)
  object space 192512K, 50% used [0xe9000000,0xeee14ad8,0xf4c00000)
127971.965: [GC 508235K->316209K(2064384K), 0.0572443 secs]
 Heap after GC invocations=1268:
Heap
 PSYoungGen      total 229376K, used 15360K [0x69000000, 0x79000000, 0x79000000)
  eden space 196608K, 0% used [0x69000000,0x69000000,0x75000000)
  from space 32768K, 46% used [0x77000000,0x77f00000,0x79000000)
  to   space 32768K, 0% used [0x75000000,0x75000000,0x77000000)
 PSOldGen        total 1835008K, used 300849K [0x79000000, 0xe9000000, 0xe9000000)
  object space 1835008K, 16% used [0x79000000,0x8b5cc7c8,0xe9000000)
 PSPermGen       total 192512K, used 96338K [0xe9000000, 0xf4c00000, 0xf9000000)
  object space 192512K, 50% used [0xe9000000,0xeee14ad8,0xf4c00000)
} Total time for which application threads were stopped: 25.5499036 seconds

This "something" could be the Java app or other application running and causing a whole process swap or heavy paging.
Do you have sar or vmstat logs preferably time correlated with GC log? We may be able to find this something.

What is your goal for this investigation?
0
 
LVL 2

Author Comment

by:wasadmin11
ID: 26303262
everytime i have checked vmstat ..everything looked normal. No paging , CPU idle 97% of the time.

I do have about 15 other java processes (adapters etc) running. I have seen when there is a full GC , i do see CPU starvation errors sometimes which goes away

My goal is to find out why CPU starvation occurs  and why HA Manager (core group ) messages occur when everything seems to be normal (vmstat, no threads to be run , not much load in the system)

I can only think of some network fluctuations as netstat -s output does show dropped packets, duplicate packets
0
 
LVL 41

Assisted Solution

by:HonorGod
HonorGod earned 2000 total points
ID: 26303333
> I do have about 15 other java processes

  Each JVM has its own heap, so you aren't going to see a correlation between heap utilization between JVM's unless applications on each are communicating between/amongst themselves.

> why HA Manager (core group ) messages occur

  Those things happen all of the time.  The HA Manager uses messages to verify the state and accessibility of the servers so that it can minimize the time required to reroute requests in the event of a server outage.

> My goal is to find out why CPU starvation occurs
  This may help:

http://publib.boulder.ibm.com/infocenter/wasinfo/v6r0/index.jsp?topic=/com.ibm.websphere.nd.doc/info/ae/ae/rtrb_ha_env_trbl.html
0
 
LVL 2

Author Comment

by:wasadmin11
ID: 26305259
The slowest GC took about 25 seconds. This is way below 230 sec

*  I did not understand how the 230 sec was determined ? can u please explain

thanks


0
 

Expert Comment

by:bepot
ID: 26305951
Sorry, the post indicated  219 seconds.
I had similar issue with one of the customers. In that case the issue was caused by poor clustering configuration. Can yo provide your J2EE app deployment topology. I can send a standard questionnaire I use to collect customers data to your email
0
 
LVL 2

Author Comment

by:wasadmin11
ID: 26310740
I can provide the J2EE app topology.

I still did not understand the relationship between the 219 seconds (CPU thread scheduling delay) and longest GC (25 seconds). Is there any correlation

thanks
0
 
LVL 2

Author Comment

by:wasadmin11
ID: 26380807
Today again i had lots of CPU starvation errors , after a code change .After the old code was reverted back everythign looked normal. CPU utilization of one WAS instance was about 15% and was hung. Load average was about 17 (w) . Native_stdout.log showed Perm Gen becoming 100% quickly. I am still investigating.

I do have another doubt. My core group in another set of servers includes the default servers (server1). Can this be deleted from the core group . Would it cause any potential issues.

0
 

Expert Comment

by:bepot
ID: 26382183
Perm Gen becoming full quickly is not good. Probably CPU starvation was caused by GC trying to free memory ( which it probably can't do for Perm Gen )  It is hard to say anything without looking at the system and all available logs.
0
 
LVL 2

Author Comment

by:wasadmin11
ID: 27280275
closing this . It was found that there were some queries which were not tuned. After tuning the problem was resolved
0
 
LVL 41

Expert Comment

by:HonorGod
ID: 27303405
Thanks for the grade & points.

I'm glad to have been able to help.

Good luck & have a great day.
0

Featured Post

[Webinar] Kill tickets & tabs using PowerShell

Are you tired of cycling through the same browser tabs everyday to close the same repetitive tickets? In this webinar JumpCloud will show how you can leverage RESTful APIs to build your own PowerShell modules to kill tickets & tabs using the PowerShell command Invoke-RestMethod.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Configure Web Service (server application) I. Configure security for Web Services methods First, we need to protect Session bean which implements the service: 1. Open EJB deployment descriptor (ejb-jar.xml) in the EJB project that contains you…
Most of the developers using Tomcat find it easy to configure the datasource in Server.xml and use the JNDI name in the code to get the connection.  So the default connection pool using DBCP (or any other framework) is made available and the life go…
Planning to migrate your EDB file(s) to a new or an existing Outlook PST file? This video will guide you how to convert EDB file(s) to PST. Besides this, it also describes, how one can easily search any item(s) from multiple folders or mailboxes…
Free Data Recovery software is an advanced solution from Kernel Tools to recover data and files such as documents, emails, database, media and pictures, etc. It supports recovery from physical & logical drive after a hard disk crash, accidental/inte…
Suggested Courses

601 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question