asked on

WebSphere CPU Starvation and hung threads

I have attached a snapshot of the error from the SystemOut.log. The website was slow and became fine once the errors stopped

Appreciate if someone can explain if its a serious error. I had done some research on CPU Starvation and read articles mentioning that if the CPU thread scheduling is less than 20 seconds and does not repeat , there is no need to worry.

The only recent change which was done was the datasource was made unshareable from shareable to resolve db connection leakage.

My log shows delay of 219 seconds.

The hardware is 32 CPU and 32 GB RAM with websphere 6.0.2.21 . WebSphere connects to 20 Oracle databases.

JVM Settings are as below

/opt/IBM/WebSphere/AppServer/java/bin/java -server -Dwas.status.socket=54964 -XX:MaxPermSiz
e=256m -Xbootclasspath/p:/opt/IBM/WebSphere/AppServer/java/jre/lib/ext/ibmorb.jar:/opt/IBM/WebSphere/AppServer/java/jre/lib/ext/ibmext.jar -classpath /opt/IB
M/WebSphere/AppServer/profiles/AppSrv02/properties:/opt/IBM/WebSphere/AppServer/properties:/opt/IBM/WebSphere/AppServer/lib/bootstrap.jar:/opt/IBM/WebSphere/
AppServer/lib/j2ee.jar:/opt/IBM/WebSphere/AppServer/lib/lmproxy.jar:/opt/IBM/WebSphere/AppServer/lib/urlprotocols.jar -verbose:gc -Xms512m -Xmx2048m -Dws.ext
.dirs=/opt/IBM/WebSphere/AppServer/java/lib:/opt/IBM/WebSphere/AppServer/profiles/AppSrv02/classes:/opt/IBM/WebSphere/AppServer/classes:/opt/IBM/WebSphere/Ap
pServer/lib:/opt/IBM/WebSphere/AppServer/installedChannels:/opt/IBM/WebSphere/AppServer/lib/ext:/opt/IBM/WebSphere/AppServer/web/help:/opt/IBM/WebSphere/AppS
erver/deploytool/itp/plugins/com.ibm.etools.ejbdeploy/runtime -Dderby.system.home=/opt/IBM/WebSphere/AppServer/derby -Dcom.ibm.itp.location=/opt/IBM/WebSpher
e/AppServer/bin -Djava.util.logging.configureByServer=true -Dibm.websphere.preload.classes=true -Duser.install.root=/opt/IBM/WebSphere/AppServer/profiles/App
Srv02 -Dwas.install.root=/opt/IBM/WebSphere/AppServer -Djava.util.logging.manager=com.ibm.ws.bootstrap.WsLogManager -Ddb2j.system.home=/opt/IBM/WebSphere/App
Server/cloudscape -Dserver.root=/opt/IBM/WebSphere/AppServer/profiles/AppSrv02 -XX:+PrintGCTimeStamps -XX:+PrintGCDetails -XX:+PrintHeapAtGC -XX:+PrintGCAppl
icationStoppedTime -Djava.awt.headless=true -Dclient.encoding.override=UTF-8 -Xmn256m -XX:+AggressiveHeap -XX:-UseAdaptiveSizePolicy -XX:MaxTenuringThreshold
=4 -XX:+UseParallelGC -XX:ParallelGCThreads=4 -XX:+DisableExplicitGC -XX:PermSize=96m -XX:+HeapDumpOnOutOfMemoryError -XX:+HeapDumpOnCtrlBreak -Djava.securit
y.auth.login.config=/opt/IBM/WebSphere/AppServer/profiles/AppSrv02/properties/wsjaas.conf -Djava.security.policy=/opt/IBM/WebSphere/AppServer/profiles/AppSrv
02/properties/server.policy com.ibm.ws.bootstrap.WSLauncher com.ibm.ws.runtime.WsServer /opt/IBM/WebSphere/AppServer/profiles/AppSrv02/config redCellX1 XYZ Node01 ABC1

Thanks in Advance !

Merry Christmas and a Happy New Year :-)

HAMgrerrorDec28thnight.txt

HonorGod

It's very hard to judge, especially from that small snippet of the SystemOut.log, what is actually happening.

I suggest that you take a look at the "IBM Thread and Monitor Dump Analyzer for Java" program from AlphaWorks:

http://www.alphaworks.ibm.com/tech/jca

Do you have a stand-alone application server, or a federated or clustered environment?

What version of the WebSphere Application Server product is being used?

- Open a command prompt
- "cd" to the AppServer bin directory
Windows:
> cd /d D:\IBM\WebSphere\AppServer\bin
Unix type systems:
# cd /opt/IBM/WebSphere/AppServer/bin
- Execute the verisonInfo command script
Windows:
> versionInfo
Unix type systems:
# ./versionInfo.sh
- Look for the information in the "Installed product" section
& provide the complete version

Thanks

wasadmin11

ASKER

Clustered environment with 10 members (5 on each box)
Version is 6.0.2.21

From Java dumps , i see mostly blocked threads.

vmstat shows no paging . CPU is 96% idle .

HonorGod

How many physical machines?

Is the database local to a machine with 1+ AppServers, or remote from all?

How "far" is the database server from the AppServers?

Are there bridges, switches, routers, firewalls between the AppServers and the database server?

wasadmin11

ASKER

2 physical macihnes

all databases are remote

there are firewalls between appservers and db servers.

There are very few articles/presentations on HA Manager /Core group . I guess i may have read all of them. I have noticed when Full GC occurs CPU starvation message pops up but the thread scheduling delays is less than 20 seconds. What worries me is when the delay is so huge like 200+ seconds.

Wondering if -XX:ParallelGCThreads=4 is too low for 32 CPU .

ASKER CERTIFIED SOLUTION

HonorGod

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

bepot

Can you please provide GC log. It should be very detailed according to the java options you use:
-XX:+PrintGCTimeStamps -XX:+PrintGCDetails -XX:+PrintHeapAtGC -XX:+PrintGCAppl
icationStoppedTime

wasadmin11

ASKER

Attached is GC log.
native-stdout.log.txt

wasadmin11

ASKER

any update on the gc log analysis
thanks in advance

bepot

The slowest GC took about 25 seconds. This is way below 230 sec. Nevertheless, according to GC log there probably was something in the system which consumed significant amount of CPU. Loot at the collection below. It took 25 seconds and what collector did is not different from other collection which took only a fraction of a second.

{Heap before GC invocations=1268:
Heap
PSYoungGen total 229376K, used 207872K [0x69000000, 0x79000000, 0x79000000)
eden space 196608K, 100% used [0x69000000,0x75000000,0x75000000)
from space 32768K, 34% used [0x75000000,0x75b00000,0x77000000)
to space 32768K, 0% used [0x77000000,0x77000000,0x79000000)
PSOldGen total 1835008K, used 300363K [0x79000000, 0xe9000000, 0xe9000000)
object space 1835008K, 16% used [0x79000000,0x8b552f30,0xe9000000)
PSPermGen total 192512K, used 96338K [0xe9000000, 0xf4c00000, 0xf9000000)
object space 192512K, 50% used [0xe9000000,0xeee14ad8,0xf4c00000)
127971.965: [GC 508235K->316209K(2064384K), 0.0572443 secs]
Heap after GC invocations=1268:
Heap
PSYoungGen total 229376K, used 15360K [0x69000000, 0x79000000, 0x79000000)
eden space 196608K, 0% used [0x69000000,0x69000000,0x75000000)
from space 32768K, 46% used [0x77000000,0x77f00000,0x79000000)
to space 32768K, 0% used [0x75000000,0x75000000,0x77000000)
PSOldGen total 1835008K, used 300849K [0x79000000, 0xe9000000, 0xe9000000)
object space 1835008K, 16% used [0x79000000,0x8b5cc7c8,0xe9000000)
PSPermGen total 192512K, used 96338K [0xe9000000, 0xf4c00000, 0xf9000000)
object space 192512K, 50% used [0xe9000000,0xeee14ad8,0xf4c00000)
} Total time for which application threads were stopped: 25.5499036 seconds

This "something" could be the Java app or other application running and causing a whole process swap or heavy paging.
Do you have sar or vmstat logs preferably time correlated with GC log? We may be able to find this something.

What is your goal for this investigation?

wasadmin11

ASKER

everytime i have checked vmstat ..everything looked normal. No paging , CPU idle 97% of the time.

I do have about 15 other java processes (adapters etc) running. I have seen when there is a full GC , i do see CPU starvation errors sometimes which goes away

My goal is to find out why CPU starvation occurs and why HA Manager (core group ) messages occur when everything seems to be normal (vmstat, no threads to be run , not much load in the system)

I can only think of some network fluctuations as netstat -s output does show dropped packets, duplicate packets

SOLUTION

HonorGod

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

wasadmin11

ASKER

The slowest GC took about 25 seconds. This is way below 230 sec

* I did not understand how the 230 sec was determined ? can u please explain

thanks

bepot

Sorry, the post indicated 219 seconds.
I had similar issue with one of the customers. In that case the issue was caused by poor clustering configuration. Can yo provide your J2EE app deployment topology. I can send a standard questionnaire I use to collect customers data to your email

wasadmin11

ASKER

I can provide the J2EE app topology.

I still did not understand the relationship between the 219 seconds (CPU thread scheduling delay) and longest GC (25 seconds). Is there any correlation

thanks

wasadmin11

ASKER

Today again i had lots of CPU starvation errors , after a code change .After the old code was reverted back everythign looked normal. CPU utilization of one WAS instance was about 15% and was hung. Load average was about 17 (w) . Native_stdout.log showed Perm Gen becoming 100% quickly. I am still investigating.

I do have another doubt. My core group in another set of servers includes the default servers (server1). Can this be deleted from the core group . Would it cause any potential issues.

bepot

Perm Gen becoming full quickly is not good. Probably CPU starvation was caused by GC trying to free memory ( which it probably can't do for Perm Gen ) It is hard to say anything without looking at the system and all available logs.

wasadmin11

ASKER

closing this . It was found that there were some queries which were not tuned. After tuning the problem was resolved

HonorGod

Thanks for the grade & points.

I'm glad to have been able to help.

Good luck & have a great day.