asked on

CPU Starvation Detected in SystemOut.log

We have noticed a message HMGR0152W: CPU Starvation detected. Current thread scheduling delay is 84 seconds.
This happens everyday night between 11 and 2. The monitorig application gives heavy memory usage on the servers and I guess this could be the cause of the Problem. The messsage is logged in the WebSphere Application Server Cluster members running on Version 6.0.2.31 and JDk is 1.4.2 SR11.

Below are the log details from the SystemOut
[1/30/09 23:27:46:888 AST] 0000167c CoordinatorCo W HMGR0152W: CPU Starvation detected. Current thread scheduling delay is 6 seconds.

[1/31/09 0:52:22:349 AST] 000000e9 WorkQueueMana W TCPC0005W: TCP Channel SIB_TCP_JFAP could not obtain thread from thread pool <null>.
[1/31/09 0:52:39:788 AST] 000036b6 WorkQueueMana W TCPC0005W: TCP Channel SIB_TCP_JFAP could not obtain thread from thread pool <null>.
[1/31/09 0:58:04:781 AST] 0000512e SystemOut O DefaultValidationEventHandler: [ERROR]: Unexpected element {}:DOCUMENT_NUMBER
Location:
[1/31/09 0:59:55:649 AST] 00005225 SibMessage E [:] CWSIC2023E: A communication error occurred when sending to or receiving from a remote client: exception com.ibm.wsspi.sib.core.exception.SIConnectionDroppedException: CWSIJ0044E: An operation was attempted on a connection that is already closed..
[1/31/09 0:59:55:656 AST] 00005234 SibMessage E [:] CWSIC2023E: A communication error occurred when sending to or receiving from a remote client: exception com.ibm.wsspi.sib.core.exception.SIConnectionDroppedException: CWSIJ0044E: An operation was attempted on a connection that is already closed..
[1/31/09 0:59:55:830 AST] 00005233 SibMessage E [:] CWSIC2023E: A communication error occurred when sending to or receiving from a remote client: exception com.ibm.wsspi.sib.core.exception.SIConnectionDroppedException: CWSIJ0044E: An operation was attempted on a connection that is already closed..
[1/31/09 0:59:55:829 AST] 00005227 SibMessage E [:] CWSIC2021E: A communication error occurred when sending to or receiving from a remote client: exception com.ibm.wsspi.sib.core.exception.SIConnectionDroppedException: CWSIJ0044E: An operation was attempted on a connection that is already closed..
[1/31/09 1:02:40:017 AST] 000036b6 WorkQueueMana W TCPC0005W: TCP Channel SIB_TCP_JFAP could not obtain thread from thread pool <null>.
[1/31/09 1:02:40:164 AST] 000000e9 WorkQueueMana W TCPC0005W: TCP Channel SIB_TCP_JFAP could not obtain thread from thread pool <null>.
[1/31/09 1:02:42:818 AST] 0000368a ThreadMonitor W WSVR0605W: Thread "HAManager.thread.pool : 1" (0000002b) has been active for 692865 milliseconds and may be hung. There is/are 2 thread(s) in total in the server that may be hung.
[1/31/09 1:02:44:614 AST] 0000368a ThreadMonitor W WSVR0605W: Thread "WebContainer : 6957" (00005115) has been active for 723163 milliseconds and may be hung. There is/are 3 thread(s) in total in the server that may be hung.
[1/31/09 1:02:44:933 AST] 0000368a ThreadMonitor W WSVR0605W: Thread "WebContainer : 6978" (0000519e) has been active for 736477 milliseconds and may be hung. There is/are 4 thread(s) in total in the server that may be hung.
[1/31/09 1:04:26:546 AST] 0000167c CoordinatorCo W HMGR0152W: CPU Starvation detected. Current thread scheduling delay is 84 seconds.
[1/31/09 1:04:26:834 AST] 00005265 ApplicationMo W DCSV0004W: DCS Stack DefaultCoreGroup.Cluster_ _Prod_1 at Member Cell_ _Prod_1\Node_ _PROD_1\AS_ _PROD_1: Did not receive adequate CPU time slice. Last known CPU usage time at 01:01:23:231 AST. Inactivity duration was 183 seconds.
[1/31/09 1:04:30:386 AST] 0000001f MbuRmmAdapter W DCSV1111W: DCS Stack DefaultCoreGroup at Member Cell_ _Prod_1\Node_ _PROD_1\AS_ _PROD_1: Suspected another member because the outgoing connection from the other member was closed. Suspected members is Cell_ _Prod_1\Node_PG_PROD_1\nodeagent. DCS logical channel is View|Gossip.
[1/31/09 1:04:33:818 AST] 0000001f MbuRmmAdapter W DCSV1111W: DCS Stack DefaultCoreGroup at Member Cell_ _Prod_1\Node_ _PROD_1\AS_ _PROD_1: Suspected another member because the outgoing connection from the other member was closed. Suspected members is Cell_ _Prod_1\Node_ _PROD_2\nodeagent. DCS logical channel is View|Gossip.
[1/31/09 1:04:35:285 AST] 00005268 ApplicationMo W DCSV0004W: DCS Stack DefaultCoreGroup at Member Cell_ _Prod_1\Node_ _PROD_1\AS_ _PROD_1: Did not receive adequate CPU time slice. Last known CPU usage time at 01:01:35:047 AST. Inactivity duration was 180 seconds.
[1/31/09 1:04:36:318 AST] 0000001f RmmPtpGroup W DCSV1111W: DCS Stack DefaultCoreGroup at Member Cell_ _Prod_1\Node_ _PROD_1\AS_ _PROD_1: Suspected another member because the outgoing connection from the other member was closed. Suspected members is Cell_ _Prod_1\Node_PG_PROD_1\AS_PG_PROD_1. DCS logical channel is View|Ptp.
[1/31/09 1:04:37:188 AST] 0000001f RmmPtpGroup W DCSV1111W: DCS Stack DefaultCoreGroup at Member Cell_ _Prod_1\Node_ _PROD_1\AS_ _PROD_1: Suspected another member because the outgoing connection from the other member was closed. Suspected members is Cell_ _Prod_1\Node_ DM_Prod_1\dmgr. DCS logical channel is View|Ptp.
[1/31/09 1:04:37:383 AST] 0000001f RmmPtpGroup W DCSV1111W: DCS Stack DefaultCoreGroup at Member Cell_ _Prod_1\Node_ _PROD_1\AS_ _PROD_1: Suspected another member because the outgoing connection from the other member was closed. Suspected members is Cell_ _Prod_1\Node_ST_Prod_1\AS_ST_Prod_1. DCS logical channel is View|Ptp.
[1/31/09 1:04:37:779 AST] 0000001f RmmPtpGroup W DCSV1111W: DCS Stack DefaultCoreGroup at Member Cell_ _Prod_1\Node_ _PROD_1\AS_ _PROD_1: Suspected another member because the outgoing connection from the other member was closed. Suspected members is Cell_ _Prod_1\Node_ST_Prod_1\nodeagent. DCS logical channel is View|Ptp.
[1/31/09 1:04:38:171 AST] 0000001f RmmPtpGroup W DCSV1111W: DCS Stack DefaultCoreGroup at Member Cell_ _Prod_1\Node_ _PROD_1\AS_ _PROD_1: Suspected another member because the outgoing connection from the other member was closed. Suspected members is Cell_ _Prod_1\Node_PG_PROD_2\nodeagent. DCS logical channel is View|Ptp.
[1/31/09 1:04:38:342 AST] 0000001f RmmPtpGroup W DCSV1111W: DCS Stack DefaultCoreGroup at Member Cell_ _Prod_1\Node_ _PROD_1\AS_ _PROD_1: Suspected another member because the outgoing connection from the other member was closed. Suspected members is Cell_ _Prod_1\Node_ _PROD_1\nodeagent. DCS logical channel is View|Ptp.
[1/31/09 1:04:38:720 AST] 0000001f RmmPtpGroup W DCSV1111W: DCS Stack DefaultCoreGroup at Member Cell_ _Prod_1\Node_ _PROD_1\AS_ _PROD_1: Suspected another member because the outgoing connection from the other member was closed. Suspected members is Cell_ _Prod_1\Node_ST_Prod_2\AS_ST_Prod_2. DCS logical channel is View|Ptp.
[1/31/09 1:04:39:617 AST] 0000001f RmmPtpGroup W DCSV1111W: DCS Stack DefaultCoreGroup at Member Cell_ _Prod_1\Node_ _PROD_1\AS_ _PROD_1: Suspected another member because the outgoing connection from the other member was closed. Suspected members is Cell_ _Prod_1\Node_PG_PROD_2\AS_PG_PROD_2. DCS logical channel is View|Ptp.
[1/31/09 1:04:39:757 AST] 0000001f RmmPtpGroup W DCSV1111W: DCS Stack DefaultCoreGroup at Member Cell_ _Prod_1\Node_ _PROD_1\AS_ _PROD_1: Suspected another member because the outgoing connection from the other member was closed. Suspected members is Cell_ _Prod_1\Node_ST_Prod_2\nodeagent. DCS logical channel is View|Ptp.
[1/31/09 1:04:39:983 AST] 0000001f RmmPtpGroup W DCSV1111W: DCS Stack DefaultCoreGroup at Member Cell_ _Prod_1\Node_ _PROD_1\AS_ _PROD_1: Suspected another member because the outgoing connection from the other member was closed. Suspected members is Cell_ _Prod_1\Node_ _PROD_2\AS_ _PROD_2. DCS logical channel is View|Ptp.
[1/31/09 1:04:40:126 AST] 0000001f DiscoveryRmmP W DCSV1111W: DCS Stack DefaultCoreGroup at Member Cell_ _Prod_1\Node_ _PROD_1\AS_ _PROD_1: Suspected another member because the outgoing connection from the other member was closed. Suspected members is Cell_ _Prod_1\Node_ _PROD_2\AS_ _PROD_2. DCS logical channel is Connected|Ptp.
[1/31/09 1:04:40:931 AST] 0000001f RmmPtpGroup W DCSV1111W: DCS Stack DefaultCoreGroup.Cluster_ _Prod_1 at Member Cell_ _Prod_1\Node_ _PROD_1\AS_ _PROD_1: Suspected another member because the outgoing connection from the other member was closed. Suspected members is Cell_ _Prod_1\Node_ _PROD_2\AS_ _PROD_2. DCS logical channel is View|Ptp.
[1/31/09 1:04:41:276 AST] 0000001f DiscoveryRmmP W DCSV1111W: DCS Stack DefaultCoreGroup.Cluster_ _Prod_1 at Member Cell_ _Prod_1\Node_ _PROD_1\AS_ _PROD_1: Suspected another member because the outgoing connection from the other member was closed. Suspected members is Cell_ _Prod_1\Node_ _PROD_2\AS_ _PROD_2. DCS logical channel is Connected|Ptp.
[1/31/09 1:04:41:653 AST] 0000001f DiscoveryRmmP W DCSV1113W: DCS Stack DefaultCoreGroup.Cluster_ _Prod_1 at Member Cell_ _Prod_1\Node_ _PROD_1\AS_ _PROD_1: Suspected another member because the outgoing connection to the other member was closed. Suspected member is Cell_ _Prod_1\Node_ _PROD_2\AS_ _PROD_2. DCS logical channel is Connected|Ptp.
[1/31/09 1:04:41:788 AST] 0000001f DiscoveryRmmP W DCSV1113W: DCS Stack DefaultCoreGroup at Member Cell_ _Prod_1\Node_ _PROD_1\AS_ _PROD_1: Suspected another member because the outgoing connection to the other member was closed. Suspected member is Cell_ _Prod_1\Node_ _PROD_2\AS_ _PROD_2. DCS logical channel is Connected|Ptp.
[1/31/09 1:04:42:535 AST] 00000017 RoleViewLeade I DCSV8053I: DCS Stack DefaultCoreGroup at Member Cell_ _Prod_1\Node_ _PROD_1\AS_ _PROD_1: View change in process. Excluded members are [Cell_ _Prod_1\Node_ _PROD_1\nodeagent Cell_ _Prod_1\Node_ _PROD_2\AS_ _PROD_2 Cell_ _Prod_1\Node_ _PROD_2\nodeagent Cell_ _Prod_1\Node_PG_PROD_1\AS_PG_PROD_1 Cell_ _Prod_1\Node_PG_PROD_1\nodeagent Cell_ _Prod_1\Node_PG_PROD_2\AS_PG_PROD_2 Cell_ _Prod_1\Node_PG_PROD_2\nodeagent Cell_ _Prod_1\Node_ DM_Prod_1\dmgr Cell_ _Prod_1\Node_ST_Prod_1\AS_ST_Prod_1 Cell_ _Prod_1\Node_ST_Prod_1\nodeagent Cell_ _Prod_1\Node_ST_Prod_2\AS_ST_Prod_2 Cell_ _Prod_1\Node_ST_Prod_2\nodeagent].
[1/31/09 1:04:42:537 AST] 0000002a RoleViewLeade I DCSV8053I: DCS Stack DefaultCoreGroup.Cluster_ _Prod_1 at Member Cell_ _Prod_1\Node_ _PROD_1\AS_ _PROD_1: View change in process. Excluded members are [Cell_ _Prod_1\Node_ _PROD_2\AS_ _PROD_2].
[1/31/09 1:04:43:309 AST] 0000001f MbuRmmAdapter I DCSV1032I: DCS Stack DefaultCoreGroup at Member Cell_ _Prod_1\Node_ _PROD_1\AS_ _PROD_1: Connected a defined member Cell_ _Prod_1\Node_ _PROD_2\AS_ _PROD_2.
[1/31/09 1:04:44:565 AST] 0000001f VSync I DCSV2004I: DCS Stack DefaultCoreGroup at Member Cell_ _Prod_1\Node_ _PROD_1\AS_ _PROD_1: The synchronization procedure completed successfully. The View Identifier is (803:0.Cell_ _Prod_1\Node_ _PROD_1\AS_ _PROD_1). The internal details are [0 0 0 0 0 0 0 0 0 0 0 0 0].
[1/31/09 1:04:44:715 AST] 0000001f VSync I DCSV2004I: DCS Stack DefaultCoreGroup.Cluster_ _Prod_1 at Member Cell_ _Prod_1\Node_ _PROD_1\AS_ _PROD_1: The synchronization procedure completed successfully. The View Identifier is (21:0.Cell_ _Prod_1\Node_ _PROD_1\AS_ _PROD_1). The internal details are [0 0].
[1/31/09 1:04:45:007 AST] 00000017 ViewReceiver I DCSV1033I: DCS Stack DefaultCoreGroup at Member Cell_ _Prod_1\Node_ _PROD_1\AS_ _PROD_1: Confirmed all new view members in view identifier (804:0.Cell_ _Prod_1\Node_ _PROD_1\AS_ _PROD_1). View channel type is View|Ptp.
[1/31/09 1:04:45:226 AST] 0000002a ViewReceiver I DCSV1033I: DCS Stack DefaultCoreGroup.Cluster_ _Prod_1 at Member Cell_ _Prod_1\Node_ _PROD_1\AS_ _PROD_1: Confirmed all new view members in view identifier (22:0.Cell_ _Prod_1\Node_ _PROD_1\AS_ _PROD_1). View channel type is View|Ptp.
[1/31/09 1:04:45:770 AST] 0000002a DataStackMemb I DCSV8050I: DCS Stack DefaultCoreGroup.Cluster_ _Prod_1 at Member Cell_ _Prod_1\Node_ _PROD_1\AS_ _PROD_1: New view installed, identifier (22:0.Cell_ _Prod_1\Node_ _PROD_1\AS_ _PROD_1), view size is 1 (AV=1, CD=1, CN=1, DF=2)
[1/31/09 1:04:47:225 AST] 000051de DRSGroup I CWWDR0010E: Replication instance HttpSessionCache caught exception when sending/receiving messages : com.ibm.wsspi.hamanager.datastack.DataStackMembershipChangingException: The target member is not currently in view.
at com.ibm.ws.hamanager.datastack.DataStackImpl.sendMessage(DataStackImpl.java(Compiled Code))
at com.ibm.ws.hamanager.agent.AgentClassImpl.sendMuxedMessage(AgentClassImpl.java(Inlined Compiled Code))
at com.ibm.ws.hamanager.agent.AgentImpl.sendMessage(AgentImpl.java(Compiled Code))
at com.ibm.ws.drs.model.DRSGroup.send(DRSGroup.java(Compiled Code))
at com.ibm.ws.drs.model.DRSGroup.send(DRSGroup.java(Compiled Code))
at com.ibm.ws.drs.stack.DRSAsyncSend.processSendMessage(DRSAsyncSend.java(Compiled Code))
at com.ibm.ws.drs.stack.DRSStack.processSendMessage(DRSStack.java(Compiled Code))

AIX-output.zip

gheist

-Xms1024m -Xmx1024m
and wait for limit to be hit.

dipinci

ASKER

Actually What you are trying to achive by this.
JVM was configured earlier with 1GB heap size, due to heapump I have increased.

gheist

http://www-01.ibm.com/support/docview.wss?uid=swg21236327

I'd add that you have to fit JVM in available physical memory by limiting heap, adding ram, and if it fits into RAM then pinning it to physical memory or at least preventing aging. Mak sure you "bindprocessor" JVM threads based on accumulated CPU use to later processors than cpu0 partition 0 which is most loaded by hw interrupts.
You can follow network serving threads using LSOF and bind them to non-virtual CPUs so that HW irqs come in first.

vmtune -fF can be set to 1024 1280 to prevent starvation from paging.
smitty wlm is of help too - guarantee 10% to system of every resource and 10% to other apps like oracle, user sessions, backup sessions, database, app server etc.
This way you can run system that pages out routinely this way, but error clearly indicates that you need more ram.

Here you can compact better server's ram:
http://download.boulder.ibm.com/ibmdl/pub/software/dw/jdk/diagnosis/GCandMemory-042005.pdf
Thruput gives feeling of better interactive response
You should examine JDBC drivers for nodelay/nagle options or disable nagle's using "no" command - that does magic with DB interactive performance.

HonorGod

I've read, and re-read your "question". Unfortunately, I have yet to find a question in it. What do you hope to achieve?

gheist

I assume asker wants to understand cause of error message and try to get rid of it. He even posted illustrative log to assist.

dipinci

ASKER

Hello,

I would like to find out why the error message is coming in the SystemOut.log and from the server monitoring team I happend to find that during the error message the Memory utilization was almost 95-99% which leads to the error in the SystemOut.log and during this time heavy paging is occuring.

After sometime the memory utilization is coming down to 76%. Only one Cluster member and node agent is running. We are trying to find out what actually casusing this problem since it is mission critical application running on this. If any outage to this application leads to Big Problems.

gheist

You really should disable nagle's algorithm and enable thruput gc.
After that you can be confident that e.g memory leak occured in right that moment you get log record and not 10min ago.

HonorGod

The article referenced by gheist in
https://www.experts-exchange.com/questions/24102496/CPU-Starvation-Detected-in-SystemOut-log.html?anchorAnswerId=23557501#a23557501
(i.e., http://www.IBM.com/support/docview.wss?rs=180&uid=swg21236327) is quite good.

The fact that you are seeing these messages at a particular time of day leads me to ask:

What "background jobs" are started during this window of time (11pm - 2am)?
Do you have some kind of virus scan, or backup, or update that occurs during this time every day?

What you may need to do is to run some checks to see what else is consuming resources on this system at that time. I'm guessing that the Operating System in use is AIX, since you included that in one of your Zones.
So, I would expect that using vmstat at regular intervals during this time might give you a better idea of what processes are consuming significant portions of the CPU, and possibly memory.

You may also want to use something like lsof to figure out what things are in memory, and what libraries are being used.

Some tools exist on the alphaworks website (http://www.alphaworks.ibm.com/) that can prove quite useful in understanding what is actually happening in an WebSphere environment.
For example:
- Heap Analyzer
http://www.alphaworks.ibm.com/tech/heapanalyzer
- Thread Dump Analyzer
http://www.alphaworks.ibm.com/tech/jca

It sounds like your Application Server is competing with something else for scare system resources...

gheist

... and there WLM comes to help to guarantee app server its fair share of resources....

dipinci

ASKER

Tried to verify the other process on the System and not find any scheduled job and other process.
There are some robos running in midnight. Other than that There is no activities.
The harware is pSeries with 6GB of Memory, is this memory is sufficient? This application was running for last 3 years with the same h/w resources.

gheist

Given error message you have current amount of memory is not sufficient.
You should enable porcess accounting (ask redbooks) to be sure nothing else is running.

I will repeat again and again until you get it:
1) Enable thruput GC
2) disable Nagle's using "no" command
3) Employ WLM and guarantee 10% of CPU to system and 10% to websphere 10% to database 1% to non-root users.
4) Tune memory allocation using MALLOC_TYPE and MALLOC FLAGS, and vmtune

I have explained reasons why you should do so. What is the problem you cannot fix _your_ system?

HonorGod

How did you "verify" that other processes aren't running?
I would suggest that you have a cron job start at script just before the normal outage time (? 10:55pm), and have it run vmstats every two minutes for the duration of the normal time (? 2am?). Have the output written to a file, and see what is actually occurring on the system during those times.

dipinci

ASKER

Already Enabled Garbage Collection on the JVM and analysed the output using the IBM Garbage Analyser. Also caputred some memory output and details got from the monitoring team. Attached the output.
I will try to enable the vmstat and get the out.

Below are the Output from GC Analysis
Memory Output
*********************

Report from Garbage Collector.
*************************************
Summary
Allocation failure count 8767
Concurrent collection count 0
Forced collection count 0
Full collections 0
GC Mode optthruput
Largest memory request (bytes) 3116400
Mean garbage collection pause (ms) 401
Mean heap unusable due to fragmentation (MB) 8.37
Mean interval between collections (minutes) 0.27
Number of collections 8767
Proportion of time spent in garbage collection pauses (%) 2.49
Proportion of time spent unpaused (%) 97.5
Rate of garbage collection 967.96 MB/minutes

Used heap (after collection)
Mean Minimum Maximum Total
heap (MB) heap (MB) heap (MB) heap (MB)
225 74.0 279 1974943

Tuning recommendation
At one point 3257 objects were queued for finalization. Using finalizers is not recommended as it can slow garbage collection and cause wasted space in the heap. Consider reviewing your application for occurrences of the finalize() method.

The application seems to be using some quite large objects. The largest request which triggered an allocation failure (and was recorded in the verbose gc log) was for 3116400 bytes.

The garbage collector tried to allocate from the large object area and failed 181 times. Consider setting or increasing -Xloratio. You may need to also increase the heap size with -mx. Be cautious of increasing -Xloratio to anything greater than 0.2.

The memory usage of the application does not indicate any obvious leaks.
Memory-Util-on-System.doc

ASKER CERTIFIED SOLUTION

gheist

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

dipinci

ASKER

The problem was not identified and no changes to the system, but now no memory or cpu starvation msg is appreaing in the logs. Thanks for the valuable suggestions and recommedation

gheist

Then it is memory leak. I'd sugeest restarting weekly or something like that.
IBM's tuning recommendations are worth looking at.
Cause is determined - jvm gets paged out. Read up there for tips how to avoid that.

wasadmin11

I am getting same issue on my system..WebSphere 6 on Solaris 10. Can someone recommend what is the solutuon for this issue.

Thanks in Advance

gheist

Reduce memory consumption of running applications. For exact Solaris tips post in Solaris zone, not AIX one.