Actually What you are trying to achive by this.
JVM was configured earlier with 1GB heap size, due to heapump I have increased.
Main Topics
Browse All TopicsWe have noticed a message HMGR0152W: CPU Starvation detected. Current thread scheduling delay is 84 seconds.
This happens everyday night between 11 and 2. The monitorig application gives heavy memory usage on the servers and I guess this could be the cause of the Problem. The messsage is logged in the WebSphere Application Server Cluster members running on Version 6.0.2.31 and JDk is 1.4.2 SR11.
Below are the log details from the SystemOut
[1/30/09 23:27:46:888 AST] 0000167c CoordinatorCo W HMGR0152W: CPU Starvation detected. Current thread scheduling delay is 6 seconds.
[1/31/09 0:52:22:349 AST] 000000e9 WorkQueueMana W TCPC0005W: TCP Channel SIB_TCP_JFAP could not obtain thread from thread pool <null>.
[1/31/09 0:52:39:788 AST] 000036b6 WorkQueueMana W TCPC0005W: TCP Channel SIB_TCP_JFAP could not obtain thread from thread pool <null>.
[1/31/09 0:58:04:781 AST] 0000512e SystemOut O DefaultValidationEventHand
Location:
[1/31/09 0:59:55:649 AST] 00005225 SibMessage E [:] CWSIC2023E: A communication error occurred when sending to or receiving from a remote client: exception com.ibm.wsspi.sib.core.exc
[1/31/09 0:59:55:656 AST] 00005234 SibMessage E [:] CWSIC2023E: A communication error occurred when sending to or receiving from a remote client: exception com.ibm.wsspi.sib.core.exc
[1/31/09 0:59:55:830 AST] 00005233 SibMessage E [:] CWSIC2023E: A communication error occurred when sending to or receiving from a remote client: exception com.ibm.wsspi.sib.core.exc
[1/31/09 0:59:55:829 AST] 00005227 SibMessage E [:] CWSIC2021E: A communication error occurred when sending to or receiving from a remote client: exception com.ibm.wsspi.sib.core.exc
[1/31/09 1:02:40:017 AST] 000036b6 WorkQueueMana W TCPC0005W: TCP Channel SIB_TCP_JFAP could not obtain thread from thread pool <null>.
[1/31/09 1:02:40:164 AST] 000000e9 WorkQueueMana W TCPC0005W: TCP Channel SIB_TCP_JFAP could not obtain thread from thread pool <null>.
[1/31/09 1:02:42:818 AST] 0000368a ThreadMonitor W WSVR0605W: Thread "HAManager.thread.pool : 1" (0000002b) has been active for 692865 milliseconds and may be hung. There is/are 2 thread(s) in total in the server that may be hung.
[1/31/09 1:02:44:614 AST] 0000368a ThreadMonitor W WSVR0605W: Thread "WebContainer : 6957" (00005115) has been active for 723163 milliseconds and may be hung. There is/are 3 thread(s) in total in the server that may be hung.
[1/31/09 1:02:44:933 AST] 0000368a ThreadMonitor W WSVR0605W: Thread "WebContainer : 6978" (0000519e) has been active for 736477 milliseconds and may be hung. There is/are 4 thread(s) in total in the server that may be hung.
[1/31/09 1:04:26:546 AST] 0000167c CoordinatorCo W HMGR0152W: CPU Starvation detected. Current thread scheduling delay is 84 seconds.
[1/31/09 1:04:26:834 AST] 00005265 ApplicationMo W DCSV0004W: DCS Stack DefaultCoreGroup.Cluster_ _Prod_1 at Member Cell_ _Prod_1\Node_ _PROD_1\AS_ _PROD_1: Did not receive adequate CPU time slice. Last known CPU usage time at 01:01:23:231 AST. Inactivity duration was 183 seconds.
[1/31/09 1:04:30:386 AST] 0000001f MbuRmmAdapter W DCSV1111W: DCS Stack DefaultCoreGroup at Member Cell_ _Prod_1\Node_ _PROD_1\AS_ _PROD_1: Suspected another member because the outgoing connection from the other member was closed. Suspected members is Cell_ _Prod_1\Node_PG_PROD_1\nod
[1/31/09 1:04:33:818 AST] 0000001f MbuRmmAdapter W DCSV1111W: DCS Stack DefaultCoreGroup at Member Cell_ _Prod_1\Node_ _PROD_1\AS_ _PROD_1: Suspected another member because the outgoing connection from the other member was closed. Suspected members is Cell_ _Prod_1\Node_ _PROD_2\nodeagent. DCS logical channel is View|Gossip.
[1/31/09 1:04:35:285 AST] 00005268 ApplicationMo W DCSV0004W: DCS Stack DefaultCoreGroup at Member Cell_ _Prod_1\Node_ _PROD_1\AS_ _PROD_1: Did not receive adequate CPU time slice. Last known CPU usage time at 01:01:35:047 AST. Inactivity duration was 180 seconds.
[1/31/09 1:04:36:318 AST] 0000001f RmmPtpGroup W DCSV1111W: DCS Stack DefaultCoreGroup at Member Cell_ _Prod_1\Node_ _PROD_1\AS_ _PROD_1: Suspected another member because the outgoing connection from the other member was closed. Suspected members is Cell_ _Prod_1\Node_PG_PROD_1\AS_
[1/31/09 1:04:37:188 AST] 0000001f RmmPtpGroup W DCSV1111W: DCS Stack DefaultCoreGroup at Member Cell_ _Prod_1\Node_ _PROD_1\AS_ _PROD_1: Suspected another member because the outgoing connection from the other member was closed. Suspected members is Cell_ _Prod_1\Node_ DM_Prod_1\dmgr. DCS logical channel is View|Ptp.
[1/31/09 1:04:37:383 AST] 0000001f RmmPtpGroup W DCSV1111W: DCS Stack DefaultCoreGroup at Member Cell_ _Prod_1\Node_ _PROD_1\AS_ _PROD_1: Suspected another member because the outgoing connection from the other member was closed. Suspected members is Cell_ _Prod_1\Node_ST_Prod_1\AS_
[1/31/09 1:04:37:779 AST] 0000001f RmmPtpGroup W DCSV1111W: DCS Stack DefaultCoreGroup at Member Cell_ _Prod_1\Node_ _PROD_1\AS_ _PROD_1: Suspected another member because the outgoing connection from the other member was closed. Suspected members is Cell_ _Prod_1\Node_ST_Prod_1\nod
[1/31/09 1:04:38:171 AST] 0000001f RmmPtpGroup W DCSV1111W: DCS Stack DefaultCoreGroup at Member Cell_ _Prod_1\Node_ _PROD_1\AS_ _PROD_1: Suspected another member because the outgoing connection from the other member was closed. Suspected members is Cell_ _Prod_1\Node_PG_PROD_2\nod
[1/31/09 1:04:38:342 AST] 0000001f RmmPtpGroup W DCSV1111W: DCS Stack DefaultCoreGroup at Member Cell_ _Prod_1\Node_ _PROD_1\AS_ _PROD_1: Suspected another member because the outgoing connection from the other member was closed. Suspected members is Cell_ _Prod_1\Node_ _PROD_1\nodeagent. DCS logical channel is View|Ptp.
[1/31/09 1:04:38:720 AST] 0000001f RmmPtpGroup W DCSV1111W: DCS Stack DefaultCoreGroup at Member Cell_ _Prod_1\Node_ _PROD_1\AS_ _PROD_1: Suspected another member because the outgoing connection from the other member was closed. Suspected members is Cell_ _Prod_1\Node_ST_Prod_2\AS_
[1/31/09 1:04:39:617 AST] 0000001f RmmPtpGroup W DCSV1111W: DCS Stack DefaultCoreGroup at Member Cell_ _Prod_1\Node_ _PROD_1\AS_ _PROD_1: Suspected another member because the outgoing connection from the other member was closed. Suspected members is Cell_ _Prod_1\Node_PG_PROD_2\AS_
[1/31/09 1:04:39:757 AST] 0000001f RmmPtpGroup W DCSV1111W: DCS Stack DefaultCoreGroup at Member Cell_ _Prod_1\Node_ _PROD_1\AS_ _PROD_1: Suspected another member because the outgoing connection from the other member was closed. Suspected members is Cell_ _Prod_1\Node_ST_Prod_2\nod
[1/31/09 1:04:39:983 AST] 0000001f RmmPtpGroup W DCSV1111W: DCS Stack DefaultCoreGroup at Member Cell_ _Prod_1\Node_ _PROD_1\AS_ _PROD_1: Suspected another member because the outgoing connection from the other member was closed. Suspected members is Cell_ _Prod_1\Node_ _PROD_2\AS_ _PROD_2. DCS logical channel is View|Ptp.
[1/31/09 1:04:40:126 AST] 0000001f DiscoveryRmmP W DCSV1111W: DCS Stack DefaultCoreGroup at Member Cell_ _Prod_1\Node_ _PROD_1\AS_ _PROD_1: Suspected another member because the outgoing connection from the other member was closed. Suspected members is Cell_ _Prod_1\Node_ _PROD_2\AS_ _PROD_2. DCS logical channel is Connected|Ptp.
[1/31/09 1:04:40:931 AST] 0000001f RmmPtpGroup W DCSV1111W: DCS Stack DefaultCoreGroup.Cluster_ _Prod_1 at Member Cell_ _Prod_1\Node_ _PROD_1\AS_ _PROD_1: Suspected another member because the outgoing connection from the other member was closed. Suspected members is Cell_ _Prod_1\Node_ _PROD_2\AS_ _PROD_2. DCS logical channel is View|Ptp.
[1/31/09 1:04:41:276 AST] 0000001f DiscoveryRmmP W DCSV1111W: DCS Stack DefaultCoreGroup.Cluster_ _Prod_1 at Member Cell_ _Prod_1\Node_ _PROD_1\AS_ _PROD_1: Suspected another member because the outgoing connection from the other member was closed. Suspected members is Cell_ _Prod_1\Node_ _PROD_2\AS_ _PROD_2. DCS logical channel is Connected|Ptp.
[1/31/09 1:04:41:653 AST] 0000001f DiscoveryRmmP W DCSV1113W: DCS Stack DefaultCoreGroup.Cluster_ _Prod_1 at Member Cell_ _Prod_1\Node_ _PROD_1\AS_ _PROD_1: Suspected another member because the outgoing connection to the other member was closed. Suspected member is Cell_ _Prod_1\Node_ _PROD_2\AS_ _PROD_2. DCS logical channel is Connected|Ptp.
[1/31/09 1:04:41:788 AST] 0000001f DiscoveryRmmP W DCSV1113W: DCS Stack DefaultCoreGroup at Member Cell_ _Prod_1\Node_ _PROD_1\AS_ _PROD_1: Suspected another member because the outgoing connection to the other member was closed. Suspected member is Cell_ _Prod_1\Node_ _PROD_2\AS_ _PROD_2. DCS logical channel is Connected|Ptp.
[1/31/09 1:04:42:535 AST] 00000017 RoleViewLeade I DCSV8053I: DCS Stack DefaultCoreGroup at Member Cell_ _Prod_1\Node_ _PROD_1\AS_ _PROD_1: View change in process. Excluded members are [Cell_ _Prod_1\Node_ _PROD_1\nodeagent Cell_ _Prod_1\Node_ _PROD_2\AS_ _PROD_2 Cell_ _Prod_1\Node_ _PROD_2\nodeagent Cell_ _Prod_1\Node_PG_PROD_1\AS_
[1/31/09 1:04:42:537 AST] 0000002a RoleViewLeade I DCSV8053I: DCS Stack DefaultCoreGroup.Cluster_ _Prod_1 at Member Cell_ _Prod_1\Node_ _PROD_1\AS_ _PROD_1: View change in process. Excluded members are [Cell_ _Prod_1\Node_ _PROD_2\AS_ _PROD_2].
[1/31/09 1:04:43:309 AST] 0000001f MbuRmmAdapter I DCSV1032I: DCS Stack DefaultCoreGroup at Member Cell_ _Prod_1\Node_ _PROD_1\AS_ _PROD_1: Connected a defined member Cell_ _Prod_1\Node_ _PROD_2\AS_ _PROD_2.
[1/31/09 1:04:44:565 AST] 0000001f VSync I DCSV2004I: DCS Stack DefaultCoreGroup at Member Cell_ _Prod_1\Node_ _PROD_1\AS_ _PROD_1: The synchronization procedure completed successfully. The View Identifier is (803:0.Cell_ _Prod_1\Node_ _PROD_1\AS_ _PROD_1). The internal details are [0 0 0 0 0 0 0 0 0 0 0 0 0].
[1/31/09 1:04:44:715 AST] 0000001f VSync I DCSV2004I: DCS Stack DefaultCoreGroup.Cluster_ _Prod_1 at Member Cell_ _Prod_1\Node_ _PROD_1\AS_ _PROD_1: The synchronization procedure completed successfully. The View Identifier is (21:0.Cell_ _Prod_1\Node_ _PROD_1\AS_ _PROD_1). The internal details are [0 0].
[1/31/09 1:04:45:007 AST] 00000017 ViewReceiver I DCSV1033I: DCS Stack DefaultCoreGroup at Member Cell_ _Prod_1\Node_ _PROD_1\AS_ _PROD_1: Confirmed all new view members in view identifier (804:0.Cell_ _Prod_1\Node_ _PROD_1\AS_ _PROD_1). View channel type is View|Ptp.
[1/31/09 1:04:45:226 AST] 0000002a ViewReceiver I DCSV1033I: DCS Stack DefaultCoreGroup.Cluster_ _Prod_1 at Member Cell_ _Prod_1\Node_ _PROD_1\AS_ _PROD_1: Confirmed all new view members in view identifier (22:0.Cell_ _Prod_1\Node_ _PROD_1\AS_ _PROD_1). View channel type is View|Ptp.
[1/31/09 1:04:45:770 AST] 0000002a DataStackMemb I DCSV8050I: DCS Stack DefaultCoreGroup.Cluster_ _Prod_1 at Member Cell_ _Prod_1\Node_ _PROD_1\AS_ _PROD_1: New view installed, identifier (22:0.Cell_ _Prod_1\Node_ _PROD_1\AS_ _PROD_1), view size is 1 (AV=1, CD=1, CN=1, DF=2)
[1/31/09 1:04:47:225 AST] 000051de DRSGroup I CWWDR0010E: Replication instance HttpSessionCache caught exception when sending/receiving messages : com.ibm.wsspi.hamanager.da
at com.ibm.ws.hamanager.datas
at com.ibm.ws.hamanager.agent
at com.ibm.ws.hamanager.agent
at com.ibm.ws.drs.model.DRSGr
at com.ibm.ws.drs.model.DRSGr
at com.ibm.ws.drs.stack.DRSAs
at com.ibm.ws.drs.stack.DRSSt
This Question has been solved and asker verified All Experts Exchange premium technology solutions are available to subscription members.
Experts Exchange has been collecting answers to technology questions since 1996…3 million and counting! If you have a question, chances are we already have your answer.
If you can't find the exact answer you're looking for, ask our exclusive community of 50,000 experts. You’ll get a personalized answer from a trusted professional.
Thousands of free tech tips, tricks, how-to’s and tutorials are available in our peer reviewed articles section. See for yourself how smart our experts are, no login required.
Access the answers to your technology questions today.
30-day free trial. Register in 60 seconds.
Members of the expert community talk about why the experience at Experts Exchange is different than what you will find anywhere else.

Try it out and discover for yourself.
30-day free trial. Register in 60 seconds.
Join the community of experts here and help other tech pros by answering question in your area of expertise. You can earn FREE access to all Experts Exchange's premium features and resources.
http://www-01.ibm.com/supp
I'd add that you have to fit JVM in available physical memory by limiting heap, adding ram, and if it fits into RAM then pinning it to physical memory or at least preventing aging. Mak sure you "bindprocessor" JVM threads based on accumulated CPU use to later processors than cpu0 partition 0 which is most loaded by hw interrupts.
You can follow network serving threads using LSOF and bind them to non-virtual CPUs so that HW irqs come in first.
vmtune -fF can be set to 1024 1280 to prevent starvation from paging.
smitty wlm is of help too - guarantee 10% to system of every resource and 10% to other apps like oracle, user sessions, backup sessions, database, app server etc.
This way you can run system that pages out routinely this way, but error clearly indicates that you need more ram.
Here you can compact better server's ram:
http://download.boulder.ib
Thruput gives feeling of better interactive response
You should examine JDBC drivers for nodelay/nagle options or disable nagle's using "no" command - that does magic with DB interactive performance.
Hello,
I would like to find out why the error message is coming in the SystemOut.log and from the server monitoring team I happend to find that during the error message the Memory utilization was almost 95-99% which leads to the error in the SystemOut.log and during this time heavy paging is occuring.
After sometime the memory utilization is coming down to 76%. Only one Cluster member and node agent is running. We are trying to find out what actually casusing this problem since it is mission critical application running on this. If any outage to this application leads to Big Problems.
The article referenced by gheist in
http://www.experts-exchang
(i.e., http://www.IBM.com/support
The fact that you are seeing these messages at a particular time of day leads me to ask:
What "background jobs" are started during this window of time (11pm - 2am)?
Do you have some kind of virus scan, or backup, or update that occurs during this time every day?
What you may need to do is to run some checks to see what else is consuming resources on this system at that time. I'm guessing that the Operating System in use is AIX, since you included that in one of your Zones.
So, I would expect that using vmstat at regular intervals during this time might give you a better idea of what processes are consuming significant portions of the CPU, and possibly memory.
You may also want to use something like lsof to figure out what things are in memory, and what libraries are being used.
Some tools exist on the alphaworks website (http://www.alphaworks.ibm
For example:
- Heap Analyzer
http://www.alphaworks.ibm.
- Thread Dump Analyzer
http://www.alphaworks.ibm.
It sounds like your Application Server is competing with something else for scare system resources...
Tried to verify the other process on the System and not find any scheduled job and other process.
There are some robos running in midnight. Other than that There is no activities.
The harware is pSeries with 6GB of Memory, is this memory is sufficient? This application was running for last 3 years with the same h/w resources.
Given error message you have current amount of memory is not sufficient.
You should enable porcess accounting (ask redbooks) to be sure nothing else is running.
I will repeat again and again until you get it:
1) Enable thruput GC
2) disable Nagle's using "no" command
3) Employ WLM and guarantee 10% of CPU to system and 10% to websphere 10% to database 1% to non-root users.
4) Tune memory allocation using MALLOC_TYPE and MALLOC FLAGS, and vmtune
I have explained reasons why you should do so. What is the problem you cannot fix _your_ system?
How did you "verify" that other processes aren't running?
I would suggest that you have a cron job start at script just before the normal outage time (? 10:55pm), and have it run vmstats every two minutes for the duration of the normal time (? 2am?). Have the output written to a file, and see what is actually occurring on the system during those times.
Already Enabled Garbage Collection on the JVM and analysed the output using the IBM Garbage Analyser. Also caputred some memory output and details got from the monitoring team. Attached the output.
I will try to enable the vmstat and get the out.
Below are the Output from GC Analysis
Memory Output
*********************
Report from Garbage Collector.
**************************
Summary
Allocation failure count 8767
Concurrent collection count 0
Forced collection count 0
Full collections 0
GC Mode optthruput
Largest memory request (bytes) 3116400
Mean garbage collection pause (ms) 401
Mean heap unusable due to fragmentation (MB) 8.37
Mean interval between collections (minutes) 0.27
Number of collections 8767
Proportion of time spent in garbage collection pauses (%) 2.49
Proportion of time spent unpaused (%) 97.5
Rate of garbage collection 967.96 MB/minutes
Used heap (after collection)
Mean Minimum Maximum Total
heap (MB) heap (MB) heap (MB) heap (MB)
225 74.0 279 1974943
Tuning recommendation
At one point 3257 objects were queued for finalization. Using finalizers is not recommended as it can slow garbage collection and cause wasted space in the heap. Consider reviewing your application for occurrences of the finalize() method.
The application seems to be using some quite large objects. The largest request which triggered an allocation failure (and was recorded in the verbose gc log) was for 3116400 bytes.
The garbage collector tried to allocate from the large object area and failed 181 times. Consider setting or increasing -Xloratio. You may need to also increase the heap size with -mx. Be cautious of increasing -Xloratio to anything greater than 0.2.
The memory usage of the application does not indicate any obvious leaks.
Business Accounts
Answer for Membership
by: gheistPosted on 2009-02-01 at 22:09:42ID: 23524668
-Xms1024m -Xmx1024m
and wait for limit to be hit.