Solved

websphere DMGR console not responding intermittently

Posted on 2010-09-14
26
2,384 Views
Last Modified: 2013-12-10
Gents,
I have a strange problem here.
One of my DMGR console for websphere 6.X ND console is not responding. But it comes to life sometimes. The frequency is like once in 3-4 hours.
The log doesn't show any fatal,critical errors but there are some warnings like below

[4/25/11 10:35:10:429 BST] 0000002a ThreadMonitor W   WSVR0605W: Thread "WebContainer : 0" (00000030) has been active for 764028 milliseconds and may be hung.  There is/are 1 thread(s) in total in the server that may be hung.

[4/24/11 17:14:14:530 BST] 0000000a ServiceLogger I com.ibm.ws.ffdc.IncidentStreamImpl initialize FFDC0009I: FFDC opened incident stream file /opt/IBM/WebSphere/AppServer/profiles/Profile01/dmgr/logs/ffdc/dmgr_0000000a_11.04.24_17.14.14_0.txt
[4/24/11 17:14:14:536 BST] 0000000a ServiceLogger I com.ibm.ws.ffdc.IncidentStreamImpl resetIncidentStream FFDC0010I: FFDC closed incident stream file /opt/IBM/WebSphere/AppServer/profiles/Profile01/dmgr/logs/ffdc/dmgr_0000000a_11.04.24_17.14.14_0.txt
0
Comment
Question by:crazywolf2010
  • 14
  • 12
26 Comments
 
LVL 41

Expert Comment

by:HonorGod
ID: 33678594
What else is going on on the system?
0
 
LVL 41

Expert Comment

by:HonorGod
ID: 33678618
Do you have the Portal Server installed?

http://www.IBM.com/support/docview.wss?uid=swg21212442

or some search engine?

http://www.IBM.com/support/docview.wss?uid=swg21212442

It seems strange that the Dmgr thread is the one that appears hung.
Might a synchronization be in progress trying to interact with very busy node agents and application servers?

0
 

Author Comment

by:crazywolf2010
ID: 33681123
Hi mate,
I don't have portal server installed.

The issue is fairly intermittent. This morning again it started responding and then went off now. Do you think ffdc files/messages would be helpful as I can upload them to you.

Thanks.

0
 

Author Comment

by:crazywolf2010
ID: 33681172
What else is going on on the system?
-- Nothing as such. This is a test system with couple of websphere application users. We have not made any application else configuration change at websphere dmgr side which should  be sync with nodeagent & cause this kind of a problem. The application works fine with no impact. It's purely DMGR issue.

But this does raise another issue, this system is time traveled which means we move forward & backward sometimes to test application functionality in the future.

Do you think dmgr uses file timestamps to sync with node agent Or it uses some hash algorithm? I am asking because if it is using file timestamp then moving date back would raise alert for sync and same thing will happen when we move forward.

I can see some files under /opt/IBM/WebSphere/AppServer/profiles/Profile01/dmgr/temp & wstemp.

How can I locate if dmgr is trying to sync something with nodeagent & probably nodeagent is not responding.

0
 

Author Comment

by:crazywolf2010
ID: 33683995
Hi,
I am monitoring activity today. I am convinced there is something wrong with DMGR and Nodeagent sync. No config was changed at websphere console.

The nodelog is spitting errors ...

[6/13/11 16:54:56:034 BST] 00000010 ServiceLogger I com.ibm.ws.ffdc.IncidentStreamImpl resetIncidentStream FFDC0010I: FFDC closed incident stream file /opt/IBM/WebSphere/AppServer/profiles/Profile01/Node/logs/ffdc/nodeagent_00000010_11.06.13_16.54.56_0.txt
[6/13/11 16:55:22:139 BST] 0000003d NodeSyncTask  A   ADMS0003I: The configuration synchronization completed successfully.
[6/13/11 16:56:22:139 BST] 0000003e NodeSyncTask  A   ADMS0003I: The configuration synchronization completed successfully.
[6/13/11 16:57:22:155 BST] 00000040 NodeSyncTask  A   ADMS0003I: The configuration synchronization completed successfully.
[6/13/11 16:58:22:273 BST] 00000041 NodeSyncTask  A   ADMS0003I: The configuration synchronization completed successfully.
[6/13/11 16:59:22:183 BST] 00000042 NodeSyncTask  A   ADMS0003I: The configuration synchronization completed successfully.
[6/13/11 17:00:22:179 BST] 00000043 NodeSyncTask  A   ADMS0003I: The configuration synchronization completed successfully.
[6/13/11 17:01:22:186 BST] 00000044 NodeSyncTask  A   ADMS0003I: The configuration synchronization completed successfully.

Is there easy way to clean nodeagent/dmgr directories ?
0
 

Author Comment

by:crazywolf2010
ID: 33732853
Hi,
Any thoughts on this please?
0
 
LVL 41

Expert Comment

by:HonorGod
ID: 33734318
Sorry for forgetting this... I took some time off to celebrate our anniversary with my bride...

When I reread this, I'm confused.

The initial question asks about the DM not responding, but the question in the most recent update says:

> Is there easy way to clean nodeagent/dmgr directories ?

If the DM is not responding, I would investigate what the system is doing (i.e., what processes are busiest, and what they are doing) at the time the DM is not being responsive.

If the question is "clean nodeagent/dmgr directories", I need to know which directories we're talking about.
0
 

Author Comment

by:crazywolf2010
ID: 33735403
Hi HonorGod,
I took some time off to celebrate our anniversary with my bride...
-- OH, Happy anniversary mate. Hope you had a good time.

The initial question asks about the DM not responding
-- Yeap this is still an issue. The DMGR responds intermittently. Despite my best efforts I am not able to see any errors at dmgr/SystemOut.log & don't know what is wrong.

If the DM is not responding, I would investigate what the system is doing (i.e., what processes are busiest, and what they are doing) at the time the DM is not being responsive.
-- I could see DMGR logs sending errors like one below
[4/25/11 10:35:10:429 BST] 0000002a ThreadMonitor W   WSVR0605W: Thread "WebContainer : 0" (00000030) has been active for 764028 milliseconds and may be hung.  There is/are 1 thread(s) in total in the server that may be hung.

When I looked at nodeagent log I could see following messages every single minute which wasn't the case in past when it was working. I guess nodegent pulls config data from DMGR so in theory DMGR should be absolutely fine but it isn't.

[6/13/11 16:55:22:139 BST] 0000003d NodeSyncTask  A   ADMS0003I: The configuration synchronization completed successfully.
[6/13/11 16:56:22:139 BST] 0000003e NodeSyncTask  A   ADMS0003I: The configuration synchronization completed successfully.

How can I trace what is going wrong? DMGR is a lifeline for us. If it's not available I am unable to monitor threads/session, deploy application changes so it's as good as entire system is down.

Regards
0
 
LVL 41

Expert Comment

by:HonorGod
ID: 33736010
> -- OH, Happy anniversary mate. Hope you had a good time.
  Thanks, yes, we did... but now, I have to pay for it... ;-)

> -- I could see DMGR logs sending errors like one below

- Yes, but what else is going on (on that system) at the same time?
- What % of the CPU is being used?
- What process(es) are using the most CPU, and memory?
- It the DM JVM is busy, it may be doing some garbage collection, or it might be "busy" trying to synchronize with some "non-responsive" nodeagent...

The time between the synchronizations appears to be 60 seconds.

- How dynamic is your environment?
- Does the DM really need to have a resynchronization interval that short?
  For a production environment, a 60 second synchronization interval is very, very short.

> How can I trace what is going wrong?

What Operating System are you using?  From the file system (i.e., "/opt/IBM/WebSphere/AppServer"), I can tell it is some kind of *ix environment.

0
 

Author Comment

by:crazywolf2010
ID: 33741866
Hi,
I have added my comments

- Yes, but what else is going on (on that system) at the same time?
--- Absolutely nothing. There are hardly 3-4 users on system. I have waited hours and days to get the situation better but it doesn't . It seems the DMGR got mind of it's own on when to respond.

- What % of the CPU is being used?
-- The CPU is ticking at 98% idle most of the time. I just got following reading and DMGR is still not responding.
Cpu(s):  1.6%us,  0.0%sy,  0.0%ni, 98.2%id,  0.2%wa,  0.0%hi,  0.0%si,  0.0%st

- What process(es) are using the most CPU, and memory?
---  CPU - no one,

Memory- as below
  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
30356 was61     18   0 3022m 1.7g  53m S  1.7 22.2  27:30.23 java
30144 was61     16   0 3024m 2.8g  53m S  1.0 36.4 116:02.84 java
 1541 root      10  -5     0    0    0 S  0.3  0.0   1:33.62 kjournald

- It the DM JVM is busy, it may be doing some garbage collection, or it might be "busy" trying to synchronize with some "non-responsive" nodeagent...
--- I have bounced (stopserver.sh-startserver.sh and init 6) many times to resolve this issue but no gain. I suspect it is trying to sync but how can I debug that?

- How dynamic is your environment?
--- This is a test system and I can't remember making any configuration change for last 3 months.

- Does the DM really need to have a resynchronization interval that short?
--- I noticed this and I need to alter that.

For a production environment, a 60 second synchronization interval is very, very short.
--- I did suggest it in past but since this is how it is setup people are reluctant to change it. They have not seen any issues so far with it. U may laugh but this is so called "best practice" put together by IBM consultant.

What Operating System are you using?  From the file system (i.e., "/opt/IBM/WebSphere/AppServer"), I can tell it is some kind of *ix environment.
-- Yeap RHEL 5.3

0
 
LVL 41

Expert Comment

by:HonorGod
ID: 33742809
> U may laugh but this is so called "best practice" put together by IBM consultant.
  Well, for a development, or test environment, that is a reasonable setting.  I feared that you had a huge, stable, production environment and for that this 1 minute interval seems to be too short.

> I suspect it is trying to sync but how can I debug that?
 Investigating.
0
 

Author Comment

by:crazywolf2010
ID: 33742909
Hi,
The error below is unusual for DMGR isn't it. Just want to know where can I update interval to monitor threads for a longer duration.

Following error was raised after 12 mins.
[4/25/11 10:35:10:429 BST] 0000002a ThreadMonitor W   WSVR0605W: Thread "WebContainer : 0" (00000030) has been active for 764028 milliseconds and may be hung.  There is/are 1 thread(s) in total in the server that may be hung.
0
 
LVL 41

Expert Comment

by:HonorGod
ID: 33742914
TroubleShooting: Synchronization problems
http://www.IBM.com/support/docview.wss?rs=180&uid=swg21199305
0
How to run any project with ease

Manage projects of all sizes how you want. Great for personal to-do lists, project milestones, team priorities and launch plans.
- Combine task lists, docs, spreadsheets, and chat in one
- View and edit from mobile/offline
- Cut down on emails

 

Author Comment

by:crazywolf2010
ID: 33743204
Hi ,
Just read above URL . Fortunately/Unfortunately none of those messages are visible at any of the log files.

I can see following success message at nodelog so I am inclined to believe SYNC is allright.
[10/26/10 11:19:49:421 BST] 0000003f NodeSyncTask  A   ADMS0003I: The configuration synchronization completed successfully.
[10/26/10 11:20:49:468 BST] 00000040 NodeSyncTask  A   ADMS0003I: The configuration synchronization completed successfully.
[10/26/10 11:21:49:449 BST] 00000042 NodeSyncTask  A   ADMS0003I: The configuration synchronization completed successfully.
[10/26/10 11:22:49:493 BST] 00000043 NodeSyncTask  A   ADMS0003I: The configuration synchronization completed successfully.

Regards
0
 
LVL 41

Accepted Solution

by:
HonorGod earned 500 total points
ID: 33743254
Drat.

We're now getting into the realm of trying to figure out exactly what is happening during the DM "hang".

Unfortunately, the best way to understand and resolve this kind of complex issue involves the generation of multiple "thread dumps" on both the DM and the node agents, and working with IBM technical support to have them analyze these (since they have access to the source code).

Do you have a support contract for your WebSphere Application Server environment?

Is calling IBM technical support an option?
0
 

Author Comment

by:crazywolf2010
ID: 33743833
Hi,
This is an issue for me. We do have support contract but that is thru head office which is in a different European country. They don't understand English so I will have to learn another language & I don't want to go down that route.

Do you have any note to debug /clean or restore DMGR part only if it is screwed up. Note my applications are working fine and no issue there.

Regards
0
 
LVL 41

Expert Comment

by:HonorGod
ID: 33748325
What, exactly, you mean when you ask how to "debug /clean or restore DMGR part"?
0
 

Author Comment

by:crazywolf2010
ID: 33752129
"debug /clean"
-- I mean to say, is it possible to enable tracing and debug bit more in detail.

or restore DMGR part"?
-- I have previous backup(say month earlier). I was thinking to just restore "DMGR" directory under profile. Not sure if that will resolve the issue. What do you think?


0
 
LVL 41

Expert Comment

by:HonorGod
ID: 33752682
> I have previous backup(say month earlier). I was thinking to just restore "DMGR" directory under profile. Not sure if that will resolve the issue. What do you think?

That is a possibility, but there is no guarantee.  It all depends upon the underlying issue, which we have yet to identify.

If the issue is that recent, it is possible that it has something to do with changes that may have occurred. However, are these changes local to the DM, or might they exist elsewhere in the configuration?  Might there be changes that have occurred in the network that have affected the synchronization process (e.g., bridge, router, firewall settings)?  I don't know.

> ... is it possible to enable tracing and debug bit more in detail.
Well, we can increase the verbosity of the tracing, but it isn't clear what we will learn from it.
At the highest level of tracing, there is information about what function/method calls are occurring, and that is most useful if you have access to the source code. So, this information is most useful to the developers of the product.
0
 

Author Comment

by:crazywolf2010
ID: 33752959
Hi HonorGod,
I agree to all of your points but I can't give up this issue. If the problem persists this means rebuilding entire environment as I don't have wsadmin jython script which will do job of
DMGR console.

Do you have a note on tracing purely non responsive DMGR issues?

Regards

0
 
LVL 41

Expert Comment

by:HonorGod
ID: 33755734
Most of the stuff that I have seen suggests that you try to rule any possible network issue out of the equation, if at all possible.

Are you familiar with network tracing tools such as WireShark (http://www.wireshark.org/)?
0
 

Author Comment

by:crazywolf2010
ID: 33787491
Hi HonorGod,
I agree with you . The network is managed by a remote team and there are chances they have altered some settings thinking it won't affect us. Unfortunately it is not in my control.

Hence what I am trying to say here is, there must be a way to trace DMGR process which will indicate what it is waiting for. Once equipped with those details, I can then go after someone with a stick.

I just tried DMGR URL to avoid network/firewall  issues on server itself and it's not responding after initial connection.
--2011-01-05 11:20:28--  https://prod:9043/ibm/console/logon.jsp
Resolving prod... 172.30.19.128
Connecting to prod|172.30.19.128|:9043... connected.
While all other boxes respond like below
[was61@TEST_BOX ~]$ wget https://TEST_BOX:9043/ibm/console/logon.jsp
--2010-11-29 13:22:30--  https://TEST_BOX:9043/ibm/console/logon.jsp
Resolving TEST_BOX... 172.30.19.129
Connecting to TEST_BOX|172.30.19.129|:9043... connected.
ERROR: cannot verify TEST_BOX's certificate, issued by `/C=US/O=IBM/CN=localhost.localdomain':
  Self-signed certificate encountered.
ERROR: certificate common name `localhost.localdomain' doesn't match requested host name `TEST_BOX'.
To connect to TEST_BOX insecurely, use `--no-check-certificate'.
Unable to establish SSL connection.
Best Regards
0
 
LVL 41

Expert Comment

by:HonorGod
ID: 33794605
You could try using tracertp to see the route used (through your network) to see where the TCP/IP packets flow.

You might also take a look at the "-r" ping option to "Record route".
0
 

Author Comment

by:crazywolf2010
ID: 33820928
Hi There,
The DMGR was responding last week but it is off again.

I don't have privs to use tracertp and $ping -r $HOSTNAME is returning the right ip & same route

$ ping -R rod_server
PING prod_server.dmz.com (172.30.9.128) 56(124) bytes of data.
64 bytes from prod_server.dmz.com (172.30.9.128): icmp_seq=1 ttl=64 time=                                    0.032 ms
RR:     prod_server.dmz.com (172.30.9.128)
        prod_server.dmz.com (172.30.9.128)
        prod_server.dmz.com (172.30.9.128)
        prod_server.dmz.com (172.30.9.128)

64 bytes from prod_server.dmz.com (172.30.9.128): icmp_seq=2 ttl=64 time=                                    0.061 ms        (same route)
64 bytes from prod_server.dmz.com (172.30.9.128): icmp_seq=3 ttl=64 time=                                    0.061 ms        (same route)
64 bytes from prod_server.dmz.com (172.30.9.128): icmp_seq=4 ttl=64 time=                                    0.040 ms        (same route)
64 bytes from prod_server.dmz.com (172.30.9.128): icmp_seq=5 ttl=64 time=                                    0.043 ms        (same route)
64 bytes from prod_server.dmz.com (172.30.9.128): icmp_seq=6 ttl=64 time=                                    0.061 ms        (same route)
0
 

Author Closing Comment

by:crazywolf2010
ID: 33958377
Issue resolved
0
 
LVL 41

Expert Comment

by:HonorGod
ID: 33958907
Thanks for the grade & points.

Good luck & have a great day.
0

Featured Post

How to run any project with ease

Manage projects of all sizes how you want. Great for personal to-do lists, project milestones, team priorities and launch plans.
- Combine task lists, docs, spreadsheets, and chat in one
- View and edit from mobile/offline
- Cut down on emails

Join & Write a Comment

Configure Web Service (server application) I. Configure security for Web Services methods First, we need to protect Session bean which implements the service: 1. Open EJB deployment descriptor (ejb-jar.xml) in the EJB project that contains you…
Most of the developers using Tomcat find it easy to configure the datasource in Server.xml and use the JNDI name in the code to get the connection.  So the default connection pool using DBCP (or any other framework) is made available and the life go…
Access reports are powerful and flexible. Learn how to create a query and then a grouped report using the wizard. Modify the report design after the wizard is done to make it look better. There will be another video to explain how to put the final p…
Polish reports in Access so they look terrific. Take yourself to another level. Equations, Back Color, Alternate Back Color. Write easy VBA Code. Tighten space to use less pages. Launch report from a menu, considering criteria only when it is filled…

707 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

13 Experts available now in Live!

Get 1:1 Help Now