Solved

oc4j_socket_recvfull generating http 500

Posted on 2004-03-23
21
20,783 Views
Last Modified: 2013-12-01
After some use (usually a couple of hours), the OC4J_Portal can no longer service requests and an HTTP 500 is generated. We've tried increasing the number of processes and this does not help. We've tried changing the threading and memory models as well.

Config is:
subnet 12 (internet):
- lbprd02/4 (aix): ibm websphere edge server load balancers
- - passes www.internetdomain.com:80 to webcachecluster1:7778
- - passes login.internetdomain.com:7777 to sso:7777
- fecprd01 (rh): mid-tier 10g (webcachecluster1), sso 10g
- fecprd02 (rh): mid-tier 10g (webcachecluster1), sso 10g
subnet 11 (intranet):
- lbprd01/3 (aix): ibm websphere edge server load balancers
- - passes www.intranetdomain.com:80 to webcachecluster2:7777
- feaprd01 (rh): mid-tier 10g (webcachecluster2) (uses SSO on subnet 12)
- feaprd02 (rh): mid-tier 10g (webcachecluster2) (uses SSO on subnet 12)
subnet 42 (intranet):
- cslprd01 (aix): infra 9.2.0.3 w/ portal schemas portal+portala,
- - DB Server 9.2.0.4 with customer dbs

We're getting the following errors in the apache error_log just before recieving HTTP 500 errors:

 oc4j_socket_recvfull timed out
(4)Interrupted system call: MOD_OC4J_0038: Receiving data from oc4j exceeded the configured "Timeout" value and the error code is 4.
MOD_OC4J_0054: Failed to call network routine to receive an ajp13 message from oc4j.

I suspect an error in ajp13 is consuming connections or some other resource and never releasing them. However, changing the OC4J to connectionless (using Oc4jCacheSize 0) does not seem to help. This error resembles the mod_jk error http://nagoya.apache.org/bugzilla/show_bug.cgi?id=10383, so I suspect something like a malformed URL slowly blocks us.

If needed I have an example of the errors when Apache is in mode debug.
0
Comment
Question by:rkogelhe
  • 9
  • 6
  • 3
  • +1
21 Comments
 
LVL 9

Expert Comment

by:vzilka
ID: 10670417
Can you decrease the number of connections to the AJP13 to 1, and then check the URL that creates the problem?
0
 
LVL 3

Author Comment

by:rkogelhe
ID: 10675152
We don't know exactly what URL causes the problem and I'm not sure how to limit the connections. The OC4J is multithreaded. We suspect that its related to struts...
0
 
LVL 9

Expert Comment

by:vzilka
ID: 10675163
Is it your own application that you check?
0
 
LVL 9

Expert Comment

by:vzilka
ID: 10675165
Sorry.
I mean do you suspect an application you deployed on the server?
Try to access a page and check the apache logs for the error all the time.
0
 
LVL 3

Author Comment

by:rkogelhe
ID: 10685925
The app is a jsp / bc4j app deployed in BI_Forms providing for Portal. We can already see the error above, but here is the error in detail:

[Mon Mar 22 14:45:57 2004] [warn] [client 123.45.678.30] oc4j_socket_recvfull timed out
[Mon Mar 22 14:45:57 2004] [warn] [client 123.45.678.30] oc4j_socket_recvfull timed out
[Mon Mar 22 14:45:57 2004] [error] [client 123.45.678.30] [ecid: 195348896995,3] (4)Interrupted system call: MOD_OC4J_0038: Receiving data from oc4j exceeded the configured "Timeout" value and the error code is 4.
[Mon Mar 22 14:45:57 2004] [error] [client 123.45.678.30] [ecid: 195348896995,3] (4)Interrupted system call: MOD_OC4J_0038: Receiving data from oc4j exceeded the configured "Timeout" value and the error code is 4.
[Mon Mar 22 14:45:57 2004] [error] [client 123.45.678.30] [ecid: 195348896995,3] MOD_OC4J_0054: Failed to call network routine to receive an ajp13 message from oc4j.
[Mon Mar 22 14:45:57 2004] [error] [client 123.45.678.30] [ecid: 195348896995,3] MOD_OC4J_0054: Failed to call network routine to receive an ajp13 message from oc4j.
[Mon Mar 22 14:45:57 2004] [error] [client 123.45.678.30] [ecid: 195348896995,3] MOD_OC4J_0033: Failed to receive an ajp13 message from oc4j.
[Mon Mar 22 14:45:57 2004] [error] [client 123.45.678.30] [ecid: 195348896995,3] MOD_OC4J_0033: Failed to receive an ajp13 message from oc4j.
[Mon Mar 22 14:45:57 2004] [warn] [client 123.45.678.30] [ecid: 195348896995,3] MOD_OC4J_0078: Network connection errors happened to host: fecprd01.internetaccess.mydomain.com and port: 3304 while receiving the first response from oc4j.  This request is recoverable.
[Mon Mar 22 14:45:57 2004] [warn] [client 123.45.678.30] [ecid: 195348896995,3] MOD_OC4J_0078: Network connection errors happened to host: fecprd01.internetaccess.mydomain.com and port: 3304 while receiving the first response from oc4j.  This request is recoverable.
[Mon Mar 22 14:45:57 2004] [debug] oc4j_proc_tbl.c(1840): [client 123.45.678.30] [ecid: 195348896995,3] MOD_OC4J_0228: Process (opmnid=219349677) removed from index list of worker OC4J_Portal.  Global index:3, local index:0, remaining entries in local index list: 0
[Mon Mar 22 14:45:57 2004] [debug] oc4j_proc_tbl.c(775): [client 123.45.678.30] [ecid: 195348896995,3] MOD_OC4J_0226: Freed process (opmnid=219349677) from mod_oc4j proc table.
[Mon Mar 22 14:45:57 2004] [debug] oc4j_proc_tbl.c(1746): [client 123.45.678.30] [ecid: 195348896995,3] MOD_OC4J_0175: After removing entry (index = 3 with (opmnid=219349677) from mod_oc4j proc table, first_free_slot: 2 and nelts: 4.
[Mon Mar 22 14:45:57 2004] [info] [client 123.45.678.30] [ecid: 195348896995,3] MOD_OC4J_0123: Mark network worker: OC4J_Portal_15 to be in_error_state because it failed to service the request and it is recoverable.
[Mon Mar 22 14:45:57 2004] [debug] oc4j_proc_tbl.c(1647): [client 123.45.678.30] [ecid: 195348896995,3] MOD_OC4J_0161: Failed to find the process with (opmnid=219349677) in internal process table to remove.
[Mon Mar 22 14:45:57 2004] [info] [client 123.45.678.30] [ecid: 195348896995,3] MOD_OC4J_0123: Mark network worker: OC4J_Portal_15 to be in_error_state because it failed to service the request and it is recoverable.
[Mon Mar 22 14:45:57 2004] [debug] oc4j_util.c(591): [client 123.45.678.30] [ecid: 195348896995,3] MOD_OC4J_0133: Request cookie's jsessionid is 8123a41ece8452b49abd681444292efa2f3ede91f7c.nAjxq79ya34InljMpR1zmkTMplbNmQ5IpAzInMTvmQTDnkLNnkSImQXHcxaNa3eUePWLch8IahmS-xuIaxyIn3uKmAiN-Anvc3iLaxySaAmI-huKa30xq6XMr65GlR9zq6XMr7aL-AnzmR1Mn30L-ATMqR0ImkbIokjGqQjI-AbJpgaMahyNb3yQbNqxf2bdeNf+lP1JqBfvp2bynknvrkLOlQzNp65In0__.
[Mon Mar 22 14:45:57 2004] [debug] oc4j_util.c(607): [client 123.45.678.30] [ecid: 195348896995,3] MOD_OC4J_0134: Decoded oc4j jvm route is fecprd01.internetaccess.mydomain.com:3304:C_198.765.8.29.d80be3.fa8512982f.-8000#portal_reports1.fecprd01.myapp.mydomain.com#219349677#D#OC4J_Portal#default_island.
[Mon Mar 22 14:45:57 2004] [debug] oc4j_util.c(591): [client 123.45.678.30] [ecid: 195348896995,3] MOD_OC4J_0133: Request cookie's jsessionid is 8123a41ece8452b49abd681444292efa2f3ede91f7c.nAjxq79ya34InljMpR1zmkTMplbNmQ5IpAzInMTvmQTDnkLNnkSImQXHcxaNa3eUePWLch8IahmS-xuIaxyIn3uKmAiN-Anvc3iLaxySaAmI-huKa30xq6XMr65GlR9zq6XMr7aL-AnzmR1Mn30L-ATMqR0ImkbIokjGqQjI-AbJpgaMahyNb3yQbNqxf2bdeNf+lP1JqBfvp2bynknvrkLOlQzNp65In0__.
[Mon Mar 22 14:45:57 2004] [debug] oc4j_ac_worker.c(703): [client 123.45.678.30] [ecid: 195348896995,3] MOD_OC4J_0139: Found a matched network worker: OC4J_Portal_15 with host: fecprd01.internetaccess.mydomain.com, port: 3304, opmnid: 219349677 and ias-instance: portal_reports1.fecprd01.myapp.mydomain.com in the child worker list.
[Mon Mar 22 14:45:57 2004] [debug] oc4j_util.c(607): [client 123.45.678.30] [ecid: 195348896995,3] MOD_OC4J_0134: Decoded oc4j jvm route is fecprd01.internetaccess.mydomain.com:3304:C_198.765.8.29.d80be3.fa8512982f.-8000#portal_reports1.fecprd01.myapp.mydomain.com#219349677#D#OC4J_Portal#default_island.
[Mon Mar 22 14:45:57 2004] [info] [client 123.45.678.30] [ecid: 195348896995,3] MOD_OC4J_0151: Network worker: OC4J_Portal_15 with host: fecprd01.internetaccess.mydomain.com and port: 3304 stored in cookie/jsessionid's jvm route is in_error_state. Is about to do failover.
[Mon Mar 22 14:45:57 2004] [debug] oc4j_ac_worker.c(703): [client 123.45.678.30] [ecid: 195348896995,3] MOD_OC4J_0139: Found a matched network worker: OC4J_Portal_15 with host: fecprd01.internetaccess.mydomain.com, port: 3304, opmnid: 219349677 and ias-instance: portal_reports1.fecprd01.myapp.mydomain.com in the child worker list.
[Mon Mar 22 14:45:57 2004] [info] [client 123.45.678.30] [ecid: 195348896995,3] MOD_OC4J_0151: Network worker: OC4J_Portal_15 with host: fecprd01.internetaccess.mydomain.com and port: 3304 stored in cookie/jsessionid's jvm route is in_error_state. Is about to do failover.
[Mon Mar 22 14:45:57 2004] [warn] [client 123.45.678.30] [ecid: 195348896995,3] MOD_OC4J_0187: Failed to find an oc4j process within destination:[OC4J_Portal] and island id: default_island for session failover.
[Mon Mar 22 14:45:57 2004] [warn] [client 123.45.678.30] [ecid: 195348896995,3] MOD_OC4J_0187: Failed to find an oc4j process within destination:[OC4J_Portal] and island id: default_island for session failover.
[Mon Mar 22 14:45:57 2004] [error] [client 123.45.678.30] [ecid: 195348896995,3] MOD_OC4J_0150: Failed to deterministicly find a failover oc4j process for session request for island: default_island for destination: OC4J_Portal.
[Mon Mar 22 14:45:57 2004] [error] [client 123.45.678.30] [ecid: 195348896995,3] MOD_OC4J_0150: Failed to deterministicly find a failover oc4j process for session request for island: default_island for destination: OC4J_Portal.
[Mon Mar 22 14:45:57 2004] [error] [client 123.45.678.30] [ecid: 195348896995,3] MOD_OC4J_0119: Failed to get an oc4j process for destination: OC4J_Portal
[Mon Mar 22 14:45:57 2004] [error] [client 123.45.678.30] [ecid: 195348896995,3] MOD_OC4J_0119: Failed to get an oc4j process for destination: OC4J_Portal
[Mon Mar 22 14:45:57 2004] [error] [client 123.45.678.30] [ecid: 195348896995,3] MOD_OC4J_0013: Failed to call destination: OC4J_Portal's service() to service the request.
[Mon Mar 22 14:45:57 2004] [error] [client 123.45.678.30] [ecid: 195348896995,3] MOD_OC4J_0013: Failed to call destination: OC4J_Portal's service() to service the request.
[Mon Mar 22 14:45:57 2004] [debug] mod_osso.c(3846): [client 123.45.678.30] [ecid: 195348896995,3]
[OSSO] ===== REQUEST =====
[OSSO] POST /app/classes HTTP/1.1

[Mon Mar 22 14:45:57 2004] [debug] mod_osso.c(3846): [client 123.45.678.30] [ecid: 195348896995,3]
[OSSO] ===== REQUEST =====
[OSSO] POST /app/classes HTTP/1.1

[Mon Mar 22 14:45:57 2004] [debug] mod_osso.c(337): [client 123.45.678.30] [ecid: 195348896995,3]
[OSSO] D26: process_idle_time()

[Mon Mar 22 14:45:57 2004] [debug] mod_osso.c(718): [client 123.45.678.30] [ecid: 195348896995,3]
[OSSO] D03: update_request()

[Mon Mar 22 14:45:57 2004] [error] [client 123.45.678.30] [ecid: 195348896995,3] File does not exist: /9IAS_MIDDLE/app/oracle/product/oas10g/Apache/Apache/htdocs/error_docs/500
0
 
LVL 9

Expert Comment

by:vzilka
ID: 10690291
Do you have any info in the OC4J logs?
What ias version are you using?
0
 
LVL 3

Author Comment

by:rkogelhe
ID: 10700666
Version info depends on component (see the original note). There's nothing strange in the OC4J logs. Currently doing a Linux OS patch and that may help. Oracle guys suspect a network error...
0
 
LVL 9

Expert Comment

by:vzilka
ID: 10702096
From my experience with AJP13 in IIS and apache, if the AJP protocol recieves reserved words in cookie names or header parameters (post) then connections are thrown into a locked situation.
0
 
LVL 3

Author Comment

by:rkogelhe
ID: 10704858
Ahh... this sounds a lot like this error on bugzilla. Can you tell me more about how this would happen. We see that it happens with Struts mostly. As I don't understand exactly how either AJP or Struts work, I can't really say if this error is a likely candidate. Do you know anything about Struts?
0
Do You Know the 4 Main Threat Actor Types?

Do you know the main threat actor types? Most attackers fall into one of four categories, each with their own favored tactics, techniques, and procedures.

 
LVL 3

Author Comment

by:rkogelhe
ID: 10704871
Also, what is the behaviour of the OC4J and Apache when this happens? Presumably, we'll end up with a lot of TCP connections between the two and eventually run out of ports in the standard port-range? I do see a lot of close-waits sometimes.

Is it usually the httpd that messes up or is it the OC4J?
0
 
LVL 9

Accepted Solution

by:
vzilka earned 500 total points
ID: 10704884
I think it is the OC4J part.
The place I saw it in also used Struts, but the problem was a request parameter they defined, which had a reserved name. The problem is that in Tomcat it was very easy to set up the logging, and I am not sure about your loging abilities in OC4J
0
 
LVL 3

Author Comment

by:rkogelhe
ID: 10982755
Hi, sorry I haven't updated this question. I haven't forgotten about it, but I believe the issue has been resolved and am attempting to find out what was done (I've left the project). I thought it would be more interesting to find out the exact solution used.

If I haven't a solution in four days I have no issue with the points going to vzilka and thank him as well.
0
 

Expert Comment

by:mashwort
ID: 12462592
I'm hitting the exact same problem and oracle support aren't too successful so far (its been a week at priority 1).
It looks like OC4J hangs. It appears to stop listening on the default port 3301.
We've been going thread dumps on the OC4J process and all the threads are locked waiting for various things)
Question is why doesn't it use more threads and why isnt it listening on the port.

So anyone know the answer to this question?
0
 
LVL 3

Author Comment

by:rkogelhe
ID: 12466183
I saw the guys on the team in August and asked them how they resolved the problem and they said that they applied a patch associated provided by Oracle and also increased some resource associated. I can't remember which and I'll ask the engineer to look at this post. The metalink iTar that we opened should be visible if you search for this, but I'm not sure.
0
 

Expert Comment

by:mashwort
ID: 12865076
this isnt really a solution is it?
They applied 'a patch' and increased 'a resource'
the iTAR 'should' be visible.
I've already fixed this - no thanks to this website.
I joined experts exhange as I it promised a solution which it didnt have.
I cancelled my subscription immediately and am still being charged by this website.

THIS SITE IS A CON.

0
 
LVL 3

Author Comment

by:rkogelhe
ID: 12865251
I have to agree that I rarely get answers to the questions I post here. I think it has to do with the the EE mechanisms. Unanswered questions just disappear off the lists and nobody sees them.

Anyway. I was at another clients site last week and saw the same errors in 10g. There is a metalink document with this error now. Basically they say that the problem is often related to the OHS setting KeepAlive and a bug in IE5 and IE6 with POSTs. Basically it says to patch IE or turn KeepAlive off.

How did you solve the problem?
0
 

Expert Comment

by:mashwort
ID: 12865463

Oh ta, I thought it was just me!
there are a few problems:
Yes you are right, you get occasional errors chucked up from some IE problem, so you have to switch keepalive off.
there was also a problem with a firewall (the Application servers are in a DMZ) timing out JDBC connections in the pool, OC4J didnt know they were timed out, when load increased and it tried to use them again, ALL the threads hung - locked on the thread trying to use a dead JDBC connection.
we also had lots of application errors (max open cursors, java errors) which are seen as timeouts by apache. It causes the OC4J process to by dynamically removed from the routing list, so you've got no process to send your requests to (until its added back into the list again).

So the Actual solutions are:
get your application right
switch off firewall timeout between App server and DB, or use keepalive on the db listener, or use a connection pool of 0.
We are currently using static OC4J routes to stop the dynamic removal.

regards,
     Mark
0
 
LVL 3

Author Comment

by:rkogelhe
ID: 12865779
Thanks Mark. Personally, I think there's something wrong with the whole AJP / worker implementation in the Oracle container. It should not be so vulnerable to external communication issues.
0
 

Expert Comment

by:jason_weiss
ID: 20631465
I was having the same problem.

Metalink doc ID: <b> Note:333862.1 </b> was particularly helpful in helping to determine the cause (in our case a performance problem in the data layer of our application).

0

Featured Post

How to improve team productivity

Quip adds documents, spreadsheets, and tasklists to your Slack experience
- Elevate ideas to Quip docs
- Share Quip docs in Slack
- Get notified of changes to your docs
- Available on iOS/Android/Desktop/Web
- Online/Offline

Join & Write a Comment

Most of the developers using Tomcat find it easy to configure the datasource in Server.xml and use the JNDI name in the code to get the connection.  So the default connection pool using DBCP (or any other framework) is made available and the life go…
This article is about some of the basic and important steps to be used to improve the performance in web-sphere commerce application development. 1) Always leverage the Dyna-caching facility provided by the product 2) Remove the unwanted code …
This video discusses moving either the default database or any database to a new volume.
This video shows how to remove a single email address from the Outlook 2010 Auto Suggestion memory. NOTE: For Outlook 2016 and 2013 perform the exact same steps. Open a new email: Click the New email button in Outlook. Start typing the address: …

707 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

14 Experts available now in Live!

Get 1:1 Help Now