After some use (usually a couple of hours), the OC4J_Portal can no longer service requests and an HTTP 500 is generated. We've tried increasing the number of processes and this does not help. We've tried changing the threading and memory models as well.
subnet 12 (internet):
- lbprd02/4 (aix): ibm websphere edge server load balancers
- - passes www.internetdomain.com:80
- - passes login.internetdomain.com:7
777 to sso:7777
- fecprd01 (rh): mid-tier 10g (webcachecluster1), sso 10g
- fecprd02 (rh): mid-tier 10g (webcachecluster1), sso 10g
subnet 11 (intranet):
- lbprd01/3 (aix): ibm websphere edge server load balancers
- - passes www.intranetdomain.com:80
- feaprd01 (rh): mid-tier 10g (webcachecluster2) (uses SSO on subnet 12)
- feaprd02 (rh): mid-tier 10g (webcachecluster2) (uses SSO on subnet 12)
subnet 42 (intranet):
- cslprd01 (aix): infra 126.96.36.199 w/ portal schemas portal+portala,
- - DB Server 188.8.131.52 with customer dbs
We're getting the following errors in the apache error_log just before recieving HTTP 500 errors:
oc4j_socket_recvfull timed out
(4)Interrupted system call: MOD_OC4J_0038: Receiving data from oc4j exceeded the configured "Timeout" value and the error code is 4.
MOD_OC4J_0054: Failed to call network routine to receive an ajp13 message from oc4j.
I suspect an error in ajp13 is consuming connections or some other resource and never releasing them. However, changing the OC4J to connectionless (using Oc4jCacheSize 0) does not seem to help. This error resembles the mod_jk error http://nagoya.apache.org/bugzilla/show_bug.cgi?id=10383
so I suspect something like a malformed URL slowly blocks us.
If needed I have an example of the errors when Apache is in mode debug.