We are having a batch job called "C". This is basically a J2EE application client which runs third in our batch trail. The first two J2EE aplication client batch jobs, namely are "A" and "B". "A" starts at 3:30 AM followed by "B". Both these jobs run successfully in nearly less than 3 minutes after which our next job "C" kicks off. This is around 3:33 AM. Now, normally "C" takes less than 15 minutes to finish. This was running fine for past 1.5 yrs. But since past two months we are observing a strange behaviour. This job "C" keeps hanging for almost two hrs. and nothing is written to the logs. After about 2 hrs. it errors out around 5:33 AM and error message in the logs is as follows:
COM.ibm.db2.jdbc.DB2Exception: [IBM][CLI Driver] SQL30081N A communication error has been detected. Communication protocol being used: "TCP/IP". Communication API being used: "SOCKETS". Location where the error was detected: "<IP Address>". Communication function detecting the error: "recv". Protocol specific error code(s): "73", "*", "0". SQLSTATE=08001
This is consuming more than 2 hrs time of the other jobs that are supposed to run after "C" and delaying those. So nowadays we monitor "C" for about an hour. If it does not run successfully, we cancel and rerun it. On a second attempt it is always successful!!!.
We have not been able to figure out that why it fails the first time. Any help/pointers will be appreciated. Thanks.