Java Uncaught Exception

I'd like have experts' quick assessment of the argument between Infrastructure support team and Java development team.
We have a java web application hosted on WAS 7 ND servers. It is typical 3 tiers, IHS, WAS and Oracle database server.
The servers were recently reinstalled and there were also several code changes.  After these changes, the system becomes unstable especially when there are large transaction volumes.  

Usually end users report the system performs extremely slow, after a while more and more users found they can not even login the system. At that time, we can see in the SystemOut the messages below. So, looks web containers hang. We have to restart the WAS servers. After restart, the system recovered and soon same problem started again.
[7/31/15 10:50:37:813 CST] 00000003 ThreadMonitor W   WSVR0605W: Thread "WebContainer : 66" (00000235) has been active for 905322 milliseconds and may be hung.  There is/are 5 thread(s) in total in the server that may be hung.

[7/31/15 10:50:37:817 CST] 00000003 ThreadMonitor W   WSVR0605W: Thread "WebContainer : 82" (00000248) has been active for 919984 milliseconds and may be hung.  There is/are 6 thread(s) in total in the server that may be hung.

[7/31/15 10:53:37:903 CST] 00000029 ThreadMonitor W   WSVR0605W: Thread "WebContainer : 78" (00000244) has been active for 1064060 milliseconds and may be hung.  There is/are 7 thread(s) in total in the server that may be hung.

[7/31/15 10:53:37:907 CST] 00000029 ThreadMonitor W   WSVR0605W: Thread "WebContainer : 1" (00000026) has been active for 1047950 milliseconds and may be hung.  There is/are 8 thread(s) in total in the server that may be hung.

[7/31/15 10:53:37:910 CST] 00000029 ThreadMonitor W   WSVR0605W: Thread "WebContainer : 2" (00000027) has been active for 1013530 milliseconds and may be hung.  There is/are 9 thread(s) in total in the server that may be hung.

The application development team thinks it is due to in-appropriate configurations on WAS servers, or WAS cannot handle large data result (the transactional data increases steadily and data archive was not done for 1 year)

[7/31/15 8:48:07:988 CST] 00000278 SystemErr     R java.lang.RuntimeException: the sql result is too large

While the infrastructure support team thinks it is due to bad code quality especially recent changes.  One proof is that there are quite lots of uncaught exception including Java.Lang.NullPointException. For example,

[7/31/15 9:06:29:259 CST] 00000129 ServletWrappe E com.ibm.ws.webcontainer.servlet.ServletWrapper service SRVE0068E: Uncaught exception created in one of the service methods of the servlet /ZmclMaintain/MtOrderV2/zmclsoSummary.jsp in application oval_war. Exception created : java.lang.NullPointerException

and it might be cause by data error
[7/31/15 9:06:06:338 CST] 000004ec SystemErr     R java.text.ParseException: Unparseable date: "2015/7/29 13:56:20"

So, what could be more relevant? any suggestion to further investigate?
BTW, Infrastructure and application are outsourced to 2 different companies, thus need clear and strong proof to push either side to move :-(
MatthewLiuAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

dpearsonCommented:
First up - this looks like a performance problem.  As the traffic scales, the site slows down until it effectively becomes unusable and crashes.  So the errors about the threads that are hung are likely effects (of this slowdown) rather than causes.

NullPointer and Unparseable date exceptions are very unlikely to cause a performance problem.  Exceptions like those will generally be due to something failing to work at all.  You look for a date, it's in the unexpected format - so you get an error.  You don't typically get a slowdown.  It's not like it spends 20 mins trying to figure out what the date might mean.  So I would doubt those exceptions are closely tied to the problems (they may however indicate lots of other issues).  It's not impossible they're related - but you can't infer it from the presence of the exceptions themselves.

The question then is what is causing the performance problem?

In my experience the first place you should *always* look for performance issues it as the data layer.  If you think about most web applications what they do is generally very simple.  Something like:

"retrieve some data"
"modify it into some useful form"
"display to user"

or

"accept input from a user"
"encode it into expected form"
"store in database"

The point being that it's actually quite rare for any major processing to happen in the web layer itself.  It's possible there are some big complex calculations happening there - but usually it's just assembling data together, checking some rules and sending it on its way.

So it's usually the place where you collect up or store the data that gets slow - which is the data layer.

Now the cause of that can be poorly written queries (within the Java application) or poor indexing (within the database) or poor configuration of the database itself.

To figure that out, the easiest way is to review the queries currently being sent to the database.  Are there a lot of poorly indexed queries being sent?  Are the queries scanning or returning a really large number of rows?  (The answer to that last one appears to be yes given this error
"the sql result is too large")  etc.

So my suggestion is to at least start at the database and work up.  If you have a DBA working on the project - I'd call them first.

Hope that helps,

Doug
0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
mccarlIT Business Systems Analyst / Software DeveloperCommented:
I would also get the app developers to sort there stuff out. I agree with Doug that it is unlikely that those errors are a direct cause for the problems you are seeing, but the app developer can't really expect anyone to take their argument (that it is an infrastructure problem) seriously if they can't sort out basic stuff like the uncaught exceptions.

Now there are plenty of ways to write code that performs badly against databases, etc. no matter how good the DB optimzation is. And an app developer that can't get the above stuff right doesn't fill me with confidence that they can get the more complex stuff right.
0
MatthewLiuAuthor Commented:
We did check the Oracle database performance. According to DBA, the load is low, including the disk I/O.  
Actually we have 6 WAS instances in the cluster. I am not sure it is really a performance issue, or web containers died one by one.
0
dpearsonCommented:
That's good if disk I/O in particular is low in the database.  Suggests that may not be the root problem.
Was the DBA able to identify what was causing this specific error?

"the sql result is too large"?

If the max result set that the database can return is smaller than required to solve the problem, then you might see this and the database would look very healthy (since it's chopping off the results returned?).

I am not sure it is really a performance issue, or web containers died one by one.
You should be able to tell the difference between these two.  Either the request rate (within a given application server) is getting slower and slower as the load increases (i.e. performance problem) or the request rate means constant and then the server suddenly falls over (i.e. some ugly coding problem or hitting a capacity limit - e.g. we've had Tomcat servers die because they ran out of Linux file descriptors which defaults to a surprisingly low number on some distros).

You should also check the memory usage on the app servers (relative to the max RAM allocated when the process starts) to make sure it's not just a case of increasing memory usage.  That will cause a slow down (getting close to running out of memory) and then a crash (out of memory).

Doug
0
MatthewLiuAuthor Commented:
We collected the Oracle awr reports and analyzing queries took long elapse time. Some questions will be post to Oracle forum. Thanks for all the suggestions
0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Java

From novice to tech pro — start learning today.

Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.