Solved

Memory Sizing

Posted on 2014-11-14
4
297 Views
Last Modified: 2014-11-20
We're running the Blackboard LMS on Oracle 11gR2 and have always run into problems of some sort.  Problems have increased greatly since their latest upgrade
where they are essentially building a data warehouse with an ETL process kicking off every 20 minutes on top of what had been an exclusively transaction type system.
Their recommendation is now to increase SGA from 12 to 16 and PGA to a minimum of 50 which to my unlearned ears sounds more like pure data warehouse but I thought first i'd ask if that sounds the same to you experts.
0
Comment
Question by:xoxomos
  • 2
  • 2
4 Comments
 
LVL 76

Expert Comment

by:slightwv (䄆 Netminder)
ID: 40443879
Sorry but we'll need more information.

>>SGA from 12 to 16 and PGA to a minimum of 50

Bytes, Megs, Gigs, Terabytes, ???

My database is under 1TB and I have more than 16G for an SGA.  It is the only database on the server so I gave it as MUCH RAM as I could.  Only because I can.

>>and have always run into problems of some sort

Before suggesting any change, we need more information.  If you are receiving errors, what are they?

If the errors are disk/space related, memory parameters don't help.
0
 

Author Comment

by:xoxomos
ID: 40443963
Gigabytes in each case.  This particular DB is about half a terrabyte.  With this particular set of problems the only ORA- errors we ever see are:

TNS-12537: TNS:connection closed
    ns secondary err code: 12560
    nt main err code: 0
    nt secondary err code: 0
    nt OS err code: 0
opiodr aborting process unknown ospid (31738) as a result of ORA-609
  Tns error struct:
    ns main err code: 12537


TNS-12537: TNS:connection closed
    ns secondary err code: 12560
    nt main err code: 0
    nt secondary err code: 0
    nt OS err code: 0
opiodr aborting process unknown ospid (31753) as a result of ORA-609
opiodr aborting process unknown ospid (31793) as a result of ORA-609
opiodr aborting process unknown ospid (31749) as a result of ORA-609


WARNING: inbound connection timed out (ORA-3136)
  Client address: (ADDRESS=(PROTOCOL=tcp)(HOST=172.22.119.89)(PORT=49206))
WARNING: inbound connection timed out (ORA-3136)
TNS-12535: TNS:operation timed out
    ns secondary err code: 12606
    nt main err code: 0
    nt secondary err code: 0
    nt OS err code: 0
  Client address: (ADDRESS=(PROTOCOL=tcp)(HOST=172.22.119.92)(PORT=58498))
WARNING: inbound connection timed out (ORA-3136)
Thu Nov 13 11:17:55 2014

Increasing that timeout parameter would only drag the inevitable out over a longer period.

 
Fatal NI connect error 12170.



No disk , space type errors in production.

For some unknown reason, the number of connections just go far too high for what is happening.  Since this ETL process runs every twenty minutes invariably it's in the mix when outages occur.  Earlier Blackboard got on the system and found frequently a second ETL process would kick off before the first had completed.
They then gave us an 'improved' :-) ETL process and we had another outage i believe less than twenty four hours later.

>>  >>and have always run into problems of some sort

For instance at the beginning of the year we were running 11.2.0.2.  There was some sort of mutex X bug and the patch did not work for 11.2.0.2 so we needed to upgrade to .04. We upgraded in March.  From March until the end of September we did not experience that kind of problem.  In September Blackboard upgraded and that's when this type of problem started occurring again.  Looking at AWRs, we're again seeing mutex X at the top of background and foreground waits.  Blackboard sent us one the improved procedures.  The next outage, mutex x was about where it should be but scheduler had taken it's place:

Top 10 Foreground Events by Total Wait Time

Event      Waits      Total Wait Time (sec)      Wait Avg(ms)      % DB time      Wait Class
resmgr:cpu quantum      418,149      386.1K      923      86.4      Scheduler
buffer busy waits      1,220      9124      7479      2.0      Concurrency

Wait Classes by Total Wait Time

Wait Class      Waits      Total Wait Time (sec)      Avg Wait (ms)      % DB time      Avg Active Sessions
Scheduler      418,600      386,163      923      86.4      217.1
Concurrency      8,115      18,949      2335      4.2      10.7
Other      7,083      15,600      2202      3.5      8.8

I believe Blackboard again changed some procedures because on the next outage mutex X had regained top spot :-)
Next they had us deleting a bunch of data.  Yet another outage.  Now the cure is  AT LEAST 50G PGA and at least 16G sga.
I don't know very much, but it sounds to me like they've been throwing it against the wall and waiting to see what sticks.  For instance, they gave us configurations which they divided into standard, some other class and advanced.  They put us in advanced. Advanced is for 8000 concurrent sessions.  Currently we exceed all the standard requirements except cpu and we meet the requirement for cpu.  According to their paper should handle up to 2000 concurrent sessions.  Our range normally varies between 600 and 900 concurrent sessions.  Other schools that have not upgraded to the version with these ETL type processes are not having this type of problem.
0
 
LVL 76

Accepted Solution

by:
slightwv (䄆 Netminder) earned 500 total points
ID: 40444001
I've only worked on one system in my career that had over 100 concurrent connections (it had around 8000) so my experience with this type of system is limited.

There are a few errors that you posted.

Unfortunately they can all mean different things.

I would start with the following doc on Oracle Support:
Alert.log shows ORA-609 with TNS-12537: TNS:connection closed (Doc ID 1538717.1)

The "TNS-12535: TNS:operation timed out" by itself is 99.99% of the time a firewall issue but combined with the others is likely an error caused by the other problems.

>>For some unknown reason, the number of connections just go far too high for what is happening.

Most of the time this is due to the application not properly closing/disposing of connections and relying in SMON to clean up dead connections.  I've had issues in the past with older .Net apps with connection pooling that SMON still saw them as 'active' and never closed them.

When that happened I would run into the error "ORA-00020: maximum number of processes exceeded".  If you aren't seeing that error, probably not the same thing but I would look at what is causing the increase in processes.  Something isn't working as 'designed'.

>>but it sounds to me like they've been throwing it against the wall and waiting to see what sticks.  

Probably.  But when trying to track down a issue that has never been seen before, it is a common practice.  Even Doctors do it:  Try solution A, observe the results, adjust plan, repeat until patient is healed (or dies).

All that said:
In looking on Oracle Support for the various errors(not all he exact messages) you posted, I saw one thread that talked about the machine having a virus.  Have you ruled that out 90% as the cause (no one can be 100% sure they don't have a virus)?

Some point to a login credential issue:
'Fatal NI Connect 12560' And 'ORA-609 Opiodr Aborting Process' Errors In The Alert Log (Doc ID 987162.1)

Some of the errors seem to point to networking.  Have you ruled out a faulty NIC or piece of COM gear in the network somewhere?  No, I don't know how to help diagnose that...

>>Other schools that have not upgraded to the version with these ETL type processes are not having this type of problem.

If the database versions are the same and the network gear/hardware is the same, it points to an app problem but with all the pieces that are likely between the clients and database server, not sure you can say that for sure.

It goes back to "seeing what sticks" troubleshooting.
0
 

Author Comment

by:xoxomos
ID: 40455810
"Most of the time this is due to the application not properly closing/disposing of connections and relying in SMON to clean up dead connections.  I've had issues in the past with older .Net apps with connection pooling that SMON still saw them as 'active' and never closed them."

Yes, the SA(s) at several of the other schools using Blackboard have made the same observation, complained and asked that something be done about that.
0

Featured Post

PRTG Network Monitor: Intuitive Network Monitoring

Network Monitoring is essential to ensure that computer systems and network devices are running. Use PRTG to monitor LANs, servers, websites, applications and devices, bandwidth, virtual environments, remote systems, IoT, and many more. PRTG is easy to set up & use.

Join & Write a Comment

This post first appeared at Oracleinaction  (http://oracleinaction.com/undo-and-redo-in-oracle/)by Anju Garg (Myself). I  will demonstrate that undo for DML’s is stored both in undo tablespace and online redo logs. Then, we will analyze the reaso…
Using SQL Scripts we can save all the SQL queries as files that we use very frequently on our database later point of time. This is one of the feature present under SQL Workshop in Oracle Application Express.
This video shows, step by step, how to configure Oracle Heterogeneous Services via the Generic Gateway Agent in order to make a connection from an Oracle session and access a remote SQL Server database table.
This video shows how to set up a shell script to accept a positional parameter when called, pass that to a SQL script, accept the output from the statement back and then manipulate it in the Shell.

707 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

12 Experts available now in Live!

Get 1:1 Help Now