Solved

Memory Sizing

Posted on 2014-11-14
4
331 Views
Last Modified: 2014-11-20
We're running the Blackboard LMS on Oracle 11gR2 and have always run into problems of some sort.  Problems have increased greatly since their latest upgrade
where they are essentially building a data warehouse with an ETL process kicking off every 20 minutes on top of what had been an exclusively transaction type system.
Their recommendation is now to increase SGA from 12 to 16 and PGA to a minimum of 50 which to my unlearned ears sounds more like pure data warehouse but I thought first i'd ask if that sounds the same to you experts.
0
Comment
Question by:xoxomos
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 2
  • 2
4 Comments
 
LVL 77

Expert Comment

by:slightwv (䄆 Netminder)
ID: 40443879
Sorry but we'll need more information.

>>SGA from 12 to 16 and PGA to a minimum of 50

Bytes, Megs, Gigs, Terabytes, ???

My database is under 1TB and I have more than 16G for an SGA.  It is the only database on the server so I gave it as MUCH RAM as I could.  Only because I can.

>>and have always run into problems of some sort

Before suggesting any change, we need more information.  If you are receiving errors, what are they?

If the errors are disk/space related, memory parameters don't help.
0
 

Author Comment

by:xoxomos
ID: 40443963
Gigabytes in each case.  This particular DB is about half a terrabyte.  With this particular set of problems the only ORA- errors we ever see are:

TNS-12537: TNS:connection closed
    ns secondary err code: 12560
    nt main err code: 0
    nt secondary err code: 0
    nt OS err code: 0
opiodr aborting process unknown ospid (31738) as a result of ORA-609
  Tns error struct:
    ns main err code: 12537


TNS-12537: TNS:connection closed
    ns secondary err code: 12560
    nt main err code: 0
    nt secondary err code: 0
    nt OS err code: 0
opiodr aborting process unknown ospid (31753) as a result of ORA-609
opiodr aborting process unknown ospid (31793) as a result of ORA-609
opiodr aborting process unknown ospid (31749) as a result of ORA-609


WARNING: inbound connection timed out (ORA-3136)
  Client address: (ADDRESS=(PROTOCOL=tcp)(HOST=172.22.119.89)(PORT=49206))
WARNING: inbound connection timed out (ORA-3136)
TNS-12535: TNS:operation timed out
    ns secondary err code: 12606
    nt main err code: 0
    nt secondary err code: 0
    nt OS err code: 0
  Client address: (ADDRESS=(PROTOCOL=tcp)(HOST=172.22.119.92)(PORT=58498))
WARNING: inbound connection timed out (ORA-3136)
Thu Nov 13 11:17:55 2014

Increasing that timeout parameter would only drag the inevitable out over a longer period.

 
Fatal NI connect error 12170.



No disk , space type errors in production.

For some unknown reason, the number of connections just go far too high for what is happening.  Since this ETL process runs every twenty minutes invariably it's in the mix when outages occur.  Earlier Blackboard got on the system and found frequently a second ETL process would kick off before the first had completed.
They then gave us an 'improved' :-) ETL process and we had another outage i believe less than twenty four hours later.

>>  >>and have always run into problems of some sort

For instance at the beginning of the year we were running 11.2.0.2.  There was some sort of mutex X bug and the patch did not work for 11.2.0.2 so we needed to upgrade to .04. We upgraded in March.  From March until the end of September we did not experience that kind of problem.  In September Blackboard upgraded and that's when this type of problem started occurring again.  Looking at AWRs, we're again seeing mutex X at the top of background and foreground waits.  Blackboard sent us one the improved procedures.  The next outage, mutex x was about where it should be but scheduler had taken it's place:

Top 10 Foreground Events by Total Wait Time

Event      Waits      Total Wait Time (sec)      Wait Avg(ms)      % DB time      Wait Class
resmgr:cpu quantum      418,149      386.1K      923      86.4      Scheduler
buffer busy waits      1,220      9124      7479      2.0      Concurrency

Wait Classes by Total Wait Time

Wait Class      Waits      Total Wait Time (sec)      Avg Wait (ms)      % DB time      Avg Active Sessions
Scheduler      418,600      386,163      923      86.4      217.1
Concurrency      8,115      18,949      2335      4.2      10.7
Other      7,083      15,600      2202      3.5      8.8

I believe Blackboard again changed some procedures because on the next outage mutex X had regained top spot :-)
Next they had us deleting a bunch of data.  Yet another outage.  Now the cure is  AT LEAST 50G PGA and at least 16G sga.
I don't know very much, but it sounds to me like they've been throwing it against the wall and waiting to see what sticks.  For instance, they gave us configurations which they divided into standard, some other class and advanced.  They put us in advanced. Advanced is for 8000 concurrent sessions.  Currently we exceed all the standard requirements except cpu and we meet the requirement for cpu.  According to their paper should handle up to 2000 concurrent sessions.  Our range normally varies between 600 and 900 concurrent sessions.  Other schools that have not upgraded to the version with these ETL type processes are not having this type of problem.
0
 
LVL 77

Accepted Solution

by:
slightwv (䄆 Netminder) earned 500 total points
ID: 40444001
I've only worked on one system in my career that had over 100 concurrent connections (it had around 8000) so my experience with this type of system is limited.

There are a few errors that you posted.

Unfortunately they can all mean different things.

I would start with the following doc on Oracle Support:
Alert.log shows ORA-609 with TNS-12537: TNS:connection closed (Doc ID 1538717.1)

The "TNS-12535: TNS:operation timed out" by itself is 99.99% of the time a firewall issue but combined with the others is likely an error caused by the other problems.

>>For some unknown reason, the number of connections just go far too high for what is happening.

Most of the time this is due to the application not properly closing/disposing of connections and relying in SMON to clean up dead connections.  I've had issues in the past with older .Net apps with connection pooling that SMON still saw them as 'active' and never closed them.

When that happened I would run into the error "ORA-00020: maximum number of processes exceeded".  If you aren't seeing that error, probably not the same thing but I would look at what is causing the increase in processes.  Something isn't working as 'designed'.

>>but it sounds to me like they've been throwing it against the wall and waiting to see what sticks.  

Probably.  But when trying to track down a issue that has never been seen before, it is a common practice.  Even Doctors do it:  Try solution A, observe the results, adjust plan, repeat until patient is healed (or dies).

All that said:
In looking on Oracle Support for the various errors(not all he exact messages) you posted, I saw one thread that talked about the machine having a virus.  Have you ruled that out 90% as the cause (no one can be 100% sure they don't have a virus)?

Some point to a login credential issue:
'Fatal NI Connect 12560' And 'ORA-609 Opiodr Aborting Process' Errors In The Alert Log (Doc ID 987162.1)

Some of the errors seem to point to networking.  Have you ruled out a faulty NIC or piece of COM gear in the network somewhere?  No, I don't know how to help diagnose that...

>>Other schools that have not upgraded to the version with these ETL type processes are not having this type of problem.

If the database versions are the same and the network gear/hardware is the same, it points to an app problem but with all the pieces that are likely between the clients and database server, not sure you can say that for sure.

It goes back to "seeing what sticks" troubleshooting.
0
 

Author Comment

by:xoxomos
ID: 40455810
"Most of the time this is due to the application not properly closing/disposing of connections and relying in SMON to clean up dead connections.  I've had issues in the past with older .Net apps with connection pooling that SMON still saw them as 'active' and never closed them."

Yes, the SA(s) at several of the other schools using Blackboard have made the same observation, complained and asked that something be done about that.
0

Featured Post

[Webinar] Code, Load, and Grow

Managing multiple websites, servers, applications, and security on a daily basis? Join us for a webinar on May 25th to learn how to simplify administration and management of virtual hosts for IT admins, create a secure environment, and deploy code more effectively and frequently.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
migration MS SQL database to Oracle 30 80
DB migration from Mssql to 12c oracle , data not loading. 3 52
Shared Service Environment 2 51
construct a query sql 11 37
Why doesn't the Oracle optimizer use my index? Querying too much data Most Oracle developers know that an index is useful when you can use it to restrict your result set to a small number of the total rows in a table. So, the obvious side…
Background In several of the companies I have worked for, I noticed that corporate reporting is off loaded from the production database and done mainly on a clone database which needs to be kept up to date daily by various means, be it a logical…
Via a live example show how to connect to RMAN, make basic configuration settings changes and then take a backup of a demo database
This video explains what a user managed backup is and shows how to take one, providing a couple of simple example scripts.

738 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question