asked on

DB Hang/Slow Performance-11202 RHEL 5.5

Hi Gurus,

I need help in getting AWR understand and finding what could be the reason for DB hanging intermittently.

The top wait event in the AWR is "latch: row cache objects" and "latch: shared pool" and taking most of the DB TIME.

The most of the gets and misses are for child latches dc_objects,dc_users.

Please let me know is it application that is an issue here or the DB.

here are the sections of AWR during the time when db had performance issues.

 Snap Id Snap Time Sessions Cursors/Session 
Begin Snap: 32650 06-Jun-13 14:30:06 272 2.9 
End Snap: 32652 06-Jun-13 15:30:03 275 4.2 
Elapsed:   59.95 (mins)     
DB Time:   1,392.85 (mins)     



Report Summary
Cache Sizes 

 Begin End   
Buffer Cache: 7,872M 7,872M Std Block Size: 8K 
Shared Pool Size: 768M 768M Log Buffer: 23,700K 


Load Profile

 Per Second Per Transaction Per Exec Per Call 
DB Time(s): 23.2 9.2 0.15 0.10 
DB CPU(s): 0.4 0.2 0.00 0.00 
Redo size: 14,334.1 5,644.4     
Logical reads: 15,423.1 6,073.2     
Block changes: 604.2 237.9     
Physical reads: 166.6 65.6     
Physical writes: 4.3 1.7     
User calls: 240.5 94.7     
Parses: 78.6 30.9     
Hard parses: 0.5 0.2     
W/A MB processed: 3.4 1.3     
Logons: 1.5 0.6     
Executes: 152.1 59.9     
Rollbacks: 0.0 0.0     
Transactions: 2.5     

Instance Efficiency Percentages (Target 100%) 

Buffer Nowait %: 99.98 Redo NoWait %: 99.75 
Buffer Hit %: 99.75 In-memory Sort %: 100.00 
Library Hit %: 99.12 Soft Parse %: 99.37 
Execute to Parse %: 48.33 Latch Hit %: 99.36 
Parse CPU to Parse Elapsd %: 0.65 % Non-Parse CPU: 86.06 


Shared Pool Statistics 

 Begin End 
Memory Usage %: 92.87 93.29 
% SQL with executions>1: 88.61 81.69 
% Memory for SQL w/exec>1: 82.17 82.29 



Top 5 Timed Foreground Events


Event Waits Time(s) Avg wait (ms) % DB time Wait Class 
latch: row cache objects 19,235 39,763 2067 47.58 Concurrency 
latch: shared pool 17,970 13,373 744 16.00 Concurrency 
library cache: mutex X 4,462 2,220 498 2.66 Concurrency 
read by other session 5,768 1,490 258 1.78 User I/O 
DB CPU   1,381   1.65   

Host CPU (CPUs: 16 Cores: 8 Sockets: 2) 

Load Average Begin Load Average End %User %System %WIO %Idle 
12.41 17.86 3.1 3.4 19.5 93.2 

Time Model Statistics
Total time in database user-calls (DB Time): 83571s 
Statistics including the word "background" measure background process time, and so do not contribute to the DB time statistic 
Ordered by % or DB time desc, Statistic name 
Statistic Name Time (s) % of DB Time 
sql execute elapsed time 36,139.91 43.24 
parse time elapsed 30,169.29 36.10 
hard parse elapsed time 7,196.63 8.61 
connection management call elapsed time 5,431.36 6.50 
PL/SQL execution elapsed time 2,586.68 3.10 
DB CPU 1,381.35 1.65 
failed parse elapsed time 312.11 0.37 
hard parse (sharing criteria) elapsed time 218.60 0.26 
PL/SQL compilation elapsed time 99.72 0.12 
sequence load elapsed time 37.82 0.05 
hard parse (bind mismatch) elapsed time 36.91 0.04 
RMAN cpu time (backup/restore) 0.33 0.00 
inbound PL/SQL rpc elapsed time 0.11 0.00 
repeated bind elapsed time 0.07 0.00 
DB time 83,571.00   
background elapsed time 3,380.38   
background cpu time 18.84 

SQL ordered by Parse Calls
Total Parse Calls: 282,606 
Captured SQL account for 75.4% of Total 
Parse Calls Executions  % Total Parses SQL Id SQL Module SQL Text 
69,608 69,609 24.63 52q4nbsfm4560 JDBC Thin Client  select count(*) from pgm 
22,840 22,843 8.08 ftj9uawt4wwzb    select condition from cdef$ wh... 
22,429 22,429 7.94 4c9r2ub0g5ru4 JDBC Thin Client  /* load one-to-many org.ets.ib... 
9,602 9,602 3.40 cm5vu20fhtnq1    select /*+ connect_by_filterin... 
5,127 5,127 1.81 bbmh9kxspx088 JDBC Thin Client  /* named HQL query ItmTypCnvsn... 
4,749 4,749 1.68 79s0wvywf0kvk JDBC Thin Client  /* named HQL query Spbq.CHECK_... 
4,748 4,748 1.68 btu407nruw5ua JDBC Thin Client  /* named HQL query ItmClctnAuI... 
4,739 4,739 1.68 6q7mp9bvzvcpy JDBC Thin Client  DECLARE v_ip1 VARCHAR2(20); v_... 
4,739 4,739 1.68 f7hmwpb5n30nk JDBC Thin Client  SELECT INSTANCE_NAME FROM V$IN... 
4,739 4,739 1.68 fnjmch4hyqp3u JDBC Thin Client  SELECT DISTINCT SID, SERIAL#, ... 
4,726 4,727 1.67 0k8522rmdzg4k    select privilege# from sysauth... 
4,726 4,726 1.67 0ws7ahf1d78qa    select SYS_CONTEXT('USERENV', ... 
4,725 4,726 1.67 459f3z9u4fb3u    select value$ from props$ wher... 
4,720 4,719 1.67 apz201tvmcc2d JDBC Thin Client  SELECT PROFILE FROM DBA_USERS ... 
4,717 4,717 1.67 5ur69atw3vfhj    select decode(failover_method,... 
3,895 3,895 1.38 135sp2csy25xa JDBC Thin Client  select * from lgn_adt_ip WHERE... 
3,068 3,068 1.09 bq3hw9rfmdagh JDBC Thin Client  /* named HQL query getAppConfi... 
3,067 3,067 1.09 8f0zcb9dat13j JDBC Thin Client  SELECT DISTINCT a.rsrch_usg_t... 
2,992 2,992 1.06 2jfqzrxhrm93b    select /*+ rule */ c.name, u.n... 

SQL ordered by Executions
%CPU - CPU Time as a percentage of Elapsed Time 
%IO - User I/O Time as a percentage of Elapsed Time 
Total Executions: 546,913 
Captured SQL account for 84.6% of Total 
Executions  Rows Processed Rows per Exec Elapsed Time (s) %CPU %IO SQL Id SQL Module SQL Text 
246,002 245,549 1.00 47.15 24.41 6.42 3gnr85n6k0zmy JDBC Thin Client  SELECT CLSFN_NAM FROM CLSFN_HI... 
69,609 69,609 1.00 95.26 5.91 0.00 52q4nbsfm4560 JDBC Thin Client  select count(*) from pgm 
22,843 22,843 1.00 2.05 55.42 0.00 ftj9uawt4wwzb    select condition from cdef$ wh... 
22,429 17,595 0.78 302.24 2.70 0.00 4c9r2ub0g5ru4 JDBC Thin Client  /* load one-to-many org.ets.ib... 
9,602 461 0.05 221.84 0.87 0.00 cm5vu20fhtnq1    select /*+ connect_by_filterin... 
5,127 0 0.00 3,892.88 0.03 0.00 bbmh9kxspx088 JDBC Thin Client  /* named HQL query ItmTypCnvsn... 
4,749 3,958 0.83 999.63 0.08 0.00 79s0wvywf0kvk JDBC Thin Client  /* named HQL query Spbq.CHECK_... 
4,748 4,748 1.00 547.09 0.12 0.00 btu407nruw5ua JDBC Thin Client  /* named HQL query ItmClctnAuI... 
4,739 4,739 1.00 824.33 4.78 0.00 6q7mp9bvzvcpy JDBC Thin Client  DECLARE v_ip1 VARCHAR2(20); v_... 
4,739 4,739 1.00 263.89 0.44 0.00 f7hmwpb5n30nk JDBC Thin Client  SELECT INSTANCE_NAME FROM V$IN... 

Dictionary Cache Stats
"Pct Misses" should be very low (< 2% in most cases) 
"Final Usage" is the number of cache entries being used 
Cache Get Requests Pct Miss Scan Reqs Pct Miss Mod Reqs Final Usage 
dc_awr_control 71 0.00 0   4 1 
dc_global_oids 49,980 0.00 0   0 147 
dc_histogram_data 8,061 0.26 0   0 1,229 
dc_histogram_defs 436,718 0.17 0   0 5,181 
dc_object_grants 9,123 0.05 0   0 372 
dc_objects 5,637,398 0.01 0   5 3,752 
dc_profiles 4,842 0.00 0   0 5 
dc_rollback_segments 1,520 0.00 0   6 36 
dc_segments 99,554 0.28 0   5 2,267 
dc_sequences 119 6.72 0   119 20 
dc_tablespaces 397,156 0.00 0   0 20 
dc_users 616,828 0.00 0   0 167 
global database name 7,135 0.00 0   0 1 
outstanding_alerts 128 5.47 0   12 21 
sch_lj_oids 5 0.00 0

Open in new window

Thanks

arnold

Wait on IO 20%

how much memory is in the system? 8GB
Your system is now swapping.
Memory Usage %: 92.87 93.29
% SQL with executions>1: 88.61 81.69
% Memory for SQL w/exec>1: 82.17 82.29
adding memory should help improve the situation

i.e. when it began it was fine, now the amount of data has grown.

monto1

ASKER

32GB of RAM.

Franck Pachot

Hi,

Can you attach the whole report, it's not easy to read here.
It seems that you do lot of parse calls, probably not using bind variables.
The 'select condition from cdef$' is a symptom of lot of parses on tables that have a lot of constraints. 'condition' is a long that is not in row cache.

Other sections from the report should help to focus.

Regards,
Franck.

arnold

How much of the total system's memory is consumed?
Your buffer indicates an ~8GB
Not sure whether you can increase it or whether it dynamically adjusts based on demand.

monto1

ASKER

Thanks ,I have attached the AWR report.
awr-report-32650-32652.html

Franck Pachot

Those big select with huge list of literal value is very bad for the shared pool. You should check the fetching strategy in order to do join instead of list fetching.

monto1

ASKER

Thanks.I did let know the app team that application is doing lot of parsing but they raise the question why is it now or why is it happening sometime it has to be consistent isn't it?here are doubts please clear if possible.

1)Are they all hard parses?the awr load profile section doesn't show much with hard parsing is happening.how did you find that its hard parsing(because of not using bind variable).

2)Should i be looking at increasing the Shared pool?

3)Do you see any unusual acitvity with connection pool with connects/disconnects?

Can you please elaborate more on this?

You should check the fetching strategy in order to do join instead of list fetching.

The query with literal value is being generated with dynamic SQL where user builds the query.Is there any thing that can be done to reduce hard parsing in such cases.

Regards

Franck Pachot

Hi,

I checked very quickly the AWR report from my phone for the previous answer.
When you see a query such as sql_id=g2jq3aa2c177k you know that something is very bad.
And 'select condition from cdef$ where rowid=:1' is typical from hard parse: getting table the constraints. And condition is a long - not in the row cache.

But re-reading the AWR report there are a few issues that you seem to have already identified.

>> I did let know the app team that application is doing lot of parsing but they raise the question why is it now or why is it happening sometime it has to be consistent isn't it?
Difficult to say. Sometimes resource intensive behaviour is just fine until you reach the point where everything goes bad. A bad design, not scalable, cannot expect linear reponse time when workload increases.

>> are they all hard parses?
No there are a lot of soft parses as well. Soft parses - even if better than hard parses - are still using a lot of resources: it has to parse the statement text, do semantic analysis and check if a shared cursor can be used. So you need CPU, dictionary (row cache) and library cache access (latches).

>> Should i be looking at increasing the Shared pool?
According to Shared Pool Advisory response is no. I see no trace of statement begin invalidated because of under-sized shared pool.

>>Do you see any unusual acitvity with connection pool with connects/disconnects?
5,468 logons over 1 hour. Yes you're right that's bad. Spending 6% of the time on 'connection management call elapsed time' is something to fix.

The remark about fetching strategy was because I thought that it was an ORM generated query. When user builds the query, where those 'in' list of values comes from ? I suppose they don't enter them. So if they come from data, you should use joins instead of sending them in the SQL statement. Or pass a collection as bind variable.

And now something else. If you sum the timed events you do not reach the 100%. So some waits are hidden here. Maybe the CPU is overloaded (even if OS stats seem to tell us a low usage). Maybe some events are missing. So unfortunately you cannot explain all the behaviour with those stats.

I know it can be difficult to explain to application team. But when you see a query such as sql_id=g2jq3aa2c177k you know that something is very bad in the design. SQL is code, that is compiled once and executed then. You don't put thousands of hardcoded values un the code. You don't compile a progran each time you execute it. Developers should understand if you explain them that SQL queries is just like code. Do they generate a new .java code for each user input, with hardcoded values from that input, and compile it and execute it - each time ? No. Database is not different.

And 'select count(*) from pgm' executed 70000 times over one hour. Who needs to count the rows of a table every second ? RESULT_CACHE may help for that one, as a workaround. But this is a design issue.

Regards,
Franck.

monto1

ASKER

Thanks Sir for the excellent explanation. Do you think the DB hung and those parses could be a result of sudden burst of connections?there is no way of avoid the soft parses,correct?would increasing the session_cached_cursor should help for soft parse issue?

I would appreciate your expert guidance on this.

Regards

Franck Pachot

Hi,
>> would increasing the session_cached_cursor should help for soft parse issue?
Not sure as it seems that session cursor cache is already well used when you compare 'session cursor cache hits' and 'parse count (total)'
But here again, you have a report that covers 1 hour of various activity, so maybe the averages are not representative of all that happens. Maybe you can get a short report (few minutes) when it hangs.

>> there is no way of avoid the soft parses,correct?
Yes session cursor cache helps. Or at client level you can use Statement Caching for example if you are on JDBC.

>>. Do you think the DB hung and those parses could be a result of sudden burst of connections?
Yes. Establishing a connection (timed by 'connection management call elapsed time') uses a lot of latches. So besides the time it takes, it adds CPU and library cache contention.

Franck Pachot

Hi,

I did a quick test with 20 concurrent processes doing 500 connections.

They spend lot of time on CPU and 'library cache: mutex X', and have a lot of 'row cache objects' latches.

I did the same with SHARED SERVERS instead of DEDICATED, CPU usage has lower a lot but 'shared pool latch' is the same and 'row cache objects' has been divided by 2 only.

So you can try SHARED SERVERS as a workaround, but it may not be sufficient to solve your bottleneck.

Regards,
Franck.

monto1

ASKER

How can this be fixed?

Yes you're right that's bad. Spending 6% of the time on 'connection management call elapsed time' is something to fix.

I had the same issue again this morning in Prod where DB was very slow and the load on the server was very high at that time.

top - 11:11:42 up 107 days,  9:47,  2 users,  load average: 30.43, 27.08, 18.62
Tasks: 937 total,   3 running, 934 sleeping,   0 stopped,   0 zombie
Cpu(s): 12.7%us, 10.5%sy,  0.0%ni, 42.3%id, 32.9%wa,  0.6%hi,  0.9%si,  0.0%st
Mem:  32947876k total, 24643800k used,  8304076k free,   393672k buffers
Swap: 33554424k total, 10501052k used, 23053372k free, 14716436k cached

Open in new window

I have uploaded the file.

Thanks
awrrpt-1-32505-32506.html

arnold

32.9% waiting on IO. what is your storage like? See whether as was suggested you can optimize the search, adding index, etc. that will reduce the scans of entire tables.

Franck Pachot

Hi,
That 30 min report is still too large: OS load was 2 at the beginning and 20 at the end. But we still are not sure that the database si responsible for that: Average Active Session on CPU is 2,822.37(60*30.50) =1.6 over 30 minutes. But we know it is a lot higher when the system hangs.
Maybe you can have a more relevant report if you exec dbms_workload_repository.create_snapshot at 5 minutes interval when it hangs.
ASH report may also give a better picture on a short duration around 10-Jun-13 11:00:52

monto1

ASKER

Thanks for the valuable input.I have attached ASH for 5 min .

Thanks
ASH-report-1370893722070.html

Franck Pachot

ASH is interesting. Especially: CPU + Wait for CPU = 16.79 Avg Active Sessions
Now no doubt, your OS load comes from your database.
Same issues that previously analyzed on the AWR report, but now you are sure that they are the real cause for contention.

Maybe you can sent the SQL queries to the developer team and tell them that the Database is not a word processor ;)

The database expects that:
- people connect once
- queries are parsed/optimized once
- and then executed a lot in a very efficient way

monto1

ASKER

Sir,you mean connection pooling and parsing?what can i recommend them for connects/disconnects since they use connection pooling(jboss).

Thanks

ASKER CERTIFIED SOLUTION

Franck Pachot

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

monto1

ASKER

You are awesome sir!!! i have been a regular visitor to your blog.

Regards