Learn how to a build a cloud-first strategyRegister Now

x
?
Solved

Need Urgent Help on Performance issue having Oracle DB on RHEL 5.2

Posted on 2012-08-23
30
Medium Priority
?
2,255 Views
Last Modified: 2012-09-06
The situation is critical today as an application. All the application users are disappointed by the performances and are not able to work with it. The application is not stable:
 
- the performance is very low.
- the system is very slow and freezing all the time
 
- some time application crashing or user not able to login

We have three server:
1. Prod Server(RHEL 5.2)
2. App Prod server(RHEL 5.2)
3. DB Server(RHEL 5.2)

Things tried:

1. Raise up memory on application system
2. Kernel parameters changes
3. Increasing the swap space on Application Server
4. vm.nr_hugepages setting to be changed on prod server
5. Set up the memlock user limit
6. Upgrade omnivision version on all 3 servers to 7.21
7. Clear the Alert Log and diagnostic files on the Preprod server
8. Extend disk space from 3.4 GB to 10 GB on Prod DB server
 
Linux platform with Oracle databases mounted

still not able to diagnose the root cause and the troubleshooting continues. Till then the temporary workaround is to bounce the DB and reboot the server after 10-15 days. Need your help in:
 
1. Server performance needs to be looked into.
2. Network Communication between (Prod DB) and (Prod Apps) needs to be checked.

 
Need your help to diagnose the ROOT CAUSE and resolve this issue.


Pls check the logs attached with this case:
recommendations.txt
0
Comment
Question by:apunkabollywood
  • 10
  • 4
  • 4
  • +5
28 Comments
 
LVL 19

Expert Comment

by:jools
ID: 38325107
What information did you collect to make you change memory/swap/disk etc etc?

Have you been collecting performance stats using ps -eaf/atop/iostat/vmstat/sar etc and going thru them to identify problem areas?

Is the database the bottleneck? I'm not sure what tools oracle has to monitor table performance... it might be you need to look at index performance.

If it's not the database then look at memory/disk/network atop can record the stats over a number of days and you can replay them to find hotspots.

Performance is one of those things that can be quite subjective. You need to be collecting data and acting on the data, not just blindly adding memory/disk.

We will need more information, you can post ps output or attach atop files (if you have them) and we might be able to have a look.
0
 
LVL 14

Expert Comment

by:ThomasMcA2
ID: 38325599
Things to try:

1. Run an AWR report (awrrpt.sql)
2. Some systems run AWR reports on a schedule. If you have an AWR report from a period where performance was good, use awrddrpt.sql to compare the AWR reports to find the differences.
3. A high db file sequential read in the AWR report = poor/missing indexes, and/or outdated db stats
4. Is fragmentation a problem? Any table that has a fragmentation count that is more than 5% of the # of records in the table will impact performance
 select * from all_tables where chain_cnt > 0
5. What is the "Out Of Order Ratio %" for your largest tables? Run this, replacing <your_table_name> with the name of each of your largest tables.

select (round(((select count(*) as cnt
from (select item_id, location_id, sales_date, relative_fno, block_number, row_number, data_row,
(lag(data_row) over(partition by relative_fno, block_number order by row_number)) as prev_data_row
from (select item_id, location_id, sales_date
,relative_fno, block_number, row_number
,(dense_rank() over(partition by relative_fno, block_number order by item_id, location_id, sales_date)) as data_row
from (select item_id, location_id, sales_date
,dbms_rowid.rowid_relative_fno(rowid) relative_fno
,dbms_rowid.rowid_block_number(rowid) as block_number
,dbms_rowid.rowid_row_number(rowid) as row_number
from <your_table_name>
) c
) b
) a
where data_row != prev_data_row
and data_row != prev_data_row + 1) / (select count(*) from sales_data)),3)*100) as "out of order ratio %"
from dual;

Open in new window


If the value of "out of order ratio %" > 20 % you need to follow the document 1085012.1 Reording Columns in Oracle For Demantra Performance
0
 
LVL 25

Expert Comment

by:madunix
ID: 38326420
Oracle Tuning Guide can be found at the following location:
https://access.redhat.com/sites/default/files/oracle_tuning_guide_1-1_11-28-2007.pdf
0
Independent Software Vendors: We Want Your Opinion

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 

Author Comment

by:apunkabollywood
ID: 38328346
Please check the below facts from DB server and suggest are these right or wrong:

Page cache :

pagecahe 100 (Current)


Default

1  10   15



DB Server setting

 ethtool eth0
Settings for eth0:
        Supported ports: [ TP ]
        Supported link modes:   1000baseT/Full
        Supports auto-negotiation: No
        Advertised link modes:  Not reported
        Advertised auto-negotiation: No
        Speed: Unknown! (10000)
        Duplex: Full
        Port: Twisted Pair
        PHYAD: 0
        Transceiver: internal
        Auto-negotiation: off
        Supports Wake-on: uag
        Wake-on: d
        Link detected: yes

Prod Server settings

ethtool eth0
Settings for eth0:
        Supported ports: [ TP ]
        Supported link modes:   10baseT/Half 10baseT/Full
                                100baseT/Half 100baseT/Full
                                1000baseT/Full
        Supports auto-negotiation: Yes
        Advertised link modes:  10baseT/Half 10baseT/Full
                                100baseT/Half 100baseT/Full
                                1000baseT/Full
        Advertised auto-negotiation: Yes
        Speed: 1000Mb/s
        Duplex: Full
        Port: Twisted Pair
        PHYAD: 0
        Transceiver: internal
        Auto-negotiation: on
        Supports Wake-on: d
        Wake-on: d
        Current message level: 0x00000007 (7)
        Link detected: yes

On DB Server

less /proc/meminfo
MemTotal:     32960172 kB
MemFree:       9254140 kB
Buffers:       1471736 kB
Cached:         399360 kB
SwapCached:          0 kB
Active:        1626732 kB
Inactive:       468776 kB
HighTotal:           0 kB
HighFree:            0 kB
LowTotal:     32960172 kB
LowFree:       9254140 kB
SwapTotal:    25165816 kB
SwapFree:     25165816 kB
Dirty:              48 kB
Writeback:           0 kB
AnonPages:      244324 kB
Mapped:          55356 kB
Slab:           135700 kB
PageTables:      13804 kB
NFS_Unstable:        0 kB
Bounce:              0 kB
CommitLimit:  36400972 kB
Committed_AS:   648460 kB
VmallocTotal: 34359738367 kB
VmallocUsed:    276452 kB
VmallocChunk: 34359461623 kB
HugePages_Total:  5122
HugePages_Free:     49
HugePages_Rsvd:     48
Hugepagesize:     2048 kB


oracle : PPRVDB1 ]> ulimit -l
30000000


less /proc/sys/vm/swappiness
60


Request you all to plss suggest kernel paramneters or oracle tuning  how to check and increase performance of DB server


Thanks
0
 
LVL 74

Assisted Solution

by:sdstuber
sdstuber earned 1336 total points
ID: 38334648
If bouncing the database resolves the problem then that implies that at start up the database and applications are running fine.  i.e. acceptable performance.

Is that correct?

If it is, then it's unlikely that examining or changing the operating system will lead to much.  The problem is in the database.

Oracle runs fine on linux, so it's not likely that the database itself is the problem, so it's the application.  That means sql and pl/sql and/or connection activity.

What is running when the problem occurs?  One of the previous posts mentions AWR reports.  If you have a license to use them, that's a good idea.  If you don't then I suggest installing statspack and collecting 10 minute snapshots.  When the problem occurs, run snapshot reports before and during the problem and post the results.
0
 
LVL 18

Assisted Solution

by:BigSchmuh
BigSchmuh earned 664 total points
ID: 38335089
IMHO, most bottlenecks comes from:
-Poor application development (stats shows that this is 80% of all case a real problem)
==> Too late for that kind of actions ?

-Application servers being short in CPU
==> Confirm and add more application servers in the farm

-Servers being short in Memory (or bad conf consuming too much of it) resulting in too high swapping io
==> Confirm and reconfigure or add memory

-DB server using not enough IOPS capability (1x SSD delivers 50000 IOPS where 1x SAS 15k drive delivers 200 IOPS only)
==> Confirm and enlarge the cache or add some spindles/SSD to the storage subsys

-Too much logging defined implies too many IO on some servers
==> Lower useless logs ( Your attached "recommendations.txt" shows as "Blockers" that you are gathering Histograms on nc_objects, nc_params, nc_referencess )

-Network adapters not able to negotiate the expected throughput (which means a 10Mbps one)
==> Force adapters throughput at both server and switch sides
0
 

Author Comment

by:apunkabollywood
ID: 38335627
Thank you so much Sdstuber and Bigschmuh for you valuable advice - I have collected AWR report from DB server - but not much good to analyse that kindly look and give your suggestion accordingly.attached with this comment
report.doc
0
 
LVL 18

Expert Comment

by:BigSchmuh
ID: 38335640
Appart from the 3 SQL starting with "SELECT oo.object_id FROM nc_ob..." shown in "SQL ordered by gets" which query plan should be reviewed as they are issuing about 16300 gets per execution, I don't see any real problem in the report.
Although, the "Operating system statistics" clearly shows that :
-Too much CPU on this db server is available (IDLE_TIME is huge against others)
-IO wait time looks acceptable

On the application server, did you check the CPU and swap levels ?
0
 

Author Comment

by:apunkabollywood
ID: 38335727
Here is the output: App Server


var/log/syslog/message.log | grep error
 snmpd[18889]: error on subcontainer '' insert (-1)
 snmpd[32587]: error on subcontainer '' insert (-1)

Jul 22 06:19:43 appserver snmpd[3431]: warning: /etc/hosts.allow, line 8: can't verify hostname: getaddrinfo(appserver, AF_INET) failed
Jul 22 06:20:08 appserver snmpd[3431]: warning: /etc/hosts.allow, line 8: can't verify hostname: getaddrinfo(appserver) failed
Jul 22 06:20:10 appserver snmpd[3431]: warning: /etc/hosts.allow, line 8: can't verify hostname: getaddrinfo(appserver) failed

Kernel Logs:

Jun  2 03:23:27 appserver kernel:  [<ffffffff8005dde9>] error_exit+0x0/0x84
Jun  5 03:14:17 appserver kernel:  [<ffffffff8005dde9>] error_exit+0x0/0x84
Jun  8 03:15:33 appserver kernel:  [<ffffffff8005dde9>] error_exit+0x0/0x84
Jun 21 11:44:27 appserver kernel: omv_mng[11997]: segfault at 0000000000000000 rip 0000000008082478 rsp 00000000ffee93a4 error 4
Jul 29 03:21:50 appserver kernel:  [<ffffffff8005dde9>] error_exit+0x0/0x84
Jul 29 03:21:52 appserver kernel:  [<ffffffff8005dde9>] error_exit+0x0/0x84

kernel: Kernel logging (proc) stopped.
kernel: Kernel logging (proc) stopped.
nel: Kernel logging (proc) stopped.
nel: Kernel logging (proc) stopped.
nel:  [<ffffffff880317ae>] :jbd:journal_stop+0x1f3/0x1ff
sce kernel: Kernel logging (proc) stopped.
sce kernel:  [<ffffffff880317ae>] :jbd:journal_stop+0x1f3/0x1ff
sce kernel: Kernel logging (proc) stopped.
sce kernel: Kernel logging (proc) stopped.
 
acpiphp: Slot [260] registered
acpiphp: Slot [261] registered
acpiphp: Slot [262] registered
acpiphp: Slot [263] registered
VMware memory control driver initialized
e1000: eth0: e1000_set_tso: TSO is Enabled
e1000: eth1: e1000_set_tso: TSO is Enabled
e1000: eth0: e1000_watchdog_task: NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None
e1000: eth1: e1000_watchdog_task: NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None
Installing knfsd (copyright (C) 1996 okir@monad.swb.de).
NFSD: Using /var/lib/nfs/v4recovery as the NFSv4 state recovery directory
NFSD: starting 90-second grace period
  Vendor: VMware    Model: Virtual disk      Rev: 1.0
  Type:   Direct-Access                      ANSI SCSI revision: 02
 target0:0:2: Beginning Domain Validation
 target0:0:2: Domain Validation skipping write tests
 target0:0:2: Ending Domain Validation
 target0:0:2: FAST-40 WIDE SCSI 80.0 MB/s ST (25 ns, offset 127)
SCSI device sdc: 20971520 512-byte hdwr sectors (10737 MB)
sdc: Write Protect is off
sdc: cache data unavailable
sdc: assuming drive cache: write through
SCSI device sdc: 20971520 512-byte hdwr sectors (10737 MB)
sdc: Write Protect is off
sdc: cache data unavailable
sdc: assuming drive cache: write through
 sdc: unknown partition table
sd 0:0:2:0: Attached scsi disk sdc
sd 0:0:2:0: Attached scsi generic sg2 type 0
SCSI device sdc: 20971520 512-byte hdwr sectors (10737 MB)
sdc: Write Protect is off
sdc: cache data unavailable
sdc: assuming drive cache: write through
 sdc: sdc1
SCSI device sdc: 20971520 512-byte hdwr sectors (10737 MB)
sdc: Write Protect is off
sdc: cache data unavailable
sdc: assuming drive cache: write through
 sdc: sdc1

free -g
             total       used       free     shared    buffers     cached
Mem:             7          1          6          0          0          0
-/+ buffers/cache:          0          6
Swap:            3          1          2


Linux 2.6.18-164.2.1.el5 (app-server)    

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.35    0.00    0.69    0.39    0.00   98.57

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
sda               6.49        98.49       120.63  152738948  187075862
sda1              0.00         0.02         0.00      31564        906
sda2              6.49        98.47       120.63  152705504  187074956
sdb               1.04        21.40        13.14   33182340   20371901
dm-0              2.66         9.03        20.34   14011266   31549472
dm-1              0.82         7.54         6.02   11698282    9341416
dm-2              0.00         0.00         0.00       3838       2124
dm-3              0.00         0.00         0.00       6594       6552
dm-4              2.94         9.02        22.61   13988026   35061568
dm-5              0.91         4.62         5.46    7157226    8466880
dm-6              1.86         0.20         3.71     304350    5759042
dm-7              0.00         0.01         0.00      10666       2252
dm-8              4.62        20.24        16.76   31389168   25986920
dm-9              0.11         2.96         0.23    4593410     351992
dm-10             0.00         0.00         0.01       4744       9266
dm-11             0.19         6.55         0.43   10154618     665896
dm-12            10.42        38.29        45.06   59377344   69871576
dm-13             1.97        13.72        11.80   21283098   18293405
dm-14             0.28         7.67         1.34   11889810    2078496

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           1.95    0.00    1.95    4.88    0.00   91.22

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
sda              24.51      1505.88         0.00       1536          0
sda1              0.00         0.00         0.00          0          0
sda2             24.51      1505.88         0.00       1536          0
sdb               0.00         0.00         0.00          0          0
dm-0              1.96        23.53         0.00         24          0
dm-1              0.00         0.00         0.00          0          0
dm-2              0.00         0.00         0.00          0          0
dm-3              0.00         0.00         0.00          0          0
dm-4              0.00         0.00         0.00          0          0
dm-5             27.45      1443.14         0.00       1472          0
dm-6              0.00         0.00         0.00          0          0
dm-7              0.00         0.00         0.00          0          0
dm-8              0.00         0.00         0.00          0          0
dm-9              0.00         0.00         0.00          0          0
dm-10             0.00         0.00         0.00          0          0
dm-11             0.00         0.00         0.00          0          0
dm-12             0.00         0.00         0.00          0          0
APP.pdf
0
 
LVL 19

Expert Comment

by:jools
ID: 38335969
The server looks like a vm, is the esx host busy?
0
 

Author Comment

by:apunkabollywood
ID: 38335989
yes its a VM not a hardware server..
0
 

Author Comment

by:apunkabollywood
ID: 38336067
One more thing want to add:

Is it possible CPU in the reports and logs shows is diffrent from CPU utilization shows of VM Machines on V Center --it shows they are utilizing 100% cpu ? :(
0
 
LVL 19

Expert Comment

by:jools
ID: 38336132
Do you have VMWare tools installed.
0
 
LVL 18

Expert Comment

by:BigSchmuh
ID: 38336395
You may suffer from a CPU over-provisioning situation.

When a VM need 4 vCPU, it waits for 4 fully available cores before sending control to the VM.
For example, a db server with 4 vCPU using less than 20% cpu charge would get a real latency bonus by running one 1 vCPU only.

I googled this "5 Big Considerations for vCPU Provisioning" to get you some feedback about this.
0
 
LVL 29

Expert Comment

by:MikeOM_DBA
ID: 38336564
Perhaps you should not execute report process' during work hours:

CPU hog.
0
 

Expert Comment

by:kparikh001
ID: 38336813
Few things;

Are you also running eBusiness Suite?
Have you recently upgraded from 10g to 11g?

If yes then;

Verify your hidden parameters (They start with "_". There are five parameters that can affect your performance to a dead stop speed. Out of all of them _sort_elimiation_cost_ratio parameter has biggest impact on performance.

_b_tree_bitmap_plans                 boolean     FALSE
_fast_full_scan_enabled              boolean     FALSE
_index_join_enabled                  boolean     FALSE
_like_with_bind_as_equality          boolean     TRUE
_optimizer_autostats_job             boolean     FALSE
_sort_elimination_cost_ratio         integer     5
_system_trig_enabled                 boolean     TRUE
0
 
LVL 74

Expert Comment

by:sdstuber
ID: 38336978
Start at the top of the SQL lists and work your way down.

Here's a few of the big hitters.

guyt7vgmcqbxh - took over an hour to finish   - can this be optimized? or as suggested above, run at a different time?

c52vystffk6mf - this ran 11,093 times, each one was fast, but cumulatively it consumed about 1/8 of all cpu   - can this be run fewer times? or, can it be tuned to be lighter weight? Knock a tenth of a second off it and you'll get 10 minutes back

9rrat0ynrdpbt - this consumed 70% of all Gets in your 6 hour window and took over 18 minutes.  


With a 6 hour window it's hard to say exactly what was bad when.  The three candidates I listed above might not be a problem at all.  If they ran sequentially and finished in the first couple hour and then people had problems after them then they working on them won't help.

Try generating a report for a much smaller window.  10-15 minutes usually works best.
0
 

Author Comment

by:apunkabollywood
ID: 38340951
Hi All,

Thanks all for your valuable advice:

@bigscumuh - I am planing to increase no of cpu from 4 to 8 but still not sure is this the only problem because vm showing 100% on monitoring tool on vmware but inside cpu utilization okay as u have seen logs.

@ Mike_omDBA - I am checking all status of cpu and memory continous from past 3 weeks but all three cpu utilization same 100% as per vm monitoring tools

@ kparikh001 - no its a new build 6 mnth before installed

@ sdstuber - will pull out and paste it soon

Some more points to look:

1. Some strange error on weblogic :

<Apr 2, 2012 3:05:24 PM CEST> <Error> <Server> <BEA-002608> <The ListenThread because of an error: java.lang.OutOfMemoryError: Java heap space
java.lang.OutOfMemoryError: Java heap space

2. No of active user max by 5
0
 
LVL 18

Expert Comment

by:BigSchmuh
ID: 38340964
I was suggesting to DOWNGRADE vCPU for the dbserver from 4 to 1 to avoid the latency ... not to increase it to 8 !!!
0
 
LVL 19

Expert Comment

by:jools
ID: 38341035
Pls check to make sure VMWare tools are installed, I didn't see your response
0
 

Author Comment

by:apunkabollywood
ID: 38341971
Yes they vmware tools are installed
0
 

Author Comment

by:apunkabollywood
ID: 38350434
Please check the lates hour report for DB servers and advice something to improve performance after upgarding 2 to 8 vcpus cpu utlization decrease to 70% but still not upto the mark
AWR-1-hour.doc
0
 
LVL 74

Accepted Solution

by:
sdstuber earned 1336 total points
ID: 38350615
Start at the top of the SQL lists and work your way down.

delete from PRV_DISCREPANCY_RESULTS where CUSTOMER_NAME = :1

why is that so expensive?
0
 
LVL 29

Expert Comment

by:MikeOM_DBA
ID: 38351701
@sdstuber:
Start at the top of the SQL lists and work your way down.
delete from PRV_DISCREPANCY_RESULTS where CUSTOMER_NAME = :1
why is that so expensive?
Perhaps there is no index on CUSTOMER_NAME and it's doing FTS. :p
The rest all points to un-optimized SQL.
0
 

Expert Comment

by:kparikh001
ID: 38353866
Again, try setting parameter _sort_elimination_cost_ratio         integer     5
This is a dynamic parameter and help your query find better execution plan.
0
 

Author Comment

by:apunkabollywood
ID: 38358314
Hello SDBUSTER , Please explain me it in details and @Mike I didnt understand your point ? please explain
0
 
LVL 74

Expert Comment

by:sdstuber
ID: 38359927
I don't  have any details to explain.

Look at the awr reports you posted.  In them there are lists of the top queries by time, by cpu, by io etc.   I've already identified a few of them.  Pick one and examine it.  Why does take so long?

MikeOM_DBA took a guess at one of them, maybe you don't have an index.  awr reports are only summaries we can't tell you specifically what is a problem or why.  You need to do some digging on your end and post what you find.

what are the explain plans for the expensive operations?  for the ones that execute multiple times, do they need to?  confirm that the expensive ones are actually a problem.  Did they run when people were having problems or not?  Also, if they were slow, are they culprit, or the victim of something else?
0
 

Author Closing Comment

by:apunkabollywood
ID: 38371764
Thnak u all
0

Featured Post

Prep for the ITIL® Foundation Certification Exam

December’s Course of the Month is now available! Enroll to learn ITIL® Foundation best practices for delivering IT services effectively and efficiently.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

In this article, we’ll look at how to deploy ProxySQL.
Backups and Disaster RecoveryIn this post, we’ll look at strategies for backups and disaster recovery.
Polish reports in Access so they look terrific. Take yourself to another level. Equations, Back Color, Alternate Back Color. Write easy VBA Code. Tighten space to use less pages. Launch report from a menu, considering criteria only when it is filled…
Despite its rising prevalence in the business world, "the cloud" is still misunderstood. Some companies still believe common misconceptions about lack of security in cloud solutions and many misuses of cloud storage options still occur every day. …
Suggested Courses

810 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question