Link to home
Start Free TrialLog in
Avatar of apunkabollywood
apunkabollywoodFlag for United States of America

asked on

Need Urgent Help on Performance issue having Oracle DB on RHEL 5.2

The situation is critical today as an application. All the application users are disappointed by the performances and are not able to work with it. The application is not stable:
 
- the performance is very low.
- the system is very slow and freezing all the time
 
- some time application crashing or user not able to login

We have three server:
1. Prod Server(RHEL 5.2)
2. App Prod server(RHEL 5.2)
3. DB Server(RHEL 5.2)

Things tried:

1. Raise up memory on application system
2. Kernel parameters changes
3. Increasing the swap space on Application Server
4. vm.nr_hugepages setting to be changed on prod server
5. Set up the memlock user limit
6. Upgrade omnivision version on all 3 servers to 7.21
7. Clear the Alert Log and diagnostic files on the Preprod server
8. Extend disk space from 3.4 GB to 10 GB on Prod DB server
 
Linux platform with Oracle databases mounted

still not able to diagnose the root cause and the troubleshooting continues. Till then the temporary workaround is to bounce the DB and reboot the server after 10-15 days. Need your help in:
 
1. Server performance needs to be looked into.
2. Network Communication between (Prod DB) and (Prod Apps) needs to be checked.

 
Need your help to diagnose the ROOT CAUSE and resolve this issue.


Pls check the logs attached with this case:
recommendations.txt
Avatar of Julian Parker
Julian Parker
Flag of United Kingdom of Great Britain and Northern Ireland image

What information did you collect to make you change memory/swap/disk etc etc?

Have you been collecting performance stats using ps -eaf/atop/iostat/vmstat/sar etc and going thru them to identify problem areas?

Is the database the bottleneck? I'm not sure what tools oracle has to monitor table performance... it might be you need to look at index performance.

If it's not the database then look at memory/disk/network atop can record the stats over a number of days and you can replay them to find hotspots.

Performance is one of those things that can be quite subjective. You need to be collecting data and acting on the data, not just blindly adding memory/disk.

We will need more information, you can post ps output or attach atop files (if you have them) and we might be able to have a look.
Avatar of ThomasMcA2
ThomasMcA2

Things to try:

1. Run an AWR report (awrrpt.sql)
2. Some systems run AWR reports on a schedule. If you have an AWR report from a period where performance was good, use awrddrpt.sql to compare the AWR reports to find the differences.
3. A high db file sequential read in the AWR report = poor/missing indexes, and/or outdated db stats
4. Is fragmentation a problem? Any table that has a fragmentation count that is more than 5% of the # of records in the table will impact performance
 select * from all_tables where chain_cnt > 0
5. What is the "Out Of Order Ratio %" for your largest tables? Run this, replacing <your_table_name> with the name of each of your largest tables.

select (round(((select count(*) as cnt
from (select item_id, location_id, sales_date, relative_fno, block_number, row_number, data_row,
(lag(data_row) over(partition by relative_fno, block_number order by row_number)) as prev_data_row
from (select item_id, location_id, sales_date
,relative_fno, block_number, row_number
,(dense_rank() over(partition by relative_fno, block_number order by item_id, location_id, sales_date)) as data_row
from (select item_id, location_id, sales_date
,dbms_rowid.rowid_relative_fno(rowid) relative_fno
,dbms_rowid.rowid_block_number(rowid) as block_number
,dbms_rowid.rowid_row_number(rowid) as row_number
from <your_table_name>
) c
) b
) a
where data_row != prev_data_row
and data_row != prev_data_row + 1) / (select count(*) from sales_data)),3)*100) as "out of order ratio %"
from dual;

Open in new window


If the value of "out of order ratio %" > 20 % you need to follow the document 1085012.1 Reording Columns in Oracle For Demantra Performance
Oracle Tuning Guide can be found at the following location:
https://access.redhat.com/sites/default/files/oracle_tuning_guide_1-1_11-28-2007.pdf
Avatar of apunkabollywood

ASKER

Please check the below facts from DB server and suggest are these right or wrong:

Page cache :

pagecahe 100 (Current)


Default

1  10   15



DB Server setting

 ethtool eth0
Settings for eth0:
        Supported ports: [ TP ]
        Supported link modes:   1000baseT/Full
        Supports auto-negotiation: No
        Advertised link modes:  Not reported
        Advertised auto-negotiation: No
        Speed: Unknown! (10000)
        Duplex: Full
        Port: Twisted Pair
        PHYAD: 0
        Transceiver: internal
        Auto-negotiation: off
        Supports Wake-on: uag
        Wake-on: d
        Link detected: yes

Prod Server settings

ethtool eth0
Settings for eth0:
        Supported ports: [ TP ]
        Supported link modes:   10baseT/Half 10baseT/Full
                                100baseT/Half 100baseT/Full
                                1000baseT/Full
        Supports auto-negotiation: Yes
        Advertised link modes:  10baseT/Half 10baseT/Full
                                100baseT/Half 100baseT/Full
                                1000baseT/Full
        Advertised auto-negotiation: Yes
        Speed: 1000Mb/s
        Duplex: Full
        Port: Twisted Pair
        PHYAD: 0
        Transceiver: internal
        Auto-negotiation: on
        Supports Wake-on: d
        Wake-on: d
        Current message level: 0x00000007 (7)
        Link detected: yes

On DB Server

less /proc/meminfo
MemTotal:     32960172 kB
MemFree:       9254140 kB
Buffers:       1471736 kB
Cached:         399360 kB
SwapCached:          0 kB
Active:        1626732 kB
Inactive:       468776 kB
HighTotal:           0 kB
HighFree:            0 kB
LowTotal:     32960172 kB
LowFree:       9254140 kB
SwapTotal:    25165816 kB
SwapFree:     25165816 kB
Dirty:              48 kB
Writeback:           0 kB
AnonPages:      244324 kB
Mapped:          55356 kB
Slab:           135700 kB
PageTables:      13804 kB
NFS_Unstable:        0 kB
Bounce:              0 kB
CommitLimit:  36400972 kB
Committed_AS:   648460 kB
VmallocTotal: 34359738367 kB
VmallocUsed:    276452 kB
VmallocChunk: 34359461623 kB
HugePages_Total:  5122
HugePages_Free:     49
HugePages_Rsvd:     48
Hugepagesize:     2048 kB


oracle : PPRVDB1 ]> ulimit -l
30000000


less /proc/sys/vm/swappiness
60


Request you all to plss suggest kernel paramneters or oracle tuning  how to check and increase performance of DB server


Thanks
SOLUTION
Avatar of Sean Stuber
Sean Stuber

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Thank you so much Sdstuber and Bigschmuh for you valuable advice - I have collected AWR report from DB server - but not much good to analyse that kindly look and give your suggestion accordingly.attached with this comment
report.doc
Appart from the 3 SQL starting with "SELECT oo.object_id FROM nc_ob..." shown in "SQL ordered by gets" which query plan should be reviewed as they are issuing about 16300 gets per execution, I don't see any real problem in the report.
Although, the "Operating system statistics" clearly shows that :
-Too much CPU on this db server is available (IDLE_TIME is huge against others)
-IO wait time looks acceptable

On the application server, did you check the CPU and swap levels ?
Here is the output: App Server


var/log/syslog/message.log | grep error
 snmpd[18889]: error on subcontainer '' insert (-1)
 snmpd[32587]: error on subcontainer '' insert (-1)

Jul 22 06:19:43 appserver snmpd[3431]: warning: /etc/hosts.allow, line 8: can't verify hostname: getaddrinfo(appserver, AF_INET) failed
Jul 22 06:20:08 appserver snmpd[3431]: warning: /etc/hosts.allow, line 8: can't verify hostname: getaddrinfo(appserver) failed
Jul 22 06:20:10 appserver snmpd[3431]: warning: /etc/hosts.allow, line 8: can't verify hostname: getaddrinfo(appserver) failed

Kernel Logs:

Jun  2 03:23:27 appserver kernel:  [<ffffffff8005dde9>] error_exit+0x0/0x84
Jun  5 03:14:17 appserver kernel:  [<ffffffff8005dde9>] error_exit+0x0/0x84
Jun  8 03:15:33 appserver kernel:  [<ffffffff8005dde9>] error_exit+0x0/0x84
Jun 21 11:44:27 appserver kernel: omv_mng[11997]: segfault at 0000000000000000 rip 0000000008082478 rsp 00000000ffee93a4 error 4
Jul 29 03:21:50 appserver kernel:  [<ffffffff8005dde9>] error_exit+0x0/0x84
Jul 29 03:21:52 appserver kernel:  [<ffffffff8005dde9>] error_exit+0x0/0x84

kernel: Kernel logging (proc) stopped.
kernel: Kernel logging (proc) stopped.
nel: Kernel logging (proc) stopped.
nel: Kernel logging (proc) stopped.
nel:  [<ffffffff880317ae>] :jbd:journal_stop+0x1f3/0x1ff
sce kernel: Kernel logging (proc) stopped.
sce kernel:  [<ffffffff880317ae>] :jbd:journal_stop+0x1f3/0x1ff
sce kernel: Kernel logging (proc) stopped.
sce kernel: Kernel logging (proc) stopped.
 
acpiphp: Slot [260] registered
acpiphp: Slot [261] registered
acpiphp: Slot [262] registered
acpiphp: Slot [263] registered
VMware memory control driver initialized
e1000: eth0: e1000_set_tso: TSO is Enabled
e1000: eth1: e1000_set_tso: TSO is Enabled
e1000: eth0: e1000_watchdog_task: NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None
e1000: eth1: e1000_watchdog_task: NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None
Installing knfsd (copyright (C) 1996 okir@monad.swb.de).
NFSD: Using /var/lib/nfs/v4recovery as the NFSv4 state recovery directory
NFSD: starting 90-second grace period
  Vendor: VMware    Model: Virtual disk      Rev: 1.0
  Type:   Direct-Access                      ANSI SCSI revision: 02
 target0:0:2: Beginning Domain Validation
 target0:0:2: Domain Validation skipping write tests
 target0:0:2: Ending Domain Validation
 target0:0:2: FAST-40 WIDE SCSI 80.0 MB/s ST (25 ns, offset 127)
SCSI device sdc: 20971520 512-byte hdwr sectors (10737 MB)
sdc: Write Protect is off
sdc: cache data unavailable
sdc: assuming drive cache: write through
SCSI device sdc: 20971520 512-byte hdwr sectors (10737 MB)
sdc: Write Protect is off
sdc: cache data unavailable
sdc: assuming drive cache: write through
 sdc: unknown partition table
sd 0:0:2:0: Attached scsi disk sdc
sd 0:0:2:0: Attached scsi generic sg2 type 0
SCSI device sdc: 20971520 512-byte hdwr sectors (10737 MB)
sdc: Write Protect is off
sdc: cache data unavailable
sdc: assuming drive cache: write through
 sdc: sdc1
SCSI device sdc: 20971520 512-byte hdwr sectors (10737 MB)
sdc: Write Protect is off
sdc: cache data unavailable
sdc: assuming drive cache: write through
 sdc: sdc1

free -g
             total       used       free     shared    buffers     cached
Mem:             7          1          6          0          0          0
-/+ buffers/cache:          0          6
Swap:            3          1          2


Linux 2.6.18-164.2.1.el5 (app-server)    

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.35    0.00    0.69    0.39    0.00   98.57

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
sda               6.49        98.49       120.63  152738948  187075862
sda1              0.00         0.02         0.00      31564        906
sda2              6.49        98.47       120.63  152705504  187074956
sdb               1.04        21.40        13.14   33182340   20371901
dm-0              2.66         9.03        20.34   14011266   31549472
dm-1              0.82         7.54         6.02   11698282    9341416
dm-2              0.00         0.00         0.00       3838       2124
dm-3              0.00         0.00         0.00       6594       6552
dm-4              2.94         9.02        22.61   13988026   35061568
dm-5              0.91         4.62         5.46    7157226    8466880
dm-6              1.86         0.20         3.71     304350    5759042
dm-7              0.00         0.01         0.00      10666       2252
dm-8              4.62        20.24        16.76   31389168   25986920
dm-9              0.11         2.96         0.23    4593410     351992
dm-10             0.00         0.00         0.01       4744       9266
dm-11             0.19         6.55         0.43   10154618     665896
dm-12            10.42        38.29        45.06   59377344   69871576
dm-13             1.97        13.72        11.80   21283098   18293405
dm-14             0.28         7.67         1.34   11889810    2078496

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           1.95    0.00    1.95    4.88    0.00   91.22

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
sda              24.51      1505.88         0.00       1536          0
sda1              0.00         0.00         0.00          0          0
sda2             24.51      1505.88         0.00       1536          0
sdb               0.00         0.00         0.00          0          0
dm-0              1.96        23.53         0.00         24          0
dm-1              0.00         0.00         0.00          0          0
dm-2              0.00         0.00         0.00          0          0
dm-3              0.00         0.00         0.00          0          0
dm-4              0.00         0.00         0.00          0          0
dm-5             27.45      1443.14         0.00       1472          0
dm-6              0.00         0.00         0.00          0          0
dm-7              0.00         0.00         0.00          0          0
dm-8              0.00         0.00         0.00          0          0
dm-9              0.00         0.00         0.00          0          0
dm-10             0.00         0.00         0.00          0          0
dm-11             0.00         0.00         0.00          0          0
dm-12             0.00         0.00         0.00          0          0
APP.pdf
The server looks like a vm, is the esx host busy?
yes its a VM not a hardware server..
One more thing want to add:

Is it possible CPU in the reports and logs shows is diffrent from CPU utilization shows of VM Machines on V Center --it shows they are utilizing 100% cpu ? :(
Do you have VMWare tools installed.
You may suffer from a CPU over-provisioning situation.

When a VM need 4 vCPU, it waits for 4 fully available cores before sending control to the VM.
For example, a db server with 4 vCPU using less than 20% cpu charge would get a real latency bonus by running one 1 vCPU only.

I googled this "5 Big Considerations for vCPU Provisioning" to get you some feedback about this.
Perhaps you should not execute report process' during work hours:

User generated image
Few things;

Are you also running eBusiness Suite?
Have you recently upgraded from 10g to 11g?

If yes then;

Verify your hidden parameters (They start with "_". There are five parameters that can affect your performance to a dead stop speed. Out of all of them _sort_elimiation_cost_ratio parameter has biggest impact on performance.

_b_tree_bitmap_plans                 boolean     FALSE
_fast_full_scan_enabled              boolean     FALSE
_index_join_enabled                  boolean     FALSE
_like_with_bind_as_equality          boolean     TRUE
_optimizer_autostats_job             boolean     FALSE
_sort_elimination_cost_ratio         integer     5
_system_trig_enabled                 boolean     TRUE
Start at the top of the SQL lists and work your way down.

Here's a few of the big hitters.

guyt7vgmcqbxh - took over an hour to finish   - can this be optimized? or as suggested above, run at a different time?

c52vystffk6mf - this ran 11,093 times, each one was fast, but cumulatively it consumed about 1/8 of all cpu   - can this be run fewer times? or, can it be tuned to be lighter weight? Knock a tenth of a second off it and you'll get 10 minutes back

9rrat0ynrdpbt - this consumed 70% of all Gets in your 6 hour window and took over 18 minutes.  


With a 6 hour window it's hard to say exactly what was bad when.  The three candidates I listed above might not be a problem at all.  If they ran sequentially and finished in the first couple hour and then people had problems after them then they working on them won't help.

Try generating a report for a much smaller window.  10-15 minutes usually works best.
Hi All,

Thanks all for your valuable advice:

@bigscumuh - I am planing to increase no of cpu from 4 to 8 but still not sure is this the only problem because vm showing 100% on monitoring tool on vmware but inside cpu utilization okay as u have seen logs.

@ Mike_omDBA - I am checking all status of cpu and memory continous from past 3 weeks but all three cpu utilization same 100% as per vm monitoring tools

@ kparikh001 - no its a new build 6 mnth before installed

@ sdstuber - will pull out and paste it soon

Some more points to look:

1. Some strange error on weblogic :

<Apr 2, 2012 3:05:24 PM CEST> <Error> <Server> <BEA-002608> <The ListenThread because of an error: java.lang.OutOfMemoryError: Java heap space
java.lang.OutOfMemoryError: Java heap space

2. No of active user max by 5
I was suggesting to DOWNGRADE vCPU for the dbserver from 4 to 1 to avoid the latency ... not to increase it to 8 !!!
Pls check to make sure VMWare tools are installed, I didn't see your response
Yes they vmware tools are installed
Please check the lates hour report for DB servers and advice something to improve performance after upgarding 2 to 8 vcpus cpu utlization decrease to 70% but still not upto the mark
AWR-1-hour.doc
ASKER CERTIFIED SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
@sdstuber:
Start at the top of the SQL lists and work your way down.
delete from PRV_DISCREPANCY_RESULTS where CUSTOMER_NAME = :1
why is that so expensive?
Perhaps there is no index on CUSTOMER_NAME and it's doing FTS. :p
The rest all points to un-optimized SQL.
Again, try setting parameter _sort_elimination_cost_ratio         integer     5
This is a dynamic parameter and help your query find better execution plan.
Hello SDBUSTER , Please explain me it in details and @Mike I didnt understand your point ? please explain
I don't  have any details to explain.

Look at the awr reports you posted.  In them there are lists of the top queries by time, by cpu, by io etc.   I've already identified a few of them.  Pick one and examine it.  Why does take so long?

MikeOM_DBA took a guess at one of them, maybe you don't have an index.  awr reports are only summaries we can't tell you specifically what is a problem or why.  You need to do some digging on your end and post what you find.

what are the explain plans for the expensive operations?  for the ones that execute multiple times, do they need to?  confirm that the expensive ones are actually a problem.  Did they run when people were having problems or not?  Also, if they were slow, are they culprit, or the victim of something else?
Thnak u all