[Last Call] Learn how to a build a cloud-first strategyRegister Now

x
?
Solved

Oracle RMAN High Paging on AIX - Slows AIX Server Response

Posted on 2010-01-07
33
Medium Priority
?
3,393 Views
Last Modified: 2013-12-18
Some backup info first:
We have an IBM RS6000 9131-52A that uses a DS4800 array.
We have 3 diskgroups:  rootdg, datadg, and backdg.
datadg is used to house our Oracle database.
backdg is used to house our Flash Recovery Area.
There are 6 Oracle databases running on this server.  Oracle 10.2.0.4.
One of the 6 databases is downstream from another database (lets call it streamdb) on a Windows server.  The Windows server is constantly streaming data to streamdb.

Now the problem:
The problem is that every time we start an RMAN backup for one of the other databases (called slowdb), the paging spikes (like from 0 to 50,000+) and the OS's response time degrades considerably.  The backup takes between 10-15 minutes.

For the other databases the paging spikes as well during RMAN, but there is no server impact.  To be fair, the database causing the degredation is 60GB in size and the other databases (that are supposedly not affecting the OS) are 30GB (one of them) and less than 3GB for the others.  So RMAN runs less than a minute on them and may not have time to degrade the server.

To see if it was just RMAN or Oracle in general, I ran stats on the SLOWDB database.  It ran for 1.5 hours and although paging was high, there was no impact on the OS.   This makes us think it's something to do with IO to the backdg disk group.

Oracle Support has directed us to work with IBM support to review our IO configuration.
But we don't have IBM support and have only novice knowledge of IO configurations.

We are hoping that there's an expert on EE that can assist.
Thank you.
0
Comment
Question by:vocogov
  • 17
  • 10
  • 6
33 Comments
 
LVL 68

Expert Comment

by:woolmilkporc
ID: 26203704
1) there is no rootdg in AIX. It's called rootvg

2) paging spikes of 50,000+ are by no means tolerable. How much memory does your box have?
I, personally, would not waste one single thought on I/O performance as long as paging is that high.

3) anyway, how is the DS attached?
If possible, use different adapters for the rootvg, datadg and backdg volumes, at the DS as well as at the AIX machine.
Afaik you can direct a DS4800 LUN to use a specific adapter.
Most important, separate paging I/O (most probably from/to rootvg) from data traffic.

4) To watch I/O activity, use topas during a backup session. Hit "D" for the "disk" panel and look particularly for the %busy rate.

5) A better tool is nmon (get it from here - http://www.ibm.com/developerworks/wikis/display/WikiPtype/nmon )
Hitting "a" will give you the adapter statistics. I bet you will see elevated and imbalanced values for %busy, read or write.

wmp




0
 
LVL 1

Author Comment

by:vocogov
ID: 26204117
Thanks.  We are looking into your suggestions.  At this time, I can't answer how the DS is attached.  Here is the memory information:

  Maximum number of PROCESSES allowed per user       [2560]                  +#
  Maximum number of pages in block I/O BUFFER CACHE  [20]                    +#
  Maximum Kbytes of real memory allowed for MBUFS    [0]                     +#
  Automatically REBOOT system after a crash           true                   +
  Continuously maintain DISK I/O history              false                  +
  HIGH water mark for pending write I/Os per file    [0]                     +#
  LOW water mark for pending write I/Os per file     [0]                     +#
  Amount of usable physical memory in Kbytes          16318464
  State of system keylock at boot time                normal
  Enable full CORE dump                               false                  +
  Use pre-430 style CORE dump                         false                  +
  Pre-520 tuning compatibility mode                   disable                +
  Maximum login name length at boot time             [9]                     +#
  Stack Execution Disable (SED) Mode                  select                 +
  NFS4 ACL Compatibility Mode                         secure                 +
  ARG/ENV list size in 4K byte blocks                [6]                     +#
  CPU Guard                                           enable                 +
  Processor capacity increment                        1.00
  Partition is capped                                 true
  Partition is dedicated                              true
  Entitled processor capacity                         2.00
  Minimum potential processor capacity                1.00
  Maximum potential processor capacity                2.00
  Variable processor capacity weight                  0
0
 
LVL 25

Expert Comment

by:madunix
ID: 26207721
Does it occure all time? Or, the free memory would go up & down? What kind of application is it running on that beside rman?....
in case high pagingspace usage you need to tune up system parameters...e.g. 15% usage of PagingSpace is not nothing, but nearly nothing. The alerting watermark is around 65-70%. If the system doesn't actually read/write pages from/to paginspace, there is no reason to sorrow. And rather than tune up system parameters, you should find out which processes causing the memory leak (if there is one). sometimes growing of pagingspace is kinda normal. You should monitor your pagingsapce for some days and have a look if it's constantly growing or sometimes also shrinking. Attached another document about vmm tuning which includes considerations abour lrud which is responsible for page cleaning. If paging impacts your performance, you also could think about tuning of pinned memory, memory which never would be paged.

http://www.ibm.com/developerworks/wikis/download/attachments/53871915/VMM+Tuning+Tip+-+Proctecting+Comp+Memory.pdf?version=2

https://www.ibm.com/developerworks/mydeveloperworks/blogs/aixpert/entry/initial_aix_tuning_guidelines_for?lang=en#comments

I remember, just read a very interesting article from IBM System magazine open systems version Aug/Sept 2006 which states IBM are now recommending the following The new recommendations are to leave
maxclient and maxperm at their default settings of 80,
but to still set minperm to something like 5. We also don't change the strict settings.
Instead, we alter other parameters as follows:
vmo -p -o minperm%=5
vmo -p -o lru_file_repage=0
vmo -p -o lru_poll_interval=10

i would check also ibmsystem magazin http://www.ibmsystemsmag.com/
http://www.ibmsystemsmag.com/aix/octobernovember08/coverstory/21979p1.aspx?ht=


madunix
0
Concerto's Cloud Advisory Services

Want to avoid the missteps to gaining all the benefits of the cloud? Learn more about the different assessment options from our Cloud Advisory team.

 
LVL 1

Author Comment

by:vocogov
ID: 26212709
It happens everytime I run the RMAN.   The OS response time gets so slow it's ridiculous.  Just a simple LS command sits for several seconds.  Forget about refreshing the Oracle Console window.

Everything frees up as soon as RMAN finishes.

BTW, the paging levels I see are being reported through the Oracle Enterprise Console.  I've issued OS command line commands to look at memory and paging but have a hard time interpreting them.

We have installed the nmon and are going to work with it this afternoon.  
0
 
LVL 25

Accepted Solution

by:
madunix earned 1000 total points
ID: 26213761
you say AIX Freezing ??? then  you need to add more memory and tune oracle if its really like that. check redbook on Oracle best practice for AIX, i have also multiple questions ...does the errpt  contains any error ? what is the exact text ? do you have enough space on the filesystem? are you runnig a RAC? are you using ASM?  are Oracle files are on JFS2 filesystems?is paging space equals to your real memory? show me vmo -a | egrep "maxperm|minperm|maxclient"?

- record your current vmo parmeters maxperm|minperm|maxclient" and play with them
try to change the following vmo  (dont forget the % sign)
vmo -o  minperm%=10
vmo -o  maxclient%=20
vmo -o  maxperm%=20

-check also the cio  option on the filesystem, time ago i saw on nmon and topas that no jfs2 caching happened when cio option was enabled... but not sure if that is your case

my recommendation to look @
http://www.ibm.com/developerworks/wikis/display/WikiPtype/AIXV53AdminBestPractice
http://publib-b.boulder.ibm.com/redbooks.nsf/RedbookAbstracts/sg245511.html?Open
http://www-1.ibm.com/servers/aix/whitepapers/db_perf_aix.pdf
http://www.dba-oracle.com/t_aix_cio.htm
http://www.dba-oracle.com/t_ibm_aix_tuning.htm
http://www.ibm.com/developerworks/aix/library/au-aixoracle/index.html


madunix
0
 
LVL 68

Expert Comment

by:woolmilkporc
ID: 26213978
Check the paging activity of your system:
  • with topas
    • examine pgspin and pgspout (center middle of the screen)
    • should ideally be zero, but not beyond ca. 20/sec. for a longer time period
  • with nmon
    • hit "m"
    • examine  pages/sec In/Out to Paging Space (center middle)
    • same rules as above apply
If the activity is higher, consider installing more memory.
If the activity seems normal, consider increasing Oracle's SGA size.

wmp




0
 
LVL 68

Expert Comment

by:woolmilkporc
ID: 26213997
... and don't forget the I/O tuning tips I gave you above!
0
 
LVL 1

Author Comment

by:vocogov
ID: 26214058
The server isn't frozen.  It just takes 30 seconds+ for simple OS commands to return data while the RMAN is executing.  The Oracle console is fairly useless until RMAN finishes.  As soon as RMAN finishes then everything is normal (by "frees" I meant 'free like a bird').  

We've executed the nmon and are reviewing things now.

Here's answers to your questions:
1. does the errpt  contains any error ?  NO. No errors were generated during the RMAN run.
2. what is the exact text ?
3. do you have enough space on the filesystem?  YES, PLENTY.
4. are you running a RAC?  NO RAC
5. are you using ASM?  NO ASM
6. are Oracle files are on JFS2 filesystems?  YES
7. is paging space equals to your real memory?  HOW CAN I TELL?
8. show me vmo -a | egrep "maxperm|minperm|maxclient"
              maxclient% = 80
               maxperm = 3096141
              maxperm% = 80
               minperm = 774035
              minperm% = 20
      strict_maxclient = 1
        strict_maxperm = 0

How do I tell if CIO is enabled?
0
 
LVL 1

Author Comment

by:vocogov
ID: 26214103
This is a snap shot of topas when RMAN is not running.  Does this give you an idea of my current settings?
topas-good.gif
0
 
LVL 25

Expert Comment

by:madunix
ID: 26214170
you can enable cio in /etc/filesystems: "options = rw,cio"
0
 
LVL 1

Author Comment

by:vocogov
ID: 26214179
Here's topas during RMAN
topas-rman4.gif
0
 
LVL 68

Expert Comment

by:woolmilkporc
ID: 26214180
Well, there is no paging/CPU/IO problem in this situation.

Only "Wait 5.6" (= %wait for I/O in most cases) is a bit high with such a low I/O activity.

Your memory is 16 GB, paging space is 12 GB, with 45% used, which is also not an actual problem.

Rather few memory is used for processes (% Comp = 29.9).

Watch how all the values mentionen here increase during RMAN!


0
 
LVL 1

Author Comment

by:vocogov
ID: 26214182
BTW It fluxuates to 100% to 90%.   Mostly it stays at 100%.
0
 
LVL 68

Assisted Solution

by:woolmilkporc
woolmilkporc earned 1000 total points
ID: 26214219
Well,

we can clearly see two problems

1) I/O wait / disk busy. You obviously have an overload here! Check adapter statistics with nmon!

2) paging. Follow madunix' suggestion and set

vmo -o  minperm%=10
vmo -o  maxclient%=20
vmo -o  maxperm%=20
0
 
LVL 1

Author Comment

by:vocogov
ID: 26214225
CIO is not enabled.  I don't see it specified for any of the filesystems in the /etc/filesystems file.  
Were you saying that it should NOT be enabled?
0
 
LVL 25

Expert Comment

by:madunix
ID: 26214238
for your info CIO
  Concurrent I/O
  Only available in JFS2
  Allows performance close to raw devices
  Use for Oracle dbf and control files, and online redo logs,
   not for binaries
  No system buffer caching
  Designed for apps (such as RDBs) that enforce write
   serialization at the app
  Allows non-use of inode locks
  Implies DIO as well
  Benefits heavy update workloads
  Not all apps benefit from CIO and DIO  some are better with filesystem caching and some are safer
   that way

madunix
0
 
LVL 25

Expert Comment

by:madunix
ID: 26214345
am not close to my AIX system now, i think you mount your filesystems with the cio flag:
# mount -o cio /orafilesystemblabla
before doing that adjust your vmo parameter as we said, it might help

madunix
0
 
LVL 1

Author Comment

by:vocogov
ID: 26214523
Thanks everyone.  
I will do one change at a time to see how it goes starting on Monday.  It's 5pm and I'm not that dedicated. :)
For your viewing pleasure,  here are snapshots of the nmon -m and nmon showing asynchonous i/o.
I will update on Monday of our results.

nmon-memory4-01082010.gif
0
 
LVL 1

Author Comment

by:vocogov
ID: 26214531
nmon with asynchonous i/o
nmon-aioservers4-01082010.gif
0
 
LVL 68

Expert Comment

by:woolmilkporc
ID: 26214606
Er,

I meant "nmon",  then "a" (lowercase), not "A" (uppercase!
0
 
LVL 1

Author Comment

by:vocogov
ID: 26285340
I see.  Here is the lower case A snapshot taken during RMAN for your viewing enjoyment. :)
We are going to set the vmm settings and see how things go.  We will update.   Thank you!
during-rman-d-a-carot-7.gif
0
 
LVL 68

Expert Comment

by:woolmilkporc
ID: 26285638
Hi,

I can't find your data that joyful (at least not for you).

As you might have seen by yourself you do have a huge imbalace on your FC adapters.

I think you should consider moving some write workload from fcs1 to fcs0.

dac5 (hdisk3) is overly busy writing. Try to add an additional LUN, attached via fcs0, to spread the load between adapters.

I assume hdisk0/1 make up your rootvg containing swap space?
If so, our vmo suggestions might help a bit to reduce the load on them and on the sisscsia1 adapter .

Please issue, as nmon suggests, "chdev -l sys0 -a iostat=true" to get some more I/O statistics.

Curious about your update to come.

wmp



0
 
LVL 1

Author Comment

by:vocogov
ID: 26286302
We changed the vmo settings to the suggested ones and reran our rman backup.
Here is the output from the nmon ad^.  I'll upload the memory one after this one.
during-rman-d-a-carot7.gif
0
 
LVL 1

Author Comment

by:vocogov
ID: 26286345
Here is the memory snapshot after vmo changes during rman.
memory6.gif
0
 
LVL 1

Author Comment

by:vocogov
ID: 26286368
I noticed that the fc stats received an error during the rman run (see the uploaded picture of the nmon da^.
What do you supposed that means?
0
 
LVL 68

Expert Comment

by:woolmilkporc
ID: 26286551

OK, vmo tuning seems to have helped lowering page I/O activity.
Processes are now able to take double the amount of memory than before (23.4 vs. 12.7 percent),
which leads to only 0.5 pages/sec. instead of 675.4/sec.

But it seems that the last two snapshots aren't showing the same situations as the ones before.
Although still far from being evenly distributed, I/O seems a bit more balanced, which is due to many more read I/O than before.
Obviously you have measured at an earlier point in the lifetime of your job.

Anyway, did the reduced paging activity lead to better performance?

wmp
 
0
 
LVL 1

Author Comment

by:vocogov
ID: 26286629
Yes it did lead to better performance on the OS.  Commands may have paused here and there but overall it did improve.

Why do you suppose the fcstats returned an error during the rman run?
0
 
LVL 68

Expert Comment

by:woolmilkporc
ID: 26286634
fcstat errors - does fcstat fcs0 or fcstat fcs1 work?
If yes, I'd tend to assume that this error is of transient nature.
Please try the fcstat and also try nmon again!
0
 
LVL 1

Author Comment

by:vocogov
ID: 26286755
Yes, the fcstat fcs0 and fcs1 seem to work find individually:

FIBRE CHANNEL STATISTICS REPORT: fcs0

Device Type: FC Adapter (df1000fd)
Serial Number: 1B61304547
Option ROM Version: 02C82135
Firmware Version: B1D2.10X5
World Wide Node Name: 0x20000000C953A1EC
World Wide Port Name: 0x10000000C953A1EC

FC-4 TYPES:
  Supported: 0x0000012000000000000000000000000000000000000000000000000000000000
  Active:    0x0000010000000000000000000000000000000000000000000000000000000000
Class of Service: 3
Port Speed (supported): 4 GBIT
Port Speed (running):   4 GBIT
Port FC ID: 0x000001
Port Type: Private Loop

Seconds Since Last Reset: 339945

        Transmit Statistics     Receive Statistics
        -------------------     ------------------
Frames: 46271575                378810169
Words:  14172847616             184295993856

LIP Count: 1
NOS Count: 0
Error Frames:  0
Dumped Frames: 0
Link Failure Count: 0
Loss of Sync Count: 8
Loss of Signal: 0
Primitive Seq Protocol Error Count: 0
Invalid Tx Word Count: 2
Invalid CRC Count: 0

IP over FC Adapter Driver Information
  No DMA Resource Count: 0
  No Adapter Elements Count: 0

FC SCSI Adapter Driver Information
  No DMA Resource Count: 0
  No Adapter Elements Count: 0
  No Command Resource Count: 0

IP over FC Traffic Statistics
  Input Requests:   0
  Output Requests:  0
  Control Requests: 0
  Input Bytes:  0
  Output Bytes: 0

FC SCSI Traffic Statistics
  Input Requests:   15318425
  Output Requests:  4104669
  Control Requests: 12711
  Input Bytes:  727577116782
  Output Bytes: 54958804992
-------------------------------------------------
cjisersrv:/usr/nmon >fcstat fcs1

FIBRE CHANNEL STATISTICS REPORT: fcs1

Device Type: FC Adapter (df1000fd)
Serial Number: 1B60304EB9
Option ROM Version: 02C82135
Firmware Version: B1D2.10X5
World Wide Node Name: 0x20000000C9514682
World Wide Port Name: 0x10000000C9514682

FC-4 TYPES:
  Supported: 0x0000012000000000000000000000000000000000000000000000000000000000
  Active:    0x0000010000000000000000000000000000000000000000000000000000000000
Class of Service: 3
Port Speed (supported): 4 GBIT
Port Speed (running):   4 GBIT
Port FC ID: 0x000001
Port Type: Private Loop

Seconds Since Last Reset: 339961

        Transmit Statistics     Receive Statistics
        -------------------     ------------------
Frames: 134830654               192794967
Words:  68906417152             98764589056

LIP Count: 1
NOS Count: 0
Error Frames:  0
Dumped Frames: 0
Link Failure Count: 1
Loss of Sync Count: 7
Loss of Signal: 0
Primitive Seq Protocol Error Count: 0
Invalid Tx Word Count: 34
Invalid CRC Count: 0

IP over FC Adapter Driver Information
  No DMA Resource Count: 0
  No Adapter Elements Count: 0

FC SCSI Adapter Driver Information
  No DMA Resource Count: 0
  No Adapter Elements Count: 0
  No Command Resource Count: 0

IP over FC Traffic Statistics
  Input Requests:   0
  Output Requests:  0
  Control Requests: 0
  Input Bytes:  0
  Output Bytes: 0

FC SCSI Traffic Statistics
  Input Requests:   1529498
  Output Requests:  314292
  Control Requests: 12711
  Input Bytes:  390383250431
  Output Bytes: 272319766528
0
 
LVL 1

Author Comment

by:vocogov
ID: 26286820
Seems the error is transitory since it didn't repeat this time.
during-rman-d-a-carot9.gif
0
 
LVL 1

Author Closing Comment

by:vocogov
ID: 31674183
Tough to decide how to divy up points.  Chose to split.
0
 
LVL 68

Expert Comment

by:woolmilkporc
ID: 26287382
OK,

thx for the points!

But please try to tune your I/O distribution, as I wrote in #26285638!
There is a remarkable imbalance!

wmp

0
 
LVL 25

Expert Comment

by:madunix
ID: 26287513
0

Featured Post

NEW Veeam Backup for Microsoft Office 365 1.5

With Office 365, it’s your data and your responsibility to protect it. NEW Veeam Backup for Microsoft Office 365 eliminates the risk of losing access to your Office 365 data.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Cursors in Oracle: A cursor is used to process individual rows returned by database system for a query. In oracle every SQL statement executed by the oracle server has a private area. This area contains information about the SQL statement and the…
Shell script to create broker configuration file using current broker Configuration, solely for purpose of backup on Linux. Script may need to be modified depending on OS-installation. Please deploy and verify the script in a test environment.
This video shows, step by step, how to configure Oracle Heterogeneous Services via the Generic Gateway Agent in order to make a connection from an Oracle session and access a remote SQL Server database table.
Via a live example, show how to restore a database from backup after a simulated disk failure using RMAN.
Suggested Courses
Course of the Month18 days, 10 hours left to enroll

834 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question