Oracle RMAN High Paging on AIX - Slows AIX Server Response

Some backup info first:
We have an IBM RS6000 9131-52A that uses a DS4800 array.
We have 3 diskgroups:  rootdg, datadg, and backdg.
datadg is used to house our Oracle database.
backdg is used to house our Flash Recovery Area.
There are 6 Oracle databases running on this server.  Oracle 10.2.0.4.
One of the 6 databases is downstream from another database (lets call it streamdb) on a Windows server.  The Windows server is constantly streaming data to streamdb.

Now the problem:
The problem is that every time we start an RMAN backup for one of the other databases (called slowdb), the paging spikes (like from 0 to 50,000+) and the OS's response time degrades considerably.  The backup takes between 10-15 minutes.

For the other databases the paging spikes as well during RMAN, but there is no server impact.  To be fair, the database causing the degredation is 60GB in size and the other databases (that are supposedly not affecting the OS) are 30GB (one of them) and less than 3GB for the others.  So RMAN runs less than a minute on them and may not have time to degrade the server.

To see if it was just RMAN or Oracle in general, I ran stats on the SLOWDB database.  It ran for 1.5 hours and although paging was high, there was no impact on the OS.   This makes us think it's something to do with IO to the backdg disk group.

Oracle Support has directed us to work with IBM support to review our IO configuration.
But we don't have IBM support and have only novice knowledge of IO configurations.

We are hoping that there's an expert on EE that can assist.
Thank you.
LVL 1
vocogovAsked:
Who is Participating?
 
madunixCommented:
you say AIX Freezing ??? then  you need to add more memory and tune oracle if its really like that. check redbook on Oracle best practice for AIX, i have also multiple questions ...does the errpt  contains any error ? what is the exact text ? do you have enough space on the filesystem? are you runnig a RAC? are you using ASM?  are Oracle files are on JFS2 filesystems?is paging space equals to your real memory? show me vmo -a | egrep "maxperm|minperm|maxclient"?

- record your current vmo parmeters maxperm|minperm|maxclient" and play with them
try to change the following vmo  (dont forget the % sign)
vmo -o  minperm%=10
vmo -o  maxclient%=20
vmo -o  maxperm%=20

-check also the cio  option on the filesystem, time ago i saw on nmon and topas that no jfs2 caching happened when cio option was enabled... but not sure if that is your case

my recommendation to look @
http://www.ibm.com/developerworks/wikis/display/WikiPtype/AIXV53AdminBestPractice
http://publib-b.boulder.ibm.com/redbooks.nsf/RedbookAbstracts/sg245511.html?Open
http://www-1.ibm.com/servers/aix/whitepapers/db_perf_aix.pdf
http://www.dba-oracle.com/t_aix_cio.htm
http://www.dba-oracle.com/t_ibm_aix_tuning.htm
http://www.ibm.com/developerworks/aix/library/au-aixoracle/index.html


madunix
0
 
woolmilkporcCommented:
1) there is no rootdg in AIX. It's called rootvg

2) paging spikes of 50,000+ are by no means tolerable. How much memory does your box have?
I, personally, would not waste one single thought on I/O performance as long as paging is that high.

3) anyway, how is the DS attached?
If possible, use different adapters for the rootvg, datadg and backdg volumes, at the DS as well as at the AIX machine.
Afaik you can direct a DS4800 LUN to use a specific adapter.
Most important, separate paging I/O (most probably from/to rootvg) from data traffic.

4) To watch I/O activity, use topas during a backup session. Hit "D" for the "disk" panel and look particularly for the %busy rate.

5) A better tool is nmon (get it from here - http://www.ibm.com/developerworks/wikis/display/WikiPtype/nmon )
Hitting "a" will give you the adapter statistics. I bet you will see elevated and imbalanced values for %busy, read or write.

wmp




0
 
vocogovAuthor Commented:
Thanks.  We are looking into your suggestions.  At this time, I can't answer how the DS is attached.  Here is the memory information:

  Maximum number of PROCESSES allowed per user       [2560]                  +#
  Maximum number of pages in block I/O BUFFER CACHE  [20]                    +#
  Maximum Kbytes of real memory allowed for MBUFS    [0]                     +#
  Automatically REBOOT system after a crash           true                   +
  Continuously maintain DISK I/O history              false                  +
  HIGH water mark for pending write I/Os per file    [0]                     +#
  LOW water mark for pending write I/Os per file     [0]                     +#
  Amount of usable physical memory in Kbytes          16318464
  State of system keylock at boot time                normal
  Enable full CORE dump                               false                  +
  Use pre-430 style CORE dump                         false                  +
  Pre-520 tuning compatibility mode                   disable                +
  Maximum login name length at boot time             [9]                     +#
  Stack Execution Disable (SED) Mode                  select                 +
  NFS4 ACL Compatibility Mode                         secure                 +
  ARG/ENV list size in 4K byte blocks                [6]                     +#
  CPU Guard                                           enable                 +
  Processor capacity increment                        1.00
  Partition is capped                                 true
  Partition is dedicated                              true
  Entitled processor capacity                         2.00
  Minimum potential processor capacity                1.00
  Maximum potential processor capacity                2.00
  Variable processor capacity weight                  0
0
Cloud Class® Course: CompTIA Cloud+

The CompTIA Cloud+ Basic training course will teach you about cloud concepts and models, data storage, networking, and network infrastructure.

 
madunixCommented:
Does it occure all time? Or, the free memory would go up & down? What kind of application is it running on that beside rman?....
in case high pagingspace usage you need to tune up system parameters...e.g. 15% usage of PagingSpace is not nothing, but nearly nothing. The alerting watermark is around 65-70%. If the system doesn't actually read/write pages from/to paginspace, there is no reason to sorrow. And rather than tune up system parameters, you should find out which processes causing the memory leak (if there is one). sometimes growing of pagingspace is kinda normal. You should monitor your pagingsapce for some days and have a look if it's constantly growing or sometimes also shrinking. Attached another document about vmm tuning which includes considerations abour lrud which is responsible for page cleaning. If paging impacts your performance, you also could think about tuning of pinned memory, memory which never would be paged.

http://www.ibm.com/developerworks/wikis/download/attachments/53871915/VMM+Tuning+Tip+-+Proctecting+Comp+Memory.pdf?version=2

https://www.ibm.com/developerworks/mydeveloperworks/blogs/aixpert/entry/initial_aix_tuning_guidelines_for?lang=en#comments

I remember, just read a very interesting article from IBM System magazine open systems version Aug/Sept 2006 which states IBM are now recommending the following The new recommendations are to leave
maxclient and maxperm at their default settings of 80,
but to still set minperm to something like 5. We also don't change the strict settings.
Instead, we alter other parameters as follows:
vmo -p -o minperm%=5
vmo -p -o lru_file_repage=0
vmo -p -o lru_poll_interval=10

i would check also ibmsystem magazin http://www.ibmsystemsmag.com/
http://www.ibmsystemsmag.com/aix/octobernovember08/coverstory/21979p1.aspx?ht=


madunix
0
 
vocogovAuthor Commented:
It happens everytime I run the RMAN.   The OS response time gets so slow it's ridiculous.  Just a simple LS command sits for several seconds.  Forget about refreshing the Oracle Console window.

Everything frees up as soon as RMAN finishes.

BTW, the paging levels I see are being reported through the Oracle Enterprise Console.  I've issued OS command line commands to look at memory and paging but have a hard time interpreting them.

We have installed the nmon and are going to work with it this afternoon.  
0
 
woolmilkporcCommented:
Check the paging activity of your system:
  • with topas
    • examine pgspin and pgspout (center middle of the screen)
    • should ideally be zero, but not beyond ca. 20/sec. for a longer time period
  • with nmon
    • hit "m"
    • examine  pages/sec In/Out to Paging Space (center middle)
    • same rules as above apply
If the activity is higher, consider installing more memory.
If the activity seems normal, consider increasing Oracle's SGA size.

wmp




0
 
woolmilkporcCommented:
... and don't forget the I/O tuning tips I gave you above!
0
 
vocogovAuthor Commented:
The server isn't frozen.  It just takes 30 seconds+ for simple OS commands to return data while the RMAN is executing.  The Oracle console is fairly useless until RMAN finishes.  As soon as RMAN finishes then everything is normal (by "frees" I meant 'free like a bird').  

We've executed the nmon and are reviewing things now.

Here's answers to your questions:
1. does the errpt  contains any error ?  NO. No errors were generated during the RMAN run.
2. what is the exact text ?
3. do you have enough space on the filesystem?  YES, PLENTY.
4. are you running a RAC?  NO RAC
5. are you using ASM?  NO ASM
6. are Oracle files are on JFS2 filesystems?  YES
7. is paging space equals to your real memory?  HOW CAN I TELL?
8. show me vmo -a | egrep "maxperm|minperm|maxclient"
              maxclient% = 80
               maxperm = 3096141
              maxperm% = 80
               minperm = 774035
              minperm% = 20
      strict_maxclient = 1
        strict_maxperm = 0

How do I tell if CIO is enabled?
0
 
vocogovAuthor Commented:
This is a snap shot of topas when RMAN is not running.  Does this give you an idea of my current settings?
topas-good.gif
0
 
madunixCommented:
you can enable cio in /etc/filesystems: "options = rw,cio"
0
 
vocogovAuthor Commented:
Here's topas during RMAN
topas-rman4.gif
0
 
woolmilkporcCommented:
Well, there is no paging/CPU/IO problem in this situation.

Only "Wait 5.6" (= %wait for I/O in most cases) is a bit high with such a low I/O activity.

Your memory is 16 GB, paging space is 12 GB, with 45% used, which is also not an actual problem.

Rather few memory is used for processes (% Comp = 29.9).

Watch how all the values mentionen here increase during RMAN!


0
 
vocogovAuthor Commented:
BTW It fluxuates to 100% to 90%.   Mostly it stays at 100%.
0
 
woolmilkporcCommented:
Well,

we can clearly see two problems

1) I/O wait / disk busy. You obviously have an overload here! Check adapter statistics with nmon!

2) paging. Follow madunix' suggestion and set

vmo -o  minperm%=10
vmo -o  maxclient%=20
vmo -o  maxperm%=20
0
 
vocogovAuthor Commented:
CIO is not enabled.  I don't see it specified for any of the filesystems in the /etc/filesystems file.  
Were you saying that it should NOT be enabled?
0
 
madunixCommented:
for your info CIO
  Concurrent I/O
  Only available in JFS2
  Allows performance close to raw devices
  Use for Oracle dbf and control files, and online redo logs,
   not for binaries
  No system buffer caching
  Designed for apps (such as RDBs) that enforce write
   serialization at the app
  Allows non-use of inode locks
  Implies DIO as well
  Benefits heavy update workloads
  Not all apps benefit from CIO and DIO  some are better with filesystem caching and some are safer
   that way

madunix
0
 
madunixCommented:
am not close to my AIX system now, i think you mount your filesystems with the cio flag:
# mount -o cio /orafilesystemblabla
before doing that adjust your vmo parameter as we said, it might help

madunix
0
 
vocogovAuthor Commented:
Thanks everyone.  
I will do one change at a time to see how it goes starting on Monday.  It's 5pm and I'm not that dedicated. :)
For your viewing pleasure,  here are snapshots of the nmon -m and nmon showing asynchonous i/o.
I will update on Monday of our results.

nmon-memory4-01082010.gif
0
 
vocogovAuthor Commented:
nmon with asynchonous i/o
nmon-aioservers4-01082010.gif
0
 
woolmilkporcCommented:
Er,

I meant "nmon",  then "a" (lowercase), not "A" (uppercase!
0
 
vocogovAuthor Commented:
I see.  Here is the lower case A snapshot taken during RMAN for your viewing enjoyment. :)
We are going to set the vmm settings and see how things go.  We will update.   Thank you!
during-rman-d-a-carot-7.gif
0
 
woolmilkporcCommented:
Hi,

I can't find your data that joyful (at least not for you).

As you might have seen by yourself you do have a huge imbalace on your FC adapters.

I think you should consider moving some write workload from fcs1 to fcs0.

dac5 (hdisk3) is overly busy writing. Try to add an additional LUN, attached via fcs0, to spread the load between adapters.

I assume hdisk0/1 make up your rootvg containing swap space?
If so, our vmo suggestions might help a bit to reduce the load on them and on the sisscsia1 adapter .

Please issue, as nmon suggests, "chdev -l sys0 -a iostat=true" to get some more I/O statistics.

Curious about your update to come.

wmp



0
 
vocogovAuthor Commented:
We changed the vmo settings to the suggested ones and reran our rman backup.
Here is the output from the nmon ad^.  I'll upload the memory one after this one.
during-rman-d-a-carot7.gif
0
 
vocogovAuthor Commented:
Here is the memory snapshot after vmo changes during rman.
memory6.gif
0
 
vocogovAuthor Commented:
I noticed that the fc stats received an error during the rman run (see the uploaded picture of the nmon da^.
What do you supposed that means?
0
 
woolmilkporcCommented:

OK, vmo tuning seems to have helped lowering page I/O activity.
Processes are now able to take double the amount of memory than before (23.4 vs. 12.7 percent),
which leads to only 0.5 pages/sec. instead of 675.4/sec.

But it seems that the last two snapshots aren't showing the same situations as the ones before.
Although still far from being evenly distributed, I/O seems a bit more balanced, which is due to many more read I/O than before.
Obviously you have measured at an earlier point in the lifetime of your job.

Anyway, did the reduced paging activity lead to better performance?

wmp
 
0
 
vocogovAuthor Commented:
Yes it did lead to better performance on the OS.  Commands may have paused here and there but overall it did improve.

Why do you suppose the fcstats returned an error during the rman run?
0
 
woolmilkporcCommented:
fcstat errors - does fcstat fcs0 or fcstat fcs1 work?
If yes, I'd tend to assume that this error is of transient nature.
Please try the fcstat and also try nmon again!
0
 
vocogovAuthor Commented:
Yes, the fcstat fcs0 and fcs1 seem to work find individually:

FIBRE CHANNEL STATISTICS REPORT: fcs0

Device Type: FC Adapter (df1000fd)
Serial Number: 1B61304547
Option ROM Version: 02C82135
Firmware Version: B1D2.10X5
World Wide Node Name: 0x20000000C953A1EC
World Wide Port Name: 0x10000000C953A1EC

FC-4 TYPES:
  Supported: 0x0000012000000000000000000000000000000000000000000000000000000000
  Active:    0x0000010000000000000000000000000000000000000000000000000000000000
Class of Service: 3
Port Speed (supported): 4 GBIT
Port Speed (running):   4 GBIT
Port FC ID: 0x000001
Port Type: Private Loop

Seconds Since Last Reset: 339945

        Transmit Statistics     Receive Statistics
        -------------------     ------------------
Frames: 46271575                378810169
Words:  14172847616             184295993856

LIP Count: 1
NOS Count: 0
Error Frames:  0
Dumped Frames: 0
Link Failure Count: 0
Loss of Sync Count: 8
Loss of Signal: 0
Primitive Seq Protocol Error Count: 0
Invalid Tx Word Count: 2
Invalid CRC Count: 0

IP over FC Adapter Driver Information
  No DMA Resource Count: 0
  No Adapter Elements Count: 0

FC SCSI Adapter Driver Information
  No DMA Resource Count: 0
  No Adapter Elements Count: 0
  No Command Resource Count: 0

IP over FC Traffic Statistics
  Input Requests:   0
  Output Requests:  0
  Control Requests: 0
  Input Bytes:  0
  Output Bytes: 0

FC SCSI Traffic Statistics
  Input Requests:   15318425
  Output Requests:  4104669
  Control Requests: 12711
  Input Bytes:  727577116782
  Output Bytes: 54958804992
-------------------------------------------------
cjisersrv:/usr/nmon >fcstat fcs1

FIBRE CHANNEL STATISTICS REPORT: fcs1

Device Type: FC Adapter (df1000fd)
Serial Number: 1B60304EB9
Option ROM Version: 02C82135
Firmware Version: B1D2.10X5
World Wide Node Name: 0x20000000C9514682
World Wide Port Name: 0x10000000C9514682

FC-4 TYPES:
  Supported: 0x0000012000000000000000000000000000000000000000000000000000000000
  Active:    0x0000010000000000000000000000000000000000000000000000000000000000
Class of Service: 3
Port Speed (supported): 4 GBIT
Port Speed (running):   4 GBIT
Port FC ID: 0x000001
Port Type: Private Loop

Seconds Since Last Reset: 339961

        Transmit Statistics     Receive Statistics
        -------------------     ------------------
Frames: 134830654               192794967
Words:  68906417152             98764589056

LIP Count: 1
NOS Count: 0
Error Frames:  0
Dumped Frames: 0
Link Failure Count: 1
Loss of Sync Count: 7
Loss of Signal: 0
Primitive Seq Protocol Error Count: 0
Invalid Tx Word Count: 34
Invalid CRC Count: 0

IP over FC Adapter Driver Information
  No DMA Resource Count: 0
  No Adapter Elements Count: 0

FC SCSI Adapter Driver Information
  No DMA Resource Count: 0
  No Adapter Elements Count: 0
  No Command Resource Count: 0

IP over FC Traffic Statistics
  Input Requests:   0
  Output Requests:  0
  Control Requests: 0
  Input Bytes:  0
  Output Bytes: 0

FC SCSI Traffic Statistics
  Input Requests:   1529498
  Output Requests:  314292
  Control Requests: 12711
  Input Bytes:  390383250431
  Output Bytes: 272319766528
0
 
vocogovAuthor Commented:
Seems the error is transitory since it didn't repeat this time.
during-rman-d-a-carot9.gif
0
 
vocogovAuthor Commented:
Tough to decide how to divy up points.  Chose to split.
0
 
woolmilkporcCommented:
OK,

thx for the points!

But please try to tune your I/O distribution, as I wrote in #26285638!
There is a remarkable imbalance!

wmp

0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.