Dual-Core CPU - 100% WAITIO on Primary CPU seems to cause 100% Idle on Secondary CPU

We have a database server that has 4 dual-core cpu's.  When the Primary Core's (CPU 0,2,4,6) are in 100% WAITIO, the Secondary Cores (CPU 1,3,5,7) are 100% Idle.  When this occurs, the server acts as if it is 100% busy - does not respond quickly to commands - yet there are 4 CPU cores sitting 100% idle. Why  is this?  

See attached for evidence...

- Greg
Who is Participating?

Improve company productivity with a Business Account.Sign Up

woolmilkporcConnect With a Mentor Commented:
# only works in an LPAR environment!

You don't have 50% waitio, but near 100% for the concerned threads.

If one single thread has to pass most of its time sitting and waiting for I/O, the other thread could only do some work not involving any I/O at all!

If your system were virtualized, at least the other partitions would be able to continue working (but will most probably also wait for I/O), but if it's physical ...

Will hopefully be back tomorrow to tell you more.

what are your disk mapped to other VG ? have you collected I/O statistics  for the Disks , looks like the disk are waiting on i/o operations
gmarinoAuthor Commented:
I know the CPU's are waiting on I/O (SAN).  That's not the question.  That is a totally different issue.

The question at hand is - Why are CPUs 1,3,5,7 idle when CPUs 0,2,4,6 are in 100% WAITIO?

- Greg
Free Tool: IP Lookup

Get more info about an IP address or domain name, such as organization, abuse contacts and geolocation.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.


those CPUs  1,3,5,7 are not cores, but are SMT threads.

SMT is "Simultaneous Multi-Threading". SMT capable CPUs show up in topas only "as if" they were cores.

SMT capable CPUs (along with the capable OS, AIX 5.3 and later) are able to store the states of two threads and divide their components (ALU etc.) among them.

Here is kind of an introduction to SMT: http://www.ibm.com/developerworks/aix/library/au-aix5_cpu/index.html

So with heavy I/O wait it might well be that the observed phenomenon is due to the specific SMT implementation, and because an SMT thread looks just like a "core" in topas it seems to be idle.

The above paper has some nice explanations on this, as well as this one:

I wrote "topas" - exactly the same is true for "nmon", of course.
In the "c" view of NMON ("CPU by processor", the upmost block in your picture) you can hit "#" to get PURR based values.
 PURR is " Processor Utilization Resource Register", and here is some background -
In a shared-partition environment, you need to understand that there is an unused time slice in each entitled processor capacity. When a virtual processor or SMT thread becomes idle, it is able to cede processor cycle to Hypervisor, and then the Hypervisor can dispatch unused processor cycles for other work. In order to collect CPU utilization at a processor thread level (in an SMT environment), the POWER5 architecture has implemented a new register -- it's called the Processor Utilization Resource Register (PURR). Each thread has its own PURR. The units are the same as the time base register and the sum of the PURR values for both threads is equal to time base register. More traditional methods for measuring processor utilization tend to yield incorrect results in an SMT and SPLAR environment, which is why the PURR registers provide a more accurate realistic measure of processor utilization.
The above quote is from here (another useful paper) -
Simplified - with TBR you see processor utilization from the LPAR perspective, whereas with PURR you see it from the hypervisor perspective.
Maybe comparing PURR based and TBR based values could shed some more light on your issue.
gmarinoAuthor Commented:
wmp -

Some good info in your posts!  I'm "just a lowly DBA" (yeah right) trying to understand things outside my comfort-zone.  This observation on the CPU's intrigued me and I want to understand it - and at the same time be able to explain why a server with Total CPU reporting 50% WaitIO and 50% idle is really 100% busy/non-responsive.  There are many in DB2/AIX Support that think are crazy when we report that the server is non-responsive to the command line with that CPU utilization.  We are told "There is 50% idle capacity - you must be getting a response."

My guess is that the SMT cores share IO Pathways with the Primary Core on the chip and thus cannot process anything if the Primary Core is 99% waiting for IO.  (Just a guess - looking for evidence/validation.)

I am trying to use the "#" command on nmon ('c' view) and it does nothing. nmon is running as TOPAS_NMON Version TL10.  I will have to research how to get that to work.  I did see that it works in PHYSICAL CPU mode - if you look at my screenshot, I don't have the word PHYSICAL at the top of my screen.

topas = nmon these days (Nigel must be proud that his "illegitimate child" finally got "officially adopted" by AIX.)

Thanks for the insights...

- Greg

gheistConnect With a Mentor Commented:
Your system does swap, it is a highest priority IO in system, so rest suffers.
It is normal that primary core serves IO, because secondary cannot receive hardware interrupt signals.
To ease your life - vmtune -f 1024 -F 1280
Which will make paging into big blocks without saturating IO busses

Consider compacting your datasets, pinning/mlocking/wlm-ing critical apps, and as an ultimate solution - adding more RAM.
gmarinoAuthor Commented:
wmp and gheist,

Great information - confirming and further defining what I suspected to be the case.  

apply my quick fix and spend some time configuring WLM and write to your management that you need some extra HW for optimal performance.

It looks to me that SMT is not enabled (second logical processor stays IDLE for each CPU)
Confirm it by the following command:
# smtctl

Please ensure(talk to sysadmin) that the SMT is enabled by the following command

# smtctl -m on -w boot  ( and run the 'bosboot' command )
With SMT disabled the secondary SMT threads (appearing as a logical processors) will not show up at all in nmon.
gmarinoAuthor Commented:
SMT is enabled.  The 100% WAIT IO on the primary core is effectively shutting off the second core.

This is a DB2 Data Warehouse server.  All this (and 4 other similar servers) is doing is reading data from disk into the DB2  Bufferpools on behalf of the queries being run.  We have long since identified that it's both the queries themselves (tablescans that must read these large mega-GB tables and thus flush the 2GB Bufferpools) and the mix of the various batch jobs and Brio queries that is causing the rampant over utilization of the servers IO resources.

The original question was - why when the Primary Core's (CPU 0,2,4,6) are in 100% WAITIO did the Secondary Cores (CPU 1,3,5,7) are 100% Idle.  I think that has been answered nicely above.  This gives me some confidence when I challenge my colleagues and DB2/AIX support who claim that the box still has 50% capacity in this situation (it obviously does NOT).

Thanks for the discussion!

gmarinoAuthor Commented:
... DB2 Data Warehouse ...

Advice 1 you need to tune aio with "smitty aio" and allow DB2 to use more than one aio server

Advice 2 secondary cores will not serve hardware, they are not full cpu-s like in nowadays "multicore" PC processors, they are more like "hyperthreading" - a processor microcode, so better avoided to not lose resources for database workload.

Advice 3 to complement 1 )also look at streams parameters in http://publib.boulder.ibm.com/infocenter/pseries/v5r3/index.jsp?topic=/com.ibm.aix.cmds/doc/aixcmds4/no.htm
i.e how many aio request timer you need to serve aio requests...

Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.