Link to home
Start Free TrialLog in
Avatar of RVIT
RVIT

asked on

Performance Tuning 9406 720 206A running V5R3M0 - Ran fine on V5R1M0 but when scratch installed with V5R3M0 running like a lame horse

we recently upgraded our 720 from V5R1 to V5R3 but there were problems with the upgrade and we had to do a scratch install.  Since installing V5R3 its performance is abissmal.  A job I run on the 820 that takes about 4 hours takes 3 weeks on the 720 and is still running.  I have little knowledge of performance tuning and have been running the system with Performance Adjustment set to 2 (Adjustment at IPL and automatic adjustment).

This afternoon i've tried changing the pools from *FIXED to *CALC but its not made any difference.

WRKSHRPOOL shows this:                                                                
                    Defined    Max   Allocated   Pool  -Paging Option--
Pool             Size (M)  Active  Size (M)     ID   Defined  Current
*MACHINE       104.09   +++++      104.09     1   *FIXED   *FIXED
*BASE             133.32      45          133.32    2   *CALC    *CALC  
*INTERACT       12.79       5             12.79    3   *CALC    *CALC  
*SPOOL              2.55       5              2.55     4   *FIXED   *FIXED
*SHRPOOL1          .00       0                              *FIXED          
*SHRPOOL2        3.23       1              3.23     5   *CALC    *CALC  

Our main application (Island Pacific) has multiple job queues and subsystems, all of which run through the *BASE pool.  As this is the test system, i am the only user on it so its not the interactive that is eating the performance.  I am at a complete loss to explain why this is happening.

Here is the WRKSYSSTS Screen:

System    Pool    Reserved    Max   -----DB-----  ---Non-DB---
 Pool   Size (M)  Size (M)  Active  Fault  Pages  Fault  Pages
   1      104.41     52.45   +++++     .0     .0   45.5     47.7
   2      133.66        .83      45    1.1       8.5   74.6     94.7
   3       12.79         .00       5     .0          .0      .5       1.3
   4        2.55          .00       5     .0          .0      .3        .9
   5        2.55          .00       1     .0          .0      .0        .0

Avatar of daveslater
daveslater
Flag of United States of America image

Hi
first of all, is the job a batch job or an interactive job?

How many uses do you have?

When running the job do a wrksysact - to check what the system is doing look for a job CFINT01 and check what it is doing.

The *interactive pool looks a bit low

basically the more detail you can give us the more advice you will get back.


dave
Avatar of RVIT
RVIT

ASKER

Thanks for the quick response.
there's only 1 user on this machine as its the test box (me).

here's the WRKSYSACT:

Job or                                                              CPU   Sync  Async   CPU
Task                User             Number  Thread    Pty   Util   I/O    I/O   Util
CFINT01                                                             0    5.2       0      0    .0
QPMHDWRC     QSYS            013548  00000021    1    4.7    302      2    .0
QDBSRVXR2   QSYS              012545  00000001    0    1.1     47     71    .0
SMPOL001                                                         99    1.0      0    992    .0
IOSTATSTAS                                                        0     .7      3      0    .0
IP200921        SW01             013526  00000005   50     .6    284      0    .0

The job i'm trying to run is the IP200921.
RVIT:

First principle -- Don't run jobs in *BASE. Simple.

There is no decent probability that performance adjuster can do useful adjusting if jobs are actively using the *BASE memory pool. This includes server jobs along with everything else.

As far as memory goes, the purpose of the adjuster is to move memory out of *BASE into pools that need it and to move memory out of pools that don't need it back into *BASE. If the memory is being used _while_ it's in *BASE, then all you're doing is increasing CPU utilization by running the adjuster in addition to everything else.

Although memory movement still occurs, the result is pretty much nothing but shifting it back and forth between *BASE and another pool while paging alternative jobs in and out. You'd be better off with no adjusting.

*BASE is supposed to be the leftover memory that isn't needed. By running jobs in that space, you are making it far more difficult to determine whether any given memory is sufficient.

I suspect that this isn't highly publicized by IBM because they can make money by (1) selling you memory and (2) selling you performance tuning services.

In order to do any serious performance tuning, the system must be able to make useful performance measurements for you. To get that started, you'll need to get jobs out of *BASE. You already have a couple additional shared pools created; you might need a couple more.

Run through each subsystem and review the pool assignments. All of them need pools other than *BASE; *BASE can be subsystem pool #1, but don't route anything to it. If *BASE is associated with a subsystem, the subsystem monitor job will run there.

To avoid running other jobs there, review each subsystem's prestart job and routing entries. NONE of those in ANY subsystem should refer to a subsystem pool that points to *BASE.

I generally set one shared pool for TCP/IP server jobs and a second one for the host server jobs. I create one or two others for my batch jobs, giving at least three shared pools. *INTERACT and *SPOOL are already created by default, so those can be looked at much later.

Once the pools exist, I start changing the prestart job and routing entries to start jobs in pools that can be watched (and adjusted!) This can take a couple days, especially if the system is fully active at the time.

Once jobs have settled into their new pools, you can then start watching for any adjustments that occur as well as for hot spots. If a bunch of stuff is in *BASE, it's impossible for you to know where any issues are -- everything is competing for the same memory, even jobs in other pools want to steal that memory.

Once you're in that state, you can do some real tuning and your adjuster might actually help.

However, that doesn't help you today.

For today, first thing you need to do is make sure your group PTFs are up to date, especially DB2 group.  I'm not sure what the current level is for V5R3 but it's at least at level 3.

With Island Pacific (which I _know_ could use some software tuning), I'd also want to be sure that my cume PTF level was current as well as my HI/PER level. Then, I'd go searching for additional performance related PTFs that aren't included in any cume or group package.

Once my system was at a premium level and performance was an issue, for Island Pacific, I'd check to see if any exit programs are registered against... hmmm... I think they use either the data queue or distributed program call/remote command host servers to a very high degree. If an exit program is registered against either of those, I'd remove it to see any results.

If that makes no difference, then it's time for serious investigation.

Tom
Avatar of RVIT

ASKER

Thanks Tom for your in-depth reply - i will work through this over the next couple of days and let you know.  Its a DEV box so only me on it, so as far as changing things goes i've got no restrictions...
RVIT:

Just stay aware that there are numerous performance-related PTFs that will never be in a cume PTF package nor in a group PTF package. PTFs might only affect customers with software packages such as Island Pacific -- IBM won't include those in PTF packages that would go to all of IBM's customers.

Oh, also, I wouldn't call my reply "in-depth" yet. So far, it's only been an overview of how to get to a point where it's possible to track performance issues. Actually doing something with the info hasn't even begun yet, heh.

Good luck.

Tom
Avatar of RVIT

ASKER

I didnt know that - i thought a CUM package was every fix - oh well, off to investigate how i find the missing PTF's then!
RVIT:

There are large numbers of PTFs that aren't in cume or group packages. Some may be specific to particular hardware -- one model IOP might need a PTF that would be disastrous for a different model. Some may be specific to licensed program products -- a SQL PTF might be specific to an interface to the SQL compiler preprocessor and only be valid if the SQL Dev Kit is installed. (I made up those examples; they might not make sense. Just to illustrate.)

IBM has commonly built the cume packages from PTFs that have wide application across the customer base. If you review the PSP report of PTF Summaries, take note of the large number of PTFs that always are listed as being in cume package '1000' -- that indicates it's not in any cume package (yet).

Some of those are later chosen to become part of a package. I think it's partly based on how many customers report problems that match that PTF's symptom string among other things.

Keeping the size of a cumulative package down is one goal. Keeping the complexity of the install down is another. Avoiding unintended consequences of interactions between PTFs is another. Probably other reasons.

Tom
Avatar of RVIT

ASKER

Hi,

Have loaded all the service packs i can find and created the pools as suggested.

The Batch job in question that seems to be running really slowly is only using about 1 - 2 % of the CPU in WRKACTJOB yet there is nothing else running.  I dont understand why it is not making use of the full CPU power.

Here is my pool status now (the Island Pacific job is running though system pool 4 (shrdpool1):

System    Pool    Reserved    Max   -----DB-----  ---Non-DB---
 Pool   Size (M)  Size (M)  Active  Fault  Pages  Fault  Pages
   1       78.65     50.56   +++++     .0     .0   81.6   84.0
   2      121.98       .83      55    1.2    1.6    8.6   25.9
   3       12.79       .00       5     .0     .0     .5     .8
   4       40.00       .00      10     .0     .0   66.3   66.3
   5        2.55       .00       5     .0     .0     .0     .0

as you can see the Non-DB Faults are quite high?
Help!
What is the status of the job under WrkActJob

ie IDX-MYdbf

Dave
Ps
do a wrksbsd and check
what storage pools are allocated to it (option 2)
what the routing enties and associated class (Option 7 then option 5)


dave
Avatar of RVIT

ASKER

IPTS           QSYS        SBS      .0                   DEQW
  IPMSGQ       IPTS        ASJ      .0  PGM-IPMSGQ       MSGW
  IP200921     SW01        BCH     1.0  PGM-IP009CP      RUN

its the IP200921 job.

Subsystem description:   IPTS    
                                 
Pool        Storage     Activity  
 ID        Size (K)      Level    
  1       *SHRPOOL1              


 Opt    Seq Nbr    Program       Library       Compare Value
         9999      QCMD          QSYS          *ANY          

Routing entry sequence number . . . . . . . :   9999    
Program . . . . . . . . . . . . . . . . . . :   QCMD    
  Library . . . . . . . . . . . . . . . . . :     QSYS  
Class . . . . . . . . . . . . . . . . . . . :   IPTS    
  Library . . . . . . . . . . . . . . . . . :     IPTSPGM
Maximum active routing steps  . . . . . . . :   *NOMAX  
Pool identifier . . . . . . . . . . . . . . :   1        
Compare value . . . . . . . . . . . . . . . :   *ANY    
                                                         
Compare start position  . . . . . . . . . . :            
Thread resources affinity:                              
  Group . . . . . . . . . . . . . . . . . . :   *SYSVAL  
  Level . . . . . . . . . . . . . . . . . . :            
Resources affinity group  . . . . . . . . . :   *NO      


Hope this helps - Thanks Dave!
Hi
the class is a non-standard class
and you have no memory in *shrpool1

first lets get some memory into the subsystem

do a
CHGSBSD SBSD(IPTS) POOLS((2 *BASE))  

thel see what difference that makes.

Dave
Avatar of RVIT

ASKER

Done that:

IPTS           QSYS        SBS      .0                   DEQW
  IP200921     SW01        BCH     1.7  PGM-IP009CP      RUN  

Hi
what does the subsystem description say now?
also do a
DSPCLS IPTSPGM/IPTS    


Avatar of RVIT

ASKER

Subsystem description:   IPTS      
                                   
Pool        Storage     Activity    
 ID        Size (K)      Level      
  1       *SHRPOOL1                
  2           *BASE                

 Class . . . . . . . . . . . . . . . . . . . . . . :   IPTS                    
   Library . . . . . . . . . . . . . . . . . . . . :     IPTSPGM                
 Run priority  . . . . . . . . . . . . . . . . . . :   50                      
 Time slice in milliseconds  . . . . . . . . . . . :   10000                    
 Eligible for purge  . . . . . . . . . . . . . . . :   *NO                      
 Default wait time in seconds  . . . . . . . . . . :   600                      
 Maximum CPU time in milliseconds  . . . . . . . . :   *NOMAX                  
 Maximum temporary storage in megabytes  . . . . . :   *NOMAX                  
 Maximum threads . . . . . . . . . . . . . . . . . :   *NOMAX                  
 Text  . . . . . . . . . . . . . . . . . . . . . . :   CLS for Island Pacific jo

Cheers Dave!
Hi
can you end the subsystem then enter

CHGSBSD SBSD(IPTS) POOLS((1 *RMV))  

then re-start it and try to run the job again.

Dave
Avatar of RVIT

ASKER

Message ID . . . . . . :   CPD1509                                            
Date sent  . . . . . . :   11/04/05      Time sent  . . . . . . :   15:02:18  
                                                                             
Message . . . . :   Pool definition 1 was not removed.                        
                                                                             
Cause . . . . . :   Pool definition 1 cannot be removed because it is        
  specified in one or more subsystem description entries.                    
Recovery  . . . :   Do one of the following and try the request again:        
    -- Remove the subsystem description entries using the Remove Prestart Job
  Entries (RMVPJE) command or the Remove Routing Entry (RMVRTGE) command that
  specifies the pool definition.                                              
    -- Change the pool definition that is specified in the subsystem          
  description entries (POOLID parameter).                                    

However,

i have removed pool 2 and changed pool 1 to *BASE:

Pool        Storage     Activity  
 ID        Size (K)      Level    
  1           *BASE                

is this what you actually wanted?
(no difference from looking at it - isnt this what we had when we started?)
Hi
I don't think so. I can not see any reference to changing the sub system!
Just looking through the thred I can see that you have some sharepools.
                    Defined    Max   Allocated   Pool  -Paging Option--
Pool             Size (M)  Active  Size (M)     ID   Defined  Current
*MACHINE       104.09   +++++      104.09     1   *FIXED   *FIXED
*BASE             133.32      45          133.32    2   *CALC    *CALC  
*INTERACT       12.79       5             12.79    3   *CALC    *CALC  
*SPOOL              2.55       5              2.55     4   *FIXED   *FIXED
*SHRPOOL1          .00       0                              *FIXED             <<==============
*SHRPOOL2        3.23       1              3.23     5   *CALC    *CALC  

if we look ate your subsystem description before we made the changes  we have
Subsystem description:   IPTS    
Pool        Storage     Activity  
 ID        Size (K)      Level    
  1       *SHRPOOL1               <===============

as you can see there is no memoery allocated to the share hence no memory allocated to the subsystem.

I have made so quick and nasty changes to get *base memory into the subsystem so at lease the OS has some memory to play with. If we start to get some performance improvement then we can play with memory allocation later.


I would expect to see a bit more CPU utilisation on the new config.

Dave
Avatar of RVIT

ASKER

ok thanks dave - will switch on qpfradj and see what happens :)
Note... if memory is needed in a shared pool, use CHGSHRPOOL or WRKSHRPOOL:

 ==>  chgshrpool  *shrpool1  size( 4096 )

...would shift 4MB from the *BASE pool into shared pool 1. (Size is specified as increments of kilo-bytes.) Also, not only is memory needed, but activity levels may also be needed. E.g.:

 ==>  chgshrpool  *shrpool1  size( 4096 ) +
                actlvl( 3 )

Hard to tell what a decent activity level is yet since we don't know what functions will be running in the pool. Also hard to know how much memory to add.

By turning performance adjuster on, we can check after 15 min have gone by to see what adjustments have been made.

WRKSHRPOOL provides access to tuning parameters by pressing <F11=Display tuning data>.

Don't expect to tune properly immediately. Initial settings are pure guesswork. Only after watching how interactions with other jobs change things will you start getting better.

And careful putting *BASE back in as a subsystem memory pool. That sends you right back where you started.

Tom
Hi Tom
I was just trying a few things out - just to see if there was a memory issue. The CPU was very low.
Once we can get to CPU utilisation then play with the pools - but since this is a single user box then runnung from *BASE should not have any realy implications.

Dave
Avatar of RVIT

ASKER

Hi,

Not made any difference.  The job is now running in BASE which is where it was in the first place.  Not sure if i've missed something here as i had auto tuning switched on originally.

Basically, the subsystem IPTS is now running in *BASE with QPFRADJ set to 3.

Please help!
Avatar of RVIT

ASKER

I think its also something to do with this particular job which is deleting records over about 6 large files.

should i be looking at load balancing on the disks as well?
Hi
I do not think this is an AS/400 performance issue - can we look at the following:

1) if you do a wrksyssts what is the db utilisation,
2) Can you do a dspjob and check if there are any record locks.
3) do a strsrvjob on the job, then a strdbg. Then look at the job log
4) What are the attributes of the program ie RPG, SQLRPG.

Dave

Avatar of RVIT

ASKER

% CPU used . . . . . . . :       15.7    Auxiliary storage:                    
% DB capability  . . . . :        1.8      System ASP . . . . . . :    132.6 G
Elapsed time . . . . . . :   00:00:04      % system ASP used  . . :    82.4943
Jobs in system . . . . . :        856      Total  . . . . . . . . :    132.6 G
% perm addresses . . . . :       .009      Current unprotect used :     1666 M
% temp addresses . . . . :       .010      Maximum unprotect  . . :     1676 M
                                                                               
Type changes (if allowed), press Enter.                                        
                                                                               
System    Pool    Reserved    Max   -----DB-----  ---Non-DB---                
 Pool   Size (M)  Size (M)  Active  Fault  Pages  Fault  Pages                
   1      130.56     52.00   +++++     .0     .0   31.4   32.1                
   2      110.08       .94      49    7.2    8.1  211.8  822.3                
   3       12.79       .00       5     .0     .0   10.0   20.9                
   5        2.55       .00       5     .0     .0     .0     .0                



There are member locks but as this is the only job running it should not be a problem.

done the STRSRVJOB but not sure how you want me to do the strdbg?

Thanks for being patient.
Avatar of RVIT

ASKER

unfortunately the CL program calls lots of RPG programs (RPGLE?) so i cant really say what the attributes are.
Hi
just on the interactive session.
This will debug the job in batch and give a lot more detail in the job log.

You can then do a dspjob to check if anything looks strange.


Dave
ps we and no debugging a program, but degugging the job.
Avatar of RVIT

ASKER

so after running the STRSRVJOB JOB(015501/SW01/IP200921) command i just type strdbg right?

then dspjoblog?

not seeing anything
Hi
you do a dspjob on the batch job.

dave
Minor note... there is no reason to have performance adjuster turned on if *BASE is the memory pool in use.

Also, agreed that as long as this is the only job running, then *BASE is reasonable. But, I'd be surprised if this is the only job, especially if Island Pacific is involved. E.g., the TCP/IP data queue server and/or the distributed program call/remote command host server is also probably running (as well as TCP/IP itself, etc.)

The memory pool for the actual batch job should be a different pool from the various server jobs, and all of them should be out of *BASE -- this implies *BASE plus two shared pools minimum.

We apparently want to know if serving to Island Pacific via the TCP/IP servers is part of the performance issue. There are two basic potential areas to watch: (1) the batch job that seems to be using minimal CPU and (2) the server jobs.

Or maybe I've misunderstood. It's seemed that the basic batch job isn't actually doing much, so it seemed any performance issue had to be somewhere else.

Tom
ASKER CERTIFIED SOLUTION
Avatar of daveslater
daveslater
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Yeah, it's hard to tell what <RUN> means as a status without knowing the source statements. And we only know that that was the "status" at the instant it was collected.

I have a program that talks to the EDRS server, either local or remote. While it's 'waiting' for a response, the status is <RUN> even though I know it's waiting, apparently because I've called one of the Qxda... APIs and that translates to 'running'. No CPU is used during that time.

Hmmm... I ought to try the same with SQL CLI.

Anyway, the status of <RUN> only means that no technical WAIT has been requested, AFAIK.

Since there may be many external CALLs to RPG programs and deletes are going on over large files, I'm not too surprised at the higher non-DB faulting/paging rates. Lots of programs starting/stopping and lots of files opening/closing is gonna equal lots of faulting/paging.

We also have access to WRKSYSACT. But it didn't show any real surprises -- except that there was nothing chewing up CPU. The top task was CFINT01 and it wasn't doing much. Nor were any tasks below it. CPU doesn't seem constrained in the slightest, so working on CPU seems not indicated.

Maybe there's no CPU available that _can_ be used because thrashing is keeping processes from running effectively.

So far, I don't see that we have a solid base from which we can make educated guesses. I'd still say that jobs need to be separated into appropriate pools.

Obviously details such as assigning some memory to the pools is a pretty good idea, heh. Once memory is assigned and jobs are using the pools and performance adjuster has 5-10 minutes to run, it's time to take a snapshot from DSPSYSSTS. (Use <F21=Select assistance level> to set 3=Advanced.) Then wait 10-15 minutes and take a second DSPSYSSTS snapshot.

The first snapshot should be 5-10 minutes after DSPSYSSTS is first displayed so it's had some running time to gather statistics. Get the suspect job running before using DSPSYSSTS.

If DSPSYSSTS is already running, then press <F10=Restart> after the suspect job starts. Then wait the 5-10 minutes for the first snapshot.

What we'll look for between the two snapshots is a trend. Maybe memory will be shifted; if so, where from and where to? Maybe activity levels will be changed. Maybe faulting/paging will change significantly.

Or maybe there will be no bumps in any stats at all.

Tom
Avatar of RVIT

ASKER

Hi,

sorry got caught up with something else will look at this today.
Avatar of RVIT

ASKER

Hi,

i've closed the question as i've given up on this machine.  as the original Install was problematic and it is my dev box, i'm going to completely wipe it and reinstall the OS.

Hopefully that and the latest IBM Patches will sort out my problem.