Avatar of bhagatali
bhagatali
Flag for United States of America asked on

CPYF vs Bulk INSERT

Hi,

I have this requirement in which i have to subset various tables in my database. We have a working program right now but it takes for ever to run through all the tables that we use. I am considering an overhaul and am down to 2 approaches. Can some one advise which one would be fast, efficient and take up the least resources.

Setup:
-------

--> I have a source schema (SOURCEFILE) from where i would retrieve the data. This schema is a full DB.
--> I have a target schema (TARGETFILE) which would be empty replica of the source schema.
--> I have a list file (LISTFILE) which has a list of unique ids that I am going to use to subset.
--> The biggest table that we have in our schema is close to 200 million.
--> The number of unique ids in the LISTFILE can vary from a few thousand records to 10-15% of the actual production size.

Approaches:
----------------
1) I am going to create a LF using the JFILE option to connect the SOURCEFILE and LISTFILE. Once the LF is created use a CPYF to copy the data from the LF to TARGETFILE.

2) Use the following INSERT statement to INSERT data into the TARGETFILE:
INSERT INTO TARGETFILE (SELECT * FROM SOURCEFILE WHERE UNIQUEID IN (SELECT UNIQUE ID FROM LISTFILE))

Thanks for advising.

Regards
Ali.
IBM System iCOBOL

Avatar of undefined
Last Comment
Member_2_276102

8/22/2022 - Mon
ASKER CERTIFIED SOLUTION
Gary Patterson, CISSP

Log in or sign up to see answer
Become an EE member today7-DAY FREE TRIAL
Members can start a 7-Day Free trial then enjoy unlimited access to the platform
Sign up - Free for 7 days
or
Learn why we charge membership fees
We get it - no one likes a content blocker. Take one extra minute and find out why we block content.
Not exactly the question you had in mind?
Sign up for an EE membership and get your own personalized solution. With an EE membership, you can ask unlimited troubleshooting, research, or opinion questions.
ask a question
Member_2_276102

I'd probably also write a program, though I'd say it's likely that SQL will do the task faster (assuming this is at least V5R4). I would use blocking for both the FETCH and for the INSERT n ROWS. The number of rows would be selected to fit optimum blocking for whatever VRM this will run on. I'd use INSERT n ROWS rather than a straight INSERT.

Nowadays, even repeated non-compiled SQL statements are generally parsed only once and cached in jobs by DB2, and their OPENs are also cached (i.e., pseudo-CLOSE) so that running the same INSERT (or other statement) only does a pseudo-OPEN. But compiled SQL does even better.

Tom
Gary Patterson, CISSP

@Tom:  Remember that we are talking about an ETL process here (or at least I am assuming we are).  This means that we will be running a long string of "one-shot" non-repeated queries, so statement caching, pseudo-close/open, and other similar "repeated query" refinements produce no benefit to this kind of workload whatsoever.

The general advice for writing better-performing SQL queries is right on track, though.

If you get rid of the access paths and constraints on the target tables, then SQL write performance will certainly improve (but in most cases still won't beat a hand-tuned traditional IO program, again, because you have access plan evaluation and possible build time in the SQL query that you don't have in the traditional IO program).  There is just no getting around access plan evaluation and creation time.  

That is not to say a traditional IO program is always better.  In this particular case, however, the developer knows that when the program is run, typically 15% or fewer of the source records will be selected, and as a result can hard-code the appropriate READ access method (keyed access through a specific preselected index), so there is no "decision time" for this for each query, unlike with SQL, where the SQL Query Engine has to make access plan decisions every time, and sometimes will need to build a new access plan.

I've benchmarked big ETL's like this time after time, and I've -never- had a case where SQL was faster than hand-tuned code, up to and including V6R1 (haven't done any benchmarks under V7R1 yet).  I HAVE had cases where the difference was small, however, and in those cases went with the SQL method.

Now, where SQL excels is where the % of the database selected is highly volatile from run to run.  For example, if one time 1% of source records are selected, the next time it is 20%, the next 70%, and the next 90%.  In that case, SQL is superior, specifically -because- it can make the access plan decisions "on the fly", and switch to a table scan when it estimates a high percentage of records will be selected, for example.  

In that case I would absolutely use SQL: too complicated to hand-code for the average benefit you might get.

Anyway, for me, the answer is simple.  Take a small sample of a few tables (one big, one medium, one small), and benchmark each method.  For the RPG program, make sure that you have the appropriate index over the source tables, and make sure that you are doing blocked inserts into the target tables.  Do the same with the SQL program.  Then run them under database monitor and compare the relative runtimes and CPU utilization.  You'll see that performance on the larger tables is very, very similar, since a proportionally smaller percentage of the SQE time is spent in planning work, and you'll find that on the smaller tables, the proportional difference in performance is greater.

One other piece of performance advice: regardless of the method, stop journaling on the target schema during ETL.

- Gary Patterson

bhagatali

ASKER
Thanks Gary and Tom. Your comments and inputs on this are helping a lot.

@Tom: As Gary mentioned, what we are trying to achieve is in fact an ETL process.

@Gary: In your older post you mentioned "I often write an RPG program using traditional record-level IO for jobs like this". I am trying to understand if what you refer to as traditional record-level IO is any different from what we use. Would you happen to have a sample test program (or a template) that you have available to share?
Experts Exchange is like having an extremely knowledgeable team sitting and waiting for your call. Couldn't do my job half as well as I do without it!
James Murphy
Member_2_276102

Native I/O simply refers to READs/WRITEs in a HLL such as COBOL or RPG rather than using SQL.

To demo a basic difference between SQL and native I/O, I put a few very trivial tables and programs together.

I have two of the simplest possible ILE RPG programs -- one using full procedural, the other using Primary input. They simply read from the input file and write each record to the output file. The input file is a view, ETLDATA, that selects from the IDATA table where the ID matches the ID in the MDATA table. The IDATA table would be like your SOURCEFILE table, and my MDATA table fills the function of your proposed LISTFILE table. My ODATE table would be like your TARGETFILE.

The tables and view are like this:
IDATA:

CREATE TABLE mylib/IDATA ( 
	ID INTEGER NOT NULL , 
	MYNAME CHAR(10) CCSID 37 NOT NULL DEFAULT '' , 
	MYDATE DATE NOT NULL DEFAULT CURRENT_DATE , 
	MYTIME TIME NOT NULL DEFAULT CURRENT_TIME , 
	MYTIMESTMP TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP ) ; 
  
MDATA:

CREATE TABLE mylib/MDATA ( 
	ID INTEGER NOT NULL , 
	MYNAME CHAR(10) CCSID 37 NOT NULL DEFAULT '' ) ; 
  
ODATA:

CREATE TABLE mylib/ODATA ( 
	ID INTEGER NOT NULL , 
	MYNAME CHAR(10) CCSID 37 NOT NULL DEFAULT '' , 
	MYDATA VARCHAR(10) CCSID 37 NOT NULL DEFAULT '' , 
	MYDATE DATE NOT NULL DEFAULT CURRENT_DATE , 
	MYTIME TIME NOT NULL DEFAULT CURRENT_TIME , 
	MYTIMESTMP TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP ) ; 
  
ETLDATA:

CREATE VIEW mylib/ETLDATA ( 
	ID , 
	MYNAME , 
	MYDATA , 
	MYDATE , 
	MYTIME , 
	MYTIMESTMP ) 
	AS 
	SELECT IDATA.ID, IDATA.MYNAME, 'Abc Defghi', IDATA.MYDATE, IDATA.MYTIME, IDATA.MYTIMESTMP FROM mylib/IDATA IDATA WHERE IDATA.ID IN (SELECT ID FROM mylib/MDATA); 

Open in new window

The two RPG programs:
RPGGETLF:

     Fetldata   if   e             disk
     f                                     rename(etldata:etlr)
     Fodata     o    e             disk
     f                                     rename(odata:odr)
     c                   read      etldata
     c                   dow       not( %eof(etldata) )
     c                   write     odr
     c                   read      etldata
     c                   enddo
     c                   eval      *inlr = *on
     c                   return

RPGGETL:

     Fetldata   ipe  e             disk
     f                                     rename(etldata:etlr)
     Fodata     o    e             disk
     f                                     rename(odata:odr)
     c                   write     odr

Open in new window

In addition, I created a straight SQL source member named TSTETL. It performs an INSERT from a SELECT of ETLDATA. It essentially does the same thing that the two RPG programs do, except that it will not be compiled -- I run it with RUNSQLSTM:
TSTETL:

INSERT INTO mylib/ODATA SELECT * FROM ETLDATA

Open in new window

For control, I have a CL program:
pgm

   dcl   &time        *char     6

   rtvsysval   QTIME  rtnvar( &time )

   sndpgmmsg ( 'Start  ' *cat &time )
   runsqlstm QSQLSESS TSTETL
   dspfd       odata  type( *MBR ) output( *PRINT )
   clrpfm      odata

   rtvsysval   QTIME  rtnvar( &time )

   sndpgmmsg ( 'Start2 ' *cat &time )
   call RPGETLF
   dspfd       odata  type( *MBR ) output( *PRINT )
   clrpfm      odata

   rtvsysval   QTIME  rtnvar( &time )

   sndpgmmsg ( 'Middle ' *cat &time )
   call RPGETL
   dspfd       odata  type( *MBR ) output( *PRINT )
   clrpfm      odata

   rtvsysval   QTIME  rtnvar( &time )

   sndpgmmsg ( 'End    ' *cat &time )

   return

endpgm

Open in new window

The CL captures times before and after each step. It does RUNSQLSTM first, then calls the two RPG programs. The ODATA file member is displayed each time to show how many rows have been added, then the member is cleared to prepare it for the next step. The times are displayed before and after each step.

I ran this one a small, fully PTFed model 170 at V5R3. If the code is run at later releases, the time differences may be a little different because the SQL processes continue to be improved. I ran a number of times, sometimes running the steps in different orders and sometimes with different overrides in effect. I used IDATA with 680000 rows. The MDATA table selects 75% of the rows.

The times for the run with the worst SQL performance:
Start   215223
Start2  220239 10:16
Middle  221720 14:41
End     223205 14:45

Open in new window

Note that I added the elapsed times. The CL only displays the QTIME value.

But that was the worst comparison between uncompiled SQL and compiled native I/O. SQL did better than that in every other run. Most runs showed the RPG taking more than 60% longer. In later releases, the time differences should increase even more.

In general, SQL will always outperform native I/O in this type of process even when not compiled. A major difference is that a HLL is going to be jumping back and forth between program instructions and calls into DB2 routines for every read and every write. But SQL only calls into database routines once, and then everything runs there.

In an ETL process, one area that makes a difference is the 'T'. If a transform can be done through basic data type conversions and/or standard SQL function calls, then it might all be contained within the view definition. Substrings, concatenations, CASE selections -- the transform is almost guaranteed to be more efficient in SQL.

When program code is required, using blocked FETCH/blocked INSERT to transfer as many rows as possible will likely be the best choice. Perform any transformations between arrays in memory.

One last thing about the demo code -- this REXX procedure can be used to generate test rows into IDATA. It generates rows in a nested loop. For me, the outer loop generated a small set of ID values. The inner loop generates a set of rows for each ID value. The number of IDs and rows per ID can be adjusted as desired.
REXIDATA:

signal on error   name FAILURE
signal on failure
signal on syntax

RC = 0
mytbl = 'mylib/IDATA'
limit = 5000

sql_Stmt = 'INSERT INTO ' mytbl ' VALUES( ?, ?, ',
           'current date, current time, current timestamp )',
             ' with NC'
address '*EXECSQL',
    execsql 'PREPARE S1 FROM :sql_Stmt'

 do j = 1 to 4
   do i = 1 to limit
      V = 'Val' i
      address '*EXECSQL',
          execsql 'EXECUTE S1 USING :j, :V'
   end
 end

'commit'

exit


/* ---------------------------------------------------------------- */
FAILURE:

       trace off

       parse source system start srcmbr srcfile srclib
       say 'Unexpected error at line 'sigl' of REXX program ',
           srcmbr' in 'srclib'/'srcfile'. The exception ID is 'rc'.'

exit(right(rc,4))
/* ---------------------------------------------------------------- */
SYNTAX:

       trace off

       parse source system start srcmbr srcfile srclib
       say 'Syntax error at line 'sigl' of REXX program ',
           srcmbr' in 'srclib'/'srcfile'. The error code is 'rc'.',
           ' The description is :'
       say errortext(rc)

exit(rc)

Open in new window

REXX is flexible and easily changed. SQL is also an easy fit in it.

Tom