Community Pick: Many members of our community have endorsed this article.
Editor's Choice: This article has been selected by our editors as an exceptional contribution.

Recursive SQL in DB2 (Converting Columns to Rows)

Kent OlsenData Warehouse / Database Architect
CERTIFIED EXPERT
Published:
Recursive SQL in UDB/LUW (it really isn't that hard to do)

Recursive SQL is most often used to convert columns to rows or rows to columns.  A previous article described the process of converting rows to columns.  This article will build off of that and describe converting columns to rows.  A review of the original article may be helpful as some of the examples that will be used here will tie directly to the examples in the original article.


One of the common misperceptions about recursive SQL is that it requires a join operation.  While this is true to convert rows to columns it's absolutely false when converting columns to rows.  The very nature of the join operation is to put data from different rows (usually rows from different tables ) into the same row and that doesn't change just because the SQL is recursive.


The Problem (why can't I just use plain old SQL to solve this?)

It's often desirable to take data in one row and display it on multiple rows.  Standard SQL doesn't have a built-in mechanism to do that, but the flexibility of SQL makes it possible.  The power of recursive SQL makes it downright easy.

The simplest case is to just want a data item split and shown on two lines of a report.  The SUBSTR, LEFT and RIGHT functions can be used to split the string.  Writing the strings with a NEWLINE (or CR/LF pair) between them will display them on separate lines.

But what if the requirement is that the items must be on separate rows in a result set?  The data must actually be replicated so that it propagates to the correct row and filtered, if necessary, to select the correct item(s).  There are several ways to do this, but it can be inefficient and the handling of the data can get clumsy, especially if the data item is to be split over many rows.

As an example, consider a table that contains "name" and "address" columns.  If the requirement is to print mailing labels and the mailmerge application requires that name and address be on separate lines in the input file, the easiest solution may be to display "name" and "address" with a NEWLINE (or CR/LF pair) between them.  If the requirement is that the two items must be on separate rows in a result set it gets more challenging.  Possible non-recursive techniques to solve splitting column(s) into multiple rows are:
[step="" title=""]
 1)  Append the table to itself with UNION ALL, sort by the primary key, and use the CASE operator to select the "name" or "address" column depending on whether the row number is odd or even;  
SELECT CASE WHEN MOD (r, 2) = 1 THEN name ELSE address END AS text
                      (
                        SELECT ROW_NUMBER () OVER () AS r, t0.*
                        FROM
                        (
                          SELECT * FROM mytable
                          UNION ALL
                          SELECT * FROM mytable
                          ORDER BY pkey
                        ) t0
                      ) t1

Open in new window

2)  Use a Cartesian cross join to replicate the data and again use the CASE operator to select the correct column;  
SELECT CASE idx 
                               WHEN 1 THEN name 
                               WHEN 2 THEN address END AS text
                      FROM
                      (
                        VALUES (1), (2)
                      ) t0 (idx)
                      CROSS JOIN mytable
                      ORDER BY pkey, idx;

Open in new window

3)  Select the desired fields from the table using UNION ALL and a sort to structure them properly.  
SELECT text
                      FROM
                      (
                        SELECT pkey, 1 AS rn, name AS text FROM mytable
                        UNION ALL
                        SELECT pkey, 2 AS rn, address FROM mytable
                      ) t0
                      ORDER BY pkey, rn

Open in new window

[/step]
There are other variations of these queries that will produce the desired result, but they're all reasonably ugly and inefficient.

A solution that uses CROSS JOIN is the easiest to write and produces the smallest SQL.  But as with all of these solutions, the derived table from which the final selection will be made will contain as many copies of the table as the number of exploded rows.  (If the row is to be split into 5 rows the final selection will be made from a derived table with 5 copies of the original table.)

In a more complicated example, the "name" column is formatted as 'last, first' and the task is to put the first name, last name, and address on separate rows.  Using either of the solutions that we see above, the task is solvable by parsing the "name" field for both the first and last name and adding another UNION operator.
SELECT text
                      FROM
                      (
                        SELECT pkey, 1 AS rn, LTRIM (SUBSTR (name, LOCATE (',', name) + 1)) AS text FROM mytable
                        UNION ALL
                        SELECT pkey, 2 AS rn, SUBSTR (name, 1, LOCATE (',', name) - 1)) FROM mytable
                        UNION ALL
                        SELECT pkey, 3 AS rn, address FROM mytable
                      ) t0
                      ORDER BY pkey, rn

Open in new window


Parsing the name field for two tokens is nearly trivial, but what happens where there are a lot of data points stored in a single field?

Consider a table that records student attendance.  The attendance is recorded in one row per student per month, with attendance for each day recorded into a single string by concatenating a coded value.  (P-present, L-Late, EL-Excused Late, A-Absent, EA-Excused Absent.)  Note that the codes are different lengths so it's impossible to simply index into the string by position to get the attendance flag for a given day.  A typical record could be:

'P,P,P,,,P,EA,P,EA,EA,,,P,P,P,P,P,,,P'  etc...

Answering the question, "was the student present on the 17th of the month" with standard SQL that doesn't utilize a user defined function is possible, but it's incredibly cumbersome.  Therefore, I'm not going to write it out here.

Recursive SQL, however, makes it almost trivial to answer.  Simply convert the string to separate rows for each item in the string and check item 17.


The Solution (Wow!  It really works!)

The previous article on recursion in DB2 (link above) used the example of reconstructing a sentence from the stored words.  To set up the example we had to hard-code the data into INSERT statements to populate the table.  It would certainly be much more flexible if we could generate a query that would extract the words in a sentence and store them.  And by a clever coincidence, it fits nicely into the theme of this article.

First, we need a table to record the sentences.
CREATE TABLE sentences
                      (
                        snum      INTEGER GENERATED BY DEFAULT AS IDENTITY,
                        sentence  VARCHAR (200)
                      );

Open in new window


And we still need a table to contain each of the words.
CREATE TABLE words 
                      (
                        snum      INTEGER,        -- sentence number
                        wordnum   INTEGER,        -- word number in the sentence
                        word      VARCHAR (100)   -- word being saved
                      );

Open in new window


We also need something to parse.
INSERT INTO sentences (sentence) VALUES ('This is a multiword string');

Open in new window


Parsing the sentence using standard SQL is beyond the scope of this article.  It's just too long, too ugly, and too obtuse.

But a recursive solution is neat and concise.
WITH pquery (snum, wordnum, word, remainder)
                      AS
                      (
                        SELECT base.snum, 1, 
                          CASE WHEN LOCATE (' ', sentence) > 0 THEN 
                            SUBSTR (sentence, 1, LOCATE (' ', sentence) - 1)
                          ELSE
                            sentence
                          END word,
                          CASE WHEN locate (' ', sentence) > 0 THEN
                            LTRIM (SUBSTR (sentence, LOCATE (' ', sentence) + 1))
                          ELSE
                            NULL
                          END remainder
                        FROM sentences base 
                        
                        UNION ALL
                        
                        SELECT snum, wordnum + 1, 
                          CASE WHEN LOCATE (' ', remainder) > 0 then 
                            SUBSTR (remainder, 1, LOCATE (' ', remainder) - 1)
                          ELSE
                            remainder
                          END word,
                          CASE WHEN LOCATE (' ', remainder) > 0 THEN
                            LTRIM (SUBSTR (remainder, LOCATE (' ', remainder) + 1))
                          ELSE
                            NULL
                          END remainder
                        FROM pquery t0
                        WHERE t0.remainder IS NOT NULL
                      )
                      SELECT * 
                      FROM pquery pq;

Open in new window


Note that the first column is simply read from the table and passed to the result set.  The second column is an incrementing counter that generates the word number.  The third and fourth columns produce the current word and the unparsed text strings, but are mostly sanity checks to test for a single word sentence and the end of the sentence.

Whether there is 1 word or 1,000 words in the sentence, that one recursive query will handle it.


How It Works (I still feel a headache coming on)

The query has two parts.  The upper subquery selects the sentence from the table, sets the word number to 1, extracts the first word in the sentence, and generates a new string that begins after the extracted word.  It also contains sanity checks to test whether there is more than one string.  (It does not check for leading blanks, but we can solve that with a business rule that defines a sentence a starting with a non-blank.)

The lower subquery is the recursive part of the solution.  It takes the results of the previous subquery and selects the next word in the string, increments the word count, and sets the remainder string.  Selecting these values creates another row in the result set, which causes the recursion that will process the selected row.

pquery returns 4 columns.  The first 3 columns are part of the solution:  The sentence number, echoed from the sentences table; the word number, calculated as the sentence is parsed; and the word extracted from the sentence.  The last column is the portion of the original string that is yet to be parsed.  It may be more efficient to pass the index of the next starting location than the remaining string, but the SQL gets more complicated.

A quick check of the results shows the progression.
WITH ... SELECT * FROM pquery;
                      
                       snum wordnum word      remainder
                          1       1 This      is a multiword string
                          1       2 is        a multiword string
                          1       3 a         multiword string
                          1       4 multiword string
                          1       5 string    NULL

Open in new window


The upper subquery returns the first word (This) and builds the 'remainder' string.  The lower subquery operates on the 'remainder' string, selecting the next word and generating a new 'remainder' string.  The recursion occurs when the lower subquery select from pquery (the newly created row) and repeats the lower subquery.  


The Results (it really works!)

The recursive query that converted rows to columns generated intermediate results that needed to be filtered out.  This recursive query returns all of the desired words. It's a simple matter to save the words into the 'words' table.
WITH pquery (snum, wordnum, word, remainder)
                      AS
                      (
                        SELECT base.snum, 1, 
                          CASE WHEN LOCATE (' ', sentence) > 0 THEN 
                            SUBSTR (sentence, 1, LOCATE (' ', sentence) - 1)
                          ELSE
                            sentence
                          END word,
                          CASE WHEN locate (' ', sentence) > 0 THEN
                            LTRIM (SUBSTR (sentence, LOCATE (' ', sentence) + 1))
                          ELSE
                            NULL
                          END remainder
                        FROM sentences base 
                        
                        UNION ALL
                        
                        SELECT snum, wordnum + 1, 
                          CASE WHEN LOCATE (' ', remainder) > 0 then 
                            SUBSTR (remainder, 1, LOCATE (' ', remainder) - 1)
                          ELSE
                            remainder
                          END word,
                          CASE WHEN LOCATE (' ', remainder) > 0 THEN
                            LTRIM (SUBSTR (remainder, LOCATE (' ', remainder) + 1))
                          ELSE
                            NULL
                          END remainder
                        FROM pquery t0
                        WHERE t0.remainder IS NOT NULL
                      )
                      INSERT INTO words
                      SELECT snum, wordnum, word
                      FROM pquery pq
                      ORDER BY snum, wordnum;

Open in new window



Expanding the Results (I've still got lots of data to deal with!)

The example that we've used so far is very simple.  We have one sentence with a single starting value.  In the real world the table would likely contain a lot of sentences, so let's see what happens when the table contains multiple sentences.
INSERT INTO sentences (sentence) VALUES ('This  multiword string has variable   spacing');
                      INSERT INTO sentences (sentence) VALUES ('Short sentence');

Open in new window

WITH pquery (snum, wordnum, word, remainder)
                      AS
                      (
                        SELECT base.snum, 1, 
                          CASE WHEN LOCATE (' ', sentence) > 0 THEN 
                            SUBSTR (sentence, 1, LOCATE (' ', sentence) - 1)
                          ELSE
                            sentence
                          END word,
                          CASE WHEN locate (' ', sentence) > 0 THEN
                            LTRIM (SUBSTR (sentence, LOCATE (' ', sentence) + 1))
                          ELSE
                            NULL
                          END remainder
                        FROM sentences base 
                        
                        UNION ALL
                        
                        SELECT snum, wordnum + 1, 
                          CASE WHEN LOCATE (' ', remainder) > 0 then 
                            SUBSTR (remainder, 1, LOCATE (' ', remainder) - 1)
                          ELSE
                            remainder
                          END word,
                          CASE WHEN LOCATE (' ', remainder) > 0 THEN
                            LTRIM (SUBSTR (remainder, LOCATE (' ', remainder) + 1))
                          ELSE
                            NULL
                          END remainder
                        FROM pquery t0
                        WHERE t0.remainder IS NOT NULL
                      )
                      SELECT * 
                      FROM pquery pq;

Open in new window

 SNUM WORDNUM WORD      REMAINDER
                          1       1 This      is a multiword string
                          2       1 This      multiword string has variable   spacing
                          3       1 Short     sentence
                          1       2 is        a multiword string
                          2       2 multiword string has variable   spacing
                          3       2 sentence  NULL
                          1       3 a         multiword string
                          2       3 string    has variable   spacing
                          1       4 multiword string
                          2       4 has       variable   spacing
                          1       5 string    NULL
                          2       5 variable  spacing
                          2       6 spacing   NULL

Open in new window


The most obvious thing about the results is that it appears sorted by wordnum,snum.  Actually, it's not sorted at all.  The rows are listed in the order that they were generated.  Don't forget that SQL isn't linear.  It doesn't read a row and process it.  It reads all applicable rows and processes them.  pquery reads all of the rows from 'sentences' and generates all of the results of the upper subquery.  This generates the first three lines in the result set (the items where "wordnum" is 1).  The lower subquery then  processes all of the word 2 items, then word 3, etc. until no words remain.

The result is that we now generate and select all of the words in all of the sentences that are in the table.


Gotchas (I did that again?)

There are several places where it's easy to get tripped up using recursive SQL.  Stay mindful of the items mentioned here and you'll avoid most of the major pitfalls.

Select the correct starting row(s).  If the upper subquery doesn't select the correct initial value(s), the rest of the query is meaningless.  Always select the base item(s) in the upper subquery that will be built upon by the recursive queries.

Avoid infinite recursion.  If you structure your query so that there is no logical end point, DB2 will go along merrily joining the additional rows until it detects a stack overflow and throws an exception.  This is almost always caused by applying the wrong filter in the lower subquery.


Conclusion (I thought that he'd never shut up!)

This article raised the hypothetical situation where student attendance was stored one month per string, as shown below, and posed the challenge of determining if the student was present on the 17th of the month.

Using the query from this article it's easy to answer that question.  The only change that's required is to change the separator from a space to a comma.  But for completeness a couple of the variable names should change, too.  "snum" becomes the student number, "wordnum" should be changed to "day", and "word" changed to "code".

'P,P,P,,,P,EA,P,EA,EA,,,P,P,P,P,P,,,P'  etc...

Passing the attendance string through pquery and joining the results to the student record, the student's attendance on the 17th can be checked with this SQL:
WITH pquery (snum, day, code, remainder)
                      ...
                      SELECT pq.snum, CASE WHEN pq.code = 'P' THEN 'Present' ELSE 'Detention is in someone''s future' END AS status
                      FROM pquery pq
                      WHERE pq.snum = {student id}
                        AND pq.day = 17;

Open in new window



Good Luck
6
30,378 Views
Kent OlsenData Warehouse / Database Architect
CERTIFIED EXPERT

Comments (2)

Sean McSherrySenior Data Analyst

Commented:
7 years on and this article helped me with a problem that had been vexing me for a while.   Thanks, Kent!
Kent OlsenData Warehouse / Database Architect
CERTIFIED EXPERT

Author

Commented:
Thanks, Sean.

And welcome to Experts Exchange!  :)


Kent

Have a question about something in this article? You can receive help directly from the article author. Sign up for a free trial to get started.