Comparing Results Between Two Tables - An ETL Dilemma

We've all been there. I know I have many times. We have many SSIS packages that are used primarily to create extract files that we send to various clients throughout the day/week/month/quarter/year. Many of them execute a stored procedure as the source of a data flow task and that procedure does quite a bit of work, resulting in the loading of a staging table, and then a SELECT query against the staging table.

This article isn't about SSIS, but rather confirming correct results between two tables, regardless of how they were populated, when you have the original, pre-code change table and a newly loaded post-code change table.

One day, there is a request to make a change. Perhaps in how one of the columns in the staging table is calculated. In the last case, that prompted me to write the code this article is centred around. Four different changes were requested. All involved CASE expressions in the SELECT statement that determined what the final value in the column would be.

Part of our verification process (which should be yours too) is to make sure that the changes that were made did not affect anything else. A case in point, you front load one of these column changes (let's call it ColumnA) and subsequently, that value is used to determine some other column (ColumnB) value. If the contents of columns change from the pre-update value, is that expected, or does it need to be handled as some kind of exception?

Sometimes, I can just compare pre- and post-conversion extract files, but most comparison programs will just tell you which rows are different, then it is up to you to peruse the row to determine what changed. This can be fun (tongue-in-cheek) if you have a row that is hundreds, if not thousands of characters in length.

My last task loaded a staging table that had almost 300 columns. Five of those columns made up a unique key into the data, in that any changes I made should have the same values in both the pre- and post-conversion tables. The remaining columns (with the exception of the four I changed) SHOULD have had the same values and the only changes I should see are within the four columns I updated.

I started to write code to identify changes between tables, and half-way through thought there has to be a better way. We have system tables and views that should make this task easier than typing CASE expressions, ISNULL() functions and JOIN conditions for a table with almost 300 columns.

The main purpose of the script was to identify columns that did not match up. As-is, it identifies all changes, including those you expect. With a slight change to the generated code, it could be used for changes outside the expected ones (by commenting out those columns in the query).

The script uses SQLCMD variables via :SETVAR mainly because the columns you expect to match on for a 1:1 row comparison are used subsequently in IN [and NOT IN] (column_list) statements. They could be hard-coded but that defeats the purpose of a generic script. Setting these values at the start of the script makes it possible to run against any two tables with minimal effort. Note, going forward, I am going to refer to these two tables as simply staging tables.

The variables that are set at the top of the script are KEY_COLUMNS (discussed above), TABLE_NAME_O and TABLE_NAME_N. These are the pre-conversion (TABLE_NAME_O) and post-conversion (TABLE_NAME_N) table names. Other than creating the JOIN condition during execution, all data is derived from INFORMATION_SCHEMA.COLUMNS from TABLE_NAME_O, as it is assumed both have the same schema. If not, then you may need to revise this script somewhat.

If you are adding new columns to the output and TABLE_NAME_N contains columns that are not contained in TABLE_NAME_O, this script will work, but the new columns will not be part of the generated query. If columns have been removed from TABLE_NAME_N that were in TABLE_NAME_O, that will cause a problem, and adding another SQLCMD variable for an EXCLUDED_COLUMNS list might be considered.

When executed, this script will produce a column list of the 'key' columns, plus a column named Differences.

Differences will be a list of one or more column names that did not match between the two tables. The actual differences are not displayed. Using the key column values, you would have to query the two staging tables to find the actual differences. You could also save the results to a temp table and JOIN it to each staging table to examine all the changes in a single query.

There are basically nine 'sections' to the script, each delineated by comments. The first is just the setup and establishes values for the SQLCMD variables and some SQL variables. Note that the KEY_COLUMNS value is set just as it would be inside the parenthesis of an IN () clause, and is enclosed in double quotes (").

An example would be:

:SETVAR KEY_COLUMNS "'column1','column2','column3'"

TABLE_NAME_O and TABLE_NAME_N are the staging table names, where TABLE_NAME_O contains the pre-code update values and TABLE_NAME_N contains the post-code update values. Some variables are then declared that are used within the script.

The next two lines (13-14) populates the variable @columnList using the column names contained in the SQLCMD variable KEY_COLUMNS and adding the calculated column [Differences] to the list.

The third section (line 25) creates the COALESCEd list of KEY_COLUMN columns used in the subquery.

The fourth, and most complex section (beginning on line 34) is basically one long string of CASE expressions that create a list of non-matching columns.

An example line of code might be something like:

+ CASE WHEN ISNULL(O.[FirstName], '') <> ISNULL(N.[FirstName], '') THEN 'FirstName' + ',' ELSE '' END

If FirstName is the same in both tables, then the value will be an empty string (''), else it will be the column name.

The next section (line 73) adds the FROM and FULL OUTER JOIN to the query.

The code starting on line 76 adds the JOIN conditions using the column names in the KEY_COLUMNS variable.

Line 104 'wraps up' the script by closing the subquery and adding a WHERE condition to only display rows with differences.

The code following line 104 PRINTs the contents of the @SQL variable. I swiped this code from the internet. It is used because a stand-alone PRINT statement is limited in how many characters it can print. Anything longer will be truncated.

Once the variable values have been set and the script is executed, the output in the Messages tab of your SSMS session can be copied and pasted into a new query window and executed. It is also possible to just comment out the last section of code (where the code is printed) and add an EXECUTE (@SQL), but then you don't have the finished script to "play around with."

You can use three or four-part naming on the table names, but the script must be run in the database that contains the table referenced by the variable TABLE_NAME_O.

Some modifications you might consider after generating the script:

If you only want to identify columns that you don't expect to change, comment out the CASE expression(s) pertaining to those you are expecting to change. For example, if the old table had a formatted phone number (e.g. (555) 555-5555) in a column named PHONE and you are changing the code to remove formatting, thus resulting in a new value of 5555555555, find the row in the script that reads:

+ CASE WHEN ISNULL(O.[PHONE], '') <> ISNULL(N.[PHONE], '') THEN 'PHONE' + ',' ELSE '' END

and comment it out. If you are looking for changes in specific columns, change the WHERE clause to:

WHERE Differences LIKE '%PHONE%' OR Differences LIKE '%ADDRESS%'

I suggest leaving the generic script as-is and work with the generated script to make such changes as you now have all your column names.

One caveat is if you change data types between extracts. For example, if the formatted PHONE value is VARCHAR but the new value is BIGINT, the generated script will error when executed. You could handle these on a case by case basis before you ran the script, but be aware of this potential issue.

Another potential issue is that the script will not handle all data types. It was written to handle those types we specifically use, but there could be a data type you use that the script will attempt to improperly convert in the ISNULL() functions. For example, an IMAGE data type will default to conversion to 0, which will raise an error. If you run into this issue, you can add another condition to the appropriate CASE expression.

Build Compare Script.sql

Comparing Results Between Two Tables - An ETL Dilemma

Comments (0)