How to use Slowly Changing Dimension Transformation in SSIS

Dung DinhData Platform Consultant
CERTIFIED EXPERT
Published:
Updated:
Slowly Changing Dimension Transformation component in data task flow is very useful for us to manage and control how data changes in SSIS.

Scenario

When you build a data-warehouse (DW) system, you have to manage and control what data was changed in your DW system.

For example: You are a retailer and building a data-warehouse (DW) system that support you to analyze data and make decision. In DW, you have a Customer dimensional table that contains your list of customers. Now, Kevin customer requests to change his address from New York city to Florida city in Customer Service system and we need also to update him in the DW. Kevin customer also did some transactions with $10,000 of total sales from your retail system when he lived in New York. So, how can we handle that change?

If you determine to overwrite address of Kevin customer, maybe we will have an issue while generating SALES report by Customer and City. SALES of New York will be decreased $10,000 while SALES of Florida will be increased $10,000. However, all transaction happened in New York City.

Instead, we will create new record for Kevin customer in Customer table. This record will contain new city information. This is to ensure we will have right reports by City but some data will be duplicated.

Another case, Kevin customer has changed his phone number. In this case, we do not want to analyze data by Phone Number so we can definitely overwrite Kevin's information.

Both of cases are examples of Slowly Changing Dimension (SCD) and they require us to implement a mechanism to handle every situation appropriately.

There are multiple ways to implement that in SQL server. Simplest way is using Slowly Changing Dimension in SSIS.


Dimension

Dimension is term in data management and data-warehouse. You have heard about Ralph Kimball model in data-warehouse, the dimension term is very popular. Click here to understand more detail about dimension.
 

What are SCD types?

In this article, I am going to introduce common three types although there are more than three types in real life.


SCD1
In this type, new information simply overwrites old information. In other words, no historical data. Come back above example, Kevin moved from New York to Florida. In my example, have following record.

CustomerKey             Name                      City
10000                            Kevin                        New York

After Kevin moved from New York to Florida, new information replaces old record.
CustomerKey             Name                      City
10000                           Kevin                        Florida

Advantages
This is the simplest way to implement SCD.
Disadvantages
All history lost. In this methodology, it is not possible to trace back historical data. 
 

SCD2

A new record is inserted so both of new record and old record are presented in our system. With this type, we need flags to identify what are old or new. For example:

CustomerKey             Name                      City
10000                            Kevin                        New York

After Kevin moved from New York to Florida, new information replaces old record and then.
CustomerKey             Name                      City
10000                            Kevin                        New York
10001                            Kevin                        Florida

Advantages
This allows us to keep all historical data
Disadvantages
Because new record is inserted so this can cause the size of table grow fast. We should consider in case number of rows in table is high.
 

SCD3

There will be two column. One identifies original value and another identifies current value.
CustomerKey             Name                      City
10000                            Kevin                        New York

After Kevin moved from New York to Florida, new information replaces old record and then.
CustomerKey             Name                      Original City              Current City
10000                            Kevin                        New York                     Florida

Advantages
This allows to keep the part of history
Size of table will not be increased because new data is updated.
Disadvantages 
This can not keep all historical data when data is changed more than once.


Using SCD in SSIS

Step 1: Create destination table named as DimCustomer
 
CREATE TABLE DimCustomer
                      (
                      CustomerKey nvarchar(100),
                      Name nvarchar(100),
                      City nvarchar(50)
                      )
                      GO
                      --  Insert data
                      INSERT INTO DimCustomer
                      VALUES('10000','Kevin','New York')

Open in new window


Step 2: Create SSIS new project named as SCDCustomer
Step 3: Drag and drop Data Flow task and double-click on this task
Step 4: Create example file to update DimCustomer table
CustomerKey    Name    City
10000                 Kevin     Florida
10001                 Marry    California

Step 5: Create a Flat-file connection and OLEDB connectionSCD_Connection.PNG
Step 6: Drag and drop Flat File Source into Data Flow area.

Step 7: Drag and drop Slowly Changing Dimension and double click to open Slowly Changing Dimension Wizard. Then click Next

Step 7.1
Connection Manager: OLEDBConnection
Table and view: DimCustomer
Set Key Type of Customer Key is Business Key

SCD_1.PNG
Select Business Key task is very important. It helps SCD component determine what records are existing in destination. Click Next

Step 8: At this step, we determine what types of SCD we use (SCD1, SCD2 or so on). However, SCD in SSIS only support SCD1 and SCD2 and another mode.
Fixed Attribute: Select this type when the value in a column should not change. Changes are treated as errors.
Changing Attribute: Select this type when changed values should overwrite existing values. This is the SCD1.
Historical Attribute: Select this type when changes in column values are saved in new records. Previous values are save in records marked as outdated. This is the SCD2.


In this article, I determined to use Historical Attribute (SCD2).
SCD_2.PNGClick Next.

Step 9: Because I used SCD2 so we need flags to identify what records are new or old. We have two ways to implement the flags.

  1. Use one column. It can be valued TRUE /FALSE or Current / Expired
  2. Use two columns with DATE data type (StartedDate and EndedDate)
 
ALTER TABLE DimCustomer ADD IsCurrent nvarchar(10)

Open in new window

Then update all existing records to Current

In this step, SCD component asks you how to handle new or records. Click on option Use a single column to show current and expired records.

Column to indicate current record: IsCurrent
Value when current: Current
Expiration value: Expired


SCD_3.PNGClick Next.

Step 10: Un-check Enable inferred member support. Click Next.

Step 11: Finish. There are two branches of SCD component. Historical Attribute Inserts Output branch is to insert new records if existing records change values and they are marked as expiration records. New Output to insert  new records if they are not existing.

SCD_4.PNGStep 12: Execute package.
Query table DimCustomer to ensure there is only one record for Kevin customer.
DimCustomer1.PNGExecute package
Final_Result.PNGNow we can see the result that Historical Attribute Inserts Output branch has 1 row to update and insert data of Kevin customer. New Output branch has 1 row to insert a new customer from the example file.

Query DimCustomer again.
DimCustomer2.PNG
The result as out expectation. The old record was updated Expired and new record was also inserted to indicate new value (Florida).
 

Limitations and Conclusions

Although Slowly Changing Dimension in data flow in SSIS is very useful. However, it has been designed for ease of use for small dimensions (tables) only. As above example, the output of updates are the T-SQL statements for SQL Server and they execute on each row in data flow. So there are some limitations in case using SCD component in SSIS.

  1. It can be very slowly for large dimension because there is no caching of lookup data when SCD must look for existing records in destination.
  2. It is used for SQL Server only
  3. It generates OLED Commands and execute row by rows instead of batch. This can cause performance issues.
Be carefully when you determine to use Slowly Changing Dimension in SSIS to solve the problems of changing data. Although it's very useful and easy for you to configure but the performance is the main concern if you want to process large dimensions. However, there are still alternative ways that can help you avoid performance issue
  1. Use third component party SSIS Dimension SCD Component
  2. Build your own approach by using T-SQL (INSERT, UPDATE, MERGE....) to work on batch data instead of each row as SSIS is doing.
 
1
6,808 Views
Dung DinhData Platform Consultant
CERTIFIED EXPERT

Comments (1)

Jim HornSQL Server Data Dude
CERTIFIED EXPERT
Most Valuable Expert 2013
Author of the Year 2015

Commented:
A very well written article on essential SSIS functionality dealing with record changes.  Voting Yes.

Have a question about something in this article? You can receive help directly from the article author. Sign up for a free trial to get started.