Our community of experts have been thoroughly vetted for their expertise and industry experience.
Published:
Updated:
Browse All Articles > How to use Slowly Changing Dimension Transformation in SSIS
Slowly Changing Dimension Transformation component in data task flow is very useful for us to manage and control how data changes in SSIS.
Scenario
When you build a data-warehouse (DW) system, you have to manage and control what data was changed in your DW system.
For example: You are a retailer and building a data-warehouse (DW) system that support you to analyze data and make decision. In DW, you have a Customer dimensional table that contains your list of customers. Now, Kevin customer requests to change his address from New York city to Florida city in Customer Service system and we need also to update him in the DW. Kevin customer also did some transactions with $10,000 of total sales from your retail system when he lived in New York. So, how can we handle that change?
If you determine to overwrite address of Kevin customer, maybe we will have an issue while generating SALES report by Customer and City. SALES of New York will be decreased $10,000 while SALES of Florida will be increased $10,000. However, all transaction happened in New York City.
Instead, we will create new record for Kevin customer in Customer table. This record will contain new city information. This is to ensure we will have right reports by City but some data will be duplicated.
Another case, Kevin customer has changed his phone number. In this case, we do not want to analyze data by Phone Number so we can definitely overwrite Kevin's information.
Both of cases are examples of Slowly Changing Dimension (SCD) and they require us to implement a mechanism to handle every situation appropriately.
There are multiple ways to implement that in SQL server. Simplest way is using Slowly Changing Dimension in SSIS.
Dimension
Dimension is term in data management and data-warehouse. You have heard about Ralph Kimball model in data-warehouse, the dimension term is very popular. Click here to understand more detail about dimension.
What are SCD types?
In this article, I am going to introduce common three types although there are more than three types in real life.
SCD1 In this type, new information simply overwrites old information. In other words, no historical data. Come back above example, Kevin moved from New York to Florida. In my example, have following record.
CustomerKey Name City
10000 Kevin New York
After Kevin moved from New York to Florida, new information replaces old record.
CustomerKey Name City
10000 Kevin Florida
Advantages
This is the simplest way to implement SCD.
Disadvantages
All history lost. In this methodology, it is not possible to trace back historical data.
SCD2
A new record is inserted so both of new record and old record are presented in our system. With this type, we need flags to identify what are old or new. For example:
CustomerKey Name City
10000 Kevin New York
After Kevin moved from New York to Florida, new information replaces old record and then.
CustomerKey Name City
10000 Kevin New York
10001 Kevin Florida
Advantages
This allows us to keep all historical data
Disadvantages
Because new record is inserted so this can cause the size of table grow fast. We should consider in case number of rows in table is high.
SCD3
There will be two column. One identifies original value and another identifies current value.
CustomerKey Name City
10000 Kevin New York
After Kevin moved from New York to Florida, new information replaces old record and then.
CustomerKey Name Original City Current City
10000 Kevin New York Florida
Advantages
This allows to keep the part of history
Size of table will not be increased because new data is updated.
Disadvantages
This can not keep all historical data when data is changed more than once.
Using SCD in SSIS
Step 1: Create destination table named as DimCustomer
Step 2: Create SSIS new project named as SCDCustomer Step 3: Drag and drop Data Flow task and double-click on this task
Step 4: Create example file to update DimCustomer table
CustomerKey Name City
10000 Kevin Florida
10001 Marry California
Step 5: Create a Flat-file connection and OLEDB connection Step 6: Drag and drop Flat File Source into Data Flow area.
Step 7: Drag and drop Slowly Changing Dimension and double click to open Slowly Changing Dimension Wizard. Then click Next
Step 7.1 Connection Manager: OLEDBConnection
Table and view: DimCustomer
Set Key Type of Customer Key is Business Key
Select Business Key task is very important. It helps SCD component determine what records are existing in destination. Click Next
Step 8: At this step, we determine what types of SCD we use (SCD1, SCD2 or so on). However, SCD in SSIS only support SCD1 and SCD2 and another mode.
Fixed Attribute: Select this type when the value in a column should not change. Changes are treated as errors. Changing Attribute: Select this type when changed values should overwrite existing values. This is the SCD1. Historical Attribute: Select this type when changes in column values are saved in new records. Previous values are save in records marked as outdated. This is the SCD2.
In this article, I determined to use Historical Attribute (SCD2).
Click Next.
Step 9: Because I used SCD2 so we need flags to identify what records are new or old. We have two ways to implement the flags.
Use one column. It can be valued TRUE /FALSE or Current / Expired
Use two columns with DATE data type (StartedDate and EndedDate)
ALTER TABLE DimCustomer ADD IsCurrent nvarchar(10)
In this step, SCD component asks you how to handle new or records. Click on option Use a single column to show current and expired records.
Column to indicate current record: IsCurrent
Value when current: Current
Expiration value: Expired
Click Next.
Step 10: Un-check Enable inferred member support. Click Next.
Step 11: Finish. There are two branches of SCD component. Historical Attribute Inserts Output branch is to insert new records if existing records change values and they are marked as expiration records. New Output to insert new records if they are not existing.
Step 12: Execute package.
Query table DimCustomer to ensure there is only one record for Kevin customer.
Execute package
Now we can see the result that Historical Attribute Inserts Output branch has 1 row to update and insert data of Kevin customer. New Output branch has 1 row to insert a new customer from the example file.
Query DimCustomer again.
The result as out expectation. The old record was updated Expired and new record was also inserted to indicate new value (Florida).
Limitations and Conclusions
Although Slowly Changing Dimension in data flow in SSIS is very useful. However, it has been designed for ease of use for small dimensions (tables) only. As above example, the output of updates are the T-SQL statements for SQL Server and they execute on each row in data flow. So there are some limitations in case using SCD component in SSIS.
It can be very slowly for large dimension because there is no caching of lookup data when SCD must look for existing records in destination.
It is used for SQL Server only
It generates OLED Commands and execute row by rows instead of batch. This can cause performance issues.
Be carefully when you determine to use Slowly Changing Dimension in SSIS to solve the problems of changing data. Although it's very useful and easy for you to configure but the performance is the main concern if you want to process large dimensions. However, there are still alternative ways that can help you avoid performance issue
Our community of experts have been thoroughly vetted for their expertise and industry experience. Experts with Gold status have received one of our highest-level Expert Awards, which recognize experts for their valuable contributions.
Comments (1)
Commented: