I have a large data set and a SSIS package. How can I load this file in multi threading?
The client provided a large file of 5 million rows which is example raw file. In production environment, the file would be larger. I have to think about how to load it with minimal timing possibly.
In order to load this file in good performance, it requires many factors. I have been thought about structure of table, indexes, storage, ETL system... In this article, I would like to share how we can load a raw data file in multi threading.
Microsoft provided many approaches that we can optimize our ETL system. From SQL Server 2014, Microsoft provided a data transformation component for Microsoft SQL Server Integration Services that named Balance Data Distribution
that will allow us to speed up the transformation process.
Microsoft SQL Server 2014 - 12.0.2269.0 (X64) Jun 10 2015 03:35:45 Copyright (c) Microsoft Corporation
Enterprise Edition (64-bit) on Windows NT 6.3 <X64> (Build 10586: )
Data Tools for Visual Studio 2013
Install Balanced Data Distribution component that downloaded above
This step is to create a database that named as Sales_Retail. This database has a PRIMARY file group and nice other file groups. The purpose is to use partitioned table feature in SQL Server Enterprise Edition.
IF NOT EXISTS (SELECT name FROM sys.databases WHERE name = N'Sales_Retail')
CREATE DATABASE [Sales_Retail]
CONTAINMENT = NONE
( NAME = N'Sales_Retail', FILENAME = N'D:\DataSQL2014\Sales_Retail.mdf' , SIZE = 5120KB , MAXSIZE = UNLIMITED, FILEGROWTH = 10240KB ),
( NAME = N'Sales2007', FILENAME = N'D:\DataSQL2014\Sales2007.ndf' , SIZE = 10240KB , MAXSIZE = 1048576KB , FILEGROWTH = 1024KB ),
( NAME = N'Sales2008', FILENAME = N'D:\DataSQL2014\Sales2008.ndf' , SIZE = 10240KB , MAXSIZE = 1048576KB , FILEGROWTH = 1024KB ),
( NAME = N'Sales2009', FILENAME = N'D:\DataSQL2014\Sales2009.ndf' , SIZE = 10240KB , MAXSIZE = 1048576KB , FILEGROWTH = 1024KB ),
( NAME = N'Sales2010', FILENAME = N'D:\DataSQL2014\Sales2010.ndf' , SIZE = 10240KB , MAXSIZE = 1048576KB , FILEGROWTH = 1024KB ),
( NAME = N'Sales2011', FILENAME = N'D:\DataSQL2014\Sales2011.ndf' , SIZE = 10240KB , MAXSIZE = 1048576KB , FILEGROWTH = 1024KB ),
( NAME = N'Sales2012', FILENAME = N'D:\DataSQL2014\Sales2012.ndf' , SIZE = 10240KB , MAXSIZE = 1048576KB , FILEGROWTH = 1024KB ),
( NAME = N'Sales2013', FILENAME = N'D:\DataSQL2014\Sales2013.ndf' , SIZE = 73728KB , MAXSIZE = 1048576KB , FILEGROWTH = 1024KB ),
( NAME = N'Sales2014', FILENAME = N'D:\DataSQL2014\Sales2014.ndf' , SIZE = 235520KB , MAXSIZE = 1048576KB , FILEGROWTH = 1024KB ),
( NAME = N'Sales2015', FILENAME = N'D:\DataSQL2014\Sales2015.ndf' , SIZE = 1048576KB , MAXSIZE = 1048576KB , FILEGROWTH = 1024KB ),
( NAME = N'Sales2015_1', FILENAME = N'D:\DataSQL2014\Sales2015_1.ndf' , SIZE = 168960KB , MAXSIZE = UNLIMITED, FILEGROWTH = 1024KB )
( NAME = N'Sales_Retail_log', FILENAME = N'D:\DataSQL2014\Sales_Retail_log.mdf' , SIZE = 634880KB , MAXSIZE = 2048GB , FILEGROWTH = 10240KB )
ALTER DATABASE [Sales_Retail] SET RECOVERY SIMPLE
ALTER DATABASE [Sales_Retail] SET MULTI_USER
ALTER DATABASE [Sales_Retail] SET PAGE_VERIFY CHECKSUM
Create a table
Because we want to create a partition table so we need to create a partition function and partition scheme before. Understanding the partitioned table in SQL Server, click here
Create a partition function
CREATE PARTITION FUNCTION SalesRetailRange(Datetime)
AS RANGE LEFT FOR VALUES('2007-01-01'
Create a partition scheme
--- Create a partition scheme
CREATE PARTITION SCHEME SalesRetailScheme AS PARTITION SalesRetailRange TO ([Sales2007],[Sales2008]
Create a table
This table is partitioned by OrderDate column. Whenever Sales data is imported into this table, it bases on the value of OrderData to determined which partition?
IF OBJECT_ID('Sales') IS NOT NULL
DROP TABLE Sales
CREATE TABLE Sales
SalesOrderID int NOT NULL IDENTITY (1, 1) NOT FOR REPLICATION,
OrderDate datetime NOT NULL, -- Partion column
DueDate datetime NULL,
ShipDate datetime NULL,
Status tinyint NULL,
CreditCardApprovalCode nvarchar(15) NULL,
Comment nvarchar(128) NULL,
CarrierTrackingNumber nvarchar(25) NULL,
OrderQty nvarchar(50) NULL,
UnitPrice nvarchar(50) NULL,
UnitPriceDiscount nvarchar(50) NULL,
) ON SalesRetailScheme([OrderDate])
Create a SSIS Package
- Create a SSIS project named as Balanced_Data_Distribution
- Create a SSIS package named as Load_Sales_With_BDD
- Create a OLEDBConnection in Connection Manager Panel that connects to Sales_Retail database
- Create a flat file connection such Sales that connects to the raw file. In my example, I tried to generated 5 million rows of Sales data from AdventureWork2014 database.
- Drag and drop Execute SQL Task component. This task truncates table Sales_Retail before importing data from the raw file
- Drag and dop Data Flow Task component and double click on it to go to Data Flow tab
- Drag and drop Raw File Source component named as Sales
- Connection Manager tab
- Flat file connection managers: Sales
- Columns tab: select all columns
- Drag and drop Balanced Data Distribution component. This allows us to split input data into multiple flows in Data Flows area.
- Drag and drop OLEDB Destination named Load Sale 2007-2008
- Drag and drop OLEDB Destination named Load Sale 2009-2010
- Drag and drop OLEDB Destination named Load Sale 2011-2012
- Drag and drop OLEDB Destination named Load Sale 2013-2014
- Create a SSIS package named as Load_Sales_Without_BDD
- This step is to create a new package but we will not use Balance Data Distribution. It will be likely:
Execute the package
: In this case, we tested with DefaultBufferMaxRows is 100,000 and DefaultBufferSize is 100MB
Elapsed time = 2 minutes 17 seconds
: In this case, we tested with DefaultBufferMaxRows is 10,000 and DefaultBufferSize is 10MB
Elapsed time = 51 seconds
In theory, if we increase number of rows and buffer size, it would be run faster. However, the results were not our expectation, more rows and more buffer size lead bad performance. This was because of the system executing the package. In my example, I tried to execute on my laptop (Win 10 Pro, Core i7 (2 cores) and 8GB RAM). It was not able to write data enough fast when number of rows of the package increased and caused the I/O subsystem issue
: In this case, we tested Load_Sales_Without_BDD.dtx
s package to compare with Load_Sales_With_BDD.
DefaultBufferMaxRows is 10,000 and DefaultBufferSize is 10MB
Elapsed time = 55 seconds
: In this case, we tested Load_Sales_Without_BDD.dtx
s packagebut DefaultBufferMaxRows is 100,000 and DefaultBufferSize is 100MB
Elapsed time = 2 minutes and 27 seconds.
Obviously, there is no difference significantly between With Balance Data Distribution and Without Balanced Distribution Data in my testing. However, what if the raw data file will be larger and larger ~100 million rows even more and you can execute the package on multi-core and multi-processor machines.
If data set is large and requires several buffers to hold the data, Balanced Data Distribution component can effectively process buffers of data in parallel by using separate threads.
In this article, I would like to demonstrate how to implement Balanced Data Distribution component in SSIS. At this time, I can not answer that it is the best component to split input data into multi threading (flows) but it will be useful in some cases.