Today I will describe a subset of an ETL I made and I think that could be useful for you and it will be useful for me if I receive your feedback.
1. The Problem
I have customer's data (sent by an application by text files to a specific file system share), that must be imported to a SQL Server 2005 database, more precisely, to the customers database table.
All Fridays, all the data in the text file, must be extracted, transformed and loaded into the SQL Server 2005 database. After the extraction the file must be moved to another directory in the file system. (&\ProcessedFiles)
The text file is a fixed column type, and there is a column that describes the event to follow when ETL update the Customers table.
N (means new customer to insert)
A (means update existent customer)
D (means update/inactivate existent customer)
2. The Solution
The data will be extracted from the text file into a preTable (preCustomer) without any transformation/rules (to avoid risks). For each row imported to this preTable, I will have an extra column that is a foreign key to my audit table information. Next, I read all the records from the last extraction, transform and load into Customer table.
The end user, from an asp application, will be able to correct or ignore the redirected rows, and next time the ssis package executes, these errors will disappear. (Note: the warning and error rows will be stored in redirectCustomer table, until some action (correct or ignore) from the end user)
Logical Data Flow
I have several steps that could be useful to explain, but today I will focus only in the subset that I think is more interesting. Below you can see the logical data flow for the ssis package number 3, that transform and load the customer data from the preCustomer table into customer table.
And finally the ssis package...
(Note that the schema of ETL_Customer is identically to the ETL_redirectCustomer, for this reason you can union all the two sources in the pipeline)
3. Operational Risk
Remember that in a business intelligence project 70% of the risks occurred in the ETL module. So pay attention to the ETL and always try to identify the existence of an operational risk. I use the ssis error logs only for the real unexpected errors, if you think that there is a probability of some task fail, don't wait to see the error log, try to mitigate the error before it!
Operational risk is defined as the risk of loss resulting from inadequate or failed internal processes, people and systems or from external events. This defi nition includes legal risk, but excludes strategic and reputation risk.
The European Commission's Directive 2006/48/EC9
4. Some tips for using the best practices
I selected some important tips that could be followed in order to implement a good ETL project, using the best pratices.
- Always comment you project
- Always create the control flow and dataflow logical model and always before implementation of ssis package.
- For all the documents created, insert each document inside a template with all the important information to share with all the teams involved in the project. It's very important that when you look for a document, you could identify the subject and scope of the document fast. For example, for the template I created for Logical Data Flow, I added important information as project, package, dataflow, developer/team, date, version, purpose, reference&
- Use naming conventions for the tasks and components of the SSIS Packages
- Use SSIS log provider to catch error and warning events.
- Could be useful implement not only the SSIS log provider, but also some custom logs, using event handlers.
- Test you package! How?! Use SQL Data Generator to generate thousands of rows to your source data. This software is very good, for each column you can select all the possible values that you want to random. (These values could come from a database, text file&)
I hope this post could be useful for you, and I also hope that you can give some feedback about this.