Requirements Document Template for an ETL Project

Jim HornSQL Server Data Dude
CERTIFIED EXPERT
Microsoft SQL Server database developer, architect, and author specializing in Business Intelligence, ETL, and Data Warehousing.
Published:
Updated:
Here's a requirements document template for an integration project (also known as Extract-Transform-Load or ETL) based on my development experience as an SQL Server Information Services (SSIS) developer over the years.

For a Requirements Document Template for a Reporting Project see my article here


I've been doing SSIS development off and on for over ten years now, have experienced the best and worst clients, business analysts, executive sponsors, and my own shining and less-than-shining moments I have seen many developers turn into ... the DEV Nazi!!


dev-nazi-no-soup-for-you.jpg

So to make sure that doesn't happen to you, here's a template for your ETL projects.


In scope for this article

  • A requirements document template designed for business analysts to cover most ETL projects.
  • Witty advice


Out of scope for this article

  • Requirements Elicitation, the practice of collecting project requirements of a system from users, customers and other stakeholders
  • Requirements documents specific to other types of projects, such as reporting and data warehousing
  • Any words of wisdom regarding data security
  • Design Documents, and issues that typically come up in design.
  • Estimating


So here we go.. 


Version History

A 'who changed what when' chronology of all changes, either using Word change tracking or lines like '8/1/15 Bob's changes per mutual agreement'. Keeps everyone honest when there are lots of changes, especially when there is disagreement as to what is a requirement and what is not, and if you’re ever in any situations where changes are coming fast and furious this is invaluable in managing an approved set of requirements. I’ve known many a business analyst and developer who became quickly overwhelmed by conflict between others on requirement demands. 


Purpose

Also known as project objective, business goals, business problem statement, and various other terms. A simple 'Here's why we're doing this' paragraph. The target audience is those that are likely to only read this paragraph, but this also gives the developer some design decision guidance.


In Scope / Out of Scope

Everybody LOVES this section! Okay, developers LOVE this section. In Scope is a summary of what's in the requirements. Out of Scope is usually a Top 10 list of things that are close but not in, and answers the often asked question 'Are we also getting this too?' This is a developer’s best line of defense against uncontrolled additions and changes in project deliverables, also known as scope creep, by false or unstated expectations. I've known more than a couple of clients that will negotiate effort, cost, and time, and then scope creep the hell out of a project in order to make themselves look better. Been there, dealt with that.   


107128744.jpg

At first scope creep can be handled professionally.


Just because person X told person Y a month ago that it’s in requirements, or this email two months ago said it’s in, or was assumed in an elevator conversation last week, or the client was told it can be handled even though we had no idea when we said that, or was mentioned on the golf course last year during preliminary negotiations doesn't mean that it’s in. If it was discussed and approved in a requirements meeting then it's in, otherwise it's out of scope. 


"But wait, we're a really small operation, and this isn't a big deal" you say? Fine, as long as you can roll with that, but the moment somebody has an requirement expectation that wasn't delivered that can change, forcing you to function as the gatekeeper of requirements in a more formal way. 


107130551.jpg

When scope creep get chaotic, this is me. Welcome to my world.


Service Level Agreement (SLA)

Is there a guarantee of performance that the company has negotiated with the client? If yes, then an initial design assessment needs to take place on whether this is a realistic expectation, as management will often negotiate revenue for performance and penalties for non-performance, and there could be considerable effect on scope and time in order to hit an SLA. This will also drive design as to feed reporting, as many large companies will need feedback on their SLA performance. 


I’ve been in a few situations where SLAs where negotiated such that processes would be completed by a time that was either not possible, or not possible given certain requirements, and needed to be handled in the design estimate. 


Dependencies

What data / processes / events is this project dependent on to be successful? I’ve seen plenty of integration projects put on hold or cancelled because…

 

  • Business persons could not agree on key terminology, such as 'what is a sale'. 
  • Other parts of the business have upstream processes that are not completed yet.
  • Data is not available.
  • Data was available but for a price, and the business was not willing to pay that price.
  • The source schema was not finalized so development could begin.
  • Sample data could not be created with reasonable confidence that it would be final so development could not begin.
  • The project is not funded. 


These expectations need to be identified and managed early in the project. Also some of these dependencies may not be known to a business analyst and need to be handled in design.


Data Sources

All data that goes into an ETL package:

  • Location of source of data:  databases, folder and file location, URL, Web Services task.
  • Security needed to gain access to this location. In large companies this is often handled by a separate group.
  • Are they change files that only represent changes to the previous load, or a 'full-load' file containing all available data?
  • Backup, such as ‘After the file has been processed move it to the x folder’
  • Backup file retention rules: Various legal requirements that the file be backed up for x days.
  • Name: Does the name vary based on client, customer, date created, etc. 
  • Any Personally Identifiable Information (PII)concerns that would prevent data from being materialized in a staging table, only used in a process and exported. 
  • How big is the data source?  In many Internet of Things (IoT)processes these can be gigs or bigger, so you might need to come up with a strategy for importing less than all rows in a timely manner for downstream processes to reflect useful data. 


Data Destinations (Target)

All data that goes out of an ETL package:

  • Location of destination databases: Server, Database, any access information.
  • Security needed to gain access to this location.
  • Are there any calculated values based on source data that need to be created? May not be in requirements but discovered in design.
  • Auditing columns such as created by, last updated date: Use the fields provided in the file vs. auto generated in the target database?


Source to Target Mapping

The heart of the ETL requirements document. This spells out the schema of the source(s) and destination(s) in the data feed: sections such as header and footer, how EVERY column maps between source and target, data types, calculated columns, hard-coded columns, business rule validations / acceptable values (greater than zero, date no earlier/later than, NULL values).

 

source-to-target2.jpg Source to Target Mapping grid


Actions for rows

  • If a row (say ID=12345, Name='Miriam Bizup) is in source file but not in the destination - INSERT the row.
  • Both source and target, but some values are different - UPDATE based on the defined primary key.
  • Both source and target, all values are the same - Usually no action, but sometimes UPDATE.
  • Row is in target, but not in source - Sometimes a DELETE, sometimes an UPDATE and set an 'IsActive' column to No and a date column  such as 'InactiveDate' with the current datetime.

There are some business analysts that cannot provide a source to target mapping, especially if they don’t have access to the data source, which means the developer has to figure this out themselves. If this is your situation then make sure that you’re communicating that you’re doing requirements gathering as well as development, and that some investigation will need to take place before you can give a reliable estimate. You also may have to state various assumptions in your requirements document on details that were not provided.


Business Rule Validation

One client I had wanted to make sure that if patient data came in unless there was a signed contract with the health plan, which meant there was a related row in a HEALTH_PLAN table. This is not an error, but a business rule validation that is handed to catch this data (and possibly stage it somewhere for later use), and then message various business units that this happened so that they can negotiate with that health plan.  


Another client was an airline and wanted to know if there was ever a flight that after eight hours of the flight leaving there was no data on it landing. Again not an error, but an event of interest to the business. 


If data fails a business rule validation, what action does the ETL take? Accept, accept with default values, not accept? Also be careful in that management and users will tend to call things errors that are not, so you'll need to be very specific as to if something is a true error, data the ETL feed never expected, or something the SSIS package was not designed to handle. 


Timing / Scheduling

  • When will the source file(s) be available? If multiple files is there a defined order of processing? 
  • Are these files full-load (meaning an entire set of data) or incremental (meaning only changes since the last time the file was generated)?
  • Are there any requirements for the timing of this project, such as ‘This data must be in location x by datetime y so that process z can occur with this new data’
  • How should the package behave if source files are expected at set times and they don't arrive?  Duplicate files? 


Notifications

After the feed runs, who should receive a message, and what is in the message, if…

  • The source files were missing.
  • The ETL job failed and returned an error? 
  • The ETL job ran successfully but threw an error?
  • The ETL job ran successfully but failed a business rule validation? For example, Customer sales must be for an existing customer, sales amounts cannot be a negative number, sales dates have to be for that day, etc. 
  • The ETL job ran successfully but failed a data quality validation? For example, Invalid state code such as CAN, Invalid zip codes, Invalid gender.
  • The ETL job ran successfully without errors? 


The end.  Thank you for reading my article, and please click the 'Good Article' button if it helped you. 



The material in this article was presented at SQL Saturday #682, St. Paul Minnesota, 2017.


8
11,335 Views
Jim HornSQL Server Data Dude
CERTIFIED EXPERT
Microsoft SQL Server database developer, architect, and author specializing in Business Intelligence, ETL, and Data Warehousing.

Comments (5)

aikimarkGet vaccinated; Social distance; Wear a mask
CERTIFIED EXPERT
Top Expert 2014

Commented:
nicely done, Jim

Commented:
Well written, sir!! An excellent guide.
Jim HornSQL Server Data Dude
CERTIFIED EXPERT
Most Valuable Expert 2013
Author of the Year 2015

Author

Commented:
Thanks.  And I see that we've connected on LinkedIn.

Commented:
Yes. I'm thinking that I'll need well-educated friends :)

Mike
Hi Jim,

Good Evening.


I am searching for the sample document given to the ETL developer by Business Analyst and I have found that your document suite for me. Could you please provide the sample document for the ETL developer and kindly let me know what are the inputs do we need to know to prepare the excel sheet document for specifications.


Thanks,
Srishitha.

Have a question about something in this article? You can receive help directly from the article author. Sign up for a free trial to get started.