A Case for ETL Tools in Database Development

This article started out as an answer to a question asked here on EE on why someone would want to use ETL ("Extract, Transform, Load") tools rather than writing application code or using tools native to the database platforms. My response and this article are written from the perspective of a database guy.

I've seen a few ETL tools like the open source Talend; but the only tool I've actually used in a production environment is Informatica so that tool heavily influences my perception. I never became an expert with Informatica but I got to work with some people that were; and, having survived, I'm glad for the experience.

First a few things that I turned out to LOVE.

Self documenting Data Flow. Trying to answer the question "Where did you get this piece of information?" can be difficult (and that's being kind) with scripts, programs, stored procedures and so on. However, tracing through a workflow is easy. Of course, if the processing is complex with lots of data sources and many transformations it may still take awhile to decipher it all; but the flow itself is still essentially just connecting the dots.
Semi-agnostic processing. I absolutely detest platform agnostic development. BUT... tools like Informatica have a nice compromise to them. If you change platforms of your data sources, you must rewrite your source qualifiers, but... after that you're essentially done. So, your inputs can use SQL syntax, stored procedures, specialized web services, or anything else. Whatever is the best mechanism for your source you can use it. Once you have your data, the actual "T" transform process is the same all the way through the workflow until you get to your output targets. Here again, you'll create them to be specialized for the target platform; but your workflow remains logically identical. Of course, nice modular design in any programming language can duplicate this; but the point is, it's built in. You kind of have to go out of your way to not be nicely modular with an ETL tool.
Diagnostics and logging. Most applications I've seen aren't instrumented as well as they could be. The workflow monitor and logs are awesome tools that give a ton of information without the developer needing to do much. Of course, if your mapping calls a procedure or script and produces additional logs, you can include those for even more contextual information.

You might consider these the soft benefits but to me they were some of my favorites.

On a more technical side, I was impressed by these features:

High speed features for some platforms are "options", so you don't have to do a lot of special coding in order to get significant benefits if you know when to use them. For example, if you know you can take advantage of sql*loader style direct path inserts, you can simply turn on "Bulk" option for inserts and you've got them. If it turns out you need to observe constraints and triggers then switch to "Normal." It's literally as easy as toggling an option.
Flexible mappings. Informatica allows you to build transformations within its own framework with common aggregations like SUM, COUNT, MIN/MAX, etc. It also has more sophisticated features if needed. Even better though, when the tool comes up short, you can invoke external code, calling web services, stored procedures and other tools. I worked with some that would read from DB2 on a mainframe, flow data through a few internal transformations, call an Oracle stored procedure to process more data, invoke a Java program to alter others before flowing into other internal transformations before finally invoking a web service to export the data to an external system.
Simple parallel scaling. If it's reasonable to split your work into pieces and run multiple threads, scaling the degree of parallelization up or down is quite simple with an integration tool with little to no special coding.

Having said all that, I'm still not sold on Informatica as a universal tool for all problems. More often than not, I recommend NOT using it.   That philosophy is largely due to the number of similar platforms our applications use. That is, Oracle-to-Oracle, SQL Server-to-SQL Server, DB2-to-DB2, etc. These databases all have data migration mechanisms within them. These are efficient, often supporting full transaction integrity and, simply by their intrinsic nature, have fewer moving parts. Therefore they be more robust and easier to maintain. Something as simple as linked tables in SQL Server or queries across database links in Oracle can solve many (most?) business needs without the need for additional tools.

Furthermore, if your developers and support staff are already skilled in their use of a particular platform, then writing transformations in a different tool can be more costly in construction and maintenance.   You might have an expert in Informatica; but an expert in DB2's SQLPL can probably construct a stored procedure that is just as (if not more) efficient than an ETL mapping without incurring the cost of network traffic to move the data into and out of the ETL tool.

Also, a lot of ETL developers simply want to process files. That's fine if that's all you have to work with; but far too often I find developers resorting to file processing when there are better options. A simple example - rather than defining a data source by querying a database directly, a mapping invokes a script that unloads data from the database to a file, and then the file is imported into the ETL tool as the data source. This may seem like an obvious thing to avoid; but it happens. A common reason for this is to ensure an agnostic process with nothing in the ETL system invoking the source or target directly. Mentioning this is somewhat unfair because it's not the fault of the integration tool, it's the fault of the human that chose to use it incorrectly; but I mention it because it's a real problem with significant impacts on performance and resource consumption. Adding the extra steps of creating and moving a file rather than using native tools is more work.   It's more code, so more places for bugs to creep in. More steps means more places to break from system errors.   Writing a file consumes cpu and time. Whatever query is used to populate the file consumes resources on the source system that could have been used to directly feed the ETL mapping.

My argument against files with ETL is essentially my same argument against using an ETL tool at all: adding an external step to a system doesn't make the system more efficient.

But... this article is about why ETL tools can help and there are times when processing a file is appropriate.

As alluded to above, you may not have a native access method. It might not be appropriate (for security, auditing, financial or other reasons) to access the underlying database directly; you must go through an application's export process which creates a file by approved steps with any necessary logging and notifications. Or, if your source system is outside your company you may recieve and send files as your sole transfer mechanism to the external entity. So, obviously, if what you have are files, then you'll use files.
You may have files as non-transactional data repositories. A message log for example. While it is possible to ship and process individual messages on the fly, it can be more efficient to write them to a file as the source system generates them and then read and process them in bulk at a later time. This actually reduces the number of steps involved and is entirely appropriate and encouraged.
If you have sporadic communication windows a file may be a cheaper alternative for transfer than trying to maintain communication to a database or web service. Databases in particular can be chatty in their communication protocols, so it can be more efficient to use a file transfer protocol. Also, if the extraction step is expensive on the source system, then writing it to a file as a one-time event can be a resource saver if you must restart the transfer for any reason; by simply pulling the file again rather than executing a series of expensive SQL, procedures, programs or scripts to regenerate the same data.

I don't know if I've convinced you, but my time with ETL tools has convinced me they are a viable tool to have in my toolbox. I use my screwdrivers more often than I do my block plane; and even though it's obvious one tool is more valuable more often that doesn't mean my block plane is pointless or broken. It just means it's waiting for the right job.

Comments (1)

Jim Horn

SQL Server Data Dude

CERTIFIED EXPERT

Most Valuable Expert 2013

Author of the Year 2015

Commented: 2015-11-04

Excellent article, voted Yes.

A couple of thoughts based on SSIS experience:

Aggregates and sorts are referred to as blocking transformations, where the entire set of data needs to be loaded before a sum/count/sort can be performed, which would decrease a lot of the benefits of running streams of data.

File manipulation is not difficult in SSIS, especially with the third party tool Pragmatic Works' Task Factory, however most large shops I've worked in have a separate FTP/Feed Transport guy/team where that person was in charge of moving/encrypting files.