Debate on best methods for implementing ETL, Datawarehousing and n-tier

Posted on 2013-02-05
Last Modified: 2014-05-12
My firm is  in the middle of some debate.

We do millions of ETL rows from 1000's of flat files daily. Some of our databases are quite large 100-500GB.
Most are around 100GB or less.

We update and validate portfolio data for ~30M securities portfolios stored in approx 50 Databases on 5+ servers.

We also have other servers tasked with ftp and ETL. And our application web-servers.
All backended by sql server 2008R2.

A separate db is implemented for each client but many clients are on a single db server.

No consolidation of shared data is implemented so there is some replication between db's.

No typical data-warehouse or BI technologies are in place.

There are a couple of layers.
1-2. flat files are typically ftp downloaded and read using vb into sql server and processed into a 1st stage DB
then text files are exported and then imported into a
2nd stage database for validation logic.
Stages 1 and 2 should soon be merged.

3. Data is then posted from the 1-2 stages to the client database on a server.
Currently stored procedures do a lot of the work but increasingly the transforms/validation/final posting
operations are being moved to vb objects utilizing .net cache and lots of RAM. Data is read from DB manipulated
and written back so the DB is just concerned with select/update/insert/delete simple sql.

4. data maintenance/utility scripts for cleansing are mainly sp's and sql scripts that are scheduled
out side of sql server via a home grown vb app.

5. the application/business layer/business (data) layer logic is mixed
but there is movement to n-tier architecture and moving Business Logic out of the database. A lot of complex calculation and query intensive activity is taking place here.

I see no problem moving the application to n-tier but what about the etl and transform logic.
What is the current thinking on this?

One side says to transfer the etl transform and validation logic to this VB middle tier and add
lots of ram to support the .net caching and .net sqldata classes supporting the VB objects.
Also moving the business logic the this middle tier.

Another side argues to let the application move to three (or n) tiers BUT push etl to the BI stack
and add a data-warehouse implementation so cubing (to ease calculation in the business layer) and validating could go on during warehouse loading.

In essence use the BI (IBM, Oracle or MS) stack for loading and a 3 tier system for the application based on
current warehousing technology.

Also note that very little insert/update activity is generated by the app.

Your thoughts?
Question by:dthoms000
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
LVL 39

Accepted Solution

Aaron Tomosky earned 168 total points
ID: 38859836
If you are dealing with flat files and FTP getting that stuff into a db will always be step one. At that point it really depends on your data and your use for it as to how it's organized, but you will probably have a transform and load section (s). This could have different hardware requirements than your large accessible db.

Creating small ssd local storage or NSF/iscsi San can be done on the cheap, look at intel 3700 drives as a starting place. Go with raid 10 (or mirrored pairs if you do zfs).

I don't have any personal experience with cubes and bi all that much, but I etl disparet data into common aggregated databases constantly. My problem is that it's always a little different for different clients so step 1-2 is transforming and loading into a common format that the rest of the system can aggregate and work with.

Sorry for the "it depends" answer but it really does depend on your data and use of that data.
LVL 16

Assisted Solution

AlexPace earned 166 total points
ID: 38860088
I think the development environment, error handling and logic branching is easier in VB..

However, the "generic programmer logic" method of munging large data sets can be MUCH less efficient (read: slower) than native SQL Server methods.  Also the optimization will be easier on the SQL Server.  Also it is easier to alter a stored procedure than a VB program... you don't need to recompile and redeploy it.

If you are trying to get close to real time I've had good luck with a Robo-FTP service that constantly polls a remote FTP server and downloads a source file as soon as it becomes available.  When the download is complete it executes a SQL statement that kicks off the bulk import job.  If the transfer fails it will retry automatically and resume where it left offf if the remote server supports that.
LVL 69

Assisted Solution

by:Scott Pletcher
Scott Pletcher earned 166 total points
ID: 38860179
SSIS has many built-in functions/tools to help with ETL.  To me it would make sense to take advantage of those tools.

Featured Post

Get Database Help Now w/ Support & Database Audit

Keeping your database environment tuned, optimized and high-performance is key to achieving business goals. If your database goes down, so does your business. Percona experts have a long history of helping enterprises ensure their databases are running smoothly.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

When table data gets too large to manage or queries take too long to execute the solution is often to buy bigger hardware or assign more CPUs and memory resources to the machine to solve the problem. However, the best, cheapest and most effective so…
This article shows the steps required to install WordPress on Azure. Web Apps, Mobile Apps, API Apps, or Functions, in Azure all these run in an App Service plan. WordPress is no exception and requires an App Service Plan and Database to install
Video by: Steve
Using examples as well as descriptions, step through each of the common simple join types, explaining differences in syntax, differences in expected outputs and showing how the queries run along with the actual outputs based upon a simple set of dem…
Via a live example, show how to setup several different housekeeping processes for a SQL Server.

752 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question