Debate on best methods for implementing ETL, Datawarehousing and n-tier
Posted on 2013-02-05
My firm is in the middle of some debate.
We do millions of ETL rows from 1000's of flat files daily. Some of our databases are quite large 100-500GB.
Most are around 100GB or less.
We update and validate portfolio data for ~30M securities portfolios stored in approx 50 Databases on 5+ servers.
We also have other servers tasked with ftp and ETL. And our application web-servers.
All backended by sql server 2008R2.
A separate db is implemented for each client but many clients are on a single db server.
No consolidation of shared data is implemented so there is some replication between db's.
No typical data-warehouse or BI technologies are in place.
There are a couple of layers.
1-2. flat files are typically ftp downloaded and read using vb into sql server and processed into a 1st stage DB
then text files are exported and then imported into a
2nd stage database for validation logic.
Stages 1 and 2 should soon be merged.
3. Data is then posted from the 1-2 stages to the client database on a server.
Currently stored procedures do a lot of the work but increasingly the transforms/validation/final posting
operations are being moved to vb objects utilizing .net cache and lots of RAM. Data is read from DB manipulated
and written back so the DB is just concerned with select/update/insert/delete simple sql.
4. data maintenance/utility scripts for cleansing are mainly sp's and sql scripts that are scheduled
out side of sql server via a home grown vb app.
5. the application/business layer/business (data) layer logic is mixed
but there is movement to n-tier architecture and moving Business Logic out of the database. A lot of complex calculation and query intensive activity is taking place here.
I see no problem moving the application to n-tier but what about the etl and transform logic.
What is the current thinking on this?
One side says to transfer the etl transform and validation logic to this VB middle tier and add
lots of ram to support the .net caching and .net sqldata classes supporting the VB objects.
Also moving the business logic the this middle tier.
Another side argues to let the application move to three (or n) tiers BUT push etl to the BI stack
and add a data-warehouse implementation so cubing (to ease calculation in the business layer) and validating could go on during warehouse loading.
In essence use the BI (IBM, Oracle or MS) stack for loading and a 3 tier system for the application based on
current warehousing technology.
Also note that very little insert/update activity is generated by the app.