We help IT Professionals succeed at work.

ETL issues

onlinerack used Ask the Experts™
I have taken over an ETL project that is having issues with delivery, specifically the data migration part.

I was wondering if someone experienced with ETL projects can provide an insight on:
- Overall framework on how to deliver ETL project.
- What measure to have in place to provide Quality Assurance
- Best approach to data migration strategy.

Your help is much appreciated.

Watch Question

Do more with

Expert Office
EXPERT OFFICE® is a registered trademark of EXPERTS EXCHANGE®
Fractional CTO
Distinguished Expert 2018
ETL covers a massive terrain.

The more clarity/description you provide about how your code works + exact problems you're having will determine quality of answers you receive.

- Overall framework on how to deliver ETL project.

LAMP is sufficient. I'd stick with MariaDB or MySQL 8 for your database backend.

- What measure to have in place to provide Quality Assurance.

This is very simple + hardly anyone does this correctly.

Start your development with your test suite. Then run your test suite before + after coding changes to verify your code quality.

Take a look at any PERL module. You'll find a make test step. Some modules have 1000s of test steps.

This type of testing ensures code quality only increases over time.

- Best approach to data migration strategy.

No clue what this might mean. Maybe you're talking about the Extract + Load part of ELB.

To receive good answers for this question, you'll have to define what good might mean.

Good might mean speed of Extract or Transform or Load or all three.

This really relates to how data will be used. In other words, where you run your Transform step, which can either be associated with the Extract (backup) step or Load (restore) step.

Let's take a simple ELB activity I have scripts run every night for 1000s of WordPress Websites.

1) Extract - mysqldump extracts data into the root directory of a site.

2) Transform - the root directory is backed up using the tar --transform option, which transforms file names (rather than data), then zstd is used to compress data at near maximum settings to optimize file size.

3) Load - this step is very fast because zstd requires massive time for maximum compression + decompression tends to be faster than zip or xz.

For my ELB setup, I first copy the files in step #1 (using rsync) to a server who's only mission in life is to create backups, so doing an rsync means my extract stage is fairly fast, because rsync only clones new files + for existing files, runs a diff to only clone portions of files which have changed.

I mention the rsync part of my sequence because it's essential to create tarballs on non-production servers, to reduce CPU + Disk I/O load on production servers + also, more importantly, some tarballs are 100s of Gig (for the full, rather than incremental backups). If I were to make these tarballs on a production server, then copy them over the network to multiple backup storage locations, I'd saturate the network for production machines + interfere with high traffic sites.

So the mention of the rsync step suggests there are many other factors which live outside the ELB model, which require consideration during your App design phase.


Thank you David, your response has pointed me to the right direction and helped me set the approach needed.

Thanks again.