A typical ETL pipeline involves extracting data from an external source, transforming that data to a standardized data model, and then loading the data into a centralized database. Most steps in the ETL pipeline are linear transformations to a number (for example, standardizing the unit of a measurement) or linear operations on a string variable (for example, standardizing the way an address is formatted).
Our client wanted to revamp the ETL pipeline for one of their databases. As we investigated the pipeline, we found there were circular dependencies. Some tables referenced themselves after they were created, which caused confusion and is bad practice in general. A well-designed ETL pipeline should be a sequence of batch processing jobs, where each table is created independent of any tables downstream.
We used diagrams to represent the ETL pipeline, and identify any circular dependencies. In our redesign, we removed any circular dependencies and refactored the diagram into a directed acyclic graph. A key property of any DAG is that the diagram flows in one direction - paths do not loop back onto themselves.
Using the diagrams, our team was able to refactor the ETL pipeline and reduce execution time, while increasing transparency. We reduced approximately 10,000 lines of code to less than 1,000 lines of code. The new pipeline was easier to maintain, and produced a more accurate dataset. Our client saved upwards of 1,000 hours each year with the new, improved ETL pipeline.