One of the greatest challenges for a maturing engineering team is to continue building, without breaking what has already been built. Software engineers need to verify to that a new piece of code will not disrupt the existing code set. And data engineers need to verify that additions, or changes, to one part of the data product do not disturb the rest of the data product.
Our client, a large consulting company, was having trouble maintaining a large database which collected billing and purchase data from several different sources. The database was a SQL Server database, and the ETL process was performed by a collection of SQL queries. The engineers who designed the database had left the company, and the remaining engineers were left with a partial understanding of how the ETL process was performed.
One of the biggest problems faced by the engineering team was the unexplained dependencies betwen the tables in the database. Developers would try to modify one table in the database, only to find that their change other tables which were created downstream. The ETL process was bulky and needed to be revamped, but the developers could not safely modify the ETL scripts without affecting the tables in the database.
Kvaltis used checksums, and unit testing, to detect changes in the datasets. Checksums were calculated before, and after, each code modification. Any change to the checksum indicated a change to the dataset. It turns out, there is a rather simple way to calculate checksums in SQL. We can convert each value in the table into a unique integer, and combine all the integers in a combinatorially unique way which ultimately yields a single number to represent the state of the table.
In T-SQL, table checksums can be calculated using the following query. Note that there will be some collisions - there will be some instances where a value changes in the table but the checksum remains the same. The percentage of collisions should be quite low (<1%).
SELECT CHECKSUM_AGG(CHECKSUM(*)) FROM database.dbo.table
Our team also wrote unit tests for each part of the ETL pipeline. We wrote SQL queries which returned expected values - for example, some queries were expected to return an empty set, and other queries were expected to return a non-empty set. We then created a Python framework to interpret and execute these SQL queries. The unit tests were run each time the ETL pipeline was executed, to verify a proper build of the data model.
The unit tests and checksums empowered the data engineering team to modify the ETL pipeline with confidence. Using the unit tests, the developers were able to identify segments of the dataset which were missing or otherwise incorrect. We reduced the execution time of the ETL pipeline from 4 hours to 15 mintues, saving the client upwards of 1,000 hours each year.