ETL Verification: Do Your ETL Processes Do What You Think They Do?
What is now generally referred to as “data integration” is a set of disciplines that have evolved from the methods used for populating the data systems powering business intelligence: extracting data from one or more operational systems, their transfer to a staging area for cleansing, consolidation, transformations, and reorganization in preparation for loading into the target data warehouse. This process is usually referred to as ETL: extraction, transformation, and loading.
In the early days of data warehousing, the ETL scripts were, as one might politely say, “hand-crafted.” More colloquially, each script was custom-coded in relation to the originating source, the transformation tasks to be applied, and then the consolidation, integration, and loading. And despite the evolution of rule-driven and metadata-driven ETL tools that automate the development of ETL scripts, much time has been spent writing (and rewriting) data integration scripts to extract data from different sources, apply transformations, and then load the results into a target data warehouse or an analytical appliance.
As long as the data sources remain static, the existing ETL scripts are sufficient for loading data into a target system, and as long as sufficient testing has been applied to those scripts, they are probably trustworthy. But it would be rare to be in an environment in which there is not ongoing, if not continuous change, and as the origins, volumes, and variety of source data sets grow, though, so does the complexity of the transformations. This can lead to an unpleasant situation: long-time trusted ETL processes incrementally begin to generate unexpected results.
It is worth considering some of the reasons that long-time production ETL processes generate unexpected results. In pour consulting practice, we have seen numerous environmental changes that have ramifications for existing ETL processes, such as:
- Changes in source data structures – An update to a data set changes the data type of one or more data element, adds or removes data elements from the table, or there are realignments in the relational structure, but the ETL scripts are not modified to accommodate the change in the source. In one example, one data element’s data type was changed to accommodate a data exchange requirement, which inadvertently changed one of the data transformations to execute incorrectly.
- Changes in source data semantics – Adjustments to the meaning of source data elements should have implications for interpretation of the value, and consequently, the transformations applied, but this is not communicated to the ETL development team. These types of changes often lie hidden for some time until an incorrect value leads to questions about data warehouse results.
- Changes in reference data sets – Addition or removal of reference data values may have implications for existing transformations or may require the creation of new code for proper transformation. We have seen numerous if-then-else and case statements in code that are triggered off hard-coded reference data values. Unless the programmer is notified, there would be no way to determine that an update to the code is required.
- Introduction of unsanctioned or ungoverned steps in the production flow – Often, in order to expedite the delivery of a new report or analysis, developers bypass the standard system development lifecycle stages and introduce data dependencies that are not recognized as part of a production process. Because of this, failures in the materialization of the dependent data sets are not flagged.
This last bullet item is the most prevalent, and perhaps the most insidious. For example, in one organization, we saw hand-extracted desktop spreadsheets loaded into a collaboration and sharing tool, from which the file was then downloaded, transformed, and then loaded into a data warehouse. In another example, a sequence of process steps involved dropping transformed files in target directories that were to be “swept” by another process and loaded into a target data mart. However, if one of the early steps failed, the previous day’s files were not removed from the target directory, and the sweep process picked up those same files and loaded them for a second time into the data mart. In both of these cases, failures in some part of an ungoverned process could ultimately impact the transformation and loading of data into the data warehouse.
The challenge, though, is that after the ETL scripts are tested and put into production, they are typically trusted to do what they were originally designed to do. So instead of continuous rigorous testing, verification of the ETL results is limited to sampling, parity checks (such as reasonableness with respect to aggregate sums or averages), or even manual review and scanning of target values.
Yet the types of scenarios listed earlier occur with some frequency. Therefore there is a need for continuous monitoring of existing ETL activities to ensure that over time there is no depreciation in ETL fidelity. But similar to the situation we discussed in my previous discussion of production data reconciliation, parity checks, spot checks, and manual review are going to be insufficient and patently non-scalable, especially as data volumes grow and the variety or originating sources expands.
This suggests the need for a more comprehensive approach to ETL validation and verification that is:
- Comprehensive in its ability to provide validation of the entire spectrum of data sets conveyed from the source to the target;
- Scalable to enable completeness of validation of all the data values;
- Automatable so that the validation is not dependent on allocation of staff resources for manual review; and
- Reflective of business rules, especially when there are transformations directly embedded in the ETL processes.
But mostly, this automated solution must be integrated within a stewardship/governance framework that alerts data practitioners and ETL developers when a violation of expectations is identified. This is especially effective when the data rules used for validation can be applied at different stages of the ETL process flow, since that enables the data stewards to determine the point at which a flawed value was introduced, allowing them to isolate the root cause of the unexpected results. The result is the ability to eliminate the source of the problem, instead of relying on downstream data corrections that will ultimately lead to inconsistencies with the original sources.
In essence, an approach that automates ETL verification via validation of defined business rules will enable a level of trust when ETL processes generate the appropriate results, yet provide immediate alerts and feedback when changes in the environment allow errors and flaws to sneak into the target systems.