Best Practices for Data Integration Testing Series – Instituting Good Practices for Data Testing
I have been asked by folks at Informatica to share some thoughts about best practices for data integration, and this is the first of a series on data testing.
It is rare, if not impossible, to develop software that is completely free of errors and bugs. Early in my career as a software engineer, I spent a significant amount of time on “bug duty” – the task of looking at the list of reported product errors and evaluating them one-by-one to try to identify the cause of the bug and then come up with a plan for correcting the program so that the application error is eliminated. And the software development process is one that, over time, has been the subject of significant scrutiny in relation to product quality assurance.
In fact, the state of software quality and testing is quite mature. Well-defined processes have been accepted as general best practices, and there are organized methods for evaluating software quality methodology capabilities and maturity. Yet when all applications are a combination of programs applied to input data to generate output information, it is curious that the testing practices for data integration and sharing remain largely ungoverned manual procedures.
As we move into a more data-centric world, the fact that many organizations are still relying on manual data reviews or spot-checking is puzzling at best, and risky at worse. Manual data testing takes many different forms, such as eyeballing (scanning through record sets looking for anomalies), hand-coded SQL scripts to look for specific known issues, dumping data into desktop spreadsheets and doing file diffs, or even coding scripts and programs for each data set to do validation. These artifacts reflect an absence of process and methodology; they are not reusable, may not provide complete verification, are not auditable, and worse yet, may themselves be prone to errors.
As we hear more and more about newer technologies for big data, analytical appliances, data repurposing, or text analytics (to name a few), it is uncommon for enterprise to have an organized framework for assurance that the data assets modified or created by any process are consistent with either the end-user expectations, or even that the data was not inadvertently modified as a result of a process failure somewhere along the line. More to the point: we need to develop mature processes and tools for automated, repeatable and auditable data testing that can be applied in general practice.
To tell the truth, that is what motivated my transition from being a software engineer to being a data practitioner: the desire to formulate methods and tools for automatically testing data validity. The entire premise of my first book on data quality was predicated on the idea that one could define sets of business rules for data that could automatically be integrated into a rules engine for continuous data quality assurance.
Today, many tools provide some level of this capability, but it is valuable to consider the same ideas in a number of specific use cases for data reuse, sharing, exchange, or integration (which I will explore in upcoming notes over the next few months). In particular, consider the specific application of three key practices that can supplement and enhance your organization’s level of trust in shared and repurposed data while driving operational aspects of data stewardship and governance:
- Alignment of enterprise data rules: Application developers only look at the specifics of their program’s transformations of input to output within the confines of the immediate application requirements, but don’t consider holistic expectations for downstream data use. Yet there must be some assurance to the consumers of the application’s output that the data has not been warped in any way shape or form. This is particularly true for extract/transform/load (ETL) processes, in which there must be some verification that the transformations were applied appropriately. To enable any kind of data testing, there is a need for defining and managing business data rules that characterize end-user expectations for integrated data.
- Instituting data controls: One cannot verify that data sets remain consistent or verified unless you introduce data controls at various points along the data integration pathways. Data controls can examine consistency and validity at different levels of granularity. An example of a coarse-grained control compares the number of records in a data file before and after the data integration task, while more finely-grained controls will verify that the business data rules are applied correctly at the record or even the data value level.
- Automated testing: This third practice is automating the data controls for validation. As the volumes of data absorbed, manipulated, integrated, and shared across the enterprise grow, eyeballing and manual reviews of (pseudo-)randomly selected data instances will prove to be non-scalable and unsatisfactory.
Employing our first practice of defined business data rules within our third practice using an automated data testing harness enables the broad deployment of consistent data controls for verification of expectations as well as identifying and reporting on any violations.
There are basically two abstract use cases that can benefit from automated data testing. The first is when input data sets are transformed into output data sets, and verification is required to ensure that the transformations were correctly applied, such as when ETL processes are updated or modified, or when new rules are added to a data integration process. The second is where the data sets need to remain identical, such as the data migrations necessary as legacy applications are retired and replaced with new systems.
A relatively good overview is provided in this recorded webinar from Informatica. In addition, as part of this series, I will be participating with Informatica in a number of short webinars over the next months to discuss some of the different use cases and environments in which automated data testing can provide benefits for data validation and verification, especially as heterogeneous data volumes increase in size and complexity.