Using Data Integration Testing for Reconciling Production Data Assets
Filed under: Business Intelligence, Data Analysis, Data Governance, Data Integration, Data Quality
In my last post, we started to discuss the need for fundamental processes and tools for institutionalizing data testing. While the software development practice has embraced testing as a critical gating factor for the release of newly developed capabilities, this testing often centers on functionality, sometimes to the exclusion of a broad-based survey of the underlying data asset to ensure that values did not (or would not) incorrectly change as a result.
In fact, the need for testing existing production data assets goes beyond the scope of newly developed software. Modifications are constantly applied within an organization – acquired applications are upgraded, internal operating environments are enhanced and updated, additional functionality is turned on and deployed, hardware systems are swapped out and in, and internal processes may change. Yet there are limitations in effectively verifying that interoperable components that create, touch, or modify data are not impacted. The challenge of maintaining consistency across the application infrastructure can be daunting, let alone assuring consistency in the information results.
This presents an opportunity to consider the first of our use cases for data integration testing: assuring consistent performance with respect to existing production data assets.
The core theme is straightforward: defining a set of rules that specify consistency between two copies of what should be the same thing. As a simple example, your organization decides to transition from using one vendor’s database environment in favor of a different vendor’s database system. This requires a data migration, involving an extraction of the data from the current environment and loading that same data into the new environment. It is a no-brainer to expect that the data in the new target should be identical to the data in the soon-to-be-deprecated source.
But even in places that are aware of the need to do data testing, the practices in place are often uncoordinated and rely on manual efforts, data sampling, or customized SQL queries and extracts for comparison. The biggest risks of the uncoordinated approach fall into three categories:
- Resource-constrained – these customized methods are expensive to assemble, requiring dedicated staff to pull together the right queries and extracts. Yet the completeness of the tasks is limited when the data sizes are large (or increasingly, massive) and the resources used for testing (like desktop spreadsheets) have artificial limitations. The result is that the testing itself is likely to not be thorough, and may miss both general and edge-case expectations.
- Non-repeatable – When tests are specific to a single instance, there is no ability to rely on existing corporate knowledge that might imply curious dependencies. The result is a need for rediscovery of all latent business rules for validity, and as a result there is little or no reuse of existing testing protocols.
- Ungoverned – The absence of a general repeatable methodology not only leaves open the door for introduction of errors while testing, its lack of a logging capacity to track what tests are performed and the corresponding results (and associated remediation tasks) exposes the migration to errors that often stay hidden until much later in time, when the impacts are much more insidious. No audit trail means that for any future issues, the data practitioners must basically start from scratch to figure out the original source of the introduction of the issue.
The two biggest issues, in my opinion, are the need for scalability and coverage, and the need for oversight and governance. Both of these can be addressed through data testing automation.
The first step involves accepting the idea that all of the data needs to be reviewed for a production data migration or exchange. Sampling is just not enough, despite what the statisticians tell you. Therefore, defining a set of rules for consistency between the source and the target should be a practical task, which may be simplified if the data models are also identical. However, most migrations also allow for some innovation in the underlying models to address any modernization or reengineering implied by the transition to a new environment. That means adding rules for the structural changes that still verify a consistent move. The increasing sizes of production data sets implies scalably comparing all corresponding data instances, and while the monotony of manual queries and review may lull your analysts to a mind-numbing stupor, rule-based automated testing will look at all the data and reduce the manual review to discrepancies that can be effectively linked to defined rules.
The second step builds on the first to address oversight and governance, largely manifested in traceability of applied rules, logging of results, a versioning of defined rules to monitor newly-discovered consistency issues that must be compared, and history (as well as trending for reviewing consecutive iterations of the data movement, if necessary). Providing an audit trail that tracks consistency and validity by specific business rule establishes a record that can be used in the future for isolating those latent data issues that might crop up in the future.
A database migration is just one use case; there are many other variations on the same theme, such as data movement from source systems into operational data stores, movements into alternate analytics platforms (such as Hadoop), or data exchanges. Defining the consistency rules should be straightforward (largely a question of verifying exact equality), but may include some levels of complexity when there are data format or type changes.
Lastly, remember that this is production data – the data you run your business on. You don’t want to mess with that! In this way, defined business rules and automated data validation can be used to identify unexpected production data results and how those same defined rules can help a data analyst find the root cause of the inconsistencies and remediate them.