Best Practices for Data Integration Testing Series – Instituting Good Practices for Data Testing

August 3, 2012 by
Filed under: Data Governance, Data Profiling, Data Quality, Metadata 

I have been asked by folks at Informatica to share some thoughts about best practices for data integration, and this is the first of a series on data testing.

It is rare, if not impossible, to develop software that is completely free of errors and bugs. Early in my career as a software engineer, I spent a significant amount of time on “bug duty” – the task of looking at the list of reported product errors and evaluating them one-by-one to try to identify the cause of the bug and then come up with a plan for correcting the program so that the application error is eliminated. And the software development process is one that, over time, has been the subject of significant scrutiny in relation to product quality assurance.

In fact, the state of software quality and testing is quite mature. Well-defined processes have been accepted as general best practices, and there are organized methods for evaluating software quality methodology capabilities and maturity. Yet when all applications are a combination of programs applied to input data to generate output information, it is curious that the testing practices for data integration and sharing remain largely ungoverned manual procedures.

As we move into a more data-centric world, the fact that many organizations are still relying on manual data reviews or spot-checking is puzzling at best, and risky at worse. Manual data testing takes many different forms, such as eyeballing (scanning through record sets looking for anomalies), hand-coded SQL scripts to look for specific known issues, dumping data into desktop spreadsheets and doing file diffs, or even coding scripts and programs for each data set to do validation. These artifacts reflect an absence of process and methodology; they are not reusable, may not provide complete verification, are not auditable, and worse yet, may themselves be prone to errors.

As we hear more and more about newer technologies for big data, analytical appliances, data repurposing, or text analytics (to name a few), it is uncommon for enterprise to have an organized framework for assurance that the data assets modified or created by any process are consistent with either the end-user expectations, or even that the data was not inadvertently modified as a result of a process failure somewhere along the line. More to the point: we need to develop mature processes and tools for automated, repeatable and auditable data testing that can be applied in general practice.

To tell the truth, that is what motivated my transition from being a software engineer to being a data practitioner: the desire to formulate methods and tools for automatically testing data validity. The entire premise of my first book on data quality was predicated on the idea that one could define sets of business rules for data that could automatically be integrated into a rules engine for continuous data quality assurance.

Today, many tools provide some level of this capability, but it is valuable to consider the same ideas in a number of specific use cases for data reuse, sharing, exchange, or integration (which I will explore in upcoming notes over the next few months). In particular, consider the specific application of three key practices that can supplement and enhance your organization’s level of trust in shared and repurposed data while driving operational aspects of data stewardship and governance:

  • Alignment of enterprise data rules: Application developers only look at the specifics of their program’s transformations of input to output within the confines of the immediate application requirements, but don’t consider holistic expectations for downstream data use. Yet there must be some assurance to the consumers of the application’s output that the data has not been warped in any way shape or form. This is particularly true for extract/transform/load (ETL) processes, in which there must be some verification that the transformations were applied appropriately. To enable any kind of data testing, there is a need for defining and managing business data rules that characterize end-user expectations for integrated data.
  • Instituting data controls: One cannot verify that data sets remain consistent or verified unless you introduce data controls at various points along the data integration pathways. Data controls can examine consistency and validity at different levels of granularity. An example of a coarse-grained control compares the number of records in a data file before and after the data integration task, while more finely-grained controls will verify that the business data rules are applied correctly at the record or even the data value level.
  • Automated testing: This third practice is automating the data controls for validation. As the volumes of data absorbed, manipulated, integrated, and shared across the enterprise grow, eyeballing and manual reviews of (pseudo-)randomly selected data instances will prove to be non-scalable and unsatisfactory.

Employing our first practice of defined business data rules within our third practice using an automated data testing harness enables the broad deployment of consistent data controls for verification of expectations as well as identifying and reporting on any violations.

There are basically two abstract use cases that can benefit from automated data testing. The first is when input data sets are transformed into output data sets, and verification is required to ensure that the transformations were correctly applied, such as when ETL processes are updated or modified, or when new rules are added to a data integration process. The second is where the data sets need to remain identical, such as the data migrations necessary as legacy applications are retired and replaced with new systems.

A relatively good overview is provided in this recorded webinar from Informatica. In addition, as part of this series, I will be participating with Informatica in a number of short webinars over the next months to discuss some of the different use cases and environments in which automated data testing can provide benefits for data validation and verification, especially as heterogeneous data volumes increase in size and complexity.



2 Comments on Best Practices for Data Integration Testing Series – Instituting Good Practices for Data Testing

    […] my last post, we started to discuss the need for fundamental processes and tools for institutionalizing data […]

  1. John on Fri, 2nd Nov 2012 1:51 PM
  2. Thanks for sharing the thoughts..very useful..

    My experience as an ETL tester below are my thoughts…
    Complete validation of whole test data or even testing the production data could miss defects. The reason is that the ETL development is usually done for new requirement/changes. So there are chances like the data required for testing some new transformations will not present in the test environment.. So in that way we could not tell even the production cut data have “Integrity”. Integrity (w.r.t testing) can be achieved only when the data is sufficient to test all the Business rules. Some times we need to consider even 100 records, which can cover all the business rules, are more consistent that One Billion records junk records.

    For most of the times scrambled production cut data is put into the testing environment due to security reasons(financial data/healthcare/HR System data ). This scrambling of data is inevitable due to security reasons and due to privacy laws in different countries. Even if we are testing One billion records with automation tools or with Minus queries, this will not help. (For E.g. In my current project we have 50 Million of records, but when we checked we could find 300 eligible records after filtering in some levels. So if we are automating/running minus to test the whole data, our actual testing is with just 300 records..In those records also some columns have default/null values for all records. So in order to validate those columns we need to manipulate data at source.

    Hence my thought is Test Data Identification is very important. This can be achieved by creating Test Data Identification document while Requirement gathering stage itself and should be update till execution stage.
    This will make more accountability for testing and test data. This will also help in test/test data auditing, which is an important step in Quality.

    Developing of automation tool will help. But in my experience using in-built automation tools/ Automation tools used for other requirements/projects will not help as business rules/Complexity table joins will vary.
    This can be avoid by developing framework tools suitable for the project need. ETL automation tool should only be used for the need…(means building automation tools/ providing training to wrong automation tools which is not productive, at the cost of Quality will not be a good idea!!).

    The automation tools can be used for Test data set-up, Test documentation, test execution, generating SQL queries. I have used some automation tools of HP, Informatica , but it rarely suits for project needs, even though there are success stories..).

    In my view Manual verification is getting less in ETL, if we are writing Minus queries and handling tools..(For E.g. Manual verification in excel can be avoid in very simple way by inserting data into database and use minus queries. Now most of the latest DB Query tools have this option..)

Tell me what you're thinking...
and oh, if you want a pic to show with your comment, go get a gravatar!