The Emerging Criticality of Automated Test Data Generation
A few years ago I was working on configuring a test for comparing data transformation and loading into a variety of target platforms. Essentially I was hoping to assess the comparative performance of different data management schemes (open source relational databases, enterprise versions of relational databases, columnar data stores, and other NoSQL-style schemes). But to do this, I had two constraints that I needed to overcome. The first was the need for a data set that was massive enough to really push the envelope when it came to evaluating different aspects of performance. The second was a little subtler: I needed the data set to exhibit certain data error and inconsistency characteristics that simulated a real-life scenario.
When I mentioned this to a colleague, he told me that he had programmed a small utility to generate millions of transaction records. And of course, I have enough programming experience to conjure up a similar engine that could spit out lots of randomized (and imaginary) transactions. While this addressed my first constraint, it did not touch upon the second one. The randomness is good for generating a lot of data, but becomes a drawback when you are looking for latent issues and dependencies that you’d like to test. The upshot is that simple approaches can help in generating a data set, but may not help in generating the data set that you really want to test.
And this gap becomes much more serious when you look at enterprise development projects in which the production data is not available for either the development process or for testing. And even in situations where production data is made available, when testing new product features, there may not be any production data to test. In other words, there is a need to go beyond a simplistic approach and use a more sophisticated approach for generating test data.
Therefore, what would you look for in an automated test data generation tool? If I sat down and thought about it for long, I could probably come up with a really long laundry list. However, even a few minutes of noodling yields what I might call a hefty list of demands, namely that an automated test generation tool must:
- Generate massive numbers of transaction data.
- Use and/or generate reference data used within transactions.
- Generate specific records that exhibit criteria to be tested within the application, such as compliance or non-compliance with value ranges, patterns, and other defined business rules.
- Generate data with randomly-inserted errors (such as incomplete records, misspelled values, non-compliant values).
- Generally exhibit the same value distributions as provided samples.
- Generate table keys (both primary keys and foreign keys) that can be tested for uniqueness and referential integrity.
- Protect potentially sensitive data values from accidental exposure across the dev/test/prod ecosystem.
- Include aggregation functionality where total records are a summary of other generated records
- Follow business logic to ensure application functionality is functionally intact (e.g. a sales order must have at least one item)
Fortunately, I had the opportunity to be briefed by Informatica about test data management in their upcoming 9.6 release. Apparently, they have been thinking the same thoughts about test data generation, since this release blends the types of capabilities I wished for in the above list with aspects of data protection employing encryption, masking, and data scrambling. In addition, I was told that the test data generator links with the Power Center metadata repository as well as the data profiling capabilities of their Data Quality tool. This means that the profiler can be used to accumulate knowledge about metadata within a data set to be modeled, as well as statistical information about data value distributions to guide test data generation. Lastly, data quality business rules can be used to guide the generation of specific instances that are to be subjected to testing and validation.
My perception is that with a tool like Informatica’s Test Data Management, it should be possible for enterprises to augment their existing test data that follows their business rules, its ‘quirkiness’ and associated error conditions, or, generate test data from ground up that simulates their production data.