If you have been following this series of articles about data validation and testing, you will (hopefully) come to the conclusion that there is a healthy number of scenarios in which large volumes of data are being moved (using a variety of methods), and in each of these scenarios, the choices made in developing a framework for data movement can introduce errors. One of our discussions (both in the article and in discussions with Informatica’s Ash Parikh) focused on data integration testing for production data sets, while another centered on verification of existing extraction/transformation/loading methods for data integration (you can listen to that conversation also).
In practice, though, both of these cases are specific instances of a more general notion of migration. There are basically two kinds of migrations: data migrations and system migrations. A data migration involves moving the data from one environment to another similar environment, while a system migration involves transitioning from one instance of an application to what is likely a completely different application.
Filed under: Data Governance, Data Integration, Data Quality
What is now generally referred to as “data integration” is a set of disciplines that have evolved from the methods used for populating the data systems powering business intelligence: extracting data from one or more operational systems, their transfer to a staging area for cleansing, consolidation, transformations, and reorganization in preparation for loading into the target data warehouse. This process is usually referred to as ETL: extraction, transformation, and loading.
In the early days of data warehousing, the ETL scripts were, as one might politely say, “hand-crafted.” More colloquially, each script was custom-coded in relation to the originating source, the transformation tasks to be applied, and then the consolidation, integration, and loading. And despite the evolution of rule-driven and metadata-driven ETL tools that automate the development of ETL scripts, much time has been spent writing (and rewriting) data integration scripts to extract data from different sources, apply transformations, and then load the results into a target data warehouse or an analytical appliance. Read more
As anticipated, as part of Apple’s recent release of iOS6, the incumbent Google Maps application was replaced by Apple’s homegrown version. Excitement has quickly degenerated into disappointment (at best) and anger (at worst) over the flaws in Apple’s version. And as of this morning, a quick scan at Google News reported almost 2000 articles reflecting Apple’s mea culpa culminating with a personal message from CEO Tim Cook stating:
“At Apple, we strive to make world-class products that deliver the best experience possible to our customers. With the launch of our new Maps last week, we fell short on this commitment. We are extremely sorry for the frustration this has caused our customers and we are doing everything we can to make Maps better.”
It looks like quite a firestorm over what is basically a data quality issue… Read more
Filed under: Business Intelligence, Data Analysis, Data Governance, Data Integration, Data Quality
In my last post, we started to discuss the need for fundamental processes and tools for institutionalizing data testing. While the software development practice has embraced testing as a critical gating factor for the release of newly developed capabilities, this testing often centers on functionality, sometimes to the exclusion of a broad-based survey of the underlying data asset to ensure that values did not (or would not) incorrectly change as a result.
In fact, the need for testing existing production data assets goes beyond the scope of newly developed software. Modifications are constantly applied within an organization – acquired applications are upgraded, internal operating environments are enhanced and updated, additional functionality is turned on and deployed, hardware systems are swapped out and in, and internal processes may change. Yet there are limitations in effectively verifying that interoperable components that create, touch, or modify data are not impacted. The challenge of maintaining consistency across the application infrastructure can be daunting, let alone assuring consistency in the information results. Read more
Filed under: Data Governance, Data Profiling, Data Quality, Metadata
I have been asked by folks at Informatica to share some thoughts about best practices for data integration, and this is the first of a series on data testing.
It is rare, if not impossible, to develop software that is completely free of errors and bugs. Early in my career as a software engineer, I spent a significant amount of time on “bug duty” – the task of looking at the list of reported product errors and evaluating them one-by-one to try to identify the cause of the bug and then come up with a plan for correcting the program so that the application error is eliminated. And the software development process is one that, over time, has been the subject of significant scrutiny in relation to product quality assurance.
In fact, the state of software quality and testing is quite mature. Well-defined processes have been accepted as general best practices, and there are organized methods for evaluating software quality methodology capabilities and maturity. Yet when all applications are a combination of programs applied to input data to generate output information, it is curious that the testing practices for data integration and sharing remain largely ungoverned manual procedures. Read more